Candidate Genes

This previous week we put our iterative algorithm to the test. Alex and I each ran our positive (and negative) lists for 150 iterations using the 0.200 threshold GIANT brain network. As a reminder, the 0.200 threshold network is a trimmed version of the GIANT network where all the edges have probability scores of 0.200 or greater.

I used all 541 of my positives, a combination of two sublists: cell motility genes and cell motility genes implicated in schizophrenia. In the upcoming weeks, I will experiment with running these two lists separately and seeing how the results are affected.

I also used Alex’s list of negatives, which are genes expressed in other organs but not the brain. We will need to closely consider whether it is wise to use this set of negatives alongside cell motility positives.

Even with these two potential issues, our preliminary lists of candidates looked promising – for example, one gene we found is Hes Related Family BHLH Transcription Factor With YRPW Motif 2, also know as HEY2. HEY2 was not only a schizophrenia positive but was also moderately high in the list of cell motility candidates, with a score of 0.56. Another gene we uncovered was 1-Acylglycerol-3-Phosphate O-Acyltransferase 4, or AGPAT4. This was not in either list of positives, but it was given the maximum score of 1.0 in each set of candidates. Both of these genes point to the different ways we can analyze our results and what it means to be a promising candidate – do we care about genes that are positives in the schizophrenia set and are determined by the algorithm to likely be involved in cell motility or do we care about genes that are determined by the algorithm to be likely involved in cell motility and schizophrenia? Do we care about both? What is the difference? Is one more promising than the other?  In the upcoming weeks, we will be figuring out these questions in order to refine our list of candidates.

We will also be extending the time that our algorithm runs. 150 iterations may seem like a lot, but cutting the algorithm short may lead to “stunted” results. The iterative method can be thought of as dyes from the positive and negative nodes spreading throughout the network, staining the surrounding nodes in a way that reflects their distances from the positives and negatives. If the iterations are stopped before they converge (i.e. subsequent iterations don’t change the scores, or color of the node), then the color of the nodes might be closer to the color of the positives and negatives than they should be. Our next step will be to let it run for 1500 iterations and see how our results our affected.

Out of my own curiosity, I also decided to see if all of the cell motility positives were in the GIANT network, which turned out to be false. There are about 40 genes (out of 541) that are not in the GIANT network. However, I don’t believe this to be a problem; if there are positive cell motility genes not in the GIANT network, it is probably because not all genes involved in cell motility are expressed in every type of cell and are therefore not relevant in a brain-specific network.

Week 15

We have a couple of goals for next week: fix a subtle yet major bug in our iterative method program and then run it on our gene lists to get a preliminary list of candidate genes. I will also write a short program to compare the genes in the GIANT network to the list of cell motility genes.

Week 14

At last week’s meeting, we dug into the computational portion of the project, discussing different ways we could run our algorithm on the gene lists and how that would affect our results. We also discussed a few different methods for running on our gene lists – one method involves a support vector machine and the other is an iterative method of computing scores, shown below.

Murali et al. Network- based prediction and analysis of hiv dependency factors. PLOS Computational
Biology, 2011.

Alex was able to track down some software for the former method from the autism paper that inspired our project. One of our tasks was to try to get this software to work on our current gene lists – unfortunately, we hit a bit of a wall with this method.

For the second method, we were able to successfully run it on Alex’s gene list. I also had the task of running the iterative method on a small toy example, which was a small graph consisting of 10 nodes, 11 edges, 1 positive, and 1 negative. Anna informed me that when she did it, the scores converged after about 27 time steps, which matches the results that I got. We will go over the results in this week’s meeting and discuss what our goals for this week are.

GIANT Network

This past week I continued looking at the underlying structure of the GIANT network, especially how it changes when we “trim” edges off to make it a more manageable size.

Following up on the path length distribution of the previous week, I calculated this statistic on (almost) all the trimmed networks – from probability thresholds 0.150 to 0.900. The path lengths change as we expected. The first half of the networks have their highest number of shortest path lengths around 3, with the distribution moving outward as the networks get smaller.

(Color and corresponding probability threshold: Green = 0.150, red = 0.175, cyan = 0.200, blue = 0.300, magenta = 0.400, yellow = 0.500, pink = 0.600, black = 0.700, olive = 0.800, orange = 0.900)

I also did a quick search of the nodes in the networks with the highest degree in order to determine which ones were “hubs” in the network. In the three largest networks, the same gene was the node with the highest degree (ranging from 3283 to 1969 as the trimmed networks got smaller). This gene is called neurotrophic receptor tyrosine kinase 3, or NTRK3.

Another task I was given was to determine shortest paths considering the weights of the edges. There was trick to this – because our most important weights are larger (close to 1) and shortest path algorithms with weights consider lower weight edges, we needed to adjust for this. We agreed to fix this by taking the negative log (base 10) of each probability weight. Since our trimmed networks do not contain edge weights with probability 0, we wouldn’t have to worry about log(0). I am currently working on debugging this code, and will write another post interpreting the results when I have figured this out.

Happy almost end of the semester!!

GIANT Statistics

Computing network statistics on the GIANT network turned out to be somewhat of a challenge! The first hurdle that Alex and I discovered is that the brain-specific network we are working with is enormous, containing around 43 million edges. This network size was simply too big to run any program on efficiently, so Anna stepped in and created a file of “trimmed” networks that we could quickly use for our computations. Essentially, the edges in the network are weighted by a certain probability, so the network was trimmed by choosing a probability threshold; edges with weights less than that probability would not be included in the trimmed network. The largest trimmed network had about 6.4 million edges and a probability threshold of 0.125. I ran my statistics on networks with probability thresholds of 0.150, 0.175, 0.200, and 0.300 to see if there were differences in the statistics and what those differences might reveal about the structure of the network.

The statistics I ran on the trimmed networks were degree distribution, average AND, and shortest path length distribution. The degree distribution is the most straightforward – the degree of a node in a graph is the number of nodes it is connected to, so the degree distribution is a a histogram of the number of nodes in the network with a certain degree. The shape of the curve provides information about the structure of the graph. If you take the log of the degree distribution, a nice downward sloping line tells you that the network is scale-free, meaning its degree distribution follows what is known as the “power law distribution.” Scale free networks generally contain a smaller number of nodes with a high degree and a higher number of nodes with a small degree.

Below is the degree distribution for the trimmed network with a probability threshold of 0.150. All degree distributions calculated on the trimmed networks looked the same.

AND is short for the average neighbor degree – this looks at a node and sees how many neighbors (nodes a node is connected to) its neighbors have. Average AND answers the following question: On average, what is the degree of the neighbors of nodes with a certain degree? This question essentially investigates if there is a pattern in the degree of neighbors of nodes with a certain degree. Once again, the slope of the line reveals a piece of information about the structure of the network. A negative correlation means high degree nodes tend to be connected to low degree nodes, also known as a disassortative network. A positive correlation means high degree nodes tend to be connected to other high degree nodes and low degree nodes tend to be other low degree nodes, also known as an assortative network. The following figure from a paper on biological network connectivity demonstrates this concept quite clearly:

Overall, the average AND plots of the trimmed networks appear to be assortative, though the shape differs slightly. For example, compare the average AND plots of trimmed networks with 0.150 (top), 0.175 (middle), and 0.300 (bottom) threshold probabilities:

The final statistic is the path length distribution. This statistic is calculated using the breadth-first search algorithm to determine the length of shortest paths between all nodes. However, due to the size of the networks, my program doesn’t look at all possible paths between all nodes, instead running the BFS algorithm twice; once until it hits 100,000 paths and again with 200,000 paths. This was mainly done to see if there was a huge difference in the distribution of path lengths. There is a slight difference, as demonstrated by the distributions of the trimmed network with 0.150 probability threshold:

The next step for this statistic is to normalize the number of paths and see what difference this makes. Over the next couple of weeks, my goal is to refine my positive set of genes and check the GIANT brain-specific network to see if any of these genes appear.

 

A Quick Update & Overview of Goals

Fall break is over and we’re back to work!

Before break, I found a great resource that summarizes cell motility proteins by grouping them by function; it includes chemotaxis, receptors, growth factors, rho family GTPases, adhesion, integrin-mediated signaling, cellular projections, cell polarity, and proteolysis. This resource constitutes a significant portion of the positive set of proteins involved in cell motility.

Over the past week, I have entered the next step in creating my positive set of proteins known to be participants in cell motility and schizophrenia. Because many papers cited pathways (in addition to specific proteins), it’s crucial to look at these pathways and comb proteins from them to add to the positive set. These pathways, which include the CAM pathway, FAK pathway, and Reelin pathway, were taken from KEGG, a pathway database. Unfortunately, the KEGG pathways download as unreadable XML files, so I must parse these files; I am currently using a parser developed by Anna Ritz. After I have parsed these pathways, my next small step is to see which, if any, proteins are involved in multiple pathways.

Once my positive set has come together, I will begin analyzing the GIANT network. This includes comparing KEGG proteins to the GIANT network as well as generating summary statistics of the GIANT network. I will go into more detail of what this entails as I complete this portion of my analysis, but it will include generating statistics such as degree distribution, average node degree (AND), average AND, and possibly a few others.

Weeks 4 and 5: Building a Gold Standard List

Over the past couple of weeks, Alex and I have been given the task of building our project’s gold standard list. This list will be comprised of genes known to be associated with Schizophrenia and genes known to be associated with cell motility; Alex was given the Schizophrenia half to research and I was given the cell motility half. Overwhelmed by the sheer volume of genes and pathways associated with cell motility, I began by looking up genes known to be associated with both Schizophrenia and cell motility. In our meeting, we briefly discussed how this might serve us later as a known positive we could use to test our algorithm.

I read through tons of articles and papers, many of which I found through references in previous papers we have read. In reading these papers, I was hunting for genes, pathways, and resources that were studied and employed by the researchers and found several that will be useful to us. The CAM pathway and associated genes appeared multiple times, and the resource KEGG was used by the majority of the researchers. My task over the past week has been to download data from KEGG, continue accumulating genes for our gold standard list, and begin adding “confidence details” to the genes – this is an important weight that will be added to the genes in our list and will be a measure of how “confident” we are that the gene is truly associated with either Schizophrenia or cell motility.