Week 14

At last week’s meeting, we dug into the computational portion of the project, discussing different ways we could run our algorithm on the gene lists and how that would affect our results. We also discussed a few different methods for running on our gene lists – one method involves a support vector machine and the other is an iterative method of computing scores, shown below.

Murali et al. Network- based prediction and analysis of hiv dependency factors. PLOS Computational
Biology, 2011.

Alex was able to track down some software for the former method from the autism paper that inspired our project. One of our tasks was to try to get this software to work on our current gene lists – unfortunately, we hit a bit of a wall with this method.

For the second method, we were able to successfully run it on Alex’s gene list. I also had the task of running the iterative method on a small toy example, which was a small graph consisting of 10 nodes, 11 edges, 1 positive, and 1 negative. Anna informed me that when she did it, the scores converged after about 27 time steps, which matches the results that I got. We will go over the results in this week’s meeting and discuss what our goals for this week are.

GIANT Network

This past week I continued looking at the underlying structure of the GIANT network, especially how it changes when we “trim” edges off to make it a more manageable size.

Following up on the path length distribution of the previous week, I calculated this statistic on (almost) all the trimmed networks – from probability thresholds 0.150 to 0.900. The path lengths change as we expected. The first half of the networks have their highest number of shortest path lengths around 3, with the distribution moving outward as the networks get smaller.

(Color and corresponding probability threshold: Green = 0.150, red = 0.175, cyan = 0.200, blue = 0.300, magenta = 0.400, yellow = 0.500, pink = 0.600, black = 0.700, olive = 0.800, orange = 0.900)

I also did a quick search of the nodes in the networks with the highest degree in order to determine which ones were “hubs” in the network. In the three largest networks, the same gene was the node with the highest degree (ranging from 3283 to 1969 as the trimmed networks got smaller). This gene is called neurotrophic receptor tyrosine kinase 3, or NTRK3.

Another task I was given was to determine shortest paths considering the weights of the edges. There was trick to this – because our most important weights are larger (close to 1) and shortest path algorithms with weights consider lower weight edges, we needed to adjust for this. We agreed to fix this by taking the negative log (base 10) of each probability weight. Since our trimmed networks do not contain edge weights with probability 0, we wouldn’t have to worry about log(0). I am currently working on debugging this code, and will write another post interpreting the results when I have figured this out.

Happy almost end of the semester!!

GIANT Statistics

Computing network statistics on the GIANT network turned out to be somewhat of a challenge! The first hurdle that Alex and I discovered is that the brain-specific network we are working with is enormous, containing around 43 million edges. This network size was simply too big to run any program on efficiently, so Anna stepped in and created a file of “trimmed” networks that we could quickly use for our computations. Essentially, the edges in the network are weighted by a certain probability, so the network was trimmed by choosing a probability threshold; edges with weights less than that probability would not be included in the trimmed network. The largest trimmed network had about 6.4 million edges and a probability threshold of 0.125. I ran my statistics on networks with probability thresholds of 0.150, 0.175, 0.200, and 0.300 to see if there were differences in the statistics and what those differences might reveal about the structure of the network.

The statistics I ran on the trimmed networks were degree distribution, average AND, and shortest path length distribution. The degree distribution is the most straightforward – the degree of a node in a graph is the number of nodes it is connected to, so the degree distribution is a a histogram of the number of nodes in the network with a certain degree. The shape of the curve provides information about the structure of the graph. If you take the log of the degree distribution, a nice downward sloping line tells you that the network is scale-free, meaning its degree distribution follows what is known as the “power law distribution.” Scale free networks generally contain a smaller number of nodes with a high degree and a higher number of nodes with a small degree.

Below is the degree distribution for the trimmed network with a probability threshold of 0.150. All degree distributions calculated on the trimmed networks looked the same.

AND is short for the average neighbor degree – this looks at a node and sees how many neighbors (nodes a node is connected to) its neighbors have. Average AND answers the following question: On average, what is the degree of the neighbors of nodes with a certain degree? This question essentially investigates if there is a pattern in the degree of neighbors of nodes with a certain degree. Once again, the slope of the line reveals a piece of information about the structure of the network. A negative correlation means high degree nodes tend to be connected to low degree nodes, also known as a disassortative network. A positive correlation means high degree nodes tend to be connected to other high degree nodes and low degree nodes tend to be other low degree nodes, also known as an assortative network. The following figure from a paper on biological network connectivity demonstrates this concept quite clearly:

Overall, the average AND plots of the trimmed networks appear to be assortative, though the shape differs slightly. For example, compare the average AND plots of trimmed networks with 0.150 (top), 0.175 (middle), and 0.300 (bottom) threshold probabilities:

The final statistic is the path length distribution. This statistic is calculated using the breadth-first search algorithm to determine the length of shortest paths between all nodes. However, due to the size of the networks, my program doesn’t look at all possible paths between all nodes, instead running the BFS algorithm twice; once until it hits 100,000 paths and again with 200,000 paths. This was mainly done to see if there was a huge difference in the distribution of path lengths. There is a slight difference, as demonstrated by the distributions of the trimmed network with 0.150 probability threshold:

The next step for this statistic is to normalize the number of paths and see what difference this makes. Over the next couple of weeks, my goal is to refine my positive set of genes and check the GIANT brain-specific network to see if any of these genes appear.

 

A Quick Update & Overview of Goals

Fall break is over and we’re back to work!

Before break, I found a great resource that summarizes cell motility proteins by grouping them by function; it includes chemotaxis, receptors, growth factors, rho family GTPases, adhesion, integrin-mediated signaling, cellular projections, cell polarity, and proteolysis. This resource constitutes a significant portion of the positive set of proteins involved in cell motility.

Over the past week, I have entered the next step in creating my positive set of proteins known to be participants in cell motility and schizophrenia. Because many papers cited pathways (in addition to specific proteins), it’s crucial to look at these pathways and comb proteins from them to add to the positive set. These pathways, which include the CAM pathway, FAK pathway, and Reelin pathway, were taken from KEGG, a pathway database. Unfortunately, the KEGG pathways download as unreadable XML files, so I must parse these files; I am currently using a parser developed by Anna Ritz. After I have parsed these pathways, my next small step is to see which, if any, proteins are involved in multiple pathways.

Once my positive set has come together, I will begin analyzing the GIANT network. This includes comparing KEGG proteins to the GIANT network as well as generating summary statistics of the GIANT network. I will go into more detail of what this entails as I complete this portion of my analysis, but it will include generating statistics such as degree distribution, average node degree (AND), average AND, and possibly a few others.

Weeks 4 and 5: Building a Gold Standard List

Over the past couple of weeks, Alex and I have been given the task of building our project’s gold standard list. This list will be comprised of genes known to be associated with Schizophrenia and genes known to be associated with cell motility; Alex was given the Schizophrenia half to research and I was given the cell motility half. Overwhelmed by the sheer volume of genes and pathways associated with cell motility, I began by looking up genes known to be associated with both Schizophrenia and cell motility. In our meeting, we briefly discussed how this might serve us later as a known positive we could use to test our algorithm.

I read through tons of articles and papers, many of which I found through references in previous papers we have read. In reading these papers, I was hunting for genes, pathways, and resources that were studied and employed by the researchers and found several that will be useful to us. The CAM pathway and associated genes appeared multiple times, and the resource KEGG was used by the majority of the researchers. My task over the past week has been to download data from KEGG, continue accumulating genes for our gold standard list, and begin adding “confidence details” to the genes – this is an important weight that will be added to the genes in our list and will be a measure of how “confident” we are that the gene is truly associated with either Schizophrenia or cell motility.

 

Week 2: FAK and Schizophrenia

This week we took a closer look at a paper that investigated cell adhesion, cell motility, and focal adhesion dynamics in Schizophrenia patients.

These migration functions are regulated by focal adhesion kinase (FAK) proteins. The focal adhesion kinase signaling pathway involves the expression of integrin genes; integrins are proteins that detect cell adhesion as well as attach the cell to the extracellular matrix during cell migration. A previous study showed that the expression of two integrin genes (ITGA8, ITGA3) was altered in schizophrenia-derived cells.

The study was conducted on olfactory neurosphere-derived cells (accessible via biopsy of the olfactory mucosa) from 9 healthy male subjects and 9 male schizophrenia patients using two different assays. The assays involved seeding the cells on fibronectin-coated plates or chambers, allowing them to attach for 4 hours, washing away non-adherent cells, and allowing remaining cells to migrate for different periods of time. These experiments were repeated with the presence of FAK phosphorylation inhibitors, and then again while blocking antibodies to two different types of integrins. Several pieces of data were analyzed, including levels of pFAK, migration distance, speed, and size and number of focal adhesions present in the cells.

The results of this study demonstrated several important disparities that we are paying close attention to. While there was no difference in levels of FAK between patients and control subjects, patients had significantly lower levels of phosphorylated FAK (pFAK). In addition, patient cells had fewer adhesions, were less adherent, and were more motile than control cells, with a higher percentage of patient cells migrating further and with greater speed.  When pFAK was inhibited, and antibodies to two different types of integrins were blocked (three separate experiments), patient cell motility was reduced to control levels but control levels were not changed. This last result led us to conclude that the phosphorylation of FAK does not work like an on/off switch; rather, phosphorylation of FAK alters its behavior.

This paper gave us lots of food for thought; it provided us with a starting place to form our list of schizophrenia genes to investigate. Next week we will be investigating cell motility further by finding cell motility gene databases and papers studying the link between cell motility and schizophrenia.