GIANT Statistics

Computing network statistics on the GIANT network turned out to be somewhat of a challenge! The first hurdle that Alex and I discovered is that the brain-specific network we are working with is enormous, containing around 43 million edges. This network size was simply too big to run any program on efficiently, so Anna stepped in and created a file of “trimmed” networks that we could quickly use for our computations. Essentially, the edges in the network are weighted by a certain probability, so the network was trimmed by choosing a probability threshold; edges with weights less than that probability would not be included in the trimmed network. The largest trimmed network had about 6.4 million edges and a probability threshold of 0.125. I ran my statistics on networks with probability thresholds of 0.150, 0.175, 0.200, and 0.300 to see if there were differences in the statistics and what those differences might reveal about the structure of the network.

The statistics I ran on the trimmed networks were degree distribution, average AND, and shortest path length distribution. The degree distribution is the most straightforward – the degree of a node in a graph is the number of nodes it is connected to, so the degree distribution is a a histogram of the number of nodes in the network with a certain degree. The shape of the curve provides information about the structure of the graph. If you take the log of the degree distribution, a nice downward sloping line tells you that the network is scale-free, meaning its degree distribution follows what is known as the “power law distribution.” Scale free networks generally contain a smaller number of nodes with a high degree and a higher number of nodes with a small degree.

Below is the degree distribution for the trimmed network with a probability threshold of 0.150. All degree distributions calculated on the trimmed networks looked the same.

AND is short for the average neighbor degree – this looks at a node and sees how many neighbors (nodes a node is connected to) its neighbors have. Average AND answers the following question: On average, what is the degree of the neighbors of nodes with a certain degree? This question essentially investigates if there is a pattern in the degree of neighbors of nodes with a certain degree. Once again, the slope of the line reveals a piece of information about the structure of the network. A negative correlation means high degree nodes tend to be connected to low degree nodes, also known as a disassortative network. A positive correlation means high degree nodes tend to be connected to other high degree nodes and low degree nodes tend to be other low degree nodes, also known as an assortative network. The following figure from a paper on biological network connectivity demonstrates this concept quite clearly:

Overall, the average AND plots of the trimmed networks appear to be assortative, though the shape differs slightly. For example, compare the average AND plots of trimmed networks with 0.150 (top), 0.175 (middle), and 0.300 (bottom) threshold probabilities:

The final statistic is the path length distribution. This statistic is calculated using the breadth-first search algorithm to determine the length of shortest paths between all nodes. However, due to the size of the networks, my program doesn’t look at all possible paths between all nodes, instead running the BFS algorithm twice; once until it hits 100,000 paths and again with 200,000 paths. This was mainly done to see if there was a huge difference in the distribution of path lengths. There is a slight difference, as demonstrated by the distributions of the trimmed network with 0.150 probability threshold:

The next step for this statistic is to normalize the number of paths and see what difference this makes. Over the next couple of weeks, my goal is to refine my positive set of genes and check the GIANT brain-specific network to see if any of these genes appear.


A Quick Update & Overview of Goals

Fall break is over and we’re back to work!

Before break, I found a great resource that summarizes cell motility proteins by grouping them by function; it includes chemotaxis, receptors, growth factors, rho family GTPases, adhesion, integrin-mediated signaling, cellular projections, cell polarity, and proteolysis. This resource constitutes a significant portion of the positive set of proteins involved in cell motility.

Over the past week, I have entered the next step in creating my positive set of proteins known to be participants in cell motility and schizophrenia. Because many papers cited pathways (in addition to specific proteins), it’s crucial to look at these pathways and comb proteins from them to add to the positive set. These pathways, which include the CAM pathway, FAK pathway, and Reelin pathway, were taken from KEGG, a pathway database. Unfortunately, the KEGG pathways download as unreadable XML files, so I must parse these files; I am currently using a parser developed by Anna Ritz. After I have parsed these pathways, my next small step is to see which, if any, proteins are involved in multiple pathways.

Once my positive set has come together, I will begin analyzing the GIANT network. This includes comparing KEGG proteins to the GIANT network as well as generating summary statistics of the GIANT network. I will go into more detail of what this entails as I complete this portion of my analysis, but it will include generating statistics such as degree distribution, average node degree (AND), average AND, and possibly a few others.

Weeks 4 and 5: Building a Gold Standard List

Over the past couple of weeks, Alex and I have been given the task of building our project’s gold standard list. This list will be comprised of genes known to be associated with Schizophrenia and genes known to be associated with cell motility; Alex was given the Schizophrenia half to research and I was given the cell motility half. Overwhelmed by the sheer volume of genes and pathways associated with cell motility, I began by looking up genes known to be associated with both Schizophrenia and cell motility. In our meeting, we briefly discussed how this might serve us later as a known positive we could use to test our algorithm.

I read through tons of articles and papers, many of which I found through references in previous papers we have read. In reading these papers, I was hunting for genes, pathways, and resources that were studied and employed by the researchers and found several that will be useful to us. The CAM pathway and associated genes appeared multiple times, and the resource KEGG was used by the majority of the researchers. My task over the past week has been to download data from KEGG, continue accumulating genes for our gold standard list, and begin adding “confidence details” to the genes – this is an important weight that will be added to the genes in our list and will be a measure of how “confident” we are that the gene is truly associated with either Schizophrenia or cell motility.


Week 3

No blog post this week! We didn’t have our regular meeting due to the PacNow QB Meeting hosted by Anna and Derek. This week Alex and I are presenting an overview of our project as well as the papers we’ve read so far.

Week 2: FAK and Schizophrenia

This week we took a closer look at a paper that investigated cell adhesion, cell motility, and focal adhesion dynamics in Schizophrenia patients.

These migration functions are regulated by focal adhesion kinase (FAK) proteins. The focal adhesion kinase signaling pathway involves the expression of integrin genes; integrins are proteins that detect cell adhesion as well as attach the cell to the extracellular matrix during cell migration. A previous study showed that the expression of two integrin genes (ITGA8, ITGA3) was altered in schizophrenia-derived cells.

The study was conducted on olfactory neurosphere-derived cells (accessible via biopsy of the olfactory mucosa) from 9 healthy male subjects and 9 male schizophrenia patients using two different assays. The assays involved seeding the cells on fibronectin-coated plates or chambers, allowing them to attach for 4 hours, washing away non-adherent cells, and allowing remaining cells to migrate for different periods of time. These experiments were repeated with the presence of FAK phosphorylation inhibitors, and then again while blocking antibodies to two different types of integrins. Several pieces of data were analyzed, including levels of pFAK, migration distance, speed, and size and number of focal adhesions present in the cells.

The results of this study demonstrated several important disparities that we are paying close attention to. While there was no difference in levels of FAK between patients and control subjects, patients had significantly lower levels of phosphorylated FAK (pFAK). In addition, patient cells had fewer adhesions, were less adherent, and were more motile than control cells, with a higher percentage of patient cells migrating further and with greater speed.  When pFAK was inhibited, and antibodies to two different types of integrins were blocked (three separate experiments), patient cell motility was reduced to control levels but control levels were not changed. This last result led us to conclude that the phosphorylation of FAK does not work like an on/off switch; rather, phosphorylation of FAK alters its behavior.

This paper gave us lots of food for thought; it provided us with a starting place to form our list of schizophrenia genes to investigate. Next week we will be investigating cell motility further by finding cell motility gene databases and papers studying the link between cell motility and schizophrenia.




Week 1: A Little Context

Week 1! We’re starting off the year, and our research, with some literature review to provide ourselves with a little context. Our proposed research plan mainly draws on two papers that laid out the methodological groundwork for us.

The first paper provides pertinent results and insights into the connection between cell motility and schizophrenia. The researchers concluded that cells in patients with schizophrenia are less adhesive and more motile than the cells of healthy control subjects, which was shown to improve with the inhibition of the focal adhesion kinase (FAK) protein. The results of this paper showed that there is a correlation between the altered motility of patient cells and dysregulated gene expression in the FAK signaling pathway within these cells. This paper provided us with the informational basis necessary to choose what kinds of genes we will shine our focus on – schizophrenia and migration association genes.

The second paper essentially laid out the framework for our methodology. In this paper, the researchers used a machine-learning algorithm to predict genes associated with autism spectrum disorder (ASD). They then validated these predicted genes experimentally using an independent-case sequencing study and were further able to demonstrate that this large set of ASD genes played roles in key pathways and brain development.

Pulling from both these papers, we plan to develop a network diffusion algorithm to identify candidate schizophrenia and migration associated genes. We are going to computationally build a list of predicted genes and (hopefully) validate them experimentally. However, before we do that, we need to delve into the nitty gritty of how the autism paper’s machine-learning algorithm was created.

In the paper, it outlines how the approach was based upon a human brain-specific gene functional-interaction network nicknamed GIANT (Genome-scale Integrated Analysis of gene Networks in Tissues). We began with a few initial questions: How is a functional-interaction standard set up? How was GIANT constructed? These questions will no doubt open up a whole new corridor of doors to explore; over the next couple of weeks, our goal is to investigate these questions through literature review and learn more about how we can utilize these tools in the coming year.