It works! Our program made a list of candidate genes. After 150 iterations, the SZ positives made a novel list of genes that may be involved in Schizophrenia, ranging from potassium gating to Golgi associated proteins to proteins involved in cellular motility. This program can be refined, and so we will spend the upcoming weeks narrowing the list and making sure that this process can be as precise and accurate as it can be.
Another thing that I did myself was rank SZ genes, since I didn’t want to include all 2700 genes that are implicated. First, for every dataset (Psychiatric Genomics Consortium, Common Mind Consortium, SZGene, hiPSC rtRNA results, chromosome conformation predictions), I gathered the magnitude of the change in expression/ the frequency of mutation and the p value for each gene. Then I took the log base 2 of the change and the negative log base 2 of the p value and multiplied them together.
There are a couple of important things to note. Schizophrenia is primarily a regulation problem. As the Psychiatric Genomics Consortium study pointed out, there is very little genetic variation in SZ patients from control patients in coding regions; there are only a dozen significant mutations in coding regions. Of all the mutations, the one with the highest confidence, with a p value of 10^-15, has a frequency in SZ patients at 0.87 and a frequency in control patients at 0.85. SZ is undeniably a confluence of many, possibly hundreds or thousands, of tiny genetic variations.
Apparently, many of these mutations are common among several major psychiatric disorders. An article was recently published in Science that showed that Schizophrenia, Bipolar Disorder, Depression, and Autism all have highly correlated patterns of cortical gene expression (http://science.sciencemag.org/content/359/6376/693.full). The tool we are developing will hopefully be powerful enough to help us identify the underlying causes of these diseases.
We got our program built! It works with the tiny network example that we gave it, with a couple of bugs, which we will fix this week. Also by next week, we will have a preliminary list of candidate genes.
This week, we made our plans and are now coding the program. We hope to have a MVP classifier ready by next week.
The main goal for this semester is going to be the implementation of the program and the production of a list of candidate genes.
However, we first need to figure out the best way to build the program. This week, I’m going to gain a deeper understanding of the algorithm that we’re basing our project on. Also, I’m going to try to find some software packages that can help us implement the program, which uses support vector machines. I will also research a possible alternative method that doesn’t use support vector machines: logistic regression. Finally, I will gain more information about the integration of our data sets with the functional interaction network by finding nodes with a high amount of SZ and Focal Adhesion positive neighbors.
The spring semester starts on Monday and blog posts will resume this coming week!
It’s still winter break, and will continue to be for 2 weeks
No posts. Thanksgiving week
This week, I ran some statistics on how our existing SZ gene dataset fit within the GIANT network. I found that most of our genes had a posterior probability of about 0.3-0.5. Which makes sense, given that many of our genes should be at least 0.2.
Next/this week is thanksgiving, but soon we’ll be setting parameters for how we want to judge each gene in the completed network given the qualities of the GIANT network as a whole.
This week, I got more familiar with the NetworkX package, which is concerned with graph based programming. It’s very powerful, but the complete GIANT network is far too large for it to run efficiently. Even the 0.1 threshold for edges has 41 million edges. However, Anna showed that the number of edges decreases exponentially as the threshold goes up. A careful balance will be needed as for the weight we put on these edges in our final program and how many we include.
This week, I compared the brain tissue specific network from human base (http://hb.flatironinstitute.org) with the set of genes I collected associated with a higher risk of schizophrenia. This tissue specific network gives the probability that 2 genes interact with each other specifically in the brain. Genes that interact in regions other than the brain but still interact in the brain have a lower probability count. As expected, nearly all of the associated genes had at least a 0.1 probability, which is relatively high in terms of bioinformatic confidence. Notably, several of the genes interactions that had above a 0.9 probability involved cell adhesion genes.
Of the genes below the 0.1 confidence, most do have neural roles but simply have roles common to other parts of the body. For instance, mir-137 is involved in neural development but is also involved in tumor suppression for several cancers.
This upcoming week, I will be learning how to use NetworkX and will be gathering statistics from the Humanbase network with it.