So we have a BLASTer to see which genes in humans are homologous to genes in drosophila melanogaster, our model organism for cell motility. It works- it just takes a couple of hours to get an output file for 20 genes. Should be no problem to run overnight.
Miriam has been working on a way to verify that the network is working. Currently, she has been eliminating some positives while keeping others, looking at how high up the eliminated positives are in the result. It should be fruitful.
We also looked at how other functionally similar genes we were certain about were similar. We found that all of the gene interactions we thought of as functionally similar had strong edges in the networks. This validates the network as a whole as biologically sound and will further validate our results.
Although we had the bio qual, we still managed to do a little bit of work. One of the sources for our positive set was a little suspect, so we decided to take it out and see the impact on the results. We found that there was a drastic change. Since we were relatively confident that the other positives were good, if the suspect one was good, then the results wouldn’t change as much if we ran the whole network through. What we found was that it was indeed not very good. The output file was dramatically different. But we need a way to quantify the change
This upcoming week, we will also be developing a program to find how conserved our output genes are in fruit flies, which are our model organism
I took a pathway that I know pretty well, NMDA-based long term potentiation (the process by which neurons become more sensitive to neurotransmitters in response to glutamatergic activity). I made some positives for both the first part of the pathway, the channel proteins that bind to and Ca2+ associated proteins, and the last part of the pathway, which involves nuclear proteins involved with transcribing growth factors and new channel proteins.
A protein centrally implicated in LTP, CaMKII, was used as the reference ranked 21 out of ~15,000. It’s good, but it could be improved. It might be better if we allowed for more precision.
Upcoming tasks will include further verifications, but will probably be postponed, since both of us are taking the biology junior qualifying exam, which is a cumulative exam that we need to pass to graduate.
This week was spring break
We needed to find drosophila analogs to the genes high on our list, so this week, we built a program that converts human gene names to human protein ID numbers from the NCBI. Then, the ID number is put into BLAST, a tool from the NCBI that sorts drosophila proteins by amino residue sequence similarity.
There are also things we can do to optimize the gene ranker program to improve the run time. We will be converting some sets to integers and will catalog the runtime versus the progress.
Anna noted that we need to run the program for quite a bit longer to get a more accurate result, so we’ll publish those results in a couple of days.
Also, we need to find out if our candidate genes are conserved in drosophila. We can use this with BLAST, an NCBI tool that lets us compare sequence similarity across species. Rather than do it manually, we’ll hook it up directly to our terminal.
It works! Our program made a list of candidate genes. After 150 iterations, the SZ positives made a novel list of genes that may be involved in Schizophrenia, ranging from potassium gating to Golgi associated proteins to proteins involved in cellular motility. This program can be refined, and so we will spend the upcoming weeks narrowing the list and making sure that this process can be as precise and accurate as it can be.
Another thing that I did myself was rank SZ genes, since I didn’t want to include all 2700 genes that are implicated. First, for every dataset (Psychiatric Genomics Consortium, Common Mind Consortium, SZGene, hiPSC rtRNA results, chromosome conformation predictions), I gathered the magnitude of the change in expression/ the frequency of mutation and the p value for each gene. Then I took the log base 2 of the change and the negative log base 2 of the p value and multiplied them together.
There are a couple of important things to note. Schizophrenia is primarily a regulation problem. As the Psychiatric Genomics Consortium study pointed out, there is very little genetic variation in SZ patients from control patients in coding regions; there are only a dozen significant mutations in coding regions. Of all the mutations, the one with the highest confidence, with a p value of 10^-15, has a frequency in SZ patients at 0.87 and a frequency in control patients at 0.85. SZ is undeniably a confluence of many, possibly hundreds or thousands, of tiny genetic variations.
Apparently, many of these mutations are common among several major psychiatric disorders. An article was recently published in Science that showed that Schizophrenia, Bipolar Disorder, Depression, and Autism all have highly correlated patterns of cortical gene expression (http://science.sciencemag.org/content/359/6376/693.full). The tool we are developing will hopefully be powerful enough to help us identify the underlying causes of these diseases.
We got our program built! It works with the tiny network example that we gave it, with a couple of bugs, which we will fix this week. Also by next week, we will have a preliminary list of candidate genes.
This week, we made our plans and are now coding the program. We hope to have a MVP classifier ready by next week.
The main goal for this semester is going to be the implementation of the program and the production of a list of candidate genes.
However, we first need to figure out the best way to build the program. This week, I’m going to gain a deeper understanding of the algorithm that we’re basing our project on. Also, I’m going to try to find some software packages that can help us implement the program, which uses support vector machines. I will also research a possible alternative method that doesn’t use support vector machines: logistic regression. Finally, I will gain more information about the integration of our data sets with the functional interaction network by finding nodes with a high amount of SZ and Focal Adhesion positive neighbors.