We needed to find drosophila analogs to the genes high on our list, so this week, we built a program that converts human gene names to human protein ID numbers from the NCBI. Then, the ID number is put into BLAST, a tool from the NCBI that sorts drosophila proteins by amino residue sequence similarity.
There are also things we can do to optimize the gene ranker program to improve the run time. We will be converting some sets to integers and will catalog the runtime versus the progress.
Anna noted that we need to run the program for quite a bit longer to get a more accurate result, so we’ll publish those results in a couple of days.
Also, we need to find out if our candidate genes are conserved in drosophila. We can use this with BLAST, an NCBI tool that lets us compare sequence similarity across species. Rather than do it manually, we’ll hook it up directly to our terminal.
We have two goals to accomplish for our meeting next week:
Run our iterative method for 100,000 iterations, or until the scores change by 0.001 or less. See how this affects the ranking compared to 150 iterations. Will it just affect the scores but not the relative rankings?
See if the candidates are conserved in drosophila melanogaster. The second part of this project is to experimentally validate the candidates, which we cannot do if they aren’t conserved in drosophila. As a quick sanity check, we’re going to BLAST CTNNB1 (the well-studied Catenin Beta 1 gene that codes for a protein involved in the formation of adherens junctions) in humans against drosophila which should be ARM.
This previous week we put our iterative algorithm to the test. Alex and I each ran our positive (and negative) lists for 150 iterations using the 0.200 threshold GIANT brain network. As a reminder, the 0.200 threshold network is a trimmed version of the GIANT network where all the edges have probability scores of 0.200 or greater.
I used all 541 of my positives, a combination of two sublists: cell motility genes and cell motility genes implicated in schizophrenia. In the upcoming weeks, I will experiment with running these two lists separately and seeing how the results are affected.
I also used Alex’s list of negatives, which are genes expressed in other organs but not the brain. We will need to closely consider whether it is wise to use this set of negatives alongside cell motility positives.
Even with these two potential issues, our preliminary lists of candidates looked promising – for example, one gene we found is Hes Related Family BHLH Transcription Factor With YRPW Motif 2, also know as HEY2. HEY2 was not only a schizophrenia positive but was also moderately high in the list of cell motility candidates, with a score of 0.56. Another gene we uncovered was 1-Acylglycerol-3-Phosphate O-Acyltransferase 4, or AGPAT4. This was not in either list of positives, but it was given the maximum score of 1.0 in each set of candidates. Both of these genes point to the different ways we can analyze our results and what it means to be a promising candidate – do we care about genes that are positives in the schizophrenia set and are determined by the algorithm to likely be involved in cell motility or do we care about genes that are determined by the algorithm to be likely involved in cell motility and schizophrenia? Do we care about both? What is the difference? Is one more promising than the other? In the upcoming weeks, we will be figuring out these questions in order to refine our list of candidates.
We will also be extending the time that our algorithm runs. 150 iterations may seem like a lot, but cutting the algorithm short may lead to “stunted” results. The iterative method can be thought of as dyes from the positive and negative nodes spreading throughout the network, staining the surrounding nodes in a way that reflects their distances from the positives and negatives. If the iterations are stopped before they converge (i.e. subsequent iterations don’t change the scores, or color of the node), then the color of the nodes might be closer to the color of the positives and negatives than they should be. Our next step will be to let it run for 1500 iterations and see how our results our affected.
Out of my own curiosity, I also decided to see if all of the cell motility positives were in the GIANT network, which turned out to be false. There are about 40 genes (out of 541) that are not in the GIANT network. However, I don’t believe this to be a problem; if there are positive cell motility genes not in the GIANT network, it is probably because not all genes involved in cell motility are expressed in every type of cell and are therefore not relevant in a brain-specific network.
It works! Our program made a list of candidate genes. After 150 iterations, the SZ positives made a novel list of genes that may be involved in Schizophrenia, ranging from potassium gating to Golgi associated proteins to proteins involved in cellular motility. This program can be refined, and so we will spend the upcoming weeks narrowing the list and making sure that this process can be as precise and accurate as it can be.
Another thing that I did myself was rank SZ genes, since I didn’t want to include all 2700 genes that are implicated. First, for every dataset (Psychiatric Genomics Consortium, Common Mind Consortium, SZGene, hiPSC rtRNA results, chromosome conformation predictions), I gathered the magnitude of the change in expression/ the frequency of mutation and the p value for each gene. Then I took the log base 2 of the change and the negative log base 2 of the p value and multiplied them together.
There are a couple of important things to note. Schizophrenia is primarily a regulation problem. As the Psychiatric Genomics Consortium study pointed out, there is very little genetic variation in SZ patients from control patients in coding regions; there are only a dozen significant mutations in coding regions. Of all the mutations, the one with the highest confidence, with a p value of 10^-15, has a frequency in SZ patients at 0.87 and a frequency in control patients at 0.85. SZ is undeniably a confluence of many, possibly hundreds or thousands, of tiny genetic variations.
Apparently, many of these mutations are common among several major psychiatric disorders. An article was recently published in Science that showed that Schizophrenia, Bipolar Disorder, Depression, and Autism all have highly correlated patterns of cortical gene expression (http://science.sciencemag.org/content/359/6376/693.full). The tool we are developing will hopefully be powerful enough to help us identify the underlying causes of these diseases.
We have a couple of goals for next week: fix a subtle yet major bug in our iterative method program and then run it on our gene lists to get a preliminary list of candidate genes. I will also write a short program to compare the genes in the GIANT network to the list of cell motility genes.
We got our program built! It works with the tiny network example that we gave it, with a couple of bugs, which we will fix this week. Also by next week, we will have a preliminary list of candidate genes.
At last week’s meeting, we dug into the computational portion of the project, discussing different ways we could run our algorithm on the gene lists and how that would affect our results. We also discussed a few different methods for running on our gene lists – one method involves a support vector machine and the other is an iterative method of computing scores, shown below.
Alex was able to track down some software for the former method from the autism paper that inspired our project. One of our tasks was to try to get this software to work on our current gene lists – unfortunately, we hit a bit of a wall with this method.
For the second method, we were able to successfully run it on Alex’s gene list. I also had the task of running the iterative method on a small toy example, which was a small graph consisting of 10 nodes, 11 edges, 1 positive, and 1 negative. Anna informed me that when she did it, the scores converged after about 27 time steps, which matches the results that I got. We will go over the results in this week’s meeting and discuss what our goals for this week are.
The main goal for this semester is going to be the implementation of the program and the production of a list of candidate genes.
However, we first need to figure out the best way to build the program. This week, I’m going to gain a deeper understanding of the algorithm that we’re basing our project on. Also, I’m going to try to find some software packages that can help us implement the program, which uses support vector machines. I will also research a possible alternative method that doesn’t use support vector machines: logistic regression. Finally, I will gain more information about the integration of our data sets with the functional interaction network by finding nodes with a high amount of SZ and Focal Adhesion positive neighbors.