This week, my goals have been relatively simple and short. There are a couple of bugs in my code that I need to sort out. I haven’t been able to collect any data about how “good” our method is due to these bugs, but once I do, we will hopefully be able to see that the results we are getting are not random.
Since the last time I wrote a blog post, I have mainly been working on one way of verifying our algorithm; essentially, we need to make sure that our method is actually good at doing what it’s supposed to be doing. Although there are several ways to verify an algorithm, some of which Alex has been working on, I am working on something called “k-fold cross validation.”
This method of verification works by removing (or “hiding”) 1/k-th of your positives, then running your method and seeing where the hidden positives are ranked. You randomly choose the 1/k positives to hide and do this several times. You can also use the distribution of scores you get from each run to see how your method (with all positives) compares – are you getting significant results or are you getting what is expected from randomly choosing positives?
My first attempt at this was just a test run, and there were a few mistakes that I made. First, I chose to sample half instead of choosing a more reasonable number of positives such as 1/4 or 1/5. I also didn’t randomly choose positives to hide – I simply deleted the second half of the positives, and in doing so, might have removed an entire cell motility pathway or two from the positives. This would have produced biased results.
This coming week, my goal is to randomly sample 1/4 of positives multiple times, as well as implement some plotting functions in order to visualize the distribution of scores.
So we have a BLASTer to see which genes in humans are homologous to genes in drosophila melanogaster, our model organism for cell motility. It works- it just takes a couple of hours to get an output file for 20 genes. Should be no problem to run overnight.
Miriam has been working on a way to verify that the network is working. Currently, she has been eliminating some positives while keeping others, looking at how high up the eliminated positives are in the result. It should be fruitful.
We also looked at how other functionally similar genes we were certain about were similar. We found that all of the gene interactions we thought of as functionally similar had strong edges in the networks. This validates the network as a whole as biologically sound and will further validate our results.
Although we had the bio qual, we still managed to do a little bit of work. One of the sources for our positive set was a little suspect, so we decided to take it out and see the impact on the results. We found that there was a drastic change. Since we were relatively confident that the other positives were good, if the suspect one was good, then the results wouldn’t change as much if we ran the whole network through. What we found was that it was indeed not very good. The output file was dramatically different. But we need a way to quantify the change
This upcoming week, we will also be developing a program to find how conserved our output genes are in fruit flies, which are our model organism
I am taking the biology junior qualifying exam this weekend, so there will be no blog post. The junior qualifying exam is a cumulative exam each student must take their junior year for their major. Each student must pass it in order to be able to move on to writing a thesis their senior year and graduate.
I took a pathway that I know pretty well, NMDA-based long term potentiation (the process by which neurons become more sensitive to neurotransmitters in response to glutamatergic activity). I made some positives for both the first part of the pathway, the channel proteins that bind to and Ca2+ associated proteins, and the last part of the pathway, which involves nuclear proteins involved with transcribing growth factors and new channel proteins.
A protein centrally implicated in LTP, CaMKII, was used as the reference ranked 21 out of ~15,000. It’s good, but it could be improved. It might be better if we allowed for more precision.
Upcoming tasks will include further verifications, but will probably be postponed, since both of us are taking the biology junior qualifying exam, which is a cumulative exam that we need to pass to graduate.
One of our tasks last week was to make a diagram that provides an overview of our project’s methods, pictured below. The blue represents inputs, the yellow represents a function (our semi-supervised method), the green represents outputs, and the purple is the final output.
At the top of the diagram, the blue boxes go through the process of creating the positive list of cell motility genes. The genes were gathered from different signaling pathways in the KEGG database as well as studying primary literature. From these resources, I created two separate positive lists: one purely of cell motility genes and one of cell motility genes implicated in schizophrenia. I combined these together (collapsing the duplicates) in order to create a positive list of 541 cell motility genes. At the bottom of the diagram, the two blue boxes go through the process of creating the schizophrenia positive list. Alex pulled schizophrenia genes from genome-wide association studies (GWAS), then filtered the genes down to the top 300 positives by taking into account their p-values from the literature. We each ran the semi-supervised iterative method on the GIANT brain interactome with the same negative set, differing only in the positive sets we used. From there, the iterative method spit out ranked lists of schizophrenia candidates and cell motility candidates with scores ranging from 0-1 (the green boxes.) Finally, we combined the scores by multiplying them to take into account their probability of being “good” candidates for both cell motility and schizophrenia (the purple box.)
Our runtime-shortening strategies seemed to have worked well! We plotted runtime vs. iteration number and found that each iteration took about 3-4 seconds, so there’s no increase as we saw last time. However, we did find that the scores do not change after a surprisingly small number of iterations, so we will be tweaking a few things to see what changes.
Because it’s hard to tell if a semi-supervised machine learning method is actually good at what it’s supposed to accomplish, we will also be looking into ways to test our method.
Our goals for next time include:
Run on a larger portion of the network. So far we have been using the 0.200 probability threshold network, but we will move down to 0.150 threshold.
Change the number of positives used – see what the effect is of using 1 vs. 300+ positives on the graph.
Plot a distribution of the ranked candidate scores.
Plot the absolute value of the sum of changes made during each iteration.
Look into cross validation to check if our method is doing a good job at what it’s supposed to do. This involves hiding some positives from your positive list and seeing if your method correctly identifies these hidden positives as having a high probability (if not the highest) of being involved in the pathway.
Look into how other papers were able to be convincing about the accuracy of their method.
Next week is spring break so there will be no blog post!
Last week we had the task of running our algorithm for 100,000 iterations. We ran into a couple of problems:
It took a very long time. We actually stopped running it at about 1300 iterations (which took 12 hours) because the estimated time kept increasing. There are a few lines in our code that we are going to change to improve running time – we will convert some variables we’re using to track changes between iterations from lists of nodes to integer counts.
The difference in score we wanted to see before stopping iterations was far too big. We allowed the code to run until the difference in scores between iterations was no more than 0.001 (or it hit 100,000 iterations), and to our surprise, it took around 130 and 180 iterations for cell motility and schizophrenia positives, respectively. Because the scores range from 0 to 1, a change of 10^-4 is bigger than we initially thought. After speeding up the code, we are going to run it until we see changes less than 10^-9.
We are also going to create a histogram of run time versus progress (each iteration) to track our algorithm and see if there is a separate problem that causes the code to slow down as the number of iterations increases.
Our continued goals for this week are to BLAST candidates to see if they are conserved in Drosophila.