I finally managed to download all of the GO Terms and spent some time processing the data and editing previous code so that it all fit together. After having done so and obtained a working version of the code, an initial run revealed that the run time of weighting the entire HIPPIE Interactome would be approximately 101 days. This is outside the duration of the CREU so clearly something needed to be done. To fully explain the issues that lead to such a terrible run-time it is important to first have a rough understanding of the structure of the code.
This iteration of the code first produced a list of true positives and a list of true negatives on the criteria of GO Term co-annotation. This section of the code takes about 4 minutes (which can also be potentially sped up later on) but this negligibly contributes to the 3 month run time. The next portion implements functions that in conjunction calculate the cost of a given node. The necessary inputs are
A list of all the evidence types.
A dictionary where the keys are the evidence type labels and the values are lists (evidence lists) whose elements are the protein edges (represented as lists) which are confirmed to occur by that particular evidence type
A list of true positives and a list of true negatives.
Finding the cost involves calculating the intersection and symmetric difference of the positives with the evidence lists and then the intersection and symmetric difference of the negatives with the evidence lists. This was done each for each evidence list “every” time a new edge was weighted. This approach took roughly 2.5+ minutes per edge. With 93138 edges, this was the main contributor to the 3 month run time. Further inspection reveals that the time-complexity of performing these operations on a set is O(min(len(s), len(t)) and O(max(len(s), len(t)) for a list. Performing this operation every single time was therefore very costly. Simply calculating these numbers for each edge at the outset and then appending them in a particular order to the end of each evidence list reduced the time needed to weight each edge down to approximately 0.45 seconds thus shaving down the total run-time to about 11 hours.
Another change that will significantly reduce the run time further is using sets instead of lists for the protein edges in the dictionary and representing the edges themselves as tuples. This will mean creating a different dictionary with identical keys but the values will the four calculated values mentioned above. This will supposedly shave off a large chunk of time because element look up in a set is significantly faster than the same operation in a list. Another idea is to change the order of calculations in a way that reduces the amount of time the exact same calculation is executed or to calculate the edges in a way that iterates over the approximately 230 evidence types rather than the 93138 edges.
I spent the majority of the week parsing through various databases and their associated tools to try to download GO Terms to form the set of positives and negatives required for the Bayesian Weighting Scheme. There appeared to be no easy way to download the GO annotations for specific proteins or to download specific GO Terms. Therefore, I simply downloaded a file containing all the GO annotations and stripped it of the unnecessary portions to form the positives and negatives. All that remains now is the integrate this with the remaining code for the weighting scheme. That will be the main objective of Week 5.
The achievements from last week as well as the goals were laid out in this progress report. Accordingly, I spent the majority of this week working on an implementation of the Bayesian Weighting Scheme I discussed last week. This involved carefully fleshing out all the minutiae of the relevant paper (which I laid out and explained in this document from last week). I then applied the code onto a small toy-network that I made myself and then hand-calculated a few edges to verify my implementation.
I am currently working to implement the weighting scheme onto the HIPPIE Interactome. Thus far this has meant simply writing code to read the relevant text files and format the data efficiently in a way that meshes well with code I wrote for the toy example. Eventually, I will compare the scores of edges from this weighting scheme to the ones on the actual HIPPIE Interactome and further analysis on the differences between the two weighting schemes will probably be among the next steps. Besides implementing the Bayesian Weighting Scheme I have also been working to understand the underlying statistics behind the methods used to weight the HIPPIE Interactome. I will try to have a document that explains the mathematics for this in similar fashion to the one for the Bayesian Weighting Scheme.
Week 2 was spent further looking into Bayesian Weighting Schemes and in particular how they were applied in a specific paper titled Top-Down Network Analysis to Drive Bottom-Up Modeling of Physiological Processes by Poirel et. al. This was the original LINKER paper that PATHLINKER was built upon. I also spent a very decent amount of time looking at other interactomes and curating a working list of interactomes and how they weight their interactomes. So far I have only done this in detail for the HIPPIE interactome but more work is still needed. Besides research, I also worked on a document about Bayesian Weighting that aims to elucidate some of the mathematical set up and notation for the paper mentioned above as it suffered from some convoluted mathematical notation. Hopefully, this will be useful to people trying to read this paper in the future. Other tasks involved debugging code, presenting learned material and sorting through papers to find relevant papers.
My research project aims to extend upon the original PATHLINKER paper. In particular, I hope to investigate different weighting schemes both from the perspective of utilizing different data sets and data-summary methods as well as potentially maybe different statistical tools for manipulating the data in meaningful ways. This is motivated by the fact that there appears to be little overlap between interactomes consisting of the same proteins but weighted differently. Therefore, this project will allow us to analyze different schemes and develop an heuristic for weighting interactomes. As a result, we will be able to apply algorithms like PATHFINDER on graphs such that they provide results that have valid biological meaning rather than being affected by superficial factors such as trends in scientific research or biases of the scientific community. Right now, I am focusing on Bayesian Weighting schemes which attempt to use experimental evidence to provide a weight that represent the probability of a given interaction occurring in a given signaling pathway by leveraging Bayes’ theorem. I will talk more about this in future posts.
Week 1 Progress
Week 1 was spent mostly acquainting myself with the necessary prerequisite knowledge required for research during coming weeks such as Dijkstra’s algorithm (and its implementation) and Bayesian Weighting Schemes and in that regard reading some relevant papers. For the sake of keeping track the particular papers were titled :-
The first was the original PATH LINKER paper which our research hopes to extend upon. The next two provided some information on Bayesian Weighting but did not provide too much else due to some ambiguous mathematical notation. I hope to attempt to write a document that annotates the relevant portions of the papers for the sake of clarity and future reference.
This week, my goals have been relatively simple and short. There are a couple of bugs in my code that I need to sort out. I haven’t been able to collect any data about how “good” our method is due to these bugs, but once I do, we will hopefully be able to see that the results we are getting are not random.
Since the last time I wrote a blog post, I have mainly been working on one way of verifying our algorithm; essentially, we need to make sure that our method is actually good at doing what it’s supposed to be doing. Although there are several ways to verify an algorithm, some of which Alex has been working on, I am working on something called “k-fold cross validation.”
This method of verification works by removing (or “hiding”) 1/k-th of your positives, then running your method and seeing where the hidden positives are ranked. You randomly choose the 1/k positives to hide and do this several times. You can also use the distribution of scores you get from each run to see how your method (with all positives) compares – are you getting significant results or are you getting what is expected from randomly choosing positives?
My first attempt at this was just a test run, and there were a few mistakes that I made. First, I chose to sample half instead of choosing a more reasonable number of positives such as 1/4 or 1/5. I also didn’t randomly choose positives to hide – I simply deleted the second half of the positives, and in doing so, might have removed an entire cell motility pathway or two from the positives. This would have produced biased results.
This coming week, my goal is to randomly sample 1/4 of positives multiple times, as well as implement some plotting functions in order to visualize the distribution of scores.
So we have a BLASTer to see which genes in humans are homologous to genes in drosophila melanogaster, our model organism for cell motility. It works- it just takes a couple of hours to get an output file for 20 genes. Should be no problem to run overnight.
Miriam has been working on a way to verify that the network is working. Currently, she has been eliminating some positives while keeping others, looking at how high up the eliminated positives are in the result. It should be fruitful.
We also looked at how other functionally similar genes we were certain about were similar. We found that all of the gene interactions we thought of as functionally similar had strong edges in the networks. This validates the network as a whole as biologically sound and will further validate our results.
Although we had the bio qual, we still managed to do a little bit of work. One of the sources for our positive set was a little suspect, so we decided to take it out and see the impact on the results. We found that there was a drastic change. Since we were relatively confident that the other positives were good, if the suspect one was good, then the results wouldn’t change as much if we ran the whole network through. What we found was that it was indeed not very good. The output file was dramatically different. But we need a way to quantify the change
This upcoming week, we will also be developing a program to find how conserved our output genes are in fruit flies, which are our model organism