This week I began to write code to weight the Pathlinker Interactome using the Bayesian Weighting Scheme. This will be useful to compare the HIPPIE and the Pathlinker Interactome but it also serves as a check to confirm whether the code accurately weights the interactomes. The Pathlinker Interactome was initially weighted using the same scheme and so the weights generated should match the pre-existing weights of the interactome. Next steps will involve comparing the two interactomes.
I spent some time writing code that allows conversion of a list of Uniprot IDs to the common name of the genes. This code will probably prove useful at some later point. I extended this to convert from Ensemble IDS to Uniprot IDs as well. However, the Ensemble and Uniprot databases do not entirely overlap and so there are some Ensemble IDs with no corresponding Uniprot IDs. I currently have no ideas about how to solve this problem but it does not appear to be one that needs to be urgently solved.
This week I spent the majority of my time at the GCC/BOSC Conference both volunteering at and attending talks. Most of the talks concerned the software suite Galaxy and how people were adapting it or extending it for different use cases and new applications. Not having used Galaxy before nor intending to, these talks did not prove very useful to me. There were some talks that concerned more general biological problems and issues within the scientific community. These were accessible to a wider audience and I was able to learn some aspects of statistics and experimental design from them. This included talks on the importance of talking to the community of people who have the particular condition being studied as well as some concerning best practices on handling data.
I spent some portion of time fixing minor bugs in my code for the Bayesian Weighting Scheme. As a follow-up to a previous post, the run time of the code is now down to 2 or 3 minutes at the most. Using sets and tuples reduced the run time by several orders of magnitude and some other time saving measures also contributed to this much lower number. This is now sufficiently within the time frame of the CREU so the code will probably need much more speeding up. Next steps involve changing the code to weight the PathLinker Interactome to check if the code works accurately as well as parsing through the math carefully.
I finally managed to download all of the GO Terms and spent some time processing the data and editing previous code so that it all fit together. After having done so and obtained a working version of the code, an initial run revealed that the run time of weighting the entire HIPPIE Interactome would be approximately 101 days. This is outside the duration of the CREU so clearly something needed to be done. To fully explain the issues that lead to such a terrible run-time it is important to first have a rough understanding of the structure of the code.
This iteration of the code first produced a list of true positives and a list of true negatives on the criteria of GO Term co-annotation. This section of the code takes about 4 minutes (which can also be potentially sped up later on) but this negligibly contributes to the 3 month run time. The next portion implements functions that in conjunction calculate the cost of a given node. The necessary inputs are
A list of all the evidence types.
A dictionary where the keys are the evidence type labels and the values are lists (evidence lists) whose elements are the protein edges (represented as lists) which are confirmed to occur by that particular evidence type
A list of true positives and a list of true negatives.
Finding the cost involves calculating the intersection and symmetric difference of the positives with the evidence lists and then the intersection and symmetric difference of the negatives with the evidence lists. This was done each for each evidence list “every” time a new edge was weighted. This approach took roughly 2.5+ minutes per edge. With 93138 edges, this was the main contributor to the 3 month run time. Further inspection reveals that the time-complexity of performing these operations on a set is O(min(len(s), len(t)) and O(max(len(s), len(t)) for a list. Performing this operation every single time was therefore very costly. Simply calculating these numbers for each edge at the outset and then appending them in a particular order to the end of each evidence list reduced the time needed to weight each edge down to approximately 0.45 seconds thus shaving down the total run-time to about 11 hours.
Another change that will significantly reduce the run time further is using sets instead of lists for the protein edges in the dictionary and representing the edges themselves as tuples. This will mean creating a different dictionary with identical keys but the values will the four calculated values mentioned above. This will supposedly shave off a large chunk of time because element look up in a set is significantly faster than the same operation in a list. Another idea is to change the order of calculations in a way that reduces the amount of time the exact same calculation is executed or to calculate the edges in a way that iterates over the approximately 230 evidence types rather than the 93138 edges.
I spent the majority of the week parsing through various databases and their associated tools to try to download GO Terms to form the set of positives and negatives required for the Bayesian Weighting Scheme. There appeared to be no easy way to download the GO annotations for specific proteins or to download specific GO Terms. Therefore, I simply downloaded a file containing all the GO annotations and stripped it of the unnecessary portions to form the positives and negatives. All that remains now is the integrate this with the remaining code for the weighting scheme. That will be the main objective of Week 5.
The achievements from last week as well as the goals were laid out in this progress report. Accordingly, I spent the majority of this week working on an implementation of the Bayesian Weighting Scheme I discussed last week. This involved carefully fleshing out all the minutiae of the relevant paper (which I laid out and explained in this document from last week). I then applied the code onto a small toy-network that I made myself and then hand-calculated a few edges to verify my implementation.
I am currently working to implement the weighting scheme onto the HIPPIE Interactome. Thus far this has meant simply writing code to read the relevant text files and format the data efficiently in a way that meshes well with code I wrote for the toy example. Eventually, I will compare the scores of edges from this weighting scheme to the ones on the actual HIPPIE Interactome and further analysis on the differences between the two weighting schemes will probably be among the next steps. Besides implementing the Bayesian Weighting Scheme I have also been working to understand the underlying statistics behind the methods used to weight the HIPPIE Interactome. I will try to have a document that explains the mathematics for this in similar fashion to the one for the Bayesian Weighting Scheme.
Week 2 was spent further looking into Bayesian Weighting Schemes and in particular how they were applied in a specific paper titled Top-Down Network Analysis to Drive Bottom-Up Modeling of Physiological Processes by Poirel et. al. This was the original LINKER paper that PATHLINKER was built upon. I also spent a very decent amount of time looking at other interactomes and curating a working list of interactomes and how they weight their interactomes. So far I have only done this in detail for the HIPPIE interactome but more work is still needed. Besides research, I also worked on a document about Bayesian Weighting that aims to elucidate some of the mathematical set up and notation for the paper mentioned above as it suffered from some convoluted mathematical notation. Hopefully, this will be useful to people trying to read this paper in the future. Other tasks involved debugging code, presenting learned material and sorting through papers to find relevant papers.
My research project aims to extend upon the original PATHLINKER paper. In particular, I hope to investigate different weighting schemes both from the perspective of utilizing different data sets and data-summary methods as well as potentially maybe different statistical tools for manipulating the data in meaningful ways. This is motivated by the fact that there appears to be little overlap between interactomes consisting of the same proteins but weighted differently. Therefore, this project will allow us to analyze different schemes and develop an heuristic for weighting interactomes. As a result, we will be able to apply algorithms like PATHFINDER on graphs such that they provide results that have valid biological meaning rather than being affected by superficial factors such as trends in scientific research or biases of the scientific community. Right now, I am focusing on Bayesian Weighting schemes which attempt to use experimental evidence to provide a weight that represent the probability of a given interaction occurring in a given signaling pathway by leveraging Bayes’ theorem. I will talk more about this in future posts.
Week 1 Progress
Week 1 was spent mostly acquainting myself with the necessary prerequisite knowledge required for research during coming weeks such as Dijkstra’s algorithm (and its implementation) and Bayesian Weighting Schemes and in that regard reading some relevant papers. For the sake of keeping track the particular papers were titled :-
The first was the original PATH LINKER paper which our research hopes to extend upon. The next two provided some information on Bayesian Weighting but did not provide too much else due to some ambiguous mathematical notation. I hope to attempt to write a document that annotates the relevant portions of the papers for the sake of clarity and future reference.