Last week, we were lucky enough to attend the GCC/BOSC conference hosted at Reed. Although I spent the majority of the week volunteering and attending sessions, we also had a deadline for a two-page extended abstract of our CNB-MAC paper to be submitted to the main conference proceedings. We successfully cut down our original paper into two pages and submitted the abstract on Friday.
As of now, we’re almost ready to dig into the experimental portion of this project, but there are still a few things we need to iron out. At the end of last week, Alex noticed a “bug” in how we’ve been calculating the AUC of our multi-layer algorithm. Therefore, one of our priorities this week is to find a more accurate AUC calculation that better represents our method. Once that is completed, we will move on to assembling a list of candidates for experimentally validation. Hopefully by the end of the week we will have chosen 8-10 candidates that are not only biologically interesting but conserved in Drosophila as well.
This week I spent the majority of my time at the GCC/BOSC Conference both volunteering at and attending talks. Most of the talks concerned the software suite Galaxy and how people were adapting it or extending it for different use cases and new applications. Not having used Galaxy before nor intending to, these talks did not prove very useful to me. There were some talks that concerned more general biological problems and issues within the scientific community. These were accessible to a wider audience and I was able to learn some aspects of statistics and experimental design from them. This included talks on the importance of talking to the community of people who have the particular condition being studied as well as some concerning best practices on handling data.
I spent some portion of time fixing minor bugs in my code for the Bayesian Weighting Scheme. As a follow-up to a previous post, the run time of the code is now down to 2 or 3 minutes at the most. Using sets and tuples reduced the run time by several orders of magnitude and some other time saving measures also contributed to this much lower number. This is now sufficiently within the time frame of the CREU so the code will probably need much more speeding up. Next steps involve changing the code to weight the PathLinker Interactome to check if the code works accurately as well as parsing through the math carefully.
This week, my coworkers and I attended the GCC-BOSC (Galaxy Community Conference in association with the Bioinformatics Open Source Conference), so there was little to no research done.
This week we spent most of our time attending and volunteering at GCC/BOSC (Galaxy Community Conference/ Bioinformatics Open Source Conference). While the conference wasn’t extremely relevant to what we’re working on right now, we learned how to use software like Galaxy and Intermine, which might be useful to us in future projects. Also, this was the first conference I’ve ever attended so it was very interesting to learn how conferences work and to meet people.
This week, there is a conference on campus for Galaxy and open source bioinformatics projects. Also, we’re refining the code.
This week, we scrambled to make figures for the paper we are submitting to the conference. Mostly just writing and figure making.
This week, I discovered how to download files from the TCGA database, and explored their structure.
So this week, I downloaded 512 files relating to colorectal cancer (COAD in the TCGA database.) These were compressed into a Tar file, which opened into a directory tree that looked like this:
Each tiny blue dot is a folder, and inside each folder is one gzipped file (compressed). And each one of these files is a sample from a patient with gene, and its expression (from either a tumor sample or a healthy one.)
So my main issue has been to try and parse these files into a data matrix, ideally with sample-ids on the top and gene expression on the sides. So far, I’ve been able to compress these files into patient and samples relating to them because ideally, we want to look at gene expression in a healthy and tumor sample from the same patient. However, I haven’t been able to write the sample-ids with gene expression data into a file because of multiple bugs and errors. My goal for next week is to get these errors fixed.
I finally managed to download all of the GO Terms and spent some time processing the data and editing previous code so that it all fit together. After having done so and obtained a working version of the code, an initial run revealed that the run time of weighting the entire HIPPIE Interactome would be approximately 101 days. This is outside the duration of the CREU so clearly something needed to be done. To fully explain the issues that lead to such a terrible run-time it is important to first have a rough understanding of the structure of the code.
This iteration of the code first produced a list of true positives and a list of true negatives on the criteria of GO Term co-annotation. This section of the code takes about 4 minutes (which can also be potentially sped up later on) but this negligibly contributes to the 3 month run time. The next portion implements functions that in conjunction calculate the cost of a given node. The necessary inputs are
- A list of all the evidence types.
- A dictionary where the keys are the evidence type labels and the values are lists (evidence lists) whose elements are the protein edges (represented as lists) which are confirmed to occur by that particular evidence type
- A list of true positives and a list of true negatives.
Finding the cost involves calculating the intersection and symmetric difference of the positives with the evidence lists and then the intersection and symmetric difference of the negatives with the evidence lists. This was done each for each evidence list “every” time a new edge was weighted. This approach took roughly 2.5+ minutes per edge. With 93138 edges, this was the main contributor to the 3 month run time. Further inspection reveals that the time-complexity of performing these operations on a set is O(min(len(s), len(t)) and O(max(len(s), len(t)) for a list. Performing this operation every single time was therefore very costly. Simply calculating these numbers for each edge at the outset and then appending them in a particular order to the end of each evidence list reduced the time needed to weight each edge down to approximately 0.45 seconds thus shaving down the total run-time to about 11 hours.
Another change that will significantly reduce the run time further is using sets instead of lists for the protein edges in the dictionary and representing the edges themselves as tuples. This will mean creating a different dictionary with identical keys but the values will the four calculated values mentioned above. This will supposedly shave off a large chunk of time because element look up in a set is significantly faster than the same operation in a list. Another idea is to change the order of calculations in a way that reduces the amount of time the exact same calculation is executed or to calculate the edges in a way that iterates over the approximately 230 evidence types rather than the 93138 edges.
With our big CNB-MAC paper deadline passed last Friday, my tasks this week have been relatively simple. My first goal this week was to take the bar chart figures in the paper and turn them into box and whisker plots.
For example, I took this figure:
and turned it into this:
The bar chart figure shows the mean AUC of each positive set with varying number of layers, while my figure is a little more descriptive. It shows the mean AUC as well as the interquartile range and outliers.
Next, we got some surprising news! We were notified that after a preliminary pass of the paper submissions, our paper was accepted to the CNB-MAC conference as either a talk or a poster (we’re still waiting to hear which one.) This gave us the opportunity to submit a two-page abstract by the end of next week, a more challenging task. I wrote up a quick rough draft that I will continue working on next week with Alex.
Finally, with our results generated and our paper submitted, it was time to clean up our github repo. This included deleting a lot of old code and useless output files as well as restructuring the code and writing README.md files. It’s looking a lot better now!