More about DAGs

Hi, I am Tunc. If you have read this recent post by Jiarong Li, you might remember that we work on signaling pathway reconstruction. This is a better way of saying that we try to determine the many different orders of protein-protein interactions within a cell, from a receptor to a transcription factor, that make up a pathway. In this post, I want to talk about why and how we have combined PathLinker and our work on Directed Acyclic Graphs (DAGs).

Numbers represent proteins and directed edges represent interactions. PathLinker can find 1->3->4->5 and 2->3->4->5 as two different paths, but only 4 of the 6 edges are unique.
PathLinker can find 1->3->4->5 and 2->3->4->5 as two different paths, but only 4 of the 6 edges are unique. This gets only worse on the larger scale real data.

PathLinker can find k-many shortest paths from a receptor to a transcription factor. In our graphs, the length of an edge represents the cost of a protein-protein interaction. We assume that evolutionarily, if an interaction is important for the pathway, it will be optimized to happen easily or have a lower cost, meaning that it will be “shorter”. Hence, one could imagine that the chains of interactions or the paths, that are important for the pathways will be shorter, allowing us to frame the reconstruction problem as a shortest-paths problem. PathLinker can find k-many shortest paths very quickly, where k is user-defined, which is nice. However, as k increases, a lot of these paths start to reuse the same edges over and over again, resulting in a decreased amount of information added.

With DAGs, we start with some ground-truth network, find a collection of shortest paths that are not in our network, evaluate them according to some cost function we can customize, pick the path that the cost function thinks is the best, add it to our network and repeat this process many times. This process takes a lot more time than PathLinker and because it has so many parts, changing one thing like the cost function or the ground-truth network can affect the result greatly. This also means that for our program to work properly, we need to be very careful that every part works as we intend it to AND how we intend it to work makes sense biologically.

Even though our long term goal might be to get all of those parts working, one quick idea we could try was to impose some of these criteria to the paths PathLinker quickly generates. To do that, we first ran PathLinker to get 50,000 paths, which seemed like a big enough number of paths, and then we wrote a script to iterate over these paths and choose the ‘acceptable’ ones. Here, the definition of acceptable is customizable, and we have tested it with several definitions to observe how our results change.

On the EGFR1 pathway, having a stricter definition of acceptable increased recall with precision being roughly the same for networks created by 1000 acceptable paths.

In the figure above, x denotes the minimum number of new edges a path must add for it to be acceptable. Increasing x resulted in an increase of recall without a significant effect on precision. This result wasn’t very surprising because repeated use of false edges does not decrease precision values, each edge is only counted once. Recall increasing also wasn’t very surprising as all we did was to require the addition of new edges, some of which had to be true protein-protein interactions. The real surprising observation we made was when we used this method on different pathways. Epidermal Growth Factor Receptor (EGFR) is a very big pathway, at least in the NetPath database, the other pathways’ data we had were significantly smaller. Consequently, when we used the same method on other pathways, with x = 3 or sometimes even 2, we could never find 1000 acceptable paths. To reiterate, out of 50,000 shortest paths we had originally, there weren’t even 1000 of them that continually added enough new edges to our network. This meant that the reuse of edges in PathLinker was on a very big scale, underlining the potential for our new method.

Ecology Modeling: Thermal Variation and Phytoplankton Fitness

My name is Amy Rose, and I’m a post-bac in Anna’s lab this summer. I graduated last month with an Alt. Biology degree with an emphasis in Computer Science. Taking Anna’s classes in my first two years at Reed was the start of my interest in computational bio. I spent my junior year studying computer science at The University of Sussex, and after this summer I will be starting as a software engineer at Puppet here in Portland.

When it came time to find a thesis project, I thought it would be interesting to explore an area of biology that I hadn’t had time to study while at Reed. I was coadvised by Anna and Sam Fey, who is an ecologist. Sam’s research on thermal variation led me to my project, which focused on modeling the effect of thermal variation on freshwater phytoplankton using real world data.

Phytoplankton are ectothermic, which means that they are not able to regulate their own body temperature. Additionally, due to their small size it is difficult to empirically measure the variance in their body temperature due to movement through thermally variable environments. My thesis began to resolve the impact on movement on body temperature and fitness. In this context, fitness represents the overall change in population size of phytoplankton based on temperature-dependent birth and mortality rates.

Temperature data was collected from Sparkling Lake in Vilas County, Wisconsin at intervals from .5 to 3m throughout the lake with a frequency as high as every minute over a period of 26 years. We interpolated the collected data to fill in estimated temperatures over depths which were not collected, as seen in the figure below.

Interpolation of data across space. Data was collected at discrete intervals, but linearly interpolated to fill in gaps.
Sparkling Lake temperature data from the 1989 season before and after interpolation. The left figure shows the recorded temperatures collected at each measured depth. The right figure was made through interpolating the temperature at each 0.01 meters given the actual data.

We created five algorithms representing different theoretical patterns of phytoplankton movement throughout the water column, which we plotted against the data. This gave us a framework to understand the limits of what body temperatures phytoplankton may be experiencing. The second stage of the project was to plot these simulated body temperatures against a function representing phytoplankton fitness.

This summer, we hope to extend my thesis research over space and time. For my thesis, we focused on a single season, but we’re currently looking at extending the movement algorithms over all 26 years of data. We’re also interested in exploring more datasets sourced from lakes in different geographical locations. Additionally, we’re analyzing the effects of changes to the fitness function.

Summer Research 2019 – here we go!

Reed has finished for the year, but that doesn’t mean that students are done. Last week kicked off a slew of undergraduate researchers doing all kinds of research. In no particular order, here’s a taste of what people will be working on in the compbio lab. Stay tuned for occaisonal group updates.

Math-CS major Jiarong (Lee) Li ’21 and biology major Tunc Kose ’22 are going to develop algorithms to analyze a cell’s response to external signals (called signaling pathways). They will be working to extend ideas based on the original PathLinker paper and Ibrahim Youssef’s Localized-PathLinker paper.

Recent graduate Amy Rose Lazarte ’19 (alt. bio major with a CS emphasis) will continue to develop a resource and modeling framework for understanding the effect of thermal variation on freshwater phytoplankton. Co-advised by ecologist Sam Fey, she has developed a computational pipeline to analyze longitudinal lake temperature data using simulations of phytoplankton swimming strategies.

Biology major Tayla Isensee ’20 is working on identifying targets of retinoic acid signaling in zebrafish eye development. She has a hand in the wetlab work with developmental biologist Kara Cerveny, and she will be building a zebrafish protein-protein interaction network to find potential regulators to test. First, though, she’s going to hunt for retinoic acid response elements (RAREs) in the zebrafish genome to identify direct targets of retinoic acid.

Another recent graduate, neuroscience major Alex King ’19, will be wrapping up his thesis work to build a network that integrates gene, transcript, and protein relationships in order to identify dysregulated pathways in polygenic diseases based on genome-wide association study (GWAS) data.

Biology major Karl Young ’20 will be reading up on computational modeling in neuroscience, and figuring out the intersection of my world (algorithms for biological networks) and neurobiologist Erik Zornik’s world (neural circuits and how they affect behavior).

Last but not least, CS graduate Ananthan Nambiar ’19 will be getting his thesis ready to present as a poster at ISMB/ECCB in Basel later this summer. He modeled proteins as language with the help of his main advisor, natural language processing (NLP) expert Mark Hopkins in CS.