Summer Research 2020 – The Pathway Not Taken

Graphery

Hi, I am Larry Zeng, a rising sophomore in Math-CS. I’ve been working with Anna on a website called Graphery. It provides interactive graph algorithm tutorials. A beta version is up and running. It’s hosted on https://graphery.reedcompbio.org. The software is open-sourced and can be found here. This project is also presented as a poster at ACM-BCB.

In this post, I will try my best to describe what this website is about, and what I did.

Graphery: an interactive graph algorithm website. Tutorial

This page should demonstrate the basic structure of a tutorial. It has three sections: a graph section, a code section, and a text content section. The graph section displays actual biological graphs that can be used in the algorithm described in the tutorial. The code section can not only display code but also walk through the code step by step and watch the variables so that changes can be visualized. The orange box and an arrow right next to it is the indicator showing which line is been “executed”. More on that later. On the right part of the editor section is a watched variable list, which shows what value each variable is referencing. If one variable is referencing a graph object, that is, a node or an edge, the object will be highlighted on the graph correspondingly. The text content is the text description of the algorithm.

To provide a pure experience for those who want to practice or who’s interested in and graph and want to see different algorithms applied to it, we also provide a playground. It’s very similar to a tutorial page except there is no text section.

Graphery: an interactive graph algorithm website. Playground

The users of this website should be biologists who are new to computing. We designed this website for them and there are two highlights, real biological graphs, and interaction. They should help the audiences in a unique way than traditional algorithm tutorials.

I can’t talk much about the first point since I am not a biology major. But I can say real biological graphs should bring down the anxiety and unfamiliarity.

As for interaction, which I spent a relatively large amount of time thinking about, it’s not about dragging components of a graph or the ability to zoom in or out. It includes proper input interfaces and feedback.

The current input interface is provided by this control strip.

Graphery: an interactive graph algorithm website. Control Strip

Users can either click on the button or use the slider to pinpoint the exact position they want. This e-book like control system should be clear and easy to get started. However, it also has some drawbacks. Extensibility is one of them. Adding more buttons to this strip would increase the complexity and on a slightly small-screen device, those buttons are too compact to use.

Feedback is also important and most feedbacks are provided by the control strip. We are glad that we found a way to tie code execution and the graph together. On the surface, it works really well and provides enough information about the algorithm. However, some details are not nailed yet. For example, in the picture posted above, you can see (hardly) the number of total steps needed to finish a simple degree counting algorithm on the slider right above the code editor is 500-ish. Who would click 500 times? The visualization is not perfect either. In the variable list, for example, the content of the variables is just laid out in plain text, which is hard to read and may increase the anxiety. There is a long way to go and I hope you can see a better version.

More feedbacks are created by customizing code and run it. This is where things get a little weird since the audience of this website should be beginners who, if I picture them right, just know what if statements and how to write loops, and this functionality is not really tailored to their learning experience since to reach the system’s full potential, one should know what’s decorators and read the documentation of the system or even the source code. However, I think it’s part of the key experience of this kind of unique tutorials. Being able to visualize the result is usually not provided by traditional tutorials. I hope things will get more clear when we have real-world users.

It will be quite interesting to see how this system will work with users and how to improve this system to provide a better experience.

The last part of this post is just some notes on what I did. Overall, the codebase should be improved to make future maintenance easier. If you find anything in the code, please let me know, file an issue or send a PR. I really appreciate your help.

I used Vue.JS and Quasar as the frontend framework. Assisted with Cytoscape.JS, I was able to build the graph display section. The good thing is the frontend looks good and the UI is complete. However, the final result is not as expected. I am looking forward to rewriting a better version in the next few months after the new version of the framework comes out.

The backend system is powered by Django and it’s running GraphQL API. I also modified the PySnooper so that I can inspect Python code. I promised to talk about the “virtual execution” and here are some details. Python allows users to provide a function to sys.settrace so that, when Python has executed a line, it dumps the stackframe of the line and an event about what the line is doing to the function. (See docs for more details). The function I provide looks at the stackframe, evaluates the value of each watched variables, and look at a list of values of the variables from the frame, record the changes, and finally update the variable list.

Thank you Anna for providing such an opportunity. And big thanks for all the contributors, directly and indirectly, making the project happen.

Cloud Computing Platforms Benchmarking

Hi everyone! My name is Jiarong Li and I’m a senior majoring in math and computer science at Reed. During this summer, in order to offer reference information for computational researchers to choose a suitable cloud computing platform, I worked with Anna and Trina to compare mainly four cloud computing platforms: Google Cloud, Amazon Web Service (AWS), Extreme Science and Engineering Discovery Environment (XSEDE), and the Institute for Bioinformatics and Evolutionary Studies from UIdaho (iBEST CRC).

By running three cloud computing projects related to different fields, I analyzed the real application of these platforms in terms of the execution time, computing capacity, storage, customizability, compatibility, and convenience. I designed a scoring rubric and tabulated and summarized the testing results with various plots.

Projects and Testing Results

1. Computational Biology: Augmenting signaling pathway reconstructions (PRAUG) [1]

By running PRAUG to reconstruct signaling pathways and recoding the runtime, I focused on testing the computing ability of the same type of virtual machines from different platforms.

Testing Result

The trend shown in the graph above tells us that virtual machines with more vCPUs take less execution time. Since PRAUG code was originally set to run on a 4-core machine, without setting up multithread, the trend of the decreasing execution time cannot be seen when the number of vCPUs getting above 6.

In the graphs above, I mainly compared the performance of virtual machines from Google Cloud and AWS since they have comparable customizability. With the same settings of virtual machines, we can see that AWS outperforms Google Cloud.

XSEDE and iBEST have less customizability and option of virtual machines. I chose several log-in nodes to run the program and their testing results are:
14 CPUs & 128 GB: 13m33.907s (XSEDE)
16 CPUs & 3 TB: 23m5.172s (XSEDE)
40 CPUs & 192 GB: 31m28.872s (iBEST)

2. Large Image Files Subsampling and Transferring: Tabular data files for 3D still images

By running multithreaded R code to randomly generate a specified number of 3D image subsamples with a certain number of pixels, this project mainly focuses on testing virtual machines with multiple vCPUs as well as the capability and convenience of data transferring and storing of different platforms.

Testing Result

The graph above shows the difference of execution time when we input the same variables and run the single-threaded code and multi-threaded code respectively on a virtual machine with 8 cores and 64GB memory.

In this project, by running multithreaded code, we can clearly see the influence of the number of vCPUs to the execution time.

When I set the number of subsamples to 20 and pixels/sample to 100, the runtime of this R code on VMs from Google Cloud and AWS are shown above. Still, AWS outperforms Google Cloud in some extent. Moreover, I tested more resources provided by XSEDE and iBEST in this project and Jetstream from XSEDE has really good performance.

In addition to multithreaded computing, I also compared the storage and data transferring of these four platforms. I personally like the s3 bucket from AWS the best since it is very convenient to access the files stored in s3 through other platforms and the security settings of private files have great customizability. Data transferring can be easily achieved by Globus among these four platforms and other applications.

3. Machine Learning: Running a machine learning experiment in a specific Docker Image

This project mainly focuses on testing GPU virtual machines and their compatibility with other applications.

Testing Result

The graph above shows the comparison of runtime when training the ML model on VMs with NVIDIA Tesla T4 GPU. We can see that VMs from AWS run faster than those from Google Cloud.

I also tried to adopt VMs with different types of GPUs and got results shown in the graph above. For more information about NVIDIA Tesla.
I didn’t run this project on GPU nodes from XSEDE or iBEST since I have no permission to set up docker on these VMs. Generally, for all four platforms, users need to request access to GPU VMs and will get either permission or rejection depending on specific situation.

Summary

Based on all the testing results and the user experience I got during this process, I evaluated the four platforms according to the scoring rubric designed at the beginning.

Thanks to Anna, Tobias, Kelly, and Mark for providing cloud computing projects tested in this research.

References

[1] Rubel, Tobias, and Anna Ritz. “Augmenting Signaling Pathway Reconstructions.” BioRxiv. January 01, 2020. Accessed August 29, 2020. https://www.biorxiv.org/content/10.1101/2020.06.16.155853v1.full.

Summer 2020 Research Recap

We have passed the 90th consecutive night of Portland protests for Black Lives Matter. That’s three full months. That’s a whole summer.

There have been over 25,000 COVID-19 cases and 438 deaths in Oregon. The number of infected individuals total the entire population of Forest Grove, Happy Valley, or Wilsonville.

And yet, somehow, the students and staff who did research with me did phenomenally. Let me be clear: our success as a research group was driven by the collective effort of everyone involved.

Integrating Cancer Data into Interactomes

Hi all! My name is Aryeh Stahl and I’m a math and computer science dual major at Reed. I’ll be a senior this fall and will be starting my thesis. I had a great time working with Anna this summer!

An interactome is a weighted graph with nodes representing proteins and directed edges representing interactions between these proteins. Anna Ritz et al. (2015) previously developed Pathlinker, an algorithm that finds k-shortest paths from a set of source nodes to a set of target nodes. It aims to “to reconstruct the interactions in a signaling pathway of interest [2].”

In my project, I aimed to integrate cancer data into the previously generated Pathlinker interactome to find significant pathways and/or proteins involved in a given cancer. I obtained driver p-values for most proteins in the interactome. These p-values “indicate [genes’] statistical significance as drivers, according to diverse methods that account for positive selection, functional impact of mutations [and] regional mutation rates [1].”

Once I had mapped these p-values to the corresponding proteins in the Pathlinker interactome, I created new interactomes by diffusing the p-values to outgoing edges and calculating a weighted average with the old edge scores. For example, if a node had p-value of 0.5 and had 5 outgoing edges, the corresponding contribution would be 0.1 for each of the 5 outgoing edges. I then set a parameter, beta, which indicated the weight of the new values and the weight of the old values. If beta = 0, then all of the new edge weight is obtained from the p-value, while beta = 1 will give the original interactome. The following GIF shows the distribution of edge weights as beta changes (beta is 0 when most of the scores are close to 0, while beta is 1 when it is very discrete).

The input to Pathlinker is an interactome, as well as two sets of nodes: targets and sources. The output is a list of proteins and interactions, ranked “by their first occurrence in the k shortest paths from any [source] to any [target] [2].” So, I can run Pathlinker on my new interactomes and get as output a list of significant edges. Ideally, these edges would be significant in the pathway in question (such as the WNT signaling pathway), and for the cancer. Thus, we could identify potential interactions that could be disrupted and contribute to the cancer.

My next issue was to determine if my output was actually significant. I tried to do this by fabricating some data and seeing if my method recovered the edges that I expected. I created randomly weighted interactomes and chose a previously known signaling pathway as a subgraph (in this case I used WNT). I shall call this subgraph A. I then chose some scale factor to upweight the edges in that subgraph, e.g. multiplying each edge weight by a factor of 1.5. Next, I chose a random subset of the nodes of subgraph A (of a specified size, say 50% of the nodes) to serve as an intersection with another subgraph, which I shall call subgraph B. B corresponds to nodes with a high p-value in the cancer data. I then added some more nodes that are connected to this intersection and are outside A (Thus making subgraph B have the same number of nodes as subgraph A). I finally gave each of the nodes in subgraph B a high p-value and created a new interactome.

Running Pathlinker on this contrived data with the WNT sources and targets, I would expect a high percentage of the resultant significant edges would be in the intersection between subgraph A and subgraph B. I further would expect to have a larger number of the significant edges in the intersection as I upweight subgraph by a larger factor. This is exactly what I got. The following plot shows the results from several trials of Pathlinker on new interactomes with a various upweight factor, finding the k = 100 shortest paths.

References

[1] Reyna, M.A., Haan, D., Paczkowska, M. et al. Pathway and network analysis of more than 2500
whole cancer genomes. Nat Commun 11, 729 (2020). https://doi.org/10.1038/s41467-020-14367-0

[2] Ritz, A., Poirel, C., Tegge, A. et al. Pathways on demand: automated reconstruction of human
signaling networks. npj Syst Biol Appl 2, 16002 (2016). https://doi.org/10.1038/npjsba.2016.2

Hypergraph Algorithms

Hey everyone! I am Alex Richter. I am a rising junior in math and computer science working with Anna Ritz this summer on hypergraph algorithms. The algorithms that I implemented this summer were from a paper co-authored by Nick Franzese, Anna Ritz, and Adam Groce [1]. The implementation of the algorithms already existed in Python; however, I learned Java and implemented them in Java to be incorporated into a plugin written in Java.

We are working on incorporating the algorithm into the application Cytoscape via a plugin called ReactomeFIViz. In doing so, we hope that scientists will begin to examine hypergraph topology when checking to see if two molecules are connected within a cell. In viewing a different topology as opposed to simply directed or bipartite graphs, different pathways can be discovered as well as some nodes in a graph topology might be connected within a cell but not in a hypergraph topology. The potential for hypergraph connectivity in better understanding connectivity within a cell is a field that deserves to be further explored.

The algorithm that I implemented relaxes the definition of B-connectivity and determines if two nodes are connected in a hypergraph topology. Here are some definitions from the paper:

Directed hypergraphs represent reactions with many-to-many relationships, where each hyperedge e = (T_e, H_e) has a set of entities in the tail T_eand a set of entities in the head H_e (here, the tail denotes the hyperedge inputs and the head denotes the hyperedge outputs). [1]
B-connectivity: requires all the nodes in the tail of a hyperedge to be visited before it can be traversed. This definition has a natural biological meaning in reaction networks: B-connectivity requires that all reactants of a reaction must be present in order for any product of that reaction is reachable. Unlike the compound graph rules, B-connectivity describes the strictest version of connectivity, and it is the only rule used to traverse hyperedges. [1]
- Given a directed hypergraph and a source set S ⊆ V, a node u ∈ V is B-connected to S if either (a) u ∈ S or (b) there exists a hyperedge e = (T_e, H_e) where u ∈ H_e and each element in T_e is B-connected to S. We use to denote the set of nodes that are B-connected to S in . [1]

To be B-connected implies that in order to traverse a hyperedge all of the nodes in the tail set (input set) must be present. B-connectivity can be a little bit difficult to understand at first glance. It is a harsh restriction on hypergraph connectivity. Two nodes are B-connected if one of two things is true: either both nodes are in the source set so obviously they are B-connected or the node we are seeing if is B-connected is located in the head (output) set of a hyperedge that where each element in the tail (input) set is connected to original node.

b_visit Algorithm

The algorithm that I implemented is two parts. The first algorithm is called b_visit and takes in a hypergraph and a set of source nodes, s. The b_visit algorithm will then return three sets: the set of traversed hyperedges from s, the set of B-connected nodes to s, and the set of restricted hyperedges. Restricted hyperedges are edges that can be reached but not traversed because it is missing an input in the tail to make the hyperedge traversable. For example, given a source set of A, B for this hypergraph, the set of restrictive hyperedges would be:

Tail Set	Head Set
G, D	L
D, H	M
I, N	O

Set of Restrictive Hyperedges

The set of traversable hyperedges would be:

Tail Set	Head Set
A	C
A, B	D
B	E
B	F
E, F	I

Set of Traversable Hyperedges

The set of B-Connected nodes to A, B would then be: C, D, E, F, I.

b_relaxation Algorithm

The b_visit algorithm helps when trying to find out which nodes are B-connected to the source set, or the set of nodes the algorithm is called on. This is helpful to find those immediate nodes; however, since the requirements for B-connectivity are so strict, this does not include all of the nodes that can be reached from the source set in the hypergraph. Through calling b_visit multiple times and replacing the requirements for B-connectivity, more nodes can be viewed as connected in the hypergraph.

The second algorithm from the paper is called b_relaxation which makes calls to the b_visit algorithm. The input for b_relaxation is once again the hypergraph and a set of source nodes, s, on which the algorithm will be called. Through multiple calls to b_visit, b_relaxation relaxes the strict conditions of B-connectivity. The algorithm runs on the entire hypergraph to determine which hypernodes are B-connected to the source set. The return value of the algorithm is a dictionary of k-distances where the key is the node and the value is the distance. K-distance is easiest understood as the number of times the algorithm was run until it could reach the node in question. So if it was in the source set for instance, the k-distance for the node would be 0. In relaxing the harsh conditions of B-connectivity, the algorithm is essentially traversing restrictive hyperedges to find all of the nodes that are connected in the hypergraph.

Using the same toy example, at the end of b_relaxation, it would return a dictionary of distances.

Key	Value
A	0 (source set)
B	0 (source set)
C	0
D	0
E	0
F	0
G	INFINITY
H	INFINITY
I	0
J	INFINITY
K	INFINITY
L	1
M	1
N	INFINITY
O	1
P	2
Q	2
R	3

Dictionary of distances for toy hypergraph after running b_relaxation algorithm

Anna and I are working with our collaborators in incorporating this into ReactomeFIViz before adding it to Cytoscape. The next steps include cleaning up the code, making sure it works properly in the plugin, and considering implementing further hypergraph algorithms.

References:

[1] Franzese N, Groce A, Murali TM, Ritz A (2019) Hypergraph-based connectivity measures for signaling pathway topologies. PLOS Computational Biology 15(10): e1007384. https://doi.org/10.1371/journal.pcbi.1007384

Nets Full of Fish Brains

Hey folks, my name is Gabe Preising and I’m a research assistant for Suzy Renn and Anna Ritz. I’ve always been a bit all over the place with my research interests, and because of that I’ve always been drawn to interdisciplinary research (either that or I’m just indecisive). I graduated from Reed this past May with a degree in Biology, my research interests including behavioral neuroscience, computational biology, and bioinformatics.

I spent my time in the Renn lab studying the African cichlid fish Astatotilapia burtoni. These fish are native to Lake Tanganyika, one of the African great lakes on the eastern side of the continent. They engage in complex social dynamics and because of that, people who study A. burtoni have historically studied the males in the context of dominance hierarchies and aggression. However, the Renn lab focuses their research on female behaviors, and in the case of my research, maternal care behavior. A. burtoni mothers engage in mouthbrooding, which is a ~2 week period where a mother will hold her developing offspring in her mouth. During this time, she will churn them around to circulate fresh air while simultaneously protecting them from predators. The really interesting part of this behavior though is that during mouthbrooding, a mother will forgo eating throughout the brooding period. This implies that somehow, a mother is making a trade-off where she puts the energetic needs of her offspring before her own.

The way signals from the brain can influence behavior and vice versa is fascinating to me, so I decided to do my senior thesis on the neural signals of mouthbrooding. As I was deciding on a project though, I became interested in bioinformatics and computational biology and wanted to tie each of these fields together. In Anna’s computational systems biology class, we learned about protein-protein interaction networks (i.e. interactomes) and how powerful they can be for investigating a variety of biological questions such as predicting disease genes. We also learned about how you could weight interactomes with external data to alter properties of the network (depending on what you’re doing). So in one sentence, my thesis (and current research) is: use a protein-protein interactome to figure out what’s going on in the brains of mouthbrooding fish.

Since there is currently no interactome for A. burtoni, I made my own from an existing zebrafish interactome. To do this, I took advantage of two publicly available resources: orthologous relationships [1] and a zebrafish interactome [2]. Orthologous genes are related genes in different species that originated from a common ancestor. Using the program OrthoFinder [3], I generated orthologs between zebrafish and A. burtoni. I then created new edges between all of the corresponding orthologs for the pair of nodes in each zebrafish edge. Finally, I weighted the network with gene expression data from the brain such that redder nodes have a higher expression level in brooding fish and bluer nodes have a lower expression level. The following subnetwork focuses on neurotensin, a differentially upregulated gene in brooding fish.

Neurotensin is involved with both feeding and parental care, and seeing it connected to other neuropeptides like prodynorphin (pdyn) tells me that I mapped everything correctly. I’m going to keep refining my network in the coming weeks and try a different method of constructing it but I’m excited to have a usable interactome finally. Future goals include looking into NetworkX algorithms to run to find clusters of differentially expressed genes.

References:
[1] Smedley, D., Haider, S., Ballester, B., Holland, R., London, D., Thorisson, G., & Kasprzyk, A. (2009). BioMart–biological queries made easy. BMC genomics, 10(1), 22.
[2] Ogris, C., Guala, D., Kaduk, M., & Sonnhammer, E. L. (2018). FunCoup 4: new species, data, and visualization. Nucleic acids research, 46(D1), D601-D607.
[3] Emms, D. M., & Kelly, S. (2019). OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome biology, 20(1), 1-14.

Summer 2020 Research

“Stay Home, Save Lives” ad campaign by Oregon Gov. Kate Brown.

This summer will not look like any other. Yet, undergraduate research continues! At a time when experimental labs are shut down and figuring out how to reopen, computational biology can provide an opportunity for trainees to learn a new skill while at home.

This summer, there are a bunch of projects in the compbio lab, with a large focus on developing tools for public use.

CS junior Alex Richter will be implementing our recent hypergraph connectivity algorithm for use with the ReactomeFIViz app, which analyzes and visualizes signaling pathways from Reactome.

CS senior Aryeh Stahl will be exploring how to best weight protein-protein interaction networks (such as the HIPPIE interactome among others) and how to integrate high-throughput experimental data in these networks.

Biology junior Frank Zhuang will be picking up Tayla’s project from last summer, working with me and Kara Cerveny to computationally identify retinoic acid response elements (RAREs) in zebrafish.

CS senior Jiarong Li will be returning to the lab, but unlike last summer she will be working alongside Reed’s Computing & Information Services to explore feasible cloud compute platforms for college research across disciplines.

CS sophomore Larry Zeng will be developing a tutorial about network algorithms for molecular systems biology. The tutorial, designed for biologists, will be designed for researchers to learn more about networks visualized by GraphSpace, an interactive graph sharing platform.

Finally, three post-baccalaureate researchers will be working on computational biology research.

Tobias Rubel Janssen (Fall ’19) develops new algorithms for signaling pathway reconstruction from protein-protein interaction networks.
Gabe Preising (Spring ’20) will extend his undergraduate thesis project with Suzy Renn to identify differentially expressed groups of genes related to mouthbrooding in cichlid fish.
Maham Zia (Spring ’20) will develop new measures for quantifying microscopy images of cells according to specific phenotypes that are studied in Derek Applewhite’s lab.

This will make for a full and fun summer! Looking forward to establishing a summer research group, even if it’s virtual.