The Pathway Not Taken – Page 2 – Computational Biology Research at Reed College

Automate Image analysis using Matlab

Hey everyone! I am Maham Zia and this summer I worked as a post-bacc for Anna Ritz (primary advisor) and Derek Applewhite (co-advisor). I graduated from Reed in May 2020 with a degree in Physics, but during my undergraduate career I worked in a cell biology and cellular biophysics lab. I am interested in how physics interacts with other disciplines such as biology and chemistry. I believe that interdisciplinary research transcends boundaries and aids the scientific community to think about a variety of problems in creative ways.
This summer I put my coding skills to use by working on a computational project that involved analyzing images of cells. For her senior thesis, Madelyn O’ Kelley-Bangsberg ’19 used the punctate/diffuse index to measure the distribution of phosphorylated NMII Sqh. Punctate/diffuse index is widely used to measure the spread of fluorescently tagged cytochrome c in cells during apoptosis. Measuring this index involves determining standard deviation of the average brightness of the pixels using time lapse microscopy [1]. O’Kelley-Bangsberg’19 used the same idea to determine whether the cell has punctate phosphorylated myosin (high pixel intensity standard deviation) or diffuse phosphorylated myosin (low pixel intensity standard deviation). She did this in ImageJ by outlining each cell, obtaining an X-Y plot of the intensity and finally calculating the relative standard deviation ((standard deviation/mean pixel intensity) x 100 ) by setting the highest intensity value to 1 and lowest intensity value to 0. She used relative standard deviation instead of the absolute standard deviation to draw comparisons because she observed that mean pixel intensity values changed drastically between treatments [1]. Under Anna’s guidance I worked on automating this process in MATLAB by making use of image processing techniques which made the entire process a lot more time efficient.

I was working with images that each had two channels:
(1) An actin channel that shows the distribution of actin- a double helical polymer that aids in cell locomotion and gives the cell its shape. Since actin is distributed throughout the cell, staining it allows us to see the entire cell [2].
(2) A channel showing the distribution of phosphorylated NMII Sqh.

The general idea was to use the actin image to detect/outline the cells and somehow use that information to plot the cell outline on the second image which has the distribution of the phosphorylated NMII Sqh. Plotting them would allow me to obtain the intensity values of the pixels within each cell and determine the standard deviation of the pixel intensity values for each cell.

Initially, my code read in the image with the actin staining and used the inbuilt imfindcircles function in MATLAB to detect cells in the actin image. One of the input arguments of the function is radiusRange which can be used to detect circles with radii within a certain range. However, the function doesn’t work well for ranges bigger than approximately 50 pixels which means it doesn’t work well for images with both small and large cells. Moreover, the function has an internal sensitivity threshold used for detecting the cells. Sensitivity (number between 0 and 1) can be modified by putting it in as an optional argument when calling the function, but increasing it too much can lead to false detections [3]. Using imfindcircles function is not a robust way to detect cells and therefore I decided to switch to the drawfreehand function in MATLAB 2020 that allows the user to interactively create a region of interest (ROI) object.

After creating the ROI object (in simple terms that means outlining the cell) as shown in the figure above, I created a binary mask and used regionprops to get the centroid and the equivalent diameter of the object.
The code then read in the second image as shown below with the distribution of the phosphorylated NMII Sqh and used outputs from regionprops and the impixel function to get the pixel intensity values for each cell. This image mostly looks dark because it is only showing the distribution of phosphorylated NMII Sqh which is the small cluster of bright pixels we see.

Impixel function takes the column and row indices of the pixels to be sampled and gives their pixel intensity values. However, lengths of column and row vectors need to be the same and since I was approximating the cells as circles, the only way to extract the intensity values of the pixels was to think of a circle as enclosed in a square. So, I used the center coordinates and radii of the circles to get coordinates of the top left corner of the square and used the linspace function to get equally spaced vectors for column and row. Finally, std2 and insertText functions were used for calculating the standard deviation values and displaying them on the image showing the actin distribution respectively.

Future goals could involve analyzing a number of images to determine reasonable standard deviation values and finding a way to extract pixel intensity values for pixels only within the ROI object instead of a square enclosing the object to calculate standard deviation values.

Now that I am done with this project, for this upcoming year I will be working as a research assistant in a lab part of the department of Genetics, Cell Biology, and Development at the University of Minnesota.

References:

[1] M. O’Kelley-Bangsberg, Reed undergraduate thesis (2019) [2] J. Wilson and T. Hunt, Molecular Biology of the Cell: The Problems Book (Gar- land Science, New York, NY, 2008), 5th ed., ISBN 978-0-8153-4110-9, oCLC: 254255562.
[3] MathWorks, “Detect and Measure Circular Objects in an Image”, https://www.mathworks.com/help/images/detect-and-measure-circular-objects-in-an-image.html

Cloud Computing Platforms Benchmarking

Hi everyone! My name is Jiarong Li and I’m a senior majoring in math and computer science at Reed. During this summer, in order to offer reference information for computational researchers to choose a suitable cloud computing platform, I worked with Anna and Trina to compare mainly four cloud computing platforms: Google Cloud, Amazon Web Service (AWS), Extreme Science and Engineering Discovery Environment (XSEDE), and the Institute for Bioinformatics and Evolutionary Studies from UIdaho (iBEST CRC).

By running three cloud computing projects related to different fields, I analyzed the real application of these platforms in terms of the execution time, computing capacity, storage, customizability, compatibility, and convenience. I designed a scoring rubric and tabulated and summarized the testing results with various plots.

Projects and Testing Results

1. Computational Biology: Augmenting signaling pathway reconstructions (PRAUG) [1]

By running PRAUG to reconstruct signaling pathways and recoding the runtime, I focused on testing the computing ability of the same type of virtual machines from different platforms.

Testing Result

The trend shown in the graph above tells us that virtual machines with more vCPUs take less execution time. Since PRAUG code was originally set to run on a 4-core machine, without setting up multithread, the trend of the decreasing execution time cannot be seen when the number of vCPUs getting above 6.

In the graphs above, I mainly compared the performance of virtual machines from Google Cloud and AWS since they have comparable customizability. With the same settings of virtual machines, we can see that AWS outperforms Google Cloud.

XSEDE and iBEST have less customizability and option of virtual machines. I chose several log-in nodes to run the program and their testing results are:
14 CPUs & 128 GB: 13m33.907s (XSEDE)
16 CPUs & 3 TB: 23m5.172s (XSEDE)
40 CPUs & 192 GB: 31m28.872s (iBEST)

2. Large Image Files Subsampling and Transferring: Tabular data files for 3D still images

By running multithreaded R code to randomly generate a specified number of 3D image subsamples with a certain number of pixels, this project mainly focuses on testing virtual machines with multiple vCPUs as well as the capability and convenience of data transferring and storing of different platforms.

Testing Result

The graph above shows the difference of execution time when we input the same variables and run the single-threaded code and multi-threaded code respectively on a virtual machine with 8 cores and 64GB memory.

In this project, by running multithreaded code, we can clearly see the influence of the number of vCPUs to the execution time.

When I set the number of subsamples to 20 and pixels/sample to 100, the runtime of this R code on VMs from Google Cloud and AWS are shown above. Still, AWS outperforms Google Cloud in some extent. Moreover, I tested more resources provided by XSEDE and iBEST in this project and Jetstream from XSEDE has really good performance.

In addition to multithreaded computing, I also compared the storage and data transferring of these four platforms. I personally like the s3 bucket from AWS the best since it is very convenient to access the files stored in s3 through other platforms and the security settings of private files have great customizability. Data transferring can be easily achieved by Globus among these four platforms and other applications.

3. Machine Learning: Running a machine learning experiment in a specific Docker Image

This project mainly focuses on testing GPU virtual machines and their compatibility with other applications.

Testing Result

The graph above shows the comparison of runtime when training the ML model on VMs with NVIDIA Tesla T4 GPU. We can see that VMs from AWS run faster than those from Google Cloud.

I also tried to adopt VMs with different types of GPUs and got results shown in the graph above. For more information about NVIDIA Tesla.
I didn’t run this project on GPU nodes from XSEDE or iBEST since I have no permission to set up docker on these VMs. Generally, for all four platforms, users need to request access to GPU VMs and will get either permission or rejection depending on specific situation.

Summary

Based on all the testing results and the user experience I got during this process, I evaluated the four platforms according to the scoring rubric designed at the beginning.

Thanks to Anna, Tobias, Kelly, and Mark for providing cloud computing projects tested in this research.

References

[1] Rubel, Tobias, and Anna Ritz. “Augmenting Signaling Pathway Reconstructions.” BioRxiv. January 01, 2020. Accessed August 29, 2020. https://www.biorxiv.org/content/10.1101/2020.06.16.155853v1.full.

Summer 2020 Research Recap

We have passed the 90th consecutive night of Portland protests for Black Lives Matter. That’s three full months. That’s a whole summer.

There have been over 25,000 COVID-19 cases and 438 deaths in Oregon. The number of infected individuals total the entire population of Forest Grove, Happy Valley, or Wilsonville.

And yet, somehow, the students and staff who did research with me did phenomenally. Let me be clear: our success as a research group was driven by the collective effort of everyone involved.

Integrating Cancer Data into Interactomes

Hi all! My name is Aryeh Stahl and I’m a math and computer science dual major at Reed. I’ll be a senior this fall and will be starting my thesis. I had a great time working with Anna this summer!

An interactome is a weighted graph with nodes representing proteins and directed edges representing interactions between these proteins. Anna Ritz et al. (2015) previously developed Pathlinker, an algorithm that finds k-shortest paths from a set of source nodes to a set of target nodes. It aims to “to reconstruct the interactions in a signaling pathway of interest [2].”

In my project, I aimed to integrate cancer data into the previously generated Pathlinker interactome to find significant pathways and/or proteins involved in a given cancer. I obtained driver p-values for most proteins in the interactome. These p-values “indicate [genes’] statistical significance as drivers, according to diverse methods that account for positive selection, functional impact of mutations [and] regional mutation rates [1].”

Once I had mapped these p-values to the corresponding proteins in the Pathlinker interactome, I created new interactomes by diffusing the p-values to outgoing edges and calculating a weighted average with the old edge scores. For example, if a node had p-value of 0.5 and had 5 outgoing edges, the corresponding contribution would be 0.1 for each of the 5 outgoing edges. I then set a parameter, beta, which indicated the weight of the new values and the weight of the old values. If beta = 0, then all of the new edge weight is obtained from the p-value, while beta = 1 will give the original interactome. The following GIF shows the distribution of edge weights as beta changes (beta is 0 when most of the scores are close to 0, while beta is 1 when it is very discrete).

The input to Pathlinker is an interactome, as well as two sets of nodes: targets and sources. The output is a list of proteins and interactions, ranked “by their first occurrence in the k shortest paths from any [source] to any [target] [2].” So, I can run Pathlinker on my new interactomes and get as output a list of significant edges. Ideally, these edges would be significant in the pathway in question (such as the WNT signaling pathway), and for the cancer. Thus, we could identify potential interactions that could be disrupted and contribute to the cancer.

My next issue was to determine if my output was actually significant. I tried to do this by fabricating some data and seeing if my method recovered the edges that I expected. I created randomly weighted interactomes and chose a previously known signaling pathway as a subgraph (in this case I used WNT). I shall call this subgraph A. I then chose some scale factor to upweight the edges in that subgraph, e.g. multiplying each edge weight by a factor of 1.5. Next, I chose a random subset of the nodes of subgraph A (of a specified size, say 50% of the nodes) to serve as an intersection with another subgraph, which I shall call subgraph B. B corresponds to nodes with a high p-value in the cancer data. I then added some more nodes that are connected to this intersection and are outside A (Thus making subgraph B have the same number of nodes as subgraph A). I finally gave each of the nodes in subgraph B a high p-value and created a new interactome.

Running Pathlinker on this contrived data with the WNT sources and targets, I would expect a high percentage of the resultant significant edges would be in the intersection between subgraph A and subgraph B. I further would expect to have a larger number of the significant edges in the intersection as I upweight subgraph by a larger factor. This is exactly what I got. The following plot shows the results from several trials of Pathlinker on new interactomes with a various upweight factor, finding the k = 100 shortest paths.

References

[1] Reyna, M.A., Haan, D., Paczkowska, M. et al. Pathway and network analysis of more than 2500
whole cancer genomes. Nat Commun 11, 729 (2020). https://doi.org/10.1038/s41467-020-14367-0

[2] Ritz, A., Poirel, C., Tegge, A. et al. Pathways on demand: automated reconstruction of human
signaling networks. npj Syst Biol Appl 2, 16002 (2016). https://doi.org/10.1038/npjsba.2016.2

Hypergraph Algorithms

Hey everyone! I am Alex Richter. I am a rising junior in math and computer science working with Anna Ritz this summer on hypergraph algorithms. The algorithms that I implemented this summer were from a paper co-authored by Nick Franzese, Anna Ritz, and Adam Groce [1]. The implementation of the algorithms already existed in Python; however, I learned Java and implemented them in Java to be incorporated into a plugin written in Java.

We are working on incorporating the algorithm into the application Cytoscape via a plugin called ReactomeFIViz. In doing so, we hope that scientists will begin to examine hypergraph topology when checking to see if two molecules are connected within a cell. In viewing a different topology as opposed to simply directed or bipartite graphs, different pathways can be discovered as well as some nodes in a graph topology might be connected within a cell but not in a hypergraph topology. The potential for hypergraph connectivity in better understanding connectivity within a cell is a field that deserves to be further explored.

The algorithm that I implemented relaxes the definition of B-connectivity and determines if two nodes are connected in a hypergraph topology. Here are some definitions from the paper:

Directed hypergraphs represent reactions with many-to-many relationships, where each hyperedge e = (T_e, H_e) has a set of entities in the tail T_eand a set of entities in the head H_e (here, the tail denotes the hyperedge inputs and the head denotes the hyperedge outputs). [1]
B-connectivity: requires all the nodes in the tail of a hyperedge to be visited before it can be traversed. This definition has a natural biological meaning in reaction networks: B-connectivity requires that all reactants of a reaction must be present in order for any product of that reaction is reachable. Unlike the compound graph rules, B-connectivity describes the strictest version of connectivity, and it is the only rule used to traverse hyperedges. [1]
- Given a directed hypergraph and a source set S ⊆ V, a node u ∈ V is B-connected to S if either (a) u ∈ S or (b) there exists a hyperedge e = (T_e, H_e) where u ∈ H_e and each element in T_e is B-connected to S. We use to denote the set of nodes that are B-connected to S in . [1]

To be B-connected implies that in order to traverse a hyperedge all of the nodes in the tail set (input set) must be present. B-connectivity can be a little bit difficult to understand at first glance. It is a harsh restriction on hypergraph connectivity. Two nodes are B-connected if one of two things is true: either both nodes are in the source set so obviously they are B-connected or the node we are seeing if is B-connected is located in the head (output) set of a hyperedge that where each element in the tail (input) set is connected to original node.

b_visit Algorithm

The algorithm that I implemented is two parts. The first algorithm is called b_visit and takes in a hypergraph and a set of source nodes, s. The b_visit algorithm will then return three sets: the set of traversed hyperedges from s, the set of B-connected nodes to s, and the set of restricted hyperedges. Restricted hyperedges are edges that can be reached but not traversed because it is missing an input in the tail to make the hyperedge traversable. For example, given a source set of A, B for this hypergraph, the set of restrictive hyperedges would be:

Tail Set	Head Set
G, D	L
D, H	M
I, N	O

Set of Restrictive Hyperedges

The set of traversable hyperedges would be:

Tail Set	Head Set
A	C
A, B	D
B	E
B	F
E, F	I

Set of Traversable Hyperedges

The set of B-Connected nodes to A, B would then be: C, D, E, F, I.

b_relaxation Algorithm

The b_visit algorithm helps when trying to find out which nodes are B-connected to the source set, or the set of nodes the algorithm is called on. This is helpful to find those immediate nodes; however, since the requirements for B-connectivity are so strict, this does not include all of the nodes that can be reached from the source set in the hypergraph. Through calling b_visit multiple times and replacing the requirements for B-connectivity, more nodes can be viewed as connected in the hypergraph.

The second algorithm from the paper is called b_relaxation which makes calls to the b_visit algorithm. The input for b_relaxation is once again the hypergraph and a set of source nodes, s, on which the algorithm will be called. Through multiple calls to b_visit, b_relaxation relaxes the strict conditions of B-connectivity. The algorithm runs on the entire hypergraph to determine which hypernodes are B-connected to the source set. The return value of the algorithm is a dictionary of k-distances where the key is the node and the value is the distance. K-distance is easiest understood as the number of times the algorithm was run until it could reach the node in question. So if it was in the source set for instance, the k-distance for the node would be 0. In relaxing the harsh conditions of B-connectivity, the algorithm is essentially traversing restrictive hyperedges to find all of the nodes that are connected in the hypergraph.

Using the same toy example, at the end of b_relaxation, it would return a dictionary of distances.

Key	Value
A	0 (source set)
B	0 (source set)
C	0
D	0
E	0
F	0
G	INFINITY
H	INFINITY
I	0
J	INFINITY
K	INFINITY
L	1
M	1
N	INFINITY
O	1
P	2
Q	2
R	3

Dictionary of distances for toy hypergraph after running b_relaxation algorithm

Anna and I are working with our collaborators in incorporating this into ReactomeFIViz before adding it to Cytoscape. The next steps include cleaning up the code, making sure it works properly in the plugin, and considering implementing further hypergraph algorithms.

References:

[1] Franzese N, Groce A, Murali TM, Ritz A (2019) Hypergraph-based connectivity measures for signaling pathway topologies. PLOS Computational Biology 15(10): e1007384. https://doi.org/10.1371/journal.pcbi.1007384

Nets Full of Fish Brains

Hey folks, my name is Gabe Preising and I’m a research assistant for Suzy Renn and Anna Ritz. I’ve always been a bit all over the place with my research interests, and because of that I’ve always been drawn to interdisciplinary research (either that or I’m just indecisive). I graduated from Reed this past May with a degree in Biology, my research interests including behavioral neuroscience, computational biology, and bioinformatics.

I spent my time in the Renn lab studying the African cichlid fish Astatotilapia burtoni. These fish are native to Lake Tanganyika, one of the African great lakes on the eastern side of the continent. They engage in complex social dynamics and because of that, people who study A. burtoni have historically studied the males in the context of dominance hierarchies and aggression. However, the Renn lab focuses their research on female behaviors, and in the case of my research, maternal care behavior. A. burtoni mothers engage in mouthbrooding, which is a ~2 week period where a mother will hold her developing offspring in her mouth. During this time, she will churn them around to circulate fresh air while simultaneously protecting them from predators. The really interesting part of this behavior though is that during mouthbrooding, a mother will forgo eating throughout the brooding period. This implies that somehow, a mother is making a trade-off where she puts the energetic needs of her offspring before her own.

The way signals from the brain can influence behavior and vice versa is fascinating to me, so I decided to do my senior thesis on the neural signals of mouthbrooding. As I was deciding on a project though, I became interested in bioinformatics and computational biology and wanted to tie each of these fields together. In Anna’s computational systems biology class, we learned about protein-protein interaction networks (i.e. interactomes) and how powerful they can be for investigating a variety of biological questions such as predicting disease genes. We also learned about how you could weight interactomes with external data to alter properties of the network (depending on what you’re doing). So in one sentence, my thesis (and current research) is: use a protein-protein interactome to figure out what’s going on in the brains of mouthbrooding fish.

Since there is currently no interactome for A. burtoni, I made my own from an existing zebrafish interactome. To do this, I took advantage of two publicly available resources: orthologous relationships [1] and a zebrafish interactome [2]. Orthologous genes are related genes in different species that originated from a common ancestor. Using the program OrthoFinder [3], I generated orthologs between zebrafish and A. burtoni. I then created new edges between all of the corresponding orthologs for the pair of nodes in each zebrafish edge. Finally, I weighted the network with gene expression data from the brain such that redder nodes have a higher expression level in brooding fish and bluer nodes have a lower expression level. The following subnetwork focuses on neurotensin, a differentially upregulated gene in brooding fish.

Neurotensin is involved with both feeding and parental care, and seeing it connected to other neuropeptides like prodynorphin (pdyn) tells me that I mapped everything correctly. I’m going to keep refining my network in the coming weeks and try a different method of constructing it but I’m excited to have a usable interactome finally. Future goals include looking into NetworkX algorithms to run to find clusters of differentially expressed genes.

References:
[1] Smedley, D., Haider, S., Ballester, B., Holland, R., London, D., Thorisson, G., & Kasprzyk, A. (2009). BioMart–biological queries made easy. BMC genomics, 10(1), 22.
[2] Ogris, C., Guala, D., Kaduk, M., & Sonnhammer, E. L. (2018). FunCoup 4: new species, data, and visualization. Nucleic acids research, 46(D1), D601-D607.
[3] Emms, D. M., & Kelly, S. (2019). OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome biology, 20(1), 1-14.

Summer 2020 Research

“Stay Home, Save Lives” ad campaign by Oregon Gov. Kate Brown.

This summer will not look like any other. Yet, undergraduate research continues! At a time when experimental labs are shut down and figuring out how to reopen, computational biology can provide an opportunity for trainees to learn a new skill while at home.

This summer, there are a bunch of projects in the compbio lab, with a large focus on developing tools for public use.

CS junior Alex Richter will be implementing our recent hypergraph connectivity algorithm for use with the ReactomeFIViz app, which analyzes and visualizes signaling pathways from Reactome.

CS senior Aryeh Stahl will be exploring how to best weight protein-protein interaction networks (such as the HIPPIE interactome among others) and how to integrate high-throughput experimental data in these networks.

Biology junior Frank Zhuang will be picking up Tayla’s project from last summer, working with me and Kara Cerveny to computationally identify retinoic acid response elements (RAREs) in zebrafish.

CS senior Jiarong Li will be returning to the lab, but unlike last summer she will be working alongside Reed’s Computing & Information Services to explore feasible cloud compute platforms for college research across disciplines.

CS sophomore Larry Zeng will be developing a tutorial about network algorithms for molecular systems biology. The tutorial, designed for biologists, will be designed for researchers to learn more about networks visualized by GraphSpace, an interactive graph sharing platform.

Finally, three post-baccalaureate researchers will be working on computational biology research.

Tobias Rubel Janssen (Fall ’19) develops new algorithms for signaling pathway reconstruction from protein-protein interaction networks.
Gabe Preising (Spring ’20) will extend his undergraduate thesis project with Suzy Renn to identify differentially expressed groups of genes related to mouthbrooding in cichlid fish.
Maham Zia (Spring ’20) will develop new measures for quantifying microscopy images of cells according to specific phenotypes that are studied in Derek Applewhite’s lab.

This will make for a full and fun summer! Looking forward to establishing a summer research group, even if it’s virtual.

New compbio logo, courtesy of the pandemic.

More about DAGs

Hi, I am Tunc. If you have read this recent post by Jiarong Li, you might remember that we work on signaling pathway reconstruction. This is a better way of saying that we try to determine the many different orders of protein-protein interactions within a cell, from a receptor to a transcription factor, that make up a pathway. In this post, I want to talk about why and how we have combined PathLinker and our work on Directed Acyclic Graphs (DAGs).

Numbers represent proteins and directed edges represent interactions. PathLinker can find 1->3->4->5 and 2->3->4->5 as two different paths, but only 4 of the 6 edges are unique. — PathLinker can find 1->3->4->5 and 2->3->4->5 as two different paths, but only 4 of the 6 edges are unique. This gets only worse on the larger scale real data.

PathLinker can find k-many shortest paths from a receptor to a transcription factor. In our graphs, the length of an edge represents the cost of a protein-protein interaction. We assume that evolutionarily, if an interaction is important for the pathway, it will be optimized to happen easily or have a lower cost, meaning that it will be “shorter”. Hence, one could imagine that the chains of interactions or the paths, that are important for the pathways will be shorter, allowing us to frame the reconstruction problem as a shortest-paths problem. PathLinker can find k-many shortest paths very quickly, where k is user-defined, which is nice. However, as k increases, a lot of these paths start to reuse the same edges over and over again, resulting in a decreased amount of information added.

With DAGs, we start with some ground-truth network, find a collection of shortest paths that are not in our network, evaluate them according to some cost function we can customize, pick the path that the cost function thinks is the best, add it to our network and repeat this process many times. This process takes a lot more time than PathLinker and because it has so many parts, changing one thing like the cost function or the ground-truth network can affect the result greatly. This also means that for our program to work properly, we need to be very careful that every part works as we intend it to AND how we intend it to work makes sense biologically.

Even though our long term goal might be to get all of those parts working, one quick idea we could try was to impose some of these criteria to the paths PathLinker quickly generates. To do that, we first ran PathLinker to get 50,000 paths, which seemed like a big enough number of paths, and then we wrote a script to iterate over these paths and choose the ‘acceptable’ ones. Here, the definition of acceptable is customizable, and we have tested it with several definitions to observe how our results change.

On the EGFR1 pathway, having a stricter definition of acceptable increased recall with precision being roughly the same for networks created by 1000 acceptable paths.

In the figure above, x denotes the minimum number of new edges a path must add for it to be acceptable. Increasing x resulted in an increase of recall without a significant effect on precision. This result wasn’t very surprising because repeated use of false edges does not decrease precision values, each edge is only counted once. Recall increasing also wasn’t very surprising as all we did was to require the addition of new edges, some of which had to be true protein-protein interactions. The real surprising observation we made was when we used this method on different pathways. Epidermal Growth Factor Receptor (EGFR) is a very big pathway, at least in the NetPath database, the other pathways’ data we had were significantly smaller. Consequently, when we used the same method on other pathways, with x = 3 or sometimes even 2, we could never find 1000 acceptable paths. To reiterate, out of 50,000 shortest paths we had originally, there weren’t even 1000 of them that continually added enough new edges to our network. This meant that the reuse of edges in PathLinker was on a very big scale, underlining the potential for our new method.

Reconstructing Signaling Pathways by DAGs

My name is Jiarong Li, and I’m a rising junior majoring in Math-CS. In this summer, I’m working on a project related to computational biology with Tunc Kose, Ibrahim Youssef, and Anna Ritz. In the study of protein-protein interactions, we take each protein as a node and the interaction between two nodes as an edge. By building a graph of paths starting from receptors and ending at transcriptional factors, we aim to reconstruct better signaling pathways.

The former method we used to build pathways is PathLinker, which uses k-shortest path (KSP) to get k shortest paths from receptors to transcriptional factors in the collection of protein-protein interactome within a cell. This method, however, reuses a lot of edges and nodes in the process of building the graph. Thus, after a certain point, no new information will be added to the result. In order to solve this problem and get a better reconstruction, we adopt directed acyclic graph (DAG) to grow signaling pathways. DAG is a graph without cycles and all nodes in DAG can be topologically sorted. The algorithm starts from a ‘ground-truth’ graph G_0, which is the first path with more than 2 nodes generated from PathLinker, to grow DAG by introducing a new path that minimizes the total weights of all paths in the graph at every iteration. Here is an example about how the algorithm works.

By using the algorithm described above, we get reconstructions of Wnt pathways generated by DAG below. And a graph got from PathLinker for comparison.

*LocPL Wnt Reconstruction with Compartments*
http://graphspace.org/graphs/26786?user_layout=6719

As we can see, in the graph generated by DAG, all of the directed edges do not have reverse arrows directions. The triangles are receptors and squares are transcriptional factors. Yellow nodes represent nodes that are in both Wnt pathway and the interactome and blue nodes represent nodes that are only in the interactome.

With these results, we use Precision and Recall to test the performance of DAG. This method shows us the fraction of relevant instances among the retrieved instances and the fraction of relevant instances that have been retrieved over the total amount of relevant instances. Here is graphs of precision and recall for nodes and edges in the reconstruction of RANKL pathway.

We expect curves to approach to the upper right corner with high precision and recall. However, we cannot really show the obvious advantage of DAG through the this method, which means that we need to explore more ways for testing.

For future researching direction, we are looking for better G_0 options other than the result got from PathLinker. Also, for now, we’re trying adopting different algorithms to grow DAG. Instead of minimizing the total weights of paths in the graph, we’ll minimize the total weights of the graph and compare the result got from different algorithms. Moreover, we realize that there’re certain limits of precision and recall, so we’re looking for a more suitable way to test our current results to show the benefits of the DAG.

Sources:

Ritz, Anna, Christopher L. Poirel, Allison N. Tegge, Nicholas Sharp, Kelsey Simmons, Allison Powell, Shiv D. Kale, and Tm Murali. “Pathways on Demand: Automated Reconstruction of Human Signaling Networks.” Npj Systems Biology and Applications2, no. 1 (2016). doi:10.1038/npjsba.2016.2.

Youssef, Ibrahim, Jeffrey Law, and Anna Ritz. “Integrating Protein Localization with Automated Signaling Pathway Reconstruction.” 2019. doi:10.1101/609149.

Retinoic Acid, Development, and Motif Finding

My name is Tayla, and I’m a rising junior Biology major working on a research project co-advised by Anna Ritz and Kara Cerveny this summer. Overall, my project is trying to understand a vitamin A-dependent biological signaling pathway that is part of the process of stem cells differentiating into neurons.

We’re interested in this process because previous studies have shown that vitamin A is essential to proper embryonic eye development because it alters gene expression at the transcription level via specialized receptor proteins. Understanding this developmental process will provide insight into the complex differentiation process and identifying the involved genes in silico may open avenues of inquiry for in vivo studies. We hope to search for the genes that are affected by this pathway through sequence analysis and analyze how those genes might fit into this regulatory network.

But first, some background: the pathway that I’m looking at is the retinoic acid pathway which is especially significant in the retina of developing zebrafish embryos. Retinoic acid is an active metabolite of vitamin A that allows proteins to bind to DNA and alter the transcription of certain genes (Figure 1). These proteins, called retinoic acid receptors, alter in the presence of retinoic acid to bind to very specific DNA sequences called retinoic acid response elements (RAREs).

Figure 1. Heterodimerization occurs upon nuclear retinoic acid receptors (RARs) and retinoid X receptors (RXRs) recognizing a tandemly repeated hexad motif called a retinoic acid response element (RARE) usually upstream of the direct target gene. In the presence of retinoic acid (all *trans* and 9-*cis*), the complex becomes active. This complex can either encourage transcription of its target gene by cleaving the co-repressors or inhibit transcription through repressor factor recruitment.

One of my goals for this project is to find zebrafish genes that are responsive to retinoic acid influxes. To do this, I have to scan through parts of the genome and look for a tandem repeat of a six base pair motif. Retinoic acid receptors bind to these RAREs within the sequence upstream of the affected gene. I can build a program that takes these upstream regions of zebrafish genes, finds this repeated motif, and tells me all the genes that were found.

Figure 2. The RARE motif is composed of 6 base pairs of conserved sequence followed by a space of 1-5 base pairs and then a repeat of the same motif. The spacing in between motifs is used to classify it; for example the motif in the figure is a direct repeat spaced 5 base pairs apart and would be called a DR5.

While it sounds pretty simple, there are actually a lot of moving parts. First I have to read in a big file of sequence and identifying information, and preferably do it quickly. Then I have to find a six base pair motif repeated 1-5 base pairs downstream and score it according to what’s allowed by its documented variation (Figure 2). Finally I have to return the gene IDs of genes containing the repeat. All of this is run on 65,171 annotated zebrafish transcripts’ upstream regions.

Luckily, at this point in my project (about 6 weeks in), I’ve written a program that will do this in about half an hour. Now comes the interpretation: finding out where and at what stage the genes I identified with my program are expressed in zebrafish. Hopefully we’ll find some genes that we expect to be regulated by retinoic acid in the final set of candidates to validate our method. The most exciting prospect is perhaps finding novel genes regulated by this pathway, or better yet a confirmation that the genes we’re testing in the lab as direct targets of retinoic acid exhibit the canonical response site.

Sources:

Al Tanoury Z, Piskunov A, Rochette-Egly C. Vitamin A and retinoid signaling: genomic and nongenomic effects. J Lipid Res. 2013;54(7):1761-1775. doi:10.1194/jlr.R030833

Cunningham TJ, Duester G. Mechanisms of retinoic acid signalling and its roles in organ and limb development. Nat Rev Mol Cell Biol. 2015;16(2):110-123. doi:10.1038/nrm3932

Lalevée S, Anno YN, Chatagnon A, et al. Genome-wide in Silico Identification of New Conserved and Functional Retinoic Acid Receptor Response Elements (Direct Repeats Separated by 5 bp). J Biol Chem. 2011;286(38):33322-33334. doi:10.1074/jbc.M111.263681

Predki PF, Zamble D, Sarkar B, Giguère V. Ordered binding of retinoic acid and retinoid-X receptors to asymmetric response elements involves determinants adjacent to the DNA-binding domain. Mol Endocrinol. 1994;8(1):31-39. doi:10.1210/mend.8.1.8152429