Hi all! My name is Aryeh Stahl and I’m a math and computer science dual major at Reed. I’ll be a senior this fall and will be starting my thesis. I had a great time working with Anna this summer!
An interactome is a weighted graph with nodes representing proteins and directed edges representing interactions between these proteins. Anna Ritz et al. (2015) previously developed Pathlinker, an algorithm that finds k-shortest paths from a set of source nodes to a set of target nodes. It aims to “to reconstruct the interactions in a signaling pathway of interest .”
In my project, I aimed to integrate cancer data into the previously generated Pathlinker interactome to find significant pathways and/or proteins involved in a given cancer. I obtained driver p-values for most proteins in the interactome. These p-values “indicate [genes’] statistical significance as drivers, according to diverse methods that account for positive selection, functional impact of mutations [and] regional mutation rates .”
Once I had mapped these p-values to the corresponding proteins in the Pathlinker interactome, I created new interactomes by diffusing the p-values to outgoing edges and calculating a weighted average with the old edge scores. For example, if a node had p-value of 0.5 and had 5 outgoing edges, the corresponding contribution would be 0.1 for each of the 5 outgoing edges. I then set a parameter, beta, which indicated the weight of the new values and the weight of the old values. If beta = 0, then all of the new edge weight is obtained from the p-value, while beta = 1 will give the original interactome. The following GIF shows the distribution of edge weights as beta changes (beta is 0 when most of the scores are close to 0, while beta is 1 when it is very discrete).
The input to Pathlinker is an interactome, as well as two sets of nodes: targets and sources. The output is a list of proteins and interactions, ranked “by their first occurrence in the k shortest paths from any [source] to any [target] .” So, I can run Pathlinker on my new interactomes and get as output a list of significant edges. Ideally, these edges would be significant in the pathway in question (such as the WNT signaling pathway), and for the cancer. Thus, we could identify potential interactions that could be disrupted and contribute to the cancer.
My next issue was to determine if my output was actually significant. I tried to do this by fabricating some data and seeing if my method recovered the edges that I expected. I created randomly weighted interactomes and chose a previously known signaling pathway as a subgraph (in this case I used WNT). I shall call this subgraph A. I then chose some scale factor to upweight the edges in that subgraph, e.g. multiplying each edge weight by a factor of 1.5. Next, I chose a random subset of the nodes of subgraph A (of a specified size, say 50% of the nodes) to serve as an intersection with another subgraph, which I shall call subgraph B. B corresponds to nodes with a high p-value in the cancer data. I then added some more nodes that are connected to this intersection and are outside A (Thus making subgraph B have the same number of nodes as subgraph A). I finally gave each of the nodes in subgraph B a high p-value and created a new interactome.
Running Pathlinker on this contrived data with the WNT sources and targets, I would expect a high percentage of the resultant significant edges would be in the intersection between subgraph A and subgraph B. I further would expect to have a larger number of the significant edges in the intersection as I upweight subgraph by a larger factor. This is exactly what I got. The following plot shows the results from several trials of Pathlinker on new interactomes with a various upweight factor, finding the k = 100 shortest paths.
 Reyna, M.A., Haan, D., Paczkowska, M. et al. Pathway and network analysis of more than 2500 whole cancer genomes. Nat Commun 11, 729 (2020). https://doi.org/10.1038/s41467-020-14367-0
 Ritz, A., Poirel, C., Tegge, A. et al. Pathways on demand: automated reconstruction of human signaling networks. npj Syst Biol Appl 2, 16002 (2016). https://doi.org/10.1038/npjsba.2016.2
Hey everyone! I am Alex Richter. I am a rising junior in math and computer science working with Anna Ritz this summer on hypergraph algorithms. The algorithms that I implemented this summer were from a paper co-authored by Nick Franzese, Anna Ritz, and Adam Groce . The implementation of the algorithms already existed in Python; however, I learned Java and implemented them in Java to be incorporated into a plugin written in Java.
We are working on incorporating the algorithm into the application Cytoscape via a plugin called ReactomeFIViz. In doing so, we hope that scientists will begin to examine hypergraph topology when checking to see if two molecules are connected within a cell. In viewing a different topology as opposed to simply directed or bipartite graphs, different pathways can be discovered as well as some nodes in a graph topology might be connected within a cell but not in a hypergraph topology. The potential for hypergraph connectivity in better understanding connectivity within a cell is a field that deserves to be further explored.
The algorithm that I implemented relaxes the definition of B-connectivity and determines if two nodes are connected in a hypergraph topology. Here are some definitions from the paper:
Directed hypergraphs represent reactions with many-to-many relationships, where each hyperedge e = (Te, He) has a set of entities in the tail Teand a set of entities in the head He (here, the tail denotes the hyperedge inputs and the head denotes the hyperedge outputs). 
B-connectivity: requires all the nodes in the tail of a hyperedge to be visited before it can be traversed. This definition has a natural biological meaning in reaction networks: B-connectivity requires that all reactants of a reaction must be present in order for any product of that reaction is reachable. Unlike the compound graph rules, B-connectivity describes the strictest version of connectivity, and it is the only rule used to traverse hyperedges. 
Given a directed hypergraph and a source set S ⊆ V, a node u ∈ V is B-connected to S if either (a) u ∈ S or (b) there exists a hyperedge e = (Te, He) where u ∈ He and each element in Te is B-connected to S. We use to denote the set of nodes that are B-connected to S in . 
To be B-connected implies that in order to traverse a hyperedge all of the nodes in the tail set (input set) must be present. B-connectivity can be a little bit difficult to understand at first glance. It is a harsh restriction on hypergraph connectivity. Two nodes are B-connected if one of two things is true: either both nodes are in the source set so obviously they are B-connected or the node we are seeing if is B-connected is located in the head (output) set of a hyperedge that where each element in the tail (input) set is connected to original node.
The algorithm that I implemented is two parts. The first algorithm is called b_visit and takes in a hypergraph and a set of source nodes, s. The b_visit algorithm will then return three sets: the set of traversed hyperedges from s, the set of B-connected nodes to s, and the set of restricted hyperedges. Restricted hyperedges are edges that can be reached but not traversed because it is missing an input in the tail to make the hyperedge traversable. For example, given a source set of A, B for this hypergraph, the set of restrictive hyperedges would be:
Set of Restrictive Hyperedges
The set of traversable hyperedges would be:
Set of Traversable Hyperedges
The set of B-Connected nodes to A, B would then be: C, D, E, F, I.
Theb_visit algorithm helps when trying to find out which nodes are B-connected to the source set, or the set of nodes the algorithm is called on. This is helpful to find those immediate nodes; however, since the requirements for B-connectivity are so strict, this does not include all of the nodes that can be reached from the source set in the hypergraph. Through calling b_visit multiple times and replacing the requirements for B-connectivity, more nodes can be viewed as connected in the hypergraph.
The second algorithm from the paper is called b_relaxation which makes calls to the b_visit algorithm. The input for b_relaxation is once again the hypergraph and a set of source nodes, s, on which the algorithm will be called. Through multiple calls to b_visit, b_relaxation relaxes the strict conditions of B-connectivity. The algorithm runs on the entire hypergraph to determine which hypernodes are B-connected to the source set. The return value of the algorithm is a dictionary of k-distances where the key is the node and the value is the distance. K-distance is easiest understood as the number of times the algorithm was run until it could reach the node in question. So if it was in the source set for instance, the k-distance for the node would be 0. In relaxing the harsh conditions of B-connectivity, the algorithm is essentially traversing restrictive hyperedges to find all of the nodes that are connected in the hypergraph.
Using the same toy example, at the end of b_relaxation, it would return a dictionary of distances.
0 (source set)
0 (source set)
Dictionary of distances for toy hypergraph after running b_relaxation algorithm
Anna and I are working with our collaborators in incorporating this into ReactomeFIViz before adding it to Cytoscape. The next steps include cleaning up the code, making sure it works properly in the plugin, and considering implementing further hypergraph algorithms.
Hey folks, my name is Gabe Preising and I’m a research assistant for Suzy Renn and Anna Ritz. I’ve always been a bit all over the place with my research interests, and because of that I’ve always been drawn to interdisciplinary research (either that or I’m just indecisive). I graduated from Reed this past May with a degree in Biology, my research interests including behavioral neuroscience, computational biology, and bioinformatics.
I spent my time in the Renn lab studying the African cichlid fish Astatotilapia burtoni. These fish are native to Lake Tanganyika, one of the African great lakes on the eastern side of the continent. They engage in complex social dynamics and because of that, people who study A. burtoni have historically studied the males in the context of dominance hierarchies and aggression. However, the Renn lab focuses their research on female behaviors, and in the case of my research, maternal care behavior. A. burtoni mothers engage in mouthbrooding, which is a ~2 week period where a mother will hold her developing offspring in her mouth. During this time, she will churn them around to circulate fresh air while simultaneously protecting them from predators. The really interesting part of this behavior though is that during mouthbrooding, a mother will forgo eating throughout the brooding period. This implies that somehow, a mother is making a trade-off where she puts the energetic needs of her offspring before her own.
The way signals from the brain can influence behavior and vice versa is fascinating to me, so I decided to do my senior thesis on the neural signals of mouthbrooding. As I was deciding on a project though, I became interested in bioinformatics and computational biology and wanted to tie each of these fields together. In Anna’s computational systems biology class, we learned about protein-protein interaction networks (i.e. interactomes) and how powerful they can be for investigating a variety of biological questions such as predicting disease genes. We also learned about how you could weight interactomes with external data to alter properties of the network (depending on what you’re doing). So in one sentence, my thesis (and current research) is: use a protein-protein interactome to figure out what’s going on in the brains of mouthbrooding fish.
Since there is currently no interactome for A. burtoni, I made my own from an existing zebrafish interactome. To do this, I took advantage of two publicly available resources: orthologous relationships  and a zebrafish interactome . Orthologous genes are related genes in different species that originated from a common ancestor. Using the program OrthoFinder , I generated orthologs between zebrafish and A. burtoni. I then created new edges between all of the corresponding orthologs for the pair of nodes in each zebrafish edge. Finally, I weighted the network with gene expression data from the brain such that redder nodes have a higher expression level in brooding fish and bluer nodes have a lower expression level. The following subnetwork focuses on neurotensin, a differentially upregulated gene in brooding fish.
Neurotensin is involved with both feeding and parental care, and seeing it connected to other neuropeptides like prodynorphin (pdyn) tells me that I mapped everything correctly. I’m going to keep refining my network in the coming weeks and try a different method of constructing it but I’m excited to have a usable interactome finally. Future goals include looking into NetworkX algorithms to run to find clusters of differentially expressed genes.
References:  Smedley, D., Haider, S., Ballester, B., Holland, R., London, D., Thorisson, G., & Kasprzyk, A. (2009). BioMart–biological queries made easy. BMC genomics, 10(1), 22.  Ogris, C., Guala, D., Kaduk, M., & Sonnhammer, E. L. (2018). FunCoup 4: new species, data, and visualization. Nucleic acids research, 46(D1), D601-D607.  Emms, D. M., & Kelly, S. (2019). OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome biology, 20(1), 1-14.
This summer will not look like any other. Yet, undergraduate research continues! At a time when experimental labs are shut down and figuring out how to reopen, computational biology can provide an opportunity for trainees to learn a new skill while at home.
This summer, there are a bunch of projects in the compbio lab, with a large focus on developing tools for public use.
CS senior Aryeh Stahl will be exploring how to best weight protein-protein interaction networks (such as the HIPPIE interactome among others) and how to integrate high-throughput experimental data in these networks.
Biology junior Frank Zhuang will be picking up Tayla’s project from last summer, working with me and Kara Cerveny to computationally identify retinoic acid response elements (RAREs) in zebrafish.
CS senior Jiarong Li will be returning to the lab, but unlike last summer she will be working alongside Reed’s Computing & Information Services to explore feasible cloud compute platforms for college research across disciplines.
CS sophomore Larry Zeng will be developing a tutorial about network algorithms for molecular systems biology. The tutorial, designed for biologists, will be designed for researchers to learn more about networks visualized by GraphSpace, an interactive graph sharing platform.
Finally, three post-baccalaureate researchers will be working on computational biology research.
Tobias Rubel Janssen (Fall ’19) develops new algorithms for signaling pathway reconstruction from protein-protein interaction networks.
Gabe Preising (Spring ’20) will extend his undergraduate thesis project with Suzy Renn to identify differentially expressed groups of genes related to mouthbrooding in cichlid fish.
Maham Zia (Spring ’20) will develop new measures for quantifying microscopy images of cells according to specific phenotypes that are studied in Derek Applewhite’s lab.
This will make for a full and fun summer! Looking forward to establishing a summer research group, even if it’s virtual.
Hi, I am Tunc. If you have read this recent post by Jiarong Li, you might remember that we work on signaling pathway reconstruction. This is a better way of saying that we try to determine the many different orders of protein-protein interactions within a cell, from a receptor to a transcription factor, that make up a pathway. In this post, I want to talk about why and how we have combined PathLinker and our work on Directed Acyclic Graphs (DAGs).
PathLinker can find k-many shortest paths from a receptor to a transcription factor. In our graphs, the length of an edge represents the cost of a protein-protein interaction. We assume that evolutionarily, if an interaction is important for the pathway, it will be optimized to happen easily or have a lower cost, meaning that it will be “shorter”. Hence, one could imagine that the chains of interactions or the paths, that are important for the pathways will be shorter, allowing us to frame the reconstruction problem as a shortest-paths problem. PathLinker can find k-many shortest paths very quickly, where k is user-defined, which is nice. However, as k increases, a lot of these paths start to reuse the same edges over and over again, resulting in a decreased amount of information added.
With DAGs, we start with some ground-truth network, find a collection of shortest paths that are not in our network, evaluate them according to some cost function we can customize, pick the path that the cost function thinks is the best, add it to our network and repeat this process many times. This process takes a lot more time than PathLinker and because it has so many parts, changing one thing like the cost function or the ground-truth network can affect the result greatly. This also means that for our program to work properly, we need to be very careful that every part works as we intend it to AND how we intend it to work makes sense biologically.
Even though our long term goal might be to get all of those parts working, one quick idea we could try was to impose some of these criteria to the paths PathLinker quickly generates. To do that, we first ran PathLinker to get 50,000 paths, which seemed like a big enough number of paths, and then we wrote a script to iterate over these paths and choose the ‘acceptable’ ones. Here, the definition of acceptable is customizable, and we have tested it with several definitions to observe how our results change.
In the figure above, x denotes the minimum number of new edges a path must add for it to be acceptable. Increasing x resulted in an increase of recall without a significant effect on precision. This result wasn’t very surprising because repeated use of false edges does not decrease precision values, each edge is only counted once. Recall increasing also wasn’t very surprising as all we did was to require the addition of new edges, some of which had to be true protein-protein interactions. The real surprising observation we made was when we used this method on different pathways. Epidermal Growth Factor Receptor (EGFR) is a very big pathway, at least in the NetPath database, the other pathways’ data we had were significantly smaller. Consequently, when we used the same method on other pathways, with x = 3 or sometimes even 2, we could never find 1000 acceptable paths. To reiterate, out of 50,000 shortest paths we had originally, there weren’t even 1000 of them that continually added enough new edges to our network. This meant that the reuse of edges in PathLinker was on a very big scale, underlining the potential for our new method.
My name is Jiarong Li, and I’m a rising junior majoring in Math-CS. In this summer, I’m working on a project related to computational biology with Tunc Kose, Ibrahim Youssef, and Anna Ritz. In the study of protein-protein interactions, we take each protein as a node and the interaction between two nodes as an edge. By building a graph of paths starting from receptors and ending at transcriptional factors, we aim to reconstruct better signaling pathways.
The former method we used to build pathways is PathLinker, which uses k-shortest path (KSP) to get k shortest paths from receptors to transcriptional factors in the collection of protein-protein interactomewithin a cell. This method, however, reuses a lot of edges and nodes in the process of building the graph. Thus, after a certain point, no new information will be added to the result. In order to solve this problem and get a better reconstruction, we adopt directed acyclic graph (DAG) to grow signaling pathways. DAG is a graph without cycles and all nodes in DAG can be topologically sorted. The algorithm starts from a ‘ground-truth’ graph G_0, which is the first path with more than 2 nodes generated from PathLinker, to grow DAG by introducing a new path that minimizes the total weights of all paths in the graph at every iteration. Here is an example about how the algorithm works.
By using the algorithm described above, we get reconstructions of Wnt pathways generated by DAG below. And a graph got from PathLinker for comparison.
As we can see, in the graph generated by DAG, all of the directed edges do not have reverse arrows directions. The triangles are receptors and squares are transcriptional factors. Yellow nodes represent nodes that are in both Wnt pathway and the interactome and blue nodes represent nodes that are only in the interactome.
With these results, we use Precision and Recall to test the performance of DAG. This method shows us the fraction of relevant instances among the retrieved instances and the fraction of relevant instances that have been retrieved over the total amount of relevant instances. Here is graphs of precision and recall for nodes and edges in the reconstruction of RANKL pathway.
We expect curves to approach to the upper right corner with high precision and recall. However, we cannot really show the obvious advantage of DAG through the this method, which means that we need to explore more ways for testing.
For future researching direction, we are looking for better G_0 options other than the result got from PathLinker. Also, for now, we’re trying adopting different algorithms to grow DAG. Instead of minimizing the total weights of paths in the graph, we’ll minimize the total weights of the graph and compare the result got from different algorithms. Moreover, we realize that there’re certain limits of precision and recall, so we’re looking for a more suitable way to test our current results to show the benefits of the DAG.
Ritz, Anna, Christopher L. Poirel, Allison N. Tegge, Nicholas Sharp, Kelsey Simmons, Allison Powell, Shiv D. Kale, and Tm Murali. “Pathways on Demand: Automated Reconstruction of Human Signaling Networks.” Npj Systems Biology and Applications2, no. 1 (2016). doi:10.1038/npjsba.2016.2.
Youssef, Ibrahim, Jeffrey Law, and Anna Ritz. “Integrating Protein Localization with Automated Signaling Pathway Reconstruction.” 2019. doi:10.1101/609149.
My name is Tayla, and I’m a rising junior Biology major working on a research project co-advised by Anna Ritz and Kara Cerveny this summer. Overall, my project is trying to understand a vitamin A-dependent biological signaling pathway that is part of the process of stem cells differentiating into neurons.
We’re interested in this process because previous studies have shown that vitamin A is essential to proper embryonic eye development because it alters gene expression at the transcription level via specialized receptor proteins. Understanding this developmental process will provide insight into the complex differentiation process and identifying the involved genes in silico may open avenues of inquiry for in vivo studies. We hope to search for the genes that are affected by this pathway through sequence analysis and analyze how those genes might fit into this regulatory network.
But first, some background: the pathway that I’m looking at is the retinoic acid pathway which is especially significant in the retina of developing zebrafish embryos. Retinoic acid is an active metabolite of vitamin A that allows proteins to bind to DNA and alter the transcription of certain genes (Figure 1). These proteins, called retinoic acid receptors, alter in the presence of retinoic acid to bind to very specific DNA sequences called retinoic acid response elements (RAREs).
One of my goals for this project is to find zebrafish genes that are responsive to retinoic acid influxes. To do this, I have to scan through parts of the genome and look for a tandem repeat of a six base pair motif. Retinoic acid receptors bind to these RAREs within the sequence upstream of the affected gene. I can build a program that takes these upstream regions of zebrafish genes, finds this repeated motif, and tells me all the genes that were found.
While it sounds pretty simple, there are actually a lot of moving parts. First I have to read in a big file of sequence and identifying information, and preferably do it quickly. Then I have to find a six base pair motif repeated 1-5 base pairs downstream and score it according to what’s allowed by its documented variation (Figure 2). Finally I have to return the gene IDs of genes containing the repeat. All of this is run on 65,171 annotated zebrafish transcripts’ upstream regions.
Luckily, at this point in my project (about 6 weeks in), I’ve written a program that will do this in about half an hour. Now comes the interpretation: finding out where and at what stage the genes I identified with my program are expressed in zebrafish. Hopefully we’ll find some genes that we expect to be regulated by retinoic acid in the final set of candidates to validate our method. The most exciting prospect is perhaps finding novel genes regulated by this pathway, or better yet a confirmation that the genes we’re testing in the lab as direct targets of retinoic acid exhibit the canonical response site.
Al Tanoury Z, Piskunov A, Rochette-Egly C. Vitamin A and retinoid signaling: genomic and nongenomic effects. J Lipid Res. 2013;54(7):1761-1775. doi:10.1194/jlr.R030833
Cunningham TJ, Duester G. Mechanisms of retinoic acid signalling and its roles in organ and limb development. Nat Rev Mol Cell Biol. 2015;16(2):110-123. doi:10.1038/nrm3932
Lalevée S, Anno YN, Chatagnon A, et al. Genome-wide in Silico Identification of New Conserved and Functional Retinoic Acid Receptor Response Elements (Direct Repeats Separated by 5 bp). J Biol Chem. 2011;286(38):33322-33334. doi:10.1074/jbc.M111.263681
Predki PF, Zamble D, Sarkar B, Giguère V. Ordered binding of retinoic acid and retinoid-X receptors to asymmetric response elements involves determinants adjacent to the DNA-binding domain. Mol Endocrinol. 1994;8(1):31-39. doi:10.1210/mend.8.1.8152429
My name is Amy Rose, and I’m a post-bac in Anna’s lab this summer. I graduated last month with an Alt. Biology degree with an emphasis in Computer Science. Taking Anna’s classes in my first two years at Reed was the start of my interest in computational bio. I spent my junior year studying computer science at The University of Sussex, and after this summer I will be starting as a software engineer at Puppet here in Portland.
When it came time to find a thesis project, I thought it would be interesting to explore an area of biology that I hadn’t had time to study while at Reed. I was coadvised by Anna and Sam Fey, who is an ecologist. Sam’s research on thermal variation led me to my project, which focused on modeling the effect of thermal variation on freshwater phytoplankton using real world data.
Phytoplankton are ectothermic, which means that they are not able to regulate their own body temperature. Additionally, due to their small size it is difficult to empirically measure the variance in their body temperature due to movement through thermally variable environments. My thesis began to resolve the impact on movement on body temperature and fitness. In this context, fitness represents the overall change in population size of phytoplankton based on temperature-dependent birth and mortality rates.
Temperature data was collected from Sparkling Lake in Vilas County, Wisconsin at intervals from .5 to 3m throughout the lake with a frequency as high as every minute over a period of 26 years. We interpolated the collected data to fill in estimated temperatures over depths which were not collected, as seen in the figure below.
We created five algorithms representing different theoretical patterns of phytoplankton movement throughout the water column, which we plotted against the data. This gave us a framework to understand the limits of what body temperatures phytoplankton may be experiencing. The second stage of the project was to plot these simulated body temperatures against a function representing phytoplankton fitness.
This summer, we hope to extend my thesis research over space and time. For my thesis, we focused on a single season, but we’re currently looking at extending the movement algorithms over all 26 years of data. We’re also interested in exploring more datasets sourced from lakes in different geographical locations. Additionally, we’re analyzing the effects of changes to the fitness function.
Reed has finished for the year, but that doesn’t mean that students are done. Last week kicked off a slew of undergraduate researchers doing all kinds of research. In no particular order, here’s a taste of what people will be working on in the compbio lab. Stay tuned for occaisonal group updates.
Math-CS major Jiarong (Lee) Li ’21 and biology major Tunc Kose ’22 are going to develop algorithms to analyze a cell’s response to external signals (called signaling pathways). They will be working to extend ideas based on the original PathLinker paper and Ibrahim Youssef’s Localized-PathLinker paper.
Recent graduate Amy Rose Lazarte ’19 (alt. bio major with a CS emphasis) will continue to develop a resource and modeling framework for understanding the effect of thermal variation on freshwater phytoplankton. Co-advised by ecologist Sam Fey, she has developed a computational pipeline to analyze longitudinal lake temperature data using simulations of phytoplankton swimming strategies.
Biology major Tayla Isensee ’20 is working on identifying targets of retinoic acid signaling in zebrafish eye development. She has a hand in the wetlab work with developmental biologist Kara Cerveny, and she will be building a zebrafish protein-protein interaction network to find potential regulators to test. First, though, she’s going to hunt for retinoic acid response elements (RAREs) in the zebrafish genome to identify direct targets of retinoic acid.
Another recent graduate, neuroscience major Alex King ’19, will be wrapping up his thesis work to build a network that integrates gene, transcript, and protein relationships in order to identify dysregulated pathways in polygenic diseases based on genome-wide association study (GWAS) data.
Biology major Karl Young ’20 will be reading up on computational modeling in neuroscience, and figuring out the intersection of my world (algorithms for biological networks) and neurobiologist Erik Zornik’s world (neural circuits and how they affect behavior).
Last but not least, CS graduate Ananthan Nambiar ’19 will be getting his thesis ready to present as a poster at ISMB/ECCB in Basel later this summer. He modeled proteins as language with the help of his main advisor, natural language processing (NLP) expert Mark Hopkins in CS.