Hey folks, my name is Gabe Preising and I’m a research assistant for Suzy Renn and Anna Ritz. I’ve always been a bit all over the place with my research interests, and because of that I’ve always been drawn to interdisciplinary research (either that or I’m just indecisive). I graduated from Reed this past May with a degree in Biology, my research interests including behavioral neuroscience, computational biology, and bioinformatics.
I spent my time in the Renn lab studying the African cichlid fish Astatotilapia burtoni. These fish are native to Lake Tanganyika, one of the African great lakes on the eastern side of the continent. They engage in complex social dynamics and because of that, people who study A. burtoni have historically studied the males in the context of dominance hierarchies and aggression. However, the Renn lab focuses their research on female behaviors, and in the case of my research, maternal care behavior. A. burtoni mothers engage in mouthbrooding, which is a ~2 week period where a mother will hold her developing offspring in her mouth. During this time, she will churn them around to circulate fresh air while simultaneously protecting them from predators. The really interesting part of this behavior though is that during mouthbrooding, a mother will forgo eating throughout the brooding period. This implies that somehow, a mother is making a trade-off where she puts the energetic needs of her offspring before her own.
The way signals from the brain can influence behavior and vice versa is fascinating to me, so I decided to do my senior thesis on the neural signals of mouthbrooding. As I was deciding on a project though, I became interested in bioinformatics and computational biology and wanted to tie each of these fields together. In Anna’s computational systems biology class, we learned about protein-protein interaction networks (i.e. interactomes) and how powerful they can be for investigating a variety of biological questions such as predicting disease genes. We also learned about how you could weight interactomes with external data to alter properties of the network (depending on what you’re doing). So in one sentence, my thesis (and current research) is: use a protein-protein interactome to figure out what’s going on in the brains of mouthbrooding fish.
Since there is currently no interactome for A. burtoni, I made my own from an existing zebrafish interactome. To do this, I took advantage of two publicly available resources: orthologous relationships  and a zebrafish interactome . Orthologous genes are related genes in different species that originated from a common ancestor. Using the program OrthoFinder , I generated orthologs between zebrafish and A. burtoni. I then created new edges between all of the corresponding orthologs for the pair of nodes in each zebrafish edge. Finally, I weighted the network with gene expression data from the brain such that redder nodes have a higher expression level in brooding fish and bluer nodes have a lower expression level. The following subnetwork focuses on neurotensin, a differentially upregulated gene in brooding fish.
Neurotensin is involved with both feeding and parental care, and seeing it connected to other neuropeptides like prodynorphin (pdyn) tells me that I mapped everything correctly. I’m going to keep refining my network in the coming weeks and try a different method of constructing it but I’m excited to have a usable interactome finally. Future goals include looking into NetworkX algorithms to run to find clusters of differentially expressed genes.
References:  Smedley, D., Haider, S., Ballester, B., Holland, R., London, D., Thorisson, G., & Kasprzyk, A. (2009). BioMart–biological queries made easy. BMC genomics, 10(1), 22.  Ogris, C., Guala, D., Kaduk, M., & Sonnhammer, E. L. (2018). FunCoup 4: new species, data, and visualization. Nucleic acids research, 46(D1), D601-D607.  Emms, D. M., & Kelly, S. (2019). OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome biology, 20(1), 1-14.
My name is Tayla, and I’m a rising junior Biology major working on a research project co-advised by Anna Ritz and Kara Cerveny this summer. Overall, my project is trying to understand a vitamin A-dependent biological signaling pathway that is part of the process of stem cells differentiating into neurons.
We’re interested in this process because previous studies have shown that vitamin A is essential to proper embryonic eye development because it alters gene expression at the transcription level via specialized receptor proteins. Understanding this developmental process will provide insight into the complex differentiation process and identifying the involved genes in silico may open avenues of inquiry for in vivo studies. We hope to search for the genes that are affected by this pathway through sequence analysis and analyze how those genes might fit into this regulatory network.
But first, some background: the pathway that I’m looking at is the retinoic acid pathway which is especially significant in the retina of developing zebrafish embryos. Retinoic acid is an active metabolite of vitamin A that allows proteins to bind to DNA and alter the transcription of certain genes (Figure 1). These proteins, called retinoic acid receptors, alter in the presence of retinoic acid to bind to very specific DNA sequences called retinoic acid response elements (RAREs).
One of my goals for this project is to find zebrafish genes that are responsive to retinoic acid influxes. To do this, I have to scan through parts of the genome and look for a tandem repeat of a six base pair motif. Retinoic acid receptors bind to these RAREs within the sequence upstream of the affected gene. I can build a program that takes these upstream regions of zebrafish genes, finds this repeated motif, and tells me all the genes that were found.
While it sounds pretty simple, there are actually a lot of moving parts. First I have to read in a big file of sequence and identifying information, and preferably do it quickly. Then I have to find a six base pair motif repeated 1-5 base pairs downstream and score it according to what’s allowed by its documented variation (Figure 2). Finally I have to return the gene IDs of genes containing the repeat. All of this is run on 65,171 annotated zebrafish transcripts’ upstream regions.
Luckily, at this point in my project (about 6 weeks in), I’ve written a program that will do this in about half an hour. Now comes the interpretation: finding out where and at what stage the genes I identified with my program are expressed in zebrafish. Hopefully we’ll find some genes that we expect to be regulated by retinoic acid in the final set of candidates to validate our method. The most exciting prospect is perhaps finding novel genes regulated by this pathway, or better yet a confirmation that the genes we’re testing in the lab as direct targets of retinoic acid exhibit the canonical response site.
Al Tanoury Z, Piskunov A, Rochette-Egly C. Vitamin A and retinoid signaling: genomic and nongenomic effects. J Lipid Res. 2013;54(7):1761-1775. doi:10.1194/jlr.R030833
Cunningham TJ, Duester G. Mechanisms of retinoic acid signalling and its roles in organ and limb development. Nat Rev Mol Cell Biol. 2015;16(2):110-123. doi:10.1038/nrm3932
Lalevée S, Anno YN, Chatagnon A, et al. Genome-wide in Silico Identification of New Conserved and Functional Retinoic Acid Receptor Response Elements (Direct Repeats Separated by 5 bp). J Biol Chem. 2011;286(38):33322-33334. doi:10.1074/jbc.M111.263681
Predki PF, Zamble D, Sarkar B, Giguère V. Ordered binding of retinoic acid and retinoid-X receptors to asymmetric response elements involves determinants adjacent to the DNA-binding domain. Mol Endocrinol. 1994;8(1):31-39. doi:10.1210/mend.8.1.8152429
This week Kathy, Usman and I met to discuss how we would combine the projects we’ve been working on into a cohesive pathway and how we would analyze the output of CancerLinker as compared to PathLinker.
I have been working on ways to visualize the data. One thing I wanted to look at was how incorporating gene expression data would change the overall distribution of edge weights in the interactome.
The original interactome has a reasonable distribution with two values that seem to appear frequently around 0.4 and 0.8.
Gene expression data was incorporated into the original interactome with a beta value. The beta value determines the weight of the original edge weight when including the gene expression data. So the higher the beta score, the lower the importance of the gene expression data. I made three histograms one for a beta=0.25, one for beta = 0.5 (equal contribution between original edge weight and gene expression data) and one for 0.75
As can be seen in the histograms above, when gene expression data is weighted more heavily, edges weights are more closely clustered. To investigate why this occurred I made two additional histograms: one of the gene expression data before it was transformed and one of the gene expression data after it was transformed.
After the gene expression data is transformed, there is extremely little variation. Additionally all the gene expression data is greater than or equal to 0.5 which should be the median. This would explain why weighting the gene expression data more heavily causes more closely-clustered edge weights. I’m not sure how to fix this. It seems like an error, but I’ve been over the code multiple times and the math seems right to me. So my next step is to figure out what’s going on there.
In the meantime I took the output from the pipeline that Kathy put together and put graphs for the top 1000 Wnt paths with β=0.25, 0.5 and 0.75 up on GraphSpace. If I figure out what’s wrong with the function that transforms the edgeweights, I will run it again and re-upload the updated graphs.
I have spent most of this week trying to transform the gene expression data and implement the equations that Ibrahim came up with to incorporate gene expression data onto edge scores.
I have had several major hold-ups to accomplishing this. The first is pretty simple, I just don’t know how to get the kind of data output I want from the CDF function because the scipy.norm.cdf() function outputs an array and I want a single value for each node.
The biggest issue is that the TCGA data on gene expression and PathLinker use different gene nomenclature systems. The TCGA relies on Ensembl IDs whereas PathLinker relies on UniProt IDs. Originally I tried to convert all the Ensembl IDs to common names to UniProt IDs because I have a file that contains UniProt IDs and common names, and another file that contains Ensembl IDs and common names. However, this didn’t work because a single gene may have multiple common names and may refer to multiple UniProt IDs (for example there are about 20 UniProt IDs that correspond to the HLA-A gene).
Therefore, to avoid the loss of data due to various common names, I tried to make a dictionary of Ensembl IDs directly to UniProt IDs. I was able to obtain a file that contained both UniProt and Ensembl IDs from the HUGO Gene Nomenclature Committee website. However, converting between Ensemble and UniProt came with its own problems. First of all, there are many genes that either only have UniProt IDs or only have Ensembl IDs. 997 genes in the interactome were unable to be converted from one to the other because of this reason. In addition, many of the Ensembl IDs in the TCGA file (over 16,000) don’t line up with any of the Ensembl IDs in the dictionary I constructed. I think this might be because the Ensembl IDs in the TCGA file include version numbers at the end of each ID. The version number is the decimal point at the end of the ID name. For example for the gene, “ENSG00000242268.2”, the “.2” means that this is the 2nd version of that gene. I think one way to fix this problem might be to just take all the version numbers out of the gene name when constructing the dictionary from the TCGA file. However, I can’t figure out how to do this without making the code super slow. If every version number were a single decimal place (ie: .1, .2, .3 … etc), I would just cut off the last two digits of each name which wouldn’t be that slow. However different IDs have different number of decimal places. For example, if I cut off the last two digits of “ENSG00000167578.15” I would still be left with the decimal point. Therefore the only way I can currently think of to get rid of the decimal and everything following it is to use a for-loop that goes through every character in the name and if the character is a decimal point to cut the string there. However, if the program has to go through every letter in every gene name, it’s going to be extremely slow. Maybe something I could do is pre-process the data to create a text file that contains gene IDs without the decimal places so it only has to do it one time and won’t slow the whole code down, but I feel like there has to be a faster way to do it within the program.
This week I have continued trying to visualize and integrate gene expression data from GDC TCGA onto nodes in Graphspace. This is the graph I have so far:
The color of the nodes on the graph represent the mean foldchange of the gene. The foldchange is the ratio of gene expression of a cancerous tissue sample to a normal tissue sample from the same patient. There were 41 patients with colon adenocarcinoma who had gene expression data from both a tumor and a normal sample. The foldchange for each of these 41 patients was averaged to find the mean foldchange of each gene. Light blue indicates that the gene is slightly underexpressed in cancerous samples (less than 1 standard deviation), dark blue(not shown) indicates that the gene is underexpressed by 2 standard deviations, purple (not shown) indicates that the gene is underexpressed by greater than 2 standard deviations. Orange indicates that the gene is slightly overexpressed (within one standard deviation), dark orange indicates that the gene is overexpressed by 2 standard deviations, and red indicates that the gene is overexpressed by greater than 2 standard deviations.
One problem that came up was that some genes have a lot more variance in patient data than others. (Data on variance is available by clicking on the nodes in Graphspace). Additionally, patients that no gene expression or close to no gene expression in their normal samples had extremely high foldchanges that threw off the mean. It’s tempting to just throw away those samples as outliers, however several samples had multiple “outliers”. As an example I’ve included three sets of line graphs I made that show the gene expression data and foldchange of each patient for three different genes:
In the first example, CFTR, one can see that cancerous samples tend to have lower gene regulation than healthy samples and most of the fold change values hover around 1.0, meaning that the change is fairly low.
The second example, FZD9, is slightly different. The FZD9 expression data is much less uniform. In some patients, FZD9 is largely overregulated in cancerous samples. In other, it’s largely under regulated. The foldchange data shows foldchanges that range from zero to extremely high values (greater than 30,0000). This occurs because those patients had normal sample values of 0.0. In this case, it looks like dismissing the two extremely high foldchanges as outliers would yield a more realistic data set.
However, in ZSCAN4, there are many patients with extremely high or low foldchanges. Instead of ignoring these patients, I think I need to find a way to normalize the data so that a few large foldchanges don’t completely throw off the data.
The next step in this project is to work with Usman and Kathy to integrate the gene expression data onto the edge weights so that PathLinker can calculate the most disregulated protein pathways. To begin this process, I’ve written a program that produces a text file that contains for each gene the ID, common name, mean foldchange, standard deviation, and either a +1, 0, or -1, which represents whether it’s under-regulated, unchanged, or over-regulated in cancerous samples. Right now the +1, 0, and -1 are really just placeholder values. I need to meet with Ibrahim and Anna to discuss how to better weight the nodes.
I spent the majority of the week parsing through various databases and their associated tools to try to download GO Terms to form the set of positives and negatives required for the Bayesian Weighting Scheme. There appeared to be no easy way to download the GO annotations for specific proteins or to download specific GO Terms. Therefore, I simply downloaded a file containing all the GO annotations and stripped it of the unnecessary portions to form the positives and negatives. All that remains now is the integrate this with the remaining code for the weighting scheme. That will be the main objective of Week 5.
This week our main goal has been to find a pipeline to obtain TCGA data in a neat form. We discovered UCSC’s Xena Browser, which has files from the TCGA and a number of other databases.
Last week, we used the data from FireBrowse to make a graph of the genes that have patients with abnormally high or low levels of expression.
This week we changed that graph slightly by showing the difference between the number of patients with high expression and the number of patients with low expression by gene.
It is interesting to me that there are generally more patients with severe under-expression rather than severe-overexpression. I wonder if this is because these genes play a role in suppressing tumors, and that therefore maybe under-expression is more likely to cause cancer than overexpression?
I also worked on integrating gene expression data from Xena into our graph of the Wnt pathway.
Kathy figured out what was wrong with PathLinker the first time we ran it and re-ran it. I am working on turning it into a graph, but the input data is very different because it’s coming from a different version of NetPath so I need to change the program to be able to process the new data.
I also noticed while I was processing the expression data from Xena that there was a large amount of variability in gene expression between patients. I’m currently working on several things. Instead of just averaging gene expression for genes I’m comparing gene expression patient-by-patient so I’m comparing a tumor sample to a normal tissue sample for every patient. I also want to come up with a way to visualize the variance of expression among patients, because the more variance there is the less significant differences in expression between cancerous and normal tissue are. Anna suggested I do this by making the borders on nodes with high variance thicker. I am also going back and checking my math on gene expression to make sure that it is actually statistically significant and is conducted in a way that is similar to how other researchers have done similar research in the past.
On Monday we all met with Anna to agree on some goals for the week. Kathy and I set out to acheive several things: A. to figure out how to download data from FireBrowse on colon adenocarcinomas, to perform several statistical analyses on that data and to visualize it in some form using pylab, B. to write a code that produced informative graphs of the different signaling pathways from text files containing information about the edges and nodes and C. to implement PathLinker and LocPL and to create graphs of the top k paths on GraphSpace.
Originally to acheive our first goal, we tried to use Nicholas Egan’s pepper pathway code to download and format the data. However, while we were able to decipher how it worked, it ended up not being all that useful to us, so Kathy downloaded all the gene files related to colon adenocarcinoma onto her computer . Unfortunately, we’re still not really sure how to use them. We wrote a code that took text files containing information on edges and nodes and spit out a graph with color-coded nodes that were labeled with the node name and type and edges that were labeled with the type of interaction that was occuring between the two nodes for example “physical” or “phosphorylation”. I have attached to examples of the graphs we produced below, one very complex and the other very simple.
We also ran PathLinker on the Wnt pathway and made a graph of the top 200 best paths produced.
Moving forward we have several goals. Kathy is trying to figure out how to implement LocPL. I am still trying to figure out how to process and use the FireBrowse data. Because I will need to integrate methylation data from FireBrowse in my project, it is essential I figure out how to use it.
Overall, this week I have learned a lot about data processing, writing code to read text files, and using NetPath, PathLinker, FireBrowse, and GraphSpace. These skills should come in handy as I move forward in my project.
Week 2 was spent further looking into Bayesian Weighting Schemes and in particular how they were applied in a specific paper titled Top-Down Network Analysis to Drive Bottom-Up Modeling of Physiological Processes by Poirel et. al. This was the original LINKER paper that PATHLINKER was built upon. I also spent a very decent amount of time looking at other interactomes and curating a working list of interactomes and how they weight their interactomes. So far I have only done this in detail for the HIPPIE interactome but more work is still needed. Besides research, I also worked on a document about Bayesian Weighting that aims to elucidate some of the mathematical set up and notation for the paper mentioned above as it suffered from some convoluted mathematical notation. Hopefully, this will be useful to people trying to read this paper in the future. Other tasks involved debugging code, presenting learned material and sorting through papers to find relevant papers.