Retinoic Acid, Development, and Motif Finding

 

My name is Tayla, and I’m a rising junior Biology major working on a research project co-advised by Anna Ritz and Kara Cerveny this summer. Overall, my project is trying to understand a vitamin A-dependent biological signaling pathway that is part of the process of stem cells differentiating into neurons.

We’re interested in this process because previous studies have shown that vitamin A is essential to proper embryonic eye development because it alters gene expression at the transcription level via specialized receptor proteins. Understanding this developmental process will provide insight into the complex differentiation process and identifying the involved genes in silico may open avenues of inquiry for in vivo studies. We hope to search for the genes that are affected by this pathway through sequence analysis and analyze how those genes might fit into this regulatory network.

But first, some background: the pathway that I’m looking at is the retinoic acid pathway which is especially significant in the retina of developing zebrafish embryos. Retinoic acid is an active metabolite of vitamin A that allows proteins to bind to DNA and alter the transcription of certain genes (Figure 1). These proteins, called retinoic acid receptors, alter in the presence of retinoic acid to bind to very specific DNA sequences called retinoic acid response elements (RAREs).

Figure 1. Heterodimerization occurs upon nuclear retinoic acid receptors (RARs) and retinoid X receptors (RXRs) recognizing a tandemly repeated hexad motif called a retinoic acid response element (RARE) usually upstream of the direct target gene. In the presence of retinoic acid (all trans and 9-cis), the complex becomes active. This complex can either encourage transcription of its target gene by cleaving the co-repressors or inhibit transcription through repressor factor recruitment.

One of my goals for this project is to find zebrafish genes that are responsive to retinoic acid influxes. To do this, I have to scan through parts of the genome and look for a tandem repeat of a six base pair motif. Retinoic acid receptors bind to these RAREs within the sequence upstream of the affected gene. I can build a program that takes these upstream regions of zebrafish genes, finds this repeated motif, and tells me all the genes that were found.

Figure 2. The RARE motif is composed of 6 base pairs of conserved sequence followed by a space of 1-5 base pairs and then a repeat of the same motif. The spacing in between motifs is used to classify it; for example the motif in the figure is a direct repeat spaced 5 base pairs apart and would be called a DR5.

While it sounds pretty simple, there are actually a lot of moving parts. First I have to read in a big file of sequence and identifying information, and preferably do it quickly. Then I have to find a six base pair motif repeated 1-5 base pairs downstream and score it according to what’s allowed by its documented variation (Figure 2). Finally I have to return the gene IDs of genes containing the repeat. All of this is run on 65,171 annotated zebrafish transcripts’ upstream regions.

Luckily, at this point in my project (about 6 weeks in), I’ve written a program that will do this in about half an hour. Now comes the interpretation: finding out where and at what stage the genes I identified with my program are expressed in zebrafish. Hopefully we’ll find some genes that we expect to be regulated by retinoic acid in the final set of candidates to validate our method. The most exciting prospect is perhaps finding novel genes regulated by this pathway, or better yet a confirmation that the genes we’re testing in the lab as direct targets of retinoic acid exhibit the canonical response site.

Sources:

Al Tanoury Z, Piskunov A, Rochette-Egly C. Vitamin A and retinoid signaling: genomic and nongenomic effects. J Lipid Res. 2013;54(7):1761-1775. doi:10.1194/jlr.R030833

Cunningham TJ, Duester G. Mechanisms of retinoic acid signalling and its roles in organ and limb development. Nat Rev Mol Cell Biol. 2015;16(2):110-123. doi:10.1038/nrm3932

Lalevée S, Anno YN, Chatagnon A, et al. Genome-wide in Silico Identification of New Conserved and Functional Retinoic Acid Receptor Response Elements (Direct Repeats Separated by 5 bp). J Biol Chem. 2011;286(38):33322-33334. doi:10.1074/jbc.M111.263681

Predki PF, Zamble D, Sarkar B, Giguère V. Ordered binding of retinoic acid and retinoid-X receptors to asymmetric response elements involves determinants adjacent to the DNA-binding domain. Mol Endocrinol. 1994;8(1):31-39. doi:10.1210/mend.8.1.8152429

Ecology Modeling: Thermal Variation and Phytoplankton Fitness

My name is Amy Rose, and I’m a post-bac in Anna’s lab this summer. I graduated last month with an Alt. Biology degree with an emphasis in Computer Science. Taking Anna’s classes in my first two years at Reed was the start of my interest in computational bio. I spent my junior year studying computer science at The University of Sussex, and after this summer I will be starting as a software engineer at Puppet here in Portland.

When it came time to find a thesis project, I thought it would be interesting to explore an area of biology that I hadn’t had time to study while at Reed. I was coadvised by Anna and Sam Fey, who is an ecologist. Sam’s research on thermal variation led me to my project, which focused on modeling the effect of thermal variation on freshwater phytoplankton using real world data.

Phytoplankton are ectothermic, which means that they are not able to regulate their own body temperature. Additionally, due to their small size it is difficult to empirically measure the variance in their body temperature due to movement through thermally variable environments. My thesis began to resolve the impact on movement on body temperature and fitness. In this context, fitness represents the overall change in population size of phytoplankton based on temperature-dependent birth and mortality rates.

Temperature data was collected from Sparkling Lake in Vilas County, Wisconsin at intervals from .5 to 3m throughout the lake with a frequency as high as every minute over a period of 26 years. We interpolated the collected data to fill in estimated temperatures over depths which were not collected, as seen in the figure below.

Interpolation of data across space. Data was collected at discrete intervals, but linearly interpolated to fill in gaps.
Sparkling Lake temperature data from the 1989 season before and after interpolation. The left figure shows the recorded temperatures collected at each measured depth. The right figure was made through interpolating the temperature at each 0.01 meters given the actual data.

We created five algorithms representing different theoretical patterns of phytoplankton movement throughout the water column, which we plotted against the data. This gave us a framework to understand the limits of what body temperatures phytoplankton may be experiencing. The second stage of the project was to plot these simulated body temperatures against a function representing phytoplankton fitness.

This summer, we hope to extend my thesis research over space and time. For my thesis, we focused on a single season, but we’re currently looking at extending the movement algorithms over all 26 years of data. We’re also interested in exploring more datasets sourced from lakes in different geographical locations. Additionally, we’re analyzing the effects of changes to the fitness function.

Summer Research 2019 – here we go!

Reed has finished for the year, but that doesn’t mean that students are done. Last week kicked off a slew of undergraduate researchers doing all kinds of research. In no particular order, here’s a taste of what people will be working on in the compbio lab. Stay tuned for occaisonal group updates.

Math-CS major Jiarong (Lee) Li ’21 and biology major Tunc Kose ’22 are going to develop algorithms to analyze a cell’s response to external signals (called signaling pathways). They will be working to extend ideas based on the original PathLinker paper and Ibrahim Youssef’s Localized-PathLinker paper.

Recent graduate Amy Rose Lazarte ’19 (alt. bio major with a CS emphasis) will continue to develop a resource and modeling framework for understanding the effect of thermal variation on freshwater phytoplankton. Co-advised by ecologist Sam Fey, she has developed a computational pipeline to analyze longitudinal lake temperature data using simulations of phytoplankton swimming strategies.

Biology major Tayla Isensee ’20 is working on identifying targets of retinoic acid signaling in zebrafish eye development. She has a hand in the wetlab work with developmental biologist Kara Cerveny, and she will be building a zebrafish protein-protein interaction network to find potential regulators to test. First, though, she’s going to hunt for retinoic acid response elements (RAREs) in the zebrafish genome to identify direct targets of retinoic acid.

Another recent graduate, neuroscience major Alex King ’19, will be wrapping up his thesis work to build a network that integrates gene, transcript, and protein relationships in order to identify dysregulated pathways in polygenic diseases based on genome-wide association study (GWAS) data.

Biology major Karl Young ’20 will be reading up on computational modeling in neuroscience, and figuring out the intersection of my world (algorithms for biological networks) and neurobiologist Erik Zornik’s world (neural circuits and how they affect behavior).

Last but not least, CS graduate Ananthan Nambiar ’19 will be getting his thesis ready to present as a poster at ISMB/ECCB in Basel later this summer. He modeled proteins as language with the help of his main advisor, natural language processing (NLP) expert Mark Hopkins in CS.

Week 10: Fixing Up Nodescoring

Anna and Ibrahim came up with two new ways to weight the nodes, both of which have produced a far greater range of nodeweights than the original nodescoring.py program did. The histograms for the new node weights are as below:

Convolution weighted nodes

Empirical weighted nodes

Beyond that, I have spent most of this week making small tweaks to nodescoring.py so that it runs more smoothly.

Week 9: Fixing Nodescoring.py

In my last blog post, I talked about how I was concerned about how small the range of normalized node scores is. This week I’ve been trying to figure out why that is. To do this I’ve been making histograms of each step of the process from foldchange to Xv to Cv. This is an example of that process for one gene:

Distribution of foldchanges across patient samples for a single gene.

Distribution of Xvs across patient samples for a single gene

The Cv of the gene above was 0.5000000003. Unfortunately, it looks like a lot of genes even with fairly different fold change and Xv distributions end up with very similar Cvs.

Ibrahim realized that this is occurring because there is an error in the equations we were using so we will have to rethink the way we normalize the data.

Week 9: Cell tracking

So now we are in the experimental validation phase. This week, we ran a trial run of the cell tracking software, using newly cultured cells from a drosophila melanogaster, or fruit fly, line. These lines are suitable for validating that the genes are involved in cell motility due to the high degree of conservation between humans and fruit flies in basic cellular mechanisms.

Week 7: Matrix Files

This week, I’ve been creating a tabular file with all TCGA-COAD samples at the top of the file as column names, with genes at the sides. I should have a 512 x 60484 matrix file when done. However, with Sol’s help I realized that the file I initially output was actually 512 samples at the top, with only 443 columns after, but still with 60484 lines to the file.

Therefore, I think there’s something wrong in how I’m categorizing/organizing the samples.

Week 8: Putting Everything Together

This week Kathy, Usman and I  met to discuss how we would combine the projects we’ve been working on into a cohesive pathway and how we would analyze the output of CancerLinker as compared to PathLinker.

I have been working on ways to visualize the data. One thing I wanted to look at was how incorporating gene expression data would change the overall distribution of edge weights in the interactome.

The original interactome has a reasonable distribution with two values that seem to appear frequently around 0.4 and 0.8.

Original Interactome

Gene expression data was incorporated into the original interactome with a beta value. The beta value determines the weight of the original edge weight when including the gene expression data. So the higher the beta score, the lower the importance of the gene expression data. I made three histograms one for a beta=0.25, one for beta = 0.5 (equal contribution between original edge weight and gene expression data) and one for 0.75

Distribution of edge weights with gene expression incorporated.
Left B=0.25, Middle B=0.5, Right B=0.75

As can be seen in the histograms above, when gene expression data is weighted more heavily, edges weights are more closely clustered. To investigate why this occurred I made two additional histograms: one of the gene expression data before it was transformed and one of the gene expression data after it was transformed.

Left: Gene expression data before transformation
Right: Gene expression data after being transformed

After the gene expression data is transformed, there is extremely little variation. Additionally all the gene expression data is greater than or equal to 0.5 which should be the median. This would explain why weighting the gene expression data more heavily causes more closely-clustered edge weights.  I’m not sure how to fix this. It seems like an error, but I’ve been over the code multiple times and the math seems right to me. So my next step is to figure out what’s going on there.

In the meantime I took the output from the pipeline that Kathy put together and put graphs for the top 1000 Wnt paths with β=0.25, 0.5 and 0.75 up on GraphSpace. If I figure out what’s wrong with the function that transforms the edgeweights, I will run it again and re-upload the updated graphs.

 

Summer Week 8: Positive and Negative Sets

I was curious about what our AUC would look like if instead of comparing hidden positives to unlabeled, we compared hidden positives to hidden negatives, wondering if that would get us a more valid representation of the network’s ability to predict genes associated with schizophrenia. What if the unlabeled genes ranked above the hidden positives were simply actual centers of genetic perturbations in schizophrenia?

Oddly enough, the scores actually decreased from 0.65-0.70 to 0.60-0.65. I found that the AUCs for comparing hidden negatives to unlabeled nodes hovered around 0.53, meaning that the hidden negatives were ranked mostly in the top half of the list.

This somewhat makes sense though. The basis for the negative set is that they are genes in non-neurological diseases, but that doesn’t necessarily exclude them from being included in a more subtle, polygenic disorder that’s dependent on additive effects of hundreds of genetic perturbations. It also makes sense that genes involved in other diseases might have a slightly higher probability of being associated with some other disease.

I decided to make a new negative set. I started by gathering every single node in the 0.150 network. There is a resource called SZGR 2.0 (https://bioinfo.uth.edu/SZGR/) that collects all sorts of evidence for genes being associated with schizophrenia. Using this list, I excluded any gene that had any evidence for being associated with schizophrenia. I then took differential gene expression data from autism spectrum disorder, bipolar disorder, and major depressive disorder as well as schizophrenia (http://science.sciencemag.org/content/359/6376/693). If a gene was not differentially expressed in any of these diseases (FDR>0.5 for schizophrenia, FDR>0.2 for others) and was in the other list, I added that gene to my list of negatives, which was 1561 genes long. I found that this list, when hidden negatives were compared to unlabeled genes, had an AUC of about 0.44, which means that our negatives are finally oriented around schizophrenia rather than non-neurological diseases. I also found that the AUC for hidden positives vs hidden negatives spiked up to 0.75, meaning that our program is legitimately good at predicting genes associated with schizophrenia.

I found no significant performance differences from factoring in evidence levels.