Week 4: Obtaining and Processing Data

This week our main goal has been to find a pipeline to obtain TCGA data in a neat form. We discovered UCSC’s Xena Browser, which has files from the TCGA and a number of other databases.

Last week, we used the data from FireBrowse to make a graph of the genes that have patients with abnormally high or low levels of expression.

Number of patients that have abnormally high or low levels of expression

This week we changed that graph slightly by showing the difference between the number of patients with high expression and the number of patients with low expression by gene.

Number of patients by gene that have abnormally high or abnormally low levels of gene expression.

It is interesting to me that there are generally more patients with severe under-expression rather than severe-overexpression. I wonder if this is because these genes play a role in suppressing tumors, and that therefore maybe under-expression is more likely to cause cancer than overexpression?

I also worked on integrating gene expression data from Xena into our graph of the Wnt pathway.

Wnt Pathway from PathLinker with Gene Expression Data. Red = high, orange = medium-high, yellow= medium, green = low, blue(not shown) = very low, white(not shown) = no expression. Triangles are transcription factors, squares are receptors, circles are intermediate proteins.

Kathy figured out what was wrong with PathLinker the first time we ran it and re-ran it. I am working on turning it into a graph, but the input data is very different because it’s coming from a different version of NetPath so I need to change the program to be able to process the new data.

I also noticed while I was processing the expression data from Xena that there was a large amount of variability in gene expression between patients. I’m currently working on several things. Instead of just averaging gene expression for genes I’m comparing gene expression patient-by-patient so I’m comparing a tumor sample to a normal tissue sample for every patient. I also want to come up with a way to visualize the variance of expression among patients, because the more variance there is the less significant differences in expression between cancerous and normal tissue are. Anna suggested I do this by making the borders on nodes with high variance thicker. I am also going back and checking my math on gene expression to make sure that it is actually statistically significant and is conducted in a way that is similar to how other researchers have done similar research in the past.

Week 2: Understanding Wnt/β-catenin Signaling and the Ryk-CFTR-Dab2 Path

This week, Kathy, Usman, and I split up to learn more about different aspects of PathLinker. I researched the Wnt/β-catenin signaling pathway, precision-recall curves, and the role of CFTR and Dab2. On Wednesday, we all presented the topics we had researched to each other and Anna and Ibrahim.

To briefly summarize what I learned, the Wnt/β-catenin pathway regulates the transcription of certain genes related to cell proliferation, cell attachment, and growth. When the Wnt/β-catenin signaling pathway is dysregulated, a number of pathologies can develop including cancer and heart disease. In fact, in a study on colon adenocarcinoma by the Cancer Genome Atlas, 93% of tested tumors had a mutation that affected the Wnt/β-catenin signaling pathway (TCGA, 2012). At the most basic level theWnt/β-catenin signaling pathway is turned on when Wnt proteins bind to “frizzled” a 7-pass transmembrane receptor which halts the destruction of β-catenin in the cell. Usually, when the pathway is off, β-catenin is constantly being produced and destroyed. When β-catenin destruction is interrupted, β-catenin will build up in the cytoplasm and move into the nucleus where it will bind to LEF and the TCF promoter to promote the transcription of specific genes that were previously being inhibited by a transcription factor that was bound to TCF. The Wnt/β-catenin signaling pathway involves many proteins and interactions, some of which we still do not understand. The PathLinker algorithm succesfully identified a path in the Wnt/β-catenin signaling pathway that was not in the KEGG or NetPath database: the Ryk-CFTR-Dab2 path (Ritz et al, 2016).  The authors of the paper hypothesized that Ryk would interact with CFTR (Cystic Fibrosis Transmembrane-conductance Regulator), a chlorine ion channel previously studied for its role in cystic fibrosis, which would activate Dab2 to inhibit β-catenin activity. This hypothesis was experimentally confirmed by silencing Ryk, CFTR and Dab2 individually using RNA interference and then testing transcription levels of β-catenin controlled genes using a TCF/LEF luciferase activity and levels of β-catenin in the cell using a Western Blot assay.

We also attended a data management workshop taught by David Isaak, Reed’s data science librarian to learn how to use Git and GitHub. Because I’ve never really used terminal before (I usually use repl.it), I worked through a command line tutorial to learn more about it.

On Thursday, Kathy, Usman, and I went through our Dijkstra code with Anna, and finally got it to work the way we wanted it to. Anna added us to the cancer-linker repository on GitHub so we can all begin to actually work on PathLinker now.

My next project is to figure out how to take data from FireBrowse, a database containing data from TCGA on 38 different cancer types, put it into a format we can use, and apply several statistical analyses to the information. I’m hoping to base it off of PepperPathway, a program written by Nicholas Egan, a student in Anna’s lab last summer, that uses data from FireBrowse, GeneCards and NetPath and to create a visual representation on GraphSpace. I’m struggling to make sense of the PepperPathway program because there are a lot of files on the GitHub repository and I’m not really sure where to begin. I’m going to try to make sense of it this weekend.


  1. TCGA Research Network.  Comprehensive Molecular Characterization of         Human Colon and Rectal Tumors.  July 19, 2012. Nature. DOI:           10.1038/nature11252.
  2. Ritz, Anna et al. “Pathways on Demand: Automated Reconstruction of  Human Signaling Networks.” Npj Systems Biology And Applications 2 (2016): 16002. Web.

Week 1

This is Kathy, Usman, and my first week working on our summer research projects which all involve modifying the PathLinker algorithm. For my project, I am interested in integrating data about protein methylations related to colon adenocarcinoma from the FireBrowse database into PathLinker. Protein methylation, which is only one of many ways that cell signaling pathways can be altered in cancerous cells, can either inhibit or enhance protein-protein interactions. I intend to devise a way to use this information to change the weights of edges connected to methylated proteins to more accurately model cancerous cell signaling pathways. I am hoping this method could eventually be generalized so that the algorithm could be modified for multiple types of cancerous mutations. I believe that this work could be important in identifying important proteins for further research.

So far, we have mostly been trying to learn about pathway reconstruction methods before we dive in. To do this, we have been re-reading the original PathLinker paper as well as a more recent paper about an alteration to PathLinker that allows it to integrate protein localization information.

We have also been working on understanding and implementing two algorithms that calculate the shortest path from a source to a target in a graph: Breadth First Search and Dijkstra’s algorithm. Breadth-first search is an algorithm that finds the path with the fewest number of edges from a source to a target. Dijkstra is more advanced and relative to our work because it takes the weight of the edges between nodes into account. Dijkstra will find the path with the lowest sum of edge weights from the source to any node in the graph.

This weekend and this coming week I have several goals. Primarily, Usman, Kathy and I need to figure out how to get the separate pieces of code that we wrote for the Dijkstra algorithm (which all work individually) to work together to first read a text file, then run Dijkstra’s algorithm, and finally produce a visual graph that shows the best paths. I need to do more research about the original PathLinker to gain a better understanding of precision-recall curves which were used to evaluate its efficacy, Ryk-CFTR-Dab2, which is a new path that was discovered in the Wnt/β-catenin signaling pathway by PathLinker, and the experimental methods that were used to verify the importance of CFTR.

Kicking Off the Collaborative REU

As the summer winds down and classes begin at Reed College, we are excited to begin a new project that sits at the intersection of computer science and biology.  With mentoring expertise on both sides of the aisle (Anna is a computer scientist, and Derek is a cell biologist), our interdisciplinary team will apply computer science techniques to predict potential players in disease.

The Biological Question: How is cell migration regulated in patients with schizophrenia?

Schizophrenia is a psychiatric disorder that affects how a person thinks, feels, and behaves, with potentially severe symptoms.  While we know that susceptibility of this disease runs in families, there are many mysteries about which genes, or “instructions” encoded in DNA, drive schizophrenia.  A paper recently demonstrated that cell migration patterns are altered in patients with schizophrenia – the cells become more motile and less “attached” compared to the same type of cells from healthy patients.  Since genes associated with cell migration have also been implicated in other diseases, we want to identify genes that may be potentially involved in altered cell migration and schizophrenia.

The Computational Approach: Machine learning to predict disease genes

While experiments can test whether a particular gene is associated with cell migration, we can’t simply test all 20,000 possible genes – it would take way too long, be way too expensive, and a vast majority of the experiments will be uninformative.  Instead, we will develop computational approaches to predict a small subset of candidate genes for further experimental testing.  These in silico experiments (which is just a fancy word for computer-simulated experiments) may not be incredibly accurate, but they will sure be fast!

How do we go about developing a computational method to predict candidate cell migration and schizophrenia-associated genes? As we’ll detail in future blog posts, we will search for these genes within large, publicly-available datasets.  We will build a list of the genes that are known to be associated with cell migration or schizophrenia, and then look for other genes that have similar properties to the known genes.  This general technique is called machine learning, where we design instructions for a computer to make predictions.  In our case, we wish to predict whether an unknown gene could be associated with cell migration, schizophrenia, or both.

Experimental Validation: Testing the computational predictions

An important aspect of computational biology research is to experimentally test the predictions to see if we discovered new players involved in schizophrenia and cell migration. In Derek’s lab, the team will test the top candidates in two ways.  First, will see whether each candidate gene affects cell migration in fly cells by “knocking down” the gene product in the cells and observing the change in cell movement.  Next, we will take the top candidates from the first step and observe migration patterns in fly neuroblasts (cells that are destined to become neurons). From these experiments, candidate genes that alter migration patterns in fly neuroblasts may affect neuron cell migration in humans.

There is lots to learn and lots to do!  It will be a fun year – stay tuned.