Week 4: Exploring Data

I spent this week developing an understanding of the command line and version control, as well as exploring the data gathered from the TCGA cancer genome data.

First, I discovered that an application I had downloaded to make my life easier (Anaconda) had deleted python 2.7 from my machine and wouldn’t allow me to run things appropriately. This was a huge problem because currently, I need to run programs using python 2.7 and networkx 1.9.1 for dependencies to work out. So, I deleted the installation files and made sure that all of the caches and file folders they resided in were deleted, and then used Homebrew to reinstall python 2 and python 3.  The moral of this story is to always understand what you’re installing.

Secondly, I developed the framework for the bar graph that we will eventually use to analyze which genes are present in each of the datasets we are looking at. We are measuring the number of genes with different levels of gene expression(high, low, or none) in each of the datasets that we have. Each dataset will then have 3 different bars expressing this data.

We’ll be using  the TCGA Genes database,  the number of genes from the PathLinker-2015 interactome, PathLinker-2018 interactome, the Wnt Pathway from NetPath, the PathLinker-2015 interactome’s top 1000 paths, the PathLinker-2018 interactome’s top 1000 paths, and finally the Localized PathLinker-2015 interactome’s top 1000 paths, as well as the Loc_PL 2018 interactome’s top 1000 paths.

Yen’s k shortest paths

This week, we split up to delve further into the PathLinker paper according to our aims. I was assigned to research Yen’s k-shortest paths algorithm and present to the lab on Wednesday.

I’m now going to attempt to explain this algorithm–bear with me.

Some caveats: first, this is an algorithm for computing the k-shortest loopless paths from one node to another, and this algorithm only considers simple paths, where nodes may not be repeated.

To compute the k-shortest paths (where k is a user-defined parameter, say 10 or 4), we first find the shortest path (k=1) using some sort of algorithm, say Dijkstra’s.

Here’s our wonderful graph from Graphspace:

Start Node End Node Weight
a b 1
a c 1
a e 9
b e 3
b c 4
c e 2

So, using our graph, let’s find the shortest path from a to e (P_1). Looking at the graph, we can see that it is a->c->e, with cost 4. Now what we do is loop through all paths using this first path as a guide, because intuitively, there might be some obvious edges that we use in many of the shortest paths.

Let’s call what we’re using from P_1 the root path, and call what we use when we diverge from it the spur path. To find P_2 (the 2nd shortest path), we must go through the graph differently this time to make sure we just don’t get the shortest path again. Because we’re using P_1 as a guide, let’s pretend that the edge from a->c has infinite cost, so that we can never take that edge.

The paths that result from the root path being (a) is then [a ->b->e (4), a->e(9), a->b->c->e(7)].

Then, we expand our root path to be a->c,  and pretend that the edge c->e has infinite cost, so the next path is [a->c->b->e (8)].

We combine the two lists of potential paths and look at their costs. [a->b->e (4), a->e(9),a->b->c->e(7), a->c->b->e(8)]. a->b->e is the 2nd shortest path (P_2)!

Then, we can simply search through the rest of the list for the next shortest path (P_3,…,P_k).

 

Week 2: Understanding Wnt/β-catenin Signaling and the Ryk-CFTR-Dab2 Path

This week, Kathy, Usman, and I split up to learn more about different aspects of PathLinker. I researched the Wnt/β-catenin signaling pathway, precision-recall curves, and the role of CFTR and Dab2. On Wednesday, we all presented the topics we had researched to each other and Anna and Ibrahim.

To briefly summarize what I learned, the Wnt/β-catenin pathway regulates the transcription of certain genes related to cell proliferation, cell attachment, and growth. When the Wnt/β-catenin signaling pathway is dysregulated, a number of pathologies can develop including cancer and heart disease. In fact, in a study on colon adenocarcinoma by the Cancer Genome Atlas, 93% of tested tumors had a mutation that affected the Wnt/β-catenin signaling pathway (TCGA, 2012). At the most basic level theWnt/β-catenin signaling pathway is turned on when Wnt proteins bind to “frizzled” a 7-pass transmembrane receptor which halts the destruction of β-catenin in the cell. Usually, when the pathway is off, β-catenin is constantly being produced and destroyed. When β-catenin destruction is interrupted, β-catenin will build up in the cytoplasm and move into the nucleus where it will bind to LEF and the TCF promoter to promote the transcription of specific genes that were previously being inhibited by a transcription factor that was bound to TCF. The Wnt/β-catenin signaling pathway involves many proteins and interactions, some of which we still do not understand. The PathLinker algorithm succesfully identified a path in the Wnt/β-catenin signaling pathway that was not in the KEGG or NetPath database: the Ryk-CFTR-Dab2 path (Ritz et al, 2016).  The authors of the paper hypothesized that Ryk would interact with CFTR (Cystic Fibrosis Transmembrane-conductance Regulator), a chlorine ion channel previously studied for its role in cystic fibrosis, which would activate Dab2 to inhibit β-catenin activity. This hypothesis was experimentally confirmed by silencing Ryk, CFTR and Dab2 individually using RNA interference and then testing transcription levels of β-catenin controlled genes using a TCF/LEF luciferase activity and levels of β-catenin in the cell using a Western Blot assay.

We also attended a data management workshop taught by David Isaak, Reed’s data science librarian to learn how to use Git and GitHub. Because I’ve never really used terminal before (I usually use repl.it), I worked through a command line tutorial to learn more about it.

On Thursday, Kathy, Usman, and I went through our Dijkstra code with Anna, and finally got it to work the way we wanted it to. Anna added us to the cancer-linker repository on GitHub so we can all begin to actually work on PathLinker now.

My next project is to figure out how to take data from FireBrowse, a database containing data from TCGA on 38 different cancer types, put it into a format we can use, and apply several statistical analyses to the information. I’m hoping to base it off of PepperPathway, a program written by Nicholas Egan, a student in Anna’s lab last summer, that uses data from FireBrowse, GeneCards and NetPath and to create a visual representation on GraphSpace. I’m struggling to make sense of the PepperPathway program because there are a lot of files on the GitHub repository and I’m not really sure where to begin. I’m going to try to make sense of it this weekend.

References

  1. TCGA Research Network.  Comprehensive Molecular Characterization of         Human Colon and Rectal Tumors.  July 19, 2012. Nature. DOI:           10.1038/nature11252.
  2. Ritz, Anna et al. “Pathways on Demand: Automated Reconstruction of  Human Signaling Networks.” Npj Systems Biology And Applications 2 (2016): 16002. Web.

Week 1

Sol, Usman, and I have started working on summer research to modify the PathLinker algorithm.

This week, we plan to finish the code implementing Dijkstra’s algorithm on toy datasets, which was used as a warmup exercise in order to acclimatize us back into working and coding. In addition, I’ve been assigned to understand Yen’s k-shortest paths algorithm, which is one of the foundations of PathLinker.

Kicking Off the Collaborative REU

As the summer winds down and classes begin at Reed College, we are excited to begin a new project that sits at the intersection of computer science and biology.  With mentoring expertise on both sides of the aisle (Anna is a computer scientist, and Derek is a cell biologist), our interdisciplinary team will apply computer science techniques to predict potential players in disease.

The Biological Question: How is cell migration regulated in patients with schizophrenia?

Schizophrenia is a psychiatric disorder that affects how a person thinks, feels, and behaves, with potentially severe symptoms.  While we know that susceptibility of this disease runs in families, there are many mysteries about which genes, or “instructions” encoded in DNA, drive schizophrenia.  A paper recently demonstrated that cell migration patterns are altered in patients with schizophrenia – the cells become more motile and less “attached” compared to the same type of cells from healthy patients.  Since genes associated with cell migration have also been implicated in other diseases, we want to identify genes that may be potentially involved in altered cell migration and schizophrenia.

The Computational Approach: Machine learning to predict disease genes

While experiments can test whether a particular gene is associated with cell migration, we can’t simply test all 20,000 possible genes – it would take way too long, be way too expensive, and a vast majority of the experiments will be uninformative.  Instead, we will develop computational approaches to predict a small subset of candidate genes for further experimental testing.  These in silico experiments (which is just a fancy word for computer-simulated experiments) may not be incredibly accurate, but they will sure be fast!

How do we go about developing a computational method to predict candidate cell migration and schizophrenia-associated genes? As we’ll detail in future blog posts, we will search for these genes within large, publicly-available datasets.  We will build a list of the genes that are known to be associated with cell migration or schizophrenia, and then look for other genes that have similar properties to the known genes.  This general technique is called machine learning, where we design instructions for a computer to make predictions.  In our case, we wish to predict whether an unknown gene could be associated with cell migration, schizophrenia, or both.

Experimental Validation: Testing the computational predictions

An important aspect of computational biology research is to experimentally test the predictions to see if we discovered new players involved in schizophrenia and cell migration. In Derek’s lab, the team will test the top candidates in two ways.  First, will see whether each candidate gene affects cell migration in fly cells by “knocking down” the gene product in the cells and observing the change in cell movement.  Next, we will take the top candidates from the first step and observe migration patterns in fly neuroblasts (cells that are destined to become neurons). From these experiments, candidate genes that alter migration patterns in fly neuroblasts may affect neuron cell migration in humans.

There is lots to learn and lots to do!  It will be a fun year – stay tuned.