GCC/BOSC

Last week, we were lucky enough to attend the GCC/BOSC conference hosted at Reed. Although I spent the majority of the week volunteering and attending sessions, we also had a deadline for a two-page extended abstract of our CNB-MAC paper to be submitted to the main conference proceedings. We successfully cut down our original paper into two pages and submitted the abstract on Friday.

As of now, we’re almost ready to dig into the experimental portion of this project, but there are still a few things we need to iron out.  At the end of last week, Alex noticed a “bug” in how we’ve been calculating the AUC of our multi-layer algorithm. Therefore, one of our priorities this week is to find a more accurate AUC calculation that better represents our method. Once that is completed, we will move on to assembling a list of candidates for experimentally validation. Hopefully by the end of the week we will have chosen 8-10 candidates that are not only biologically interesting but conserved in Drosophila as well.

GCC/BOSC & Bug Fixing

This week I spent the majority of my time at the GCC/BOSC Conference both volunteering at and attending talks. Most of the talks concerned the software suite Galaxy and how people were adapting it or extending it for different use cases and new applications.  Not having used Galaxy before nor intending to, these talks did not prove very useful to me. There were some talks that concerned more general biological problems and issues within the scientific community. These were accessible to a wider audience and I was able to learn some aspects of statistics and experimental design from them. This included talks on the importance of talking to the community of people who have the particular condition being studied as well as some concerning best practices on handling data.

I spent some portion of time fixing minor bugs in my code for the Bayesian Weighting Scheme. As a follow-up to a previous post, the run time of the code is now down to 2 or 3 minutes at the most. Using sets and tuples reduced the run time by several orders of magnitude and some other time saving measures also contributed to this much lower number. This is now sufficiently within the time frame of the CREU so the code will probably need much more speeding up.  Next steps involve changing the code to weight the PathLinker Interactome to check if the code works accurately as well as parsing through the math carefully.

Week 4: Parsing Files

This week, I discovered how to download files from the TCGA database, and explored their structure.

So this week, I downloaded 512 files relating to colorectal cancer (COAD in the TCGA database.) These were compressed into a Tar file, which opened into a directory tree that looked like this:

 

Each tiny blue dot is a folder, and inside each folder is one gzipped file (compressed). And each one of these files is a sample from a patient with gene, and its expression (from either a tumor sample or a healthy one.)

So my main issue has been to try and parse these files into a data matrix, ideally with sample-ids on the top and gene expression on the sides. So far, I’ve been able to compress these files into patient and samples relating to them because ideally, we want to look at gene expression in a healthy and tumor sample from the same patient. However, I haven’t been able to write the sample-ids with gene expression data into a file because of multiple bugs and errors. My goal for next week is to get these errors fixed.

Implementing Bayesian Weighting Schemes – Part 2

I spent the majority of the week parsing through various databases and their associated tools to try to download GO Terms to form the set of positives and negatives required for the Bayesian Weighting Scheme. There appeared to be no easy way to download the GO annotations for specific proteins or to download specific GO Terms. Therefore, I simply downloaded a file containing all the GO annotations and stripped it of the unnecessary portions to form the positives and negatives. All that remains now is the integrate this with the remaining code for the weighting scheme. That will be the main objective of Week 5.

Week 4: Exploring Data

I spent this week developing an understanding of the command line and version control, as well as exploring the data gathered from the TCGA cancer genome data.

First, I discovered that an application I had downloaded to make my life easier (Anaconda) had deleted python 2.7 from my machine and wouldn’t allow me to run things appropriately. This was a huge problem because currently, I need to run programs using python 2.7 and networkx 1.9.1 for dependencies to work out. So, I deleted the installation files and made sure that all of the caches and file folders they resided in were deleted, and then used Homebrew to reinstall python 2 and python 3.  The moral of this story is to always understand what you’re installing.

Secondly, I developed the framework for the bar graph that we will eventually use to analyze which genes are present in each of the datasets we are looking at. We are measuring the number of genes with different levels of gene expression(high, low, or none) in each of the datasets that we have. Each dataset will then have 3 different bars expressing this data.

We’ll be using  the TCGA Genes database,  the number of genes from the PathLinker-2015 interactome, PathLinker-2018 interactome, the Wnt Pathway from NetPath, the PathLinker-2015 interactome’s top 1000 paths, the PathLinker-2018 interactome’s top 1000 paths, and finally the Localized PathLinker-2015 interactome’s top 1000 paths, as well as the Loc_PL 2018 interactome’s top 1000 paths.

Week 4: Obtaining and Processing Data

This week our main goal has been to find a pipeline to obtain TCGA data in a neat form. We discovered UCSC’s Xena Browser, which has files from the TCGA and a number of other databases.

Last week, we used the data from FireBrowse to make a graph of the genes that have patients with abnormally high or low levels of expression.

Number of patients that have abnormally high or low levels of expression

This week we changed that graph slightly by showing the difference between the number of patients with high expression and the number of patients with low expression by gene.

Number of patients by gene that have abnormally high or abnormally low levels of gene expression.

It is interesting to me that there are generally more patients with severe under-expression rather than severe-overexpression. I wonder if this is because these genes play a role in suppressing tumors, and that therefore maybe under-expression is more likely to cause cancer than overexpression?

I also worked on integrating gene expression data from Xena into our graph of the Wnt pathway.

Wnt Pathway from PathLinker with Gene Expression Data. Red = high, orange = medium-high, yellow= medium, green = low, blue(not shown) = very low, white(not shown) = no expression. Triangles are transcription factors, squares are receptors, circles are intermediate proteins.

Kathy figured out what was wrong with PathLinker the first time we ran it and re-ran it. I am working on turning it into a graph, but the input data is very different because it’s coming from a different version of NetPath so I need to change the program to be able to process the new data.

I also noticed while I was processing the expression data from Xena that there was a large amount of variability in gene expression between patients. I’m currently working on several things. Instead of just averaging gene expression for genes I’m comparing gene expression patient-by-patient so I’m comparing a tumor sample to a normal tissue sample for every patient. I also want to come up with a way to visualize the variance of expression among patients, because the more variance there is the less significant differences in expression between cancerous and normal tissue are. Anna suggested I do this by making the borders on nodes with high variance thicker. I am also going back and checking my math on gene expression to make sure that it is actually statistically significant and is conducted in a way that is similar to how other researchers have done similar research in the past.

Yen’s k shortest paths

This week, we split up to delve further into the PathLinker paper according to our aims. I was assigned to research Yen’s k-shortest paths algorithm and present to the lab on Wednesday.

I’m now going to attempt to explain this algorithm–bear with me.

Some caveats: first, this is an algorithm for computing the k-shortest loopless paths from one node to another, and this algorithm only considers simple paths, where nodes may not be repeated.

To compute the k-shortest paths (where k is a user-defined parameter, say 10 or 4), we first find the shortest path (k=1) using some sort of algorithm, say Dijkstra’s.

Here’s our wonderful graph from Graphspace:

Start Node End Node Weight
a b 1
a c 1
a e 9
b e 3
b c 4
c e 2

So, using our graph, let’s find the shortest path from a to e (P_1). Looking at the graph, we can see that it is a->c->e, with cost 4. Now what we do is loop through all paths using this first path as a guide, because intuitively, there might be some obvious edges that we use in many of the shortest paths.

Let’s call what we’re using from P_1 the root path, and call what we use when we diverge from it the spur path. To find P_2 (the 2nd shortest path), we must go through the graph differently this time to make sure we just don’t get the shortest path again. Because we’re using P_1 as a guide, let’s pretend that the edge from a->c has infinite cost, so that we can never take that edge.

The paths that result from the root path being (a) is then [a ->b->e (4), a->e(9), a->b->c->e(7)].

Then, we expand our root path to be a->c,  and pretend that the edge c->e has infinite cost, so the next path is [a->c->b->e (8)].

We combine the two lists of potential paths and look at their costs. [a->b->e (4), a->e(9),a->b->c->e(7), a->c->b->e(8)]. a->b->e is the 2nd shortest path (P_2)!

Then, we can simply search through the rest of the list for the next shortest path (P_3,…,P_k).

 

Week 2: Understanding Wnt/β-catenin Signaling and the Ryk-CFTR-Dab2 Path

This week, Kathy, Usman, and I split up to learn more about different aspects of PathLinker. I researched the Wnt/β-catenin signaling pathway, precision-recall curves, and the role of CFTR and Dab2. On Wednesday, we all presented the topics we had researched to each other and Anna and Ibrahim.

To briefly summarize what I learned, the Wnt/β-catenin pathway regulates the transcription of certain genes related to cell proliferation, cell attachment, and growth. When the Wnt/β-catenin signaling pathway is dysregulated, a number of pathologies can develop including cancer and heart disease. In fact, in a study on colon adenocarcinoma by the Cancer Genome Atlas, 93% of tested tumors had a mutation that affected the Wnt/β-catenin signaling pathway (TCGA, 2012). At the most basic level theWnt/β-catenin signaling pathway is turned on when Wnt proteins bind to “frizzled” a 7-pass transmembrane receptor which halts the destruction of β-catenin in the cell. Usually, when the pathway is off, β-catenin is constantly being produced and destroyed. When β-catenin destruction is interrupted, β-catenin will build up in the cytoplasm and move into the nucleus where it will bind to LEF and the TCF promoter to promote the transcription of specific genes that were previously being inhibited by a transcription factor that was bound to TCF. The Wnt/β-catenin signaling pathway involves many proteins and interactions, some of which we still do not understand. The PathLinker algorithm succesfully identified a path in the Wnt/β-catenin signaling pathway that was not in the KEGG or NetPath database: the Ryk-CFTR-Dab2 path (Ritz et al, 2016).  The authors of the paper hypothesized that Ryk would interact with CFTR (Cystic Fibrosis Transmembrane-conductance Regulator), a chlorine ion channel previously studied for its role in cystic fibrosis, which would activate Dab2 to inhibit β-catenin activity. This hypothesis was experimentally confirmed by silencing Ryk, CFTR and Dab2 individually using RNA interference and then testing transcription levels of β-catenin controlled genes using a TCF/LEF luciferase activity and levels of β-catenin in the cell using a Western Blot assay.

We also attended a data management workshop taught by David Isaak, Reed’s data science librarian to learn how to use Git and GitHub. Because I’ve never really used terminal before (I usually use repl.it), I worked through a command line tutorial to learn more about it.

On Thursday, Kathy, Usman, and I went through our Dijkstra code with Anna, and finally got it to work the way we wanted it to. Anna added us to the cancer-linker repository on GitHub so we can all begin to actually work on PathLinker now.

My next project is to figure out how to take data from FireBrowse, a database containing data from TCGA on 38 different cancer types, put it into a format we can use, and apply several statistical analyses to the information. I’m hoping to base it off of PepperPathway, a program written by Nicholas Egan, a student in Anna’s lab last summer, that uses data from FireBrowse, GeneCards and NetPath and to create a visual representation on GraphSpace. I’m struggling to make sense of the PepperPathway program because there are a lot of files on the GitHub repository and I’m not really sure where to begin. I’m going to try to make sense of it this weekend.

References

  1. TCGA Research Network.  Comprehensive Molecular Characterization of         Human Colon and Rectal Tumors.  July 19, 2012. Nature. DOI:           10.1038/nature11252.
  2. Ritz, Anna et al. “Pathways on Demand: Automated Reconstruction of  Human Signaling Networks.” Npj Systems Biology And Applications 2 (2016): 16002. Web.

Week 1

Aim of Research Project

My research project aims to extend upon the original PATHLINKER paper. In particular, I hope to investigate different weighting schemes both from the perspective of utilizing different data sets and data-summary methods as well as potentially maybe different statistical tools for manipulating the data in meaningful ways. This is motivated by the fact that there appears to be little overlap between interactomes consisting of the same proteins but weighted differently. Therefore, this project will allow us to analyze different schemes and develop an heuristic for weighting interactomes. As a result, we will be able to apply algorithms like PATHFINDER on graphs such that they provide results that have valid biological meaning rather than being affected by superficial factors such as trends in scientific research or biases of the scientific community. Right now, I am focusing on Bayesian Weighting schemes which attempt to use experimental evidence to provide a weight that represent the probability of a given interaction occurring in a given signaling pathway by leveraging Bayes’ theorem. I will talk more about this in future posts.

Week 1 Progress

Week 1 was spent mostly acquainting myself with the necessary prerequisite knowledge required for research during coming weeks such as Dijkstra’s algorithm (and its implementation) and Bayesian Weighting Schemes and in that regard reading some relevant papers. For the sake of keeping track the particular papers were titled :-

  1. Pathways on demand: automated reconstruction of human signalling networks
  2. Top-Down Network Analysis to Drive Bottom-Up Modeling of Physiological Processes
  3. Bridging high-throughput genetic and transactional data reveals cellular responses to alpha-synuclein toxicity.

The first was the original PATH LINKER paper which our research hopes to extend upon. The next two provided some information on Bayesian Weighting but did not provide too much else due to some ambiguous mathematical notation. I hope to attempt to write a document that annotates the relevant portions of the papers for the sake of clarity and future reference.