Week 8: Putting Everything Together

This week Kathy, Usman and I  met to discuss how we would combine the projects we’ve been working on into a cohesive pathway and how we would analyze the output of CancerLinker as compared to PathLinker.

I have been working on ways to visualize the data. One thing I wanted to look at was how incorporating gene expression data would change the overall distribution of edge weights in the interactome.

The original interactome has a reasonable distribution with two values that seem to appear frequently around 0.4 and 0.8.

Gene expression data was incorporated into the original interactome with a beta value. The beta value determines the weight of the original edge weight when including the gene expression data. So the higher the beta score, the lower the importance of the gene expression data. I made three histograms one for a beta=0.25, one for beta = 0.5 (equal contribution between original edge weight and gene expression data) and one for 0.75

As can be seen in the histograms above, when gene expression data is weighted more heavily, edges weights are more closely clustered. To investigate why this occurred I made two additional histograms: one of the gene expression data before it was transformed and one of the gene expression data after it was transformed.

After the gene expression data is transformed, there is extremely little variation. Additionally all the gene expression data is greater than or equal to 0.5 which should be the median. This would explain why weighting the gene expression data more heavily causes more closely-clustered edge weights.  I’m not sure how to fix this. It seems like an error, but I’ve been over the code multiple times and the math seems right to me. So my next step is to figure out what’s going on there.

In the meantime I took the output from the pipeline that Kathy put together and put graphs for the top 1000 Wnt paths with β=0.25, 0.5 and 0.75 up on GraphSpace. If I figure out what’s wrong with the function that transforms the edgeweights, I will run it again and re-upload the updated graphs.

Week 7: Node Scoring

I have spent most of this week trying to transform the gene expression data and implement the equations that Ibrahim came up with to incorporate gene expression data onto edge scores.

I have had several major hold-ups to accomplishing this.  The first is pretty simple, I just don’t know how to get the kind of data output I want from the CDF function because the scipy.norm.cdf() function outputs an array and I want a single value for each node.

The biggest issue is that the TCGA data on gene expression and PathLinker use different gene nomenclature systems. The TCGA relies on Ensembl IDs whereas PathLinker relies on UniProt IDs. Originally I tried to convert all the Ensembl IDs to common names to UniProt IDs because I have a file that contains UniProt IDs and common names, and another file that contains Ensembl IDs and common names. However, this didn’t work because a single gene may have multiple common names and may refer to multiple UniProt IDs (for example there are about 20 UniProt IDs that correspond to the HLA-A gene).

Therefore, to avoid the loss of data due to various common names, I tried to make a dictionary of Ensembl IDs directly to UniProt IDs. I was able to obtain a file that contained both UniProt and Ensembl IDs from the HUGO Gene Nomenclature Committee website. However, converting between Ensemble and UniProt came with its own problems. First of all, there are many genes that either only have UniProt IDs or only have Ensembl IDs. 997 genes in the interactome were unable to be converted from one to the other because of this reason. In addition, many of the Ensembl IDs in the TCGA file (over 16,000) don’t line up with any of the Ensembl IDs in the dictionary I constructed.  I think this might be because the Ensembl IDs in the TCGA file include version numbers at the end of each ID. The version number is the decimal point at the end of the ID name. For example for the gene, “ENSG00000242268.2”, the “.2” means that this is the 2nd version of that gene. I think one way to fix this problem might be to just take all the version numbers out of the gene name when constructing the dictionary from the TCGA file. However, I can’t figure out how to do this without making the code super slow. If every version number were a single decimal place (ie: .1, .2, .3 … etc), I would just cut off the last two digits of each name which wouldn’t be that slow. However different IDs have different number of decimal places. For example, if I cut off the last two digits of “ENSG00000167578.15” I would still be left with the decimal point. Therefore the only way I can currently think of to get rid of the decimal and everything following it is to use a for-loop that goes through every character in the name and if the character is a decimal point to cut the string there. However, if the program has to go through every letter in every gene name, it’s going to be extremely slow. Maybe something I could do is pre-process the data to create a text file that contains gene IDs without the decimal places so it only has to do it one time and won’t slow the whole code down, but I feel like there has to be a faster way to do it within the program.

Week 6: GCC-BOSC

This week, my coworkers and I attended the GCC-BOSC (Galaxy Community Conference in association with the Bioinformatics Open Source Conference), so there was little to no research done.

Week 4: Parsing Files

This week, I discovered how to download files from the TCGA database, and explored their structure.

So this week, I downloaded 512 files relating to colorectal cancer (COAD in the TCGA database.) These were compressed into a Tar file, which opened into a directory tree that looked like this:

Each tiny blue dot is a folder, and inside each folder is one gzipped file (compressed). And each one of these files is a sample from a patient with gene, and its expression (from either a tumor sample or a healthy one.)

So my main issue has been to try and parse these files into a data matrix, ideally with sample-ids on the top and gene expression on the sides. So far, I’ve been able to compress these files into patient and samples relating to them because ideally, we want to look at gene expression in a healthy and tumor sample from the same patient. However, I haven’t been able to write the sample-ids with gene expression data into a file because of multiple bugs and errors. My goal for next week is to get these errors fixed.

Implementing Bayesian Weighting Schemes – Part 2

I spent the majority of the week parsing through various databases and their associated tools to try to download GO Terms to form the set of positives and negatives required for the Bayesian Weighting Scheme. There appeared to be no easy way to download the GO annotations for specific proteins or to download specific GO Terms. Therefore, I simply downloaded a file containing all the GO annotations and stripped it of the unnecessary portions to form the positives and negatives. All that remains now is the integrate this with the remaining code for the weighting scheme. That will be the main objective of Week 5.

Week 4: Exploring Data

I spent this week developing an understanding of the command line and version control, as well as exploring the data gathered from the TCGA cancer genome data.

First, I discovered that an application I had downloaded to make my life easier (Anaconda) had deleted python 2.7 from my machine and wouldn’t allow me to run things appropriately. This was a huge problem because currently, I need to run programs using python 2.7 and networkx 1.9.1 for dependencies to work out. So, I deleted the installation files and made sure that all of the caches and file folders they resided in were deleted, and then used Homebrew to reinstall python 2 and python 3.  The moral of this story is to always understand what you’re installing.

Secondly, I developed the framework for the bar graph that we will eventually use to analyze which genes are present in each of the datasets we are looking at. We are measuring the number of genes with different levels of gene expression(high, low, or none) in each of the datasets that we have. Each dataset will then have 3 different bars expressing this data.

We’ll be using  the TCGA Genes database,  the number of genes from the PathLinker-2015 interactome, PathLinker-2018 interactome, the Wnt Pathway from NetPath, the PathLinker-2015 interactome’s top 1000 paths, the PathLinker-2018 interactome’s top 1000 paths, and finally the Localized PathLinker-2015 interactome’s top 1000 paths, as well as the Loc_PL 2018 interactome’s top 1000 paths.

Week 4: Obtaining and Processing Data

This week our main goal has been to find a pipeline to obtain TCGA data in a neat form. We discovered UCSC’s Xena Browser, which has files from the TCGA and a number of other databases.

Last week, we used the data from FireBrowse to make a graph of the genes that have patients with abnormally high or low levels of expression.

This week we changed that graph slightly by showing the difference between the number of patients with high expression and the number of patients with low expression by gene.

It is interesting to me that there are generally more patients with severe under-expression rather than severe-overexpression. I wonder if this is because these genes play a role in suppressing tumors, and that therefore maybe under-expression is more likely to cause cancer than overexpression?

I also worked on integrating gene expression data from Xena into our graph of the Wnt pathway.

Kathy figured out what was wrong with PathLinker the first time we ran it and re-ran it. I am working on turning it into a graph, but the input data is very different because it’s coming from a different version of NetPath so I need to change the program to be able to process the new data.

I also noticed while I was processing the expression data from Xena that there was a large amount of variability in gene expression between patients. I’m currently working on several things. Instead of just averaging gene expression for genes I’m comparing gene expression patient-by-patient so I’m comparing a tumor sample to a normal tissue sample for every patient. I also want to come up with a way to visualize the variance of expression among patients, because the more variance there is the less significant differences in expression between cancerous and normal tissue are. Anna suggested I do this by making the borders on nodes with high variance thicker. I am also going back and checking my math on gene expression to make sure that it is actually statistically significant and is conducted in a way that is similar to how other researchers have done similar research in the past.

Yen’s k shortest paths

This week, we split up to delve further into the PathLinker paper according to our aims. I was assigned to research Yen’s k-shortest paths algorithm and present to the lab on Wednesday.

I’m now going to attempt to explain this algorithm–bear with me.

Some caveats: first, this is an algorithm for computing the k-shortest loopless paths from one node to another, and this algorithm only considers simple paths, where nodes may not be repeated.

To compute the k-shortest paths (where k is a user-defined parameter, say 10 or 4), we first find the shortest path (k=1) using some sort of algorithm, say Dijkstra’s.

Here’s our wonderful graph from Graphspace:

Start Node End Node Weight
a b 1
a c 1
a e 9
b e 3
b c 4
c e 2

So, using our graph, let’s find the shortest path from a to e (P_1). Looking at the graph, we can see that it is a->c->e, with cost 4. Now what we do is loop through all paths using this first path as a guide, because intuitively, there might be some obvious edges that we use in many of the shortest paths.

Let’s call what we’re using from P_1 the root path, and call what we use when we diverge from it the spur path. To find P_2 (the 2nd shortest path), we must go through the graph differently this time to make sure we just don’t get the shortest path again. Because we’re using P_1 as a guide, let’s pretend that the edge from a->c has infinite cost, so that we can never take that edge.

The paths that result from the root path being (a) is then [a ->b->e (4), a->e(9), a->b->c->e(7)].

Then, we expand our root path to be a->c,  and pretend that the edge c->e has infinite cost, so the next path is [a->c->b->e (8)].

We combine the two lists of potential paths and look at their costs. [a->b->e (4), a->e(9),a->b->c->e(7), a->c->b->e(8)]. a->b->e is the 2nd shortest path (P_2)!

Then, we can simply search through the rest of the list for the next shortest path (P_3,…,P_k).

Week 2: Understanding Wnt/β-catenin Signaling and the Ryk-CFTR-Dab2 Path

This week, Kathy, Usman, and I split up to learn more about different aspects of PathLinker. I researched the Wnt/β-catenin signaling pathway, precision-recall curves, and the role of CFTR and Dab2. On Wednesday, we all presented the topics we had researched to each other and Anna and Ibrahim.

To briefly summarize what I learned, the Wnt/β-catenin pathway regulates the transcription of certain genes related to cell proliferation, cell attachment, and growth. When the Wnt/β-catenin signaling pathway is dysregulated, a number of pathologies can develop including cancer and heart disease. In fact, in a study on colon adenocarcinoma by the Cancer Genome Atlas, 93% of tested tumors had a mutation that affected the Wnt/β-catenin signaling pathway (TCGA, 2012). At the most basic level theWnt/β-catenin signaling pathway is turned on when Wnt proteins bind to “frizzled” a 7-pass transmembrane receptor which halts the destruction of β-catenin in the cell. Usually, when the pathway is off, β-catenin is constantly being produced and destroyed. When β-catenin destruction is interrupted, β-catenin will build up in the cytoplasm and move into the nucleus where it will bind to LEF and the TCF promoter to promote the transcription of specific genes that were previously being inhibited by a transcription factor that was bound to TCF. The Wnt/β-catenin signaling pathway involves many proteins and interactions, some of which we still do not understand. The PathLinker algorithm succesfully identified a path in the Wnt/β-catenin signaling pathway that was not in the KEGG or NetPath database: the Ryk-CFTR-Dab2 path (Ritz et al, 2016).  The authors of the paper hypothesized that Ryk would interact with CFTR (Cystic Fibrosis Transmembrane-conductance Regulator), a chlorine ion channel previously studied for its role in cystic fibrosis, which would activate Dab2 to inhibit β-catenin activity. This hypothesis was experimentally confirmed by silencing Ryk, CFTR and Dab2 individually using RNA interference and then testing transcription levels of β-catenin controlled genes using a TCF/LEF luciferase activity and levels of β-catenin in the cell using a Western Blot assay.

We also attended a data management workshop taught by David Isaak, Reed’s data science librarian to learn how to use Git and GitHub. Because I’ve never really used terminal before (I usually use repl.it), I worked through a command line tutorial to learn more about it.

On Thursday, Kathy, Usman, and I went through our Dijkstra code with Anna, and finally got it to work the way we wanted it to. Anna added us to the cancer-linker repository on GitHub so we can all begin to actually work on PathLinker now.

My next project is to figure out how to take data from FireBrowse, a database containing data from TCGA on 38 different cancer types, put it into a format we can use, and apply several statistical analyses to the information. I’m hoping to base it off of PepperPathway, a program written by Nicholas Egan, a student in Anna’s lab last summer, that uses data from FireBrowse, GeneCards and NetPath and to create a visual representation on GraphSpace. I’m struggling to make sense of the PepperPathway program because there are a lot of files on the GitHub repository and I’m not really sure where to begin. I’m going to try to make sense of it this weekend.

References

1. TCGA Research Network.  Comprehensive Molecular Characterization of         Human Colon and Rectal Tumors.  July 19, 2012. Nature. DOI:           10.1038/nature11252.
2. Ritz, Anna et al. “Pathways on Demand: Automated Reconstruction of  Human Signaling Networks.” Npj Systems Biology And Applications 2 (2016): 16002. Web.

Week 1

Aim of Research Project

My research project aims to extend upon the original PATHLINKER paper. In particular, I hope to investigate different weighting schemes both from the perspective of utilizing different data sets and data-summary methods as well as potentially maybe different statistical tools for manipulating the data in meaningful ways. This is motivated by the fact that there appears to be little overlap between interactomes consisting of the same proteins but weighted differently. Therefore, this project will allow us to analyze different schemes and develop an heuristic for weighting interactomes. As a result, we will be able to apply algorithms like PATHFINDER on graphs such that they provide results that have valid biological meaning rather than being affected by superficial factors such as trends in scientific research or biases of the scientific community. Right now, I am focusing on Bayesian Weighting schemes which attempt to use experimental evidence to provide a weight that represent the probability of a given interaction occurring in a given signaling pathway by leveraging Bayes’ theorem. I will talk more about this in future posts.

Week 1 Progress

Week 1 was spent mostly acquainting myself with the necessary prerequisite knowledge required for research during coming weeks such as Dijkstra’s algorithm (and its implementation) and Bayesian Weighting Schemes and in that regard reading some relevant papers. For the sake of keeping track the particular papers were titled :-

The first was the original PATH LINKER paper which our research hopes to extend upon. The next two provided some information on Bayesian Weighting but did not provide too much else due to some ambiguous mathematical notation. I hope to attempt to write a document that annotates the relevant portions of the papers for the sake of clarity and future reference.