Implementing Bayesian Weighting – Part 3

I finally managed to download all of the GO Terms and spent some time processing the data and editing previous code so that it all fit together. After having done so and obtained a working version of the code, an initial run revealed that the run time of weighting the entire HIPPIE Interactome would be approximately 101 days. This is outside the duration of the CREU so clearly something needed to be done. To fully explain the issues that lead to such a terrible run-time it is important to first have a rough understanding of the structure of the code.

This iteration of the code first produced a list of true positives and a list of true negatives on the criteria of GO Term co-annotation. This section of the code takes about 4 minutes (which can also be potentially sped up later on) but this negligibly contributes to the 3 month run time. The next portion implements functions that in conjunction calculate the cost of a given node. The necessary inputs are

  1.   A list of all the evidence types.
  2.   A dictionary where the keys are the evidence type labels and the values are lists (evidence lists) whose elements are the protein edges (represented as lists) which are confirmed to occur by that particular evidence type
  3.   A list of true positives and a list of true negatives.

Finding the cost involves calculating the intersection and symmetric difference of the positives with the evidence lists and then the intersection and symmetric difference of the negatives with the  evidence lists. This was done each for each evidence list “every” time a new edge was weighted. This approach took roughly 2.5+ minutes per edge. With 93138 edges, this was the main contributor to the 3 month run time. Further inspection reveals that the time-complexity of performing these operations on a set is  O(min(len(s), len(t)) and O(max(len(s), len(t)) for a list. Performing this operation every single time was therefore very costly. Simply calculating these numbers for each edge at the outset and then appending them in a particular order to the end of each evidence list reduced the time needed to weight each edge down to approximately 0.45 seconds thus shaving down the total run-time to about 11 hours.

Another change that will significantly reduce the run time further is using sets instead of lists for the protein edges in the dictionary and representing the edges themselves as tuples. This will mean creating a different dictionary with identical keys but the values will the four calculated values mentioned above. This will supposedly shave off a large chunk of time because element look up in a set is significantly faster than the same operation in a list. Another idea is to change the order of calculations in a way that reduces the amount of time the exact same calculation is executed or to calculate the edges in a way that iterates over the approximately 230 evidence types rather than the 93138 edges.

Week 5: Visualizing Data

This week I have continued trying to visualize and integrate gene expression data from GDC TCGA onto nodes in Graphspace. This is the graph I have so far:

Wnt Pathway with Gene Expression Data

The color of the nodes on the graph represent the mean foldchange of the gene. The foldchange is the ratio of gene expression of a cancerous tissue sample to a normal tissue sample from the same patient. There were 41 patients with colon adenocarcinoma who had gene expression data from both a tumor and a normal sample. The foldchange for each of these 41 patients was averaged to find the mean foldchange of each gene. Light blue indicates that the gene is slightly underexpressed in cancerous samples (less than 1 standard deviation), dark blue(not shown) indicates that the gene is underexpressed by 2 standard deviations, purple (not shown) indicates that the gene is underexpressed by greater than 2 standard deviations. Orange indicates that the gene is slightly overexpressed (within one standard deviation), dark orange indicates that the gene is overexpressed by 2 standard deviations, and red indicates that the gene is overexpressed by greater than 2 standard deviations.

One problem that came up was that some genes have a lot more variance in patient data than others. (Data on variance is available by clicking on the nodes in Graphspace). Additionally, patients that no gene expression or close to no gene expression in their normal samples had extremely high foldchanges that threw off the mean. It’s tempting to just throw away those samples as outliers, however several samples had multiple “outliers”. As an example I’ve included three sets of line graphs I made that show the gene expression data  and foldchange of each patient for three different genes:

CFTR Expression and Foldchange Line Graphs for 41 patients

In the first example, CFTR, one can see that cancerous samples tend to have lower gene regulation than healthy samples and most of the fold change values hover around 1.0, meaning that the change is fairly low.

FZD9 Gene expression and foldchange for 41 patients

The second example, FZD9, is slightly different. The FZD9 expression data is much less uniform. In some patients, FZD9 is largely overregulated in cancerous samples. In other, it’s largely under regulated. The foldchange data shows foldchanges that range from zero to extremely high values (greater than 30,0000). This occurs because those patients had normal sample values of 0.0. In this case, it looks like dismissing the two extremely high foldchanges as outliers would yield a more realistic data set.

ZSCAN4 Gene Expression and Foldchanges

However, in ZSCAN4, there are many patients with extremely high or low foldchanges. Instead of ignoring these patients, I think I need to find a way to normalize the data so that a few large foldchanges don’t completely throw off the data.

The next step in this project is to work with Usman and Kathy to integrate the gene expression data onto the edge weights so that PathLinker can calculate the most disregulated protein pathways. To begin this process, I’ve written a program that produces a text file that contains for each gene the ID, common name, mean foldchange, standard deviation, and either a +1, 0, or -1, which represents whether it’s under-regulated, unchanged, or over-regulated in cancerous samples. Right now the +1, 0, and -1 are really just placeholder values. I need to meet with Ibrahim and Anna to discuss how to better weight the nodes.

Implementing Bayesian Weighting Schemes – Part 2

I spent the majority of the week parsing through various databases and their associated tools to try to download GO Terms to form the set of positives and negatives required for the Bayesian Weighting Scheme. There appeared to be no easy way to download the GO annotations for specific proteins or to download specific GO Terms. Therefore, I simply downloaded a file containing all the GO annotations and stripped it of the unnecessary portions to form the positives and negatives. All that remains now is the integrate this with the remaining code for the weighting scheme. That will be the main objective of Week 5.

Week 4: Exploring Data

I spent this week developing an understanding of the command line and version control, as well as exploring the data gathered from the TCGA cancer genome data.

First, I discovered that an application I had downloaded to make my life easier (Anaconda) had deleted python 2.7 from my machine and wouldn’t allow me to run things appropriately. This was a huge problem because currently, I need to run programs using python 2.7 and networkx 1.9.1 for dependencies to work out. So, I deleted the installation files and made sure that all of the caches and file folders they resided in were deleted, and then used Homebrew to reinstall python 2 and python 3.  The moral of this story is to always understand what you’re installing.

Secondly, I developed the framework for the bar graph that we will eventually use to analyze which genes are present in each of the datasets we are looking at. We are measuring the number of genes with different levels of gene expression(high, low, or none) in each of the datasets that we have. Each dataset will then have 3 different bars expressing this data.

We’ll be using  the TCGA Genes database,  the number of genes from the PathLinker-2015 interactome, PathLinker-2018 interactome, the Wnt Pathway from NetPath, the PathLinker-2015 interactome’s top 1000 paths, the PathLinker-2018 interactome’s top 1000 paths, and finally the Localized PathLinker-2015 interactome’s top 1000 paths, as well as the Loc_PL 2018 interactome’s top 1000 paths.

Week 3 – Implementing Bayesian Weighting Schemes

The achievements from last week as well as the goals were laid out in this progress report. Accordingly, I spent the majority of this week working on an implementation of the Bayesian Weighting Scheme I discussed last week. This involved carefully fleshing out all the minutiae of the relevant paper (which I laid out and explained in this document from last week). I then applied the code onto a small toy-network that I made myself and then hand-calculated a few edges to verify my implementation.

Toy Network from Bayesian Weighting Scheme Implementation

I am currently working to implement the weighting scheme onto the HIPPIE Interactome. Thus far this has meant simply writing code to read the relevant text files and format the data efficiently in a way that meshes well with code I wrote for the toy example. Eventually, I will compare the scores of edges from this weighting scheme to the ones on the actual HIPPIE Interactome and further analysis on the differences between the two weighting schemes will probably be among the next steps. Besides implementing the Bayesian Weighting Scheme I have also been working to understand the underlying statistics behind the methods used to weight the HIPPIE Interactome. I will try to have a document that explains the mathematics for this in similar fashion to the one for the Bayesian Weighting Scheme.

Week 3: Processing Data from FireBrowse and PathLinker

On Monday we all met with Anna to agree on some goals for the week. Kathy and I set out to acheive several things: A. to figure out how to download data from FireBrowse on colon adenocarcinomas, to perform several statistical analyses on that data and to visualize it in some form using pylab, B. to write a code that produced informative graphs of the different signaling pathways from text files containing information about the edges and nodes and C. to implement PathLinker and LocPL and to create graphs of the top k paths on GraphSpace.

Originally to acheive our first goal, we tried to use Nicholas Egan’s pepper pathway code to download and format the data. However, while we were able to decipher how it worked, it ended up not being all that useful to us, so Kathy downloaded all the gene files related to colon adenocarcinoma onto her computer . Unfortunately, we’re still not really sure how to use them. We wrote a code that took text files containing information on edges and nodes and spit out a graph with color-coded nodes that were labeled with the node name and type and edges that were labeled with the type of interaction that was occuring between the two nodes for example “physical” or “phosphorylation”. I have attached to examples of the graphs we produced below, one very complex and the other very simple.

Left: Alpha-6-Beta-4 Integrin Graph. Right: Interleukin-7 Graph

We also ran PathLinker on the Wnt pathway and made a graph of the top 200 best paths produced.

Top 200 Wnt Signaling Pathways

Moving forward we have several goals. Kathy is trying to figure out how to implement LocPL. I am still trying to figure out how to process and use the FireBrowse data. Because I will need to integrate methylation data from FireBrowse in my project, it is essential I figure out how to use it.

Overall, this week I have learned a lot about data processing, writing code to read text files, and using NetPath, PathLinker, FireBrowse, and GraphSpace. These skills should come in handy as I move forward in my project.

Bayesian Weighting Schemes

Week 2 Progress Report

Week 2 was spent further looking into Bayesian Weighting Schemes and in particular how they were applied in a specific paper titled Top-Down Network Analysis to Drive Bottom-Up Modeling of Physiological Processes by Poirel et. al. This was the original LINKER paper that PATHLINKER was built upon. I also spent a very decent amount of time looking at other interactomes and curating a working list of interactomes and how they weight their interactomes. So far I have only done this in detail for the HIPPIE interactome but more work is still needed. Besides research, I also worked on a document about Bayesian Weighting that aims to elucidate some of the mathematical set up and notation for the paper mentioned above as it suffered from some convoluted mathematical notation. Hopefully, this will be useful to people trying to read this paper in the future. Other tasks involved debugging code, presenting learned material and sorting through papers to find relevant papers.

Yen’s k shortest paths

This week, we split up to delve further into the PathLinker paper according to our aims. I was assigned to research Yen’s k-shortest paths algorithm and present to the lab on Wednesday.

I’m now going to attempt to explain this algorithm–bear with me.

Some caveats: first, this is an algorithm for computing the k-shortest loopless paths from one node to another, and this algorithm only considers simple paths, where nodes may not be repeated.

To compute the k-shortest paths (where k is a user-defined parameter, say 10 or 4), we first find the shortest path (k=1) using some sort of algorithm, say Dijkstra’s.

Here’s our wonderful graph from Graphspace:

Start Node End Node Weight
a b 1
a c 1
a e 9
b e 3
b c 4
c e 2

So, using our graph, let’s find the shortest path from a to e (P_1). Looking at the graph, we can see that it is a->c->e, with cost 4. Now what we do is loop through all paths using this first path as a guide, because intuitively, there might be some obvious edges that we use in many of the shortest paths.

Let’s call what we’re using from P_1 the root path, and call what we use when we diverge from it the spur path. To find P_2 (the 2nd shortest path), we must go through the graph differently this time to make sure we just don’t get the shortest path again. Because we’re using P_1 as a guide, let’s pretend that the edge from a->c has infinite cost, so that we can never take that edge.

The paths that result from the root path being (a) is then [a ->b->e (4), a->e(9), a->b->c->e(7)].

Then, we expand our root path to be a->c,  and pretend that the edge c->e has infinite cost, so the next path is [a->c->b->e (8)].

We combine the two lists of potential paths and look at their costs. [a->b->e (4), a->e(9),a->b->c->e(7), a->c->b->e(8)]. a->b->e is the 2nd shortest path (P_2)!

Then, we can simply search through the rest of the list for the next shortest path (P_3,…,P_k).

 

Week 2: Understanding Wnt/β-catenin Signaling and the Ryk-CFTR-Dab2 Path

This week, Kathy, Usman, and I split up to learn more about different aspects of PathLinker. I researched the Wnt/β-catenin signaling pathway, precision-recall curves, and the role of CFTR and Dab2. On Wednesday, we all presented the topics we had researched to each other and Anna and Ibrahim.

To briefly summarize what I learned, the Wnt/β-catenin pathway regulates the transcription of certain genes related to cell proliferation, cell attachment, and growth. When the Wnt/β-catenin signaling pathway is dysregulated, a number of pathologies can develop including cancer and heart disease. In fact, in a study on colon adenocarcinoma by the Cancer Genome Atlas, 93% of tested tumors had a mutation that affected the Wnt/β-catenin signaling pathway (TCGA, 2012). At the most basic level theWnt/β-catenin signaling pathway is turned on when Wnt proteins bind to “frizzled” a 7-pass transmembrane receptor which halts the destruction of β-catenin in the cell. Usually, when the pathway is off, β-catenin is constantly being produced and destroyed. When β-catenin destruction is interrupted, β-catenin will build up in the cytoplasm and move into the nucleus where it will bind to LEF and the TCF promoter to promote the transcription of specific genes that were previously being inhibited by a transcription factor that was bound to TCF. The Wnt/β-catenin signaling pathway involves many proteins and interactions, some of which we still do not understand. The PathLinker algorithm succesfully identified a path in the Wnt/β-catenin signaling pathway that was not in the KEGG or NetPath database: the Ryk-CFTR-Dab2 path (Ritz et al, 2016).  The authors of the paper hypothesized that Ryk would interact with CFTR (Cystic Fibrosis Transmembrane-conductance Regulator), a chlorine ion channel previously studied for its role in cystic fibrosis, which would activate Dab2 to inhibit β-catenin activity. This hypothesis was experimentally confirmed by silencing Ryk, CFTR and Dab2 individually using RNA interference and then testing transcription levels of β-catenin controlled genes using a TCF/LEF luciferase activity and levels of β-catenin in the cell using a Western Blot assay.

We also attended a data management workshop taught by David Isaak, Reed’s data science librarian to learn how to use Git and GitHub. Because I’ve never really used terminal before (I usually use repl.it), I worked through a command line tutorial to learn more about it.

On Thursday, Kathy, Usman, and I went through our Dijkstra code with Anna, and finally got it to work the way we wanted it to. Anna added us to the cancer-linker repository on GitHub so we can all begin to actually work on PathLinker now.

My next project is to figure out how to take data from FireBrowse, a database containing data from TCGA on 38 different cancer types, put it into a format we can use, and apply several statistical analyses to the information. I’m hoping to base it off of PepperPathway, a program written by Nicholas Egan, a student in Anna’s lab last summer, that uses data from FireBrowse, GeneCards and NetPath and to create a visual representation on GraphSpace. I’m struggling to make sense of the PepperPathway program because there are a lot of files on the GitHub repository and I’m not really sure where to begin. I’m going to try to make sense of it this weekend.

References

  1. TCGA Research Network.  Comprehensive Molecular Characterization of         Human Colon and Rectal Tumors.  July 19, 2012. Nature. DOI:           10.1038/nature11252.
  2. Ritz, Anna et al. “Pathways on Demand: Automated Reconstruction of  Human Signaling Networks.” Npj Systems Biology And Applications 2 (2016): 16002. Web.

Week 1

Sol, Usman, and I have started working on summer research to modify the PathLinker algorithm.

This week, we plan to finish the code implementing Dijkstra’s algorithm on toy datasets, which was used as a warmup exercise in order to acclimatize us back into working and coding. In addition, I’ve been assigned to understand Yen’s k-shortest paths algorithm, which is one of the foundations of PathLinker.