Week 7: Node Scoring

I have spent most of this week trying to transform the gene expression data and implement the equations that Ibrahim came up with to incorporate gene expression data onto edge scores.

I have had several major hold-ups to accomplishing this.  The first is pretty simple, I just don’t know how to get the kind of data output I want from the CDF function because the scipy.norm.cdf() function outputs an array and I want a single value for each node.

The biggest issue is that the TCGA data on gene expression and PathLinker use different gene nomenclature systems. The TCGA relies on Ensembl IDs whereas PathLinker relies on UniProt IDs. Originally I tried to convert all the Ensembl IDs to common names to UniProt IDs because I have a file that contains UniProt IDs and common names, and another file that contains Ensembl IDs and common names. However, this didn’t work because a single gene may have multiple common names and may refer to multiple UniProt IDs (for example there are about 20 UniProt IDs that correspond to the HLA-A gene).

Therefore, to avoid the loss of data due to various common names, I tried to make a dictionary of Ensembl IDs directly to UniProt IDs. I was able to obtain a file that contained both UniProt and Ensembl IDs from the HUGO Gene Nomenclature Committee website. However, converting between Ensemble and UniProt came with its own problems. First of all, there are many genes that either only have UniProt IDs or only have Ensembl IDs. 997 genes in the interactome were unable to be converted from one to the other because of this reason. In addition, many of the Ensembl IDs in the TCGA file (over 16,000) don’t line up with any of the Ensembl IDs in the dictionary I constructed.  I think this might be because the Ensembl IDs in the TCGA file include version numbers at the end of each ID. The version number is the decimal point at the end of the ID name. For example for the gene, “ENSG00000242268.2”, the “.2” means that this is the 2nd version of that gene. I think one way to fix this problem might be to just take all the version numbers out of the gene name when constructing the dictionary from the TCGA file. However, I can’t figure out how to do this without making the code super slow. If every version number were a single decimal place (ie: .1, .2, .3 … etc), I would just cut off the last two digits of each name which wouldn’t be that slow. However different IDs have different number of decimal places. For example, if I cut off the last two digits of “ENSG00000167578.15” I would still be left with the decimal point. Therefore the only way I can currently think of to get rid of the decimal and everything following it is to use a for-loop that goes through every character in the name and if the character is a decimal point to cut the string there. However, if the program has to go through every letter in every gene name, it’s going to be extremely slow. Maybe something I could do is pre-process the data to create a text file that contains gene IDs without the decimal places so it only has to do it one time and won’t slow the whole code down, but I feel like there has to be a faster way to do it within the program.

GCC/BOSC

Last week, we were lucky enough to attend the GCC/BOSC conference hosted at Reed. Although I spent the majority of the week volunteering and attending sessions, we also had a deadline for a two-page extended abstract of our CNB-MAC paper to be submitted to the main conference proceedings. We successfully cut down our original paper into two pages and submitted the abstract on Friday.

As of now, we’re almost ready to dig into the experimental portion of this project, but there are still a few things we need to iron out.  At the end of last week, Alex noticed a “bug” in how we’ve been calculating the AUC of our multi-layer algorithm. Therefore, one of our priorities this week is to find a more accurate AUC calculation that better represents our method. Once that is completed, we will move on to assembling a list of candidates for experimentally validation. Hopefully by the end of the week we will have chosen 8-10 candidates that are not only biologically interesting but conserved in Drosophila as well.

GCC/BOSC & Bug Fixing

This week I spent the majority of my time at the GCC/BOSC Conference both volunteering at and attending talks. Most of the talks concerned the software suite Galaxy and how people were adapting it or extending it for different use cases and new applications.  Not having used Galaxy before nor intending to, these talks did not prove very useful to me. There were some talks that concerned more general biological problems and issues within the scientific community. These were accessible to a wider audience and I was able to learn some aspects of statistics and experimental design from them. This included talks on the importance of talking to the community of people who have the particular condition being studied as well as some concerning best practices on handling data.

I spent some portion of time fixing minor bugs in my code for the Bayesian Weighting Scheme. As a follow-up to a previous post, the run time of the code is now down to 2 or 3 minutes at the most. Using sets and tuples reduced the run time by several orders of magnitude and some other time saving measures also contributed to this much lower number. This is now sufficiently within the time frame of the CREU so the code will probably need much more speeding up.  Next steps involve changing the code to weight the PathLinker Interactome to check if the code works accurately as well as parsing through the math carefully.

Week 6: GCC/BOSC

This week we spent most of our time attending and volunteering at GCC/BOSC (Galaxy Community Conference/ Bioinformatics Open Source Conference). While the conference wasn’t extremely relevant to what we’re working on right now, we learned how to use software like Galaxy and Intermine, which might be useful to us in future projects. Also, this was the first conference I’ve ever attended so it was very interesting to learn how conferences work and to meet people.

Week 4: Parsing Files

This week, I discovered how to download files from the TCGA database, and explored their structure.

So this week, I downloaded 512 files relating to colorectal cancer (COAD in the TCGA database.) These were compressed into a Tar file, which opened into a directory tree that looked like this:

 

Each tiny blue dot is a folder, and inside each folder is one gzipped file (compressed). And each one of these files is a sample from a patient with gene, and its expression (from either a tumor sample or a healthy one.)

So my main issue has been to try and parse these files into a data matrix, ideally with sample-ids on the top and gene expression on the sides. So far, I’ve been able to compress these files into patient and samples relating to them because ideally, we want to look at gene expression in a healthy and tumor sample from the same patient. However, I haven’t been able to write the sample-ids with gene expression data into a file because of multiple bugs and errors. My goal for next week is to get these errors fixed.