We found an algorithm that solves the edge case problem. Sinksource+ removes the negatives and adds one negative to every single node. The weight of the edge is also a constant, which means that it’s the same for a node of any degree. This significantly devalues nodes that aren’t very well connected and would not be suitable for polygenic analysis.
However, it works horrendously in a single layer method with no prime nodes. The AUC sticks at around 0.5, which is the worst possible AUC you can have. However, it works great in a two layer network, giving our SZ network a score of 0.65. If you add negatives back in with the addition of sinksource+ and reduce the sinksource+ constant, we get an SZ AUC of 0.675 and a cell motility AUC of 0.801, which is the best AUC we have seen for an SZ gene identifier. I’m also running the program on the autism positives right now, and it appears that the AUC is running at about similar levels to where they were with E1 and E2.
Krishnan, A. et al.Genome-wide prediction and functional characterization of the genetic basis of autism spectrum disorder. Nature Neuroscience19,1454 (2016).
I think we are at the point where we will begin writing the paper. These results are a satisfying in that it’s probably just the nature of SZ that genes implicated are going to be difficult and somewhat disparately related. That’s not necessarily a bad thing. Polygenic positives are by definition only contributing slightly towards a disease. The strongest p-value for a SNP in an SZ GWAS has a frequency of 0.765 in SZ participants and a frequency of 0.75 in control participants. A low AUC in an SZ gene labeler is almost expected.
On Monday we all met with Anna to agree on some goals for the week. Kathy and I set out to acheive several things: A. to figure out how to download data from FireBrowse on colon adenocarcinomas, to perform several statistical analyses on that data and to visualize it in some form using pylab, B. to write a code that produced informative graphs of the different signaling pathways from text files containing information about the edges and nodes and C. to implement PathLinker and LocPL and to create graphs of the top k paths on GraphSpace.
Originally to acheive our first goal, we tried to use Nicholas Egan’s pepper pathway code to download and format the data. However, while we were able to decipher how it worked, it ended up not being all that useful to us, so Kathy downloaded all the gene files related to colon adenocarcinoma onto her computer . Unfortunately, we’re still not really sure how to use them. We wrote a code that took text files containing information on edges and nodes and spit out a graph with color-coded nodes that were labeled with the node name and type and edges that were labeled with the type of interaction that was occuring between the two nodes for example “physical” or “phosphorylation”. I have attached to examples of the graphs we produced below, one very complex and the other very simple.
We also ran PathLinker on the Wnt pathway and made a graph of the top 200 best paths produced.
Moving forward we have several goals. Kathy is trying to figure out how to implement LocPL. I am still trying to figure out how to process and use the FireBrowse data. Because I will need to integrate methylation data from FireBrowse in my project, it is essential I figure out how to use it.
Overall, this week I have learned a lot about data processing, writing code to read text files, and using NetPath, PathLinker, FireBrowse, and GraphSpace. These skills should come in handy as I move forward in my project.
Week 2 was spent further looking into Bayesian Weighting Schemes and in particular how they were applied in a specific paper titled Top-Down Network Analysis to Drive Bottom-Up Modeling of Physiological Processes by Poirel et. al. This was the original LINKER paper that PATHLINKER was built upon. I also spent a very decent amount of time looking at other interactomes and curating a working list of interactomes and how they weight their interactomes. So far I have only done this in detail for the HIPPIE interactome but more work is still needed. Besides research, I also worked on a document about Bayesian Weighting that aims to elucidate some of the mathematical set up and notation for the paper mentioned above as it suffered from some convoluted mathematical notation. Hopefully, this will be useful to people trying to read this paper in the future. Other tasks involved debugging code, presenting learned material and sorting through papers to find relevant papers.
This week, I got caught up with the state of the project. Anna and Miriam had cleaned up and optimized our code to the point where it could be submitted and run in a reasonable amount of time. However, there still remained the problem of nodes only connected to positives. Since positives are held at a constant score of 1.0, and negatives at a constant 0.0, any nodes only connected to a positive will have a score of 1.0, which misrepresents how a gene fits within the polygenic nature of schizophrenia and cell motility.
To solve this, we decided to make three networks, each one identical to the one we were previously testing. However, there are two key differences.
The positives are divided between the three networks.
The three nodes in each networks representing their gene are also connected to a prime node that is only connected to those three nodes. The prime node’s score is the node by which we judge a gene.
This reduces the edge cases where a gene is only connected to a positive or a negative, since the gene cannot be connected to only one node. This also allows us to evaluate positive nodes within the context of all other positives. If a positive node is found to be positive in other networks, then the score will be high. Even the neighbor positives within the same network will diffuse the positivity through their prime nodes down to other networks, increasing the score for the neighbor of the positive and therefore increasing the score for the positive.
One thing to note is that the AUC seems to be relatively stable despite the network threshold or the prime node implementation. However, my guess is that this is probably due to the positives generally being only vaguely functionally related. Also, only 116 SZ genes are in the 0.200 network, reducing the polygenic power of our method. One thing to try is to add more genes from other studies. I will research other studies to find more gene lists and p-values.
Sol, Usman, and I have started working on summer research to modify the PathLinker algorithm.
This week, we plan to finish the code implementing Dijkstra’s algorithm on toy datasets, which was used as a warmup exercise in order to acclimatize us back into working and coding. In addition, I’ve been assigned to understand Yen’s k-shortest paths algorithm, which is one of the foundations of PathLinker.
This week, my goals have been relatively simple and short. There are a couple of bugs in my code that I need to sort out. I haven’t been able to collect any data about how “good” our method is due to these bugs, but once I do, we will hopefully be able to see that the results we are getting are not random.
Since the last time I wrote a blog post, I have mainly been working on one way of verifying our algorithm; essentially, we need to make sure that our method is actually good at doing what it’s supposed to be doing. Although there are several ways to verify an algorithm, some of which Alex has been working on, I am working on something called “k-fold cross validation.”
This method of verification works by removing (or “hiding”) 1/k-th of your positives, then running your method and seeing where the hidden positives are ranked. You randomly choose the 1/k positives to hide and do this several times. You can also use the distribution of scores you get from each run to see how your method (with all positives) compares – are you getting significant results or are you getting what is expected from randomly choosing positives?
My first attempt at this was just a test run, and there were a few mistakes that I made. First, I chose to sample half instead of choosing a more reasonable number of positives such as 1/4 or 1/5. I also didn’t randomly choose positives to hide – I simply deleted the second half of the positives, and in doing so, might have removed an entire cell motility pathway or two from the positives. This would have produced biased results.
This coming week, my goal is to randomly sample 1/4 of positives multiple times, as well as implement some plotting functions in order to visualize the distribution of scores.
So we have a BLASTer to see which genes in humans are homologous to genes in drosophila melanogaster, our model organism for cell motility. It works- it just takes a couple of hours to get an output file for 20 genes. Should be no problem to run overnight.
Miriam has been working on a way to verify that the network is working. Currently, she has been eliminating some positives while keeping others, looking at how high up the eliminated positives are in the result. It should be fruitful.
We also looked at how other functionally similar genes we were certain about were similar. We found that all of the gene interactions we thought of as functionally similar had strong edges in the networks. This validates the network as a whole as biologically sound and will further validate our results.
Although we had the bio qual, we still managed to do a little bit of work. One of the sources for our positive set was a little suspect, so we decided to take it out and see the impact on the results. We found that there was a drastic change. Since we were relatively confident that the other positives were good, if the suspect one was good, then the results wouldn’t change as much if we ran the whole network through. What we found was that it was indeed not very good. The output file was dramatically different. But we need a way to quantify the change
This upcoming week, we will also be developing a program to find how conserved our output genes are in fruit flies, which are our model organism