Week 22

So we have a BLASTer to see which genes in humans are homologous to genes in drosophila melanogaster, our model organism for cell motility. It works- it just takes a couple of hours to get an output file for 20 genes. Should be no problem to run overnight.

Miriam has been working on a way to verify that the network is working. Currently, she has been eliminating some positives while keeping others, looking at how high up the eliminated positives are in the result. It should be fruitful.

We also looked at how other functionally similar genes we were certain about were similar. We found that all of the gene interactions we thought of as functionally similar had strong edges in the networks. This validates the network as a whole as biologically sound and will further validate our results.

Week 21

Although we had the bio qual, we still managed to do a little bit of work. One of the sources for our positive set was a little suspect, so we decided to take it out and see the impact on the results. We found that there was a drastic change. Since we were relatively confident that the other positives were good, if the suspect one was good, then the results wouldn’t change as much if we ran the whole network through. What we found was that it was indeed not very good. The output file was dramatically different. But we need a way to quantify the change

This upcoming week, we will also be developing a program to find how conserved our output genes are in fruit flies, which are our model organism

No Post

I am taking the biology junior qualifying exam this weekend, so there will be no blog post. The junior qualifying exam is a cumulative exam each student must take their junior year for their major. Each student must pass it in order to be able to move on to writing a thesis their senior year and graduate.

Week 20

I took a pathway that I know pretty well, NMDA-based long term potentiation (the process by which neurons become more sensitive to neurotransmitters in response to glutamatergic activity). I made some positives for both the first part of the pathway, the channel proteins that bind to and Ca2+ associated proteins, and the last part of the pathway, which involves nuclear proteins involved with transcribing growth factors and new channel proteins.

A protein centrally implicated in LTP, CaMKII, was used as the reference ranked 21 out of ~15,000. It’s good, but it could be improved. It might be better if we allowed for more precision.

Upcoming tasks will include further verifications, but will probably be postponed, since both of us are taking the biology junior qualifying exam, which is a cumulative exam that we need to pass to graduate.

Week 19

One of our tasks last week was to make a diagram that provides an overview of our project’s methods, pictured below. The blue represents inputs, the yellow represents a function (our semi-supervised method), the green represents outputs, and the purple is the final output.

Overview of the CREU project.

At the top of the diagram, the blue boxes go through the process of creating the positive list of cell motility genes. The genes were gathered from different signaling pathways in the KEGG database as well as studying primary literature. From these resources, I created two separate positive lists: one purely of cell motility genes and one of cell motility genes implicated in schizophrenia. I combined these together (collapsing the duplicates) in order to create a positive list of 541 cell motility genes. At the bottom of the diagram, the two blue boxes go through the process of creating the schizophrenia positive list. Alex pulled schizophrenia genes from genome-wide association studies (GWAS), then filtered the genes down to the top 300 positives by taking into account their p-values from the literature. We each ran the semi-supervised iterative method on the GIANT brain interactome with the same negative set, differing only in the positive sets we used. From there, the iterative method spit out ranked lists of schizophrenia candidates and cell motility candidates with scores ranging from 0-1 (the green boxes.) Finally, we combined the scores by multiplying them to take into account their probability of being “good” candidates for both cell motility and schizophrenia (the purple box.)

The CREU diagram can also be viewed here.

Our runtime-shortening strategies seemed to have worked well! We plotted runtime vs. iteration number and found that each iteration took about 3-4 seconds, so there’s no increase as we saw last time. However, we did find that the scores do not change after a surprisingly small number of iterations, so we will be tweaking a few things to see what changes.

Because it’s hard to tell if a semi-supervised machine learning method is actually good at what it’s supposed to accomplish, we will also be looking into ways to test our method.

Our goals for next time include:

  1. Run on a larger portion of the network. So far we have been using the 0.200 probability threshold network, but we will move down to 0.150 threshold.
  2. Change the number of positives used – see what the effect is of using 1 vs. 300+ positives on the graph.
  3. Plot a distribution of the ranked candidate scores.
  4. Plot the absolute value of the sum of changes made during each iteration.
  5. Look into cross validation to check if our method is doing a good job at what it’s supposed to do. This involves hiding some positives from your positive list and seeing if your method correctly identifies these hidden positives as having a high probability (if not the highest) of being involved in the pathway.
  6. Look  into how other papers were able to be convincing about the accuracy of their method.

Next week is spring break so there will be no blog post!

 

Week 18

Last week we had the task of running our algorithm for 100,000 iterations. We ran into a couple of problems:

  1. It took a very long time. We actually stopped running it at about 1300 iterations (which took 12 hours) because the estimated time kept increasing. There are a few lines in our code that we are going to change to improve running time – we will convert some variables we’re using to track changes between iterations from lists of nodes to integer counts.
  2. The difference in score we wanted to see before stopping iterations was far too big. We allowed the code to run until the difference in scores between iterations was no more than 0.001 (or it hit 100,000 iterations), and to our surprise, it took around 130 and 180 iterations for cell motility and schizophrenia positives, respectively. Because the scores range from 0 to 1, a change of 10^-4 is bigger than we initially thought. After speeding up the code, we are going to run it until we see changes less than 10^-9.

We are also going to create a histogram of run time versus progress (each iteration) to track our algorithm and see if there is a separate problem that causes the code to slow down as the number of iterations increases.

Our continued goals for this week are to BLAST candidates to see if they are conserved in Drosophila. 

Week 18

We needed to find drosophila analogs to the genes high on our list, so this week, we built a program that converts human gene names to human protein ID numbers from the NCBI. Then, the ID number is put into BLAST, a tool from the NCBI that sorts drosophila proteins by amino residue sequence similarity.

There are also things we can do to optimize the gene ranker program to improve the run time. We will be converting some sets to integers and will catalog the runtime versus the progress.

Week 17

Anna noted that we need to run the program for quite a bit longer to get a more accurate result, so we’ll publish those results in a couple of days.

Also, we need to find out if our candidate genes are conserved in drosophila. We can use this with BLAST, an NCBI tool that lets us compare sequence similarity across species. Rather than do it manually, we’ll hook it up directly to our terminal.

Week 16

We have two goals to accomplish for our meeting next week:

  1. Run our iterative method for 100,000 iterations, or until the scores change by 0.001 or less. See how this affects the ranking compared to 150 iterations. Will it just affect the scores but not the relative rankings?
  2. See if the candidates are conserved in drosophila melanogaster. The second part of this project is to experimentally validate the candidates, which we cannot do if they aren’t conserved in drosophila. As a quick sanity check, we’re going to BLAST CTNNB1 (the well-studied Catenin Beta 1 gene that codes for a protein involved in the formation of adherens junctions) in humans against drosophila which should be ARM.

The results will be posted in a couple of days.