Week 11

This week, I ran some statistics on how our existing SZ gene dataset fit within the GIANT network. I found that most of our genes had a posterior probability of about 0.3-0.5. Which makes sense, given that many of our genes should be at least 0.2.

Next/this week is thanksgiving, but soon we’ll be setting parameters for how we want to judge each gene in the completed network given the qualities of the GIANT network as a whole.

Week 10

This week, I got more familiar with the NetworkX package, which is concerned with graph based programming. It’s very powerful, but the complete GIANT network is far too large for it to run efficiently. Even the 0.1 threshold for edges has 41 million edges. However, Anna showed that the number of edges decreases exponentially as the threshold goes up. A careful balance will be needed as for the weight we put on these edges in our final program and how many we include.

Week 9

This week, I compared the brain tissue specific network from human base (http://hb.flatironinstitute.org) with the set of genes I collected associated with a higher risk of schizophrenia. This tissue specific network gives the probability that 2 genes interact with each other specifically in the brain. Genes that interact in regions other than the brain but still interact in the brain have a lower probability count. As expected, nearly all of the associated genes had at least a 0.1 probability, which is relatively high in terms of bioinformatic confidence. Notably, several of the genes interactions that had above a 0.9 probability involved cell adhesion genes.

Of the genes below the 0.1 confidence, most do have neural roles but simply have roles common to other parts of the body. For instance, mir-137 is involved in neural development but is also involved in tumor suppression for several cancers.

This upcoming week, I will be learning how to use NetworkX and will be gathering statistics from the Humanbase network with it.

Week 8

This past week, I created a unified standard for genes. Now regardless of the naming preferences for the genes that various research databases provide, I can now manipulate them as if they were all following the same naming conventions. This resulted in about 10 new genes being added to the gene overlap sets from all the different collections of genes.

This week, I am going to compare all of these genes to that of the GIANT network. That way, we can know where these genes specifically are and what processes they are involved in.

I am also going to be hunting for negatives to compare against for when we build the more complex program. Genes completely uninvolved in Schizophrenia are actually really difficult to find, given that all of the genes I hold in this collection are about 8% of the protein coding genome. I’ll look at computational biology papers associated with mental disorders similar to Schizophrenia, like autism.

Week 7

The week before break, at the group meeting, we discussed the gene lists Miriam and I created. Because of the lack of overlap on my part, my next job to do is to modify my program to account for possible different gene names. Anna sent me a giant text file for it, and I will get it done by the Friday meeting.
I did not spend all of break dormant, and I learned more about bayesian statistics in addition to a brief overview by one of Anna’s post-docs. I think one of the difficulties of knowing how Bayes’ theorem works is the fact that it’s just so ingrained into our normal thought. Given B, what is the probability of A? It’s the probability of B given A times the probability of A divided by the probability of B. The first thing to note is the B denominating the whole equation. The B is accounting for the probability warping from the context of the problem. The second factor is probability of B given A. This represents the relationship we already know. This is multiplied by the probability of A. Therefore, the numerator represents the total probability of B happening because of A, which can also be described as the total probability that the specified relationship happens. By accounting for the probability warping in the denominator, we get the actual probability of A given B.
Bayesian probability is the core of the functional interaction network and the integrated network we will make. I can already kind of see how gene interaction probabilities could be derived from this given interaction data.
However, the mutual exclusivity clause in the theorem might be tricky. I’ll have to closely look at the supplemental data to see how the functional interactive builders accounted for this.

Week 5

This week, Anna drafted me to make a program to find the common genes in my schizophrenia dataset. It was a little tricky, given that all of the datasets from the different studies were organized differently, but it wasn’t anything I couldn’t handle.


108 LOCI

Schizophrenia Working Group of the Psychiatric Genomics Consortium (2014). Biological insights from 108 schizophrenia-associated genetic loci. Nature 511, 421–427.

Allen, N.C., Bagade, S., McQueen, M.B., Ioannidis, J.P.A., Kavvoura, F.K., Khoury, M.J., Tanzi, R.E., and Bertram, L. (2008). Systematic meta-analyses and field synopsis of genetic association studies in schizophrenia: the SzGene database. Nat Genet 40, 827–834.

Brennand, K.J., Simone, A., Jou, J., Gelboin-Burkhart, C., Tran, N., Sangar, S., Li, Y., Mu, Y., Chen, G., Yu, D., et al. (2011). Modelling schizophrenia using human induced pluripotent stem cells. Nature 473, 221–225.

Fromer, M., Roussos, P., Sieberts, S.K., Johnson, J.S., Kavanagh, D.H., Perumal, T.M., Ruderfer, D.M., Oh, E.C., Topol, A., Shah, H.R., et al. (2016). Gene expression elucidates functional impact of polygenic risk for schizophrenia. Nat Neurosci 19, 1442–1453.

Won, H., de la Torre-Ubieta, L., Stein, J.L., Parikshak, N.N., Huang, J., Opland, C.K., Gandal, M.J., Sutton, G.J., Hormozdiari, F., Lu, D., et al. (2016). Chromosome conformation elucidates regulatory relationships in developing human brain. Nature 538, 523–527.


The first thing to note is that there was one gene that was in every data set except SZGene that couldn’t be included in this diagram because the Venn diagram program I found online deemed it geometrically impossible. This gene is TCF4, also known as immunoglobulin transcription factor 2. It is implicated in Pitt-Hopkins Syndrome and is involved in initiating neural differentiation.

An overlap between hiPSC, DeadExpression, and Chromosome Conformation is PRICKLE2, or Prickle Planar Cell Polarity Protein 2. It seems to be involved in the growth of post synaptic densities and neurite outgrowth.

Another pertinent overlap is EFHD1. EFHD1 was found in 108 Loci, hiPSC, and Chromosome Conformation. It seems to be calcium dependent and involved in apoptosis and neural differentiation, and it’s probably mitochondria dependent. However, its family is involved in cytoskeletal rearrangement.

One possibility for the dearth in overlap is simply the diversity in methods. Each collection got its genes from different areas, and it’s very likely that gene expression is totally different in all of these contexts. I’ll probably ask Anna more about this later. Until then, I will restart my Bayesian studies.

Week 4

I am currently in the process of figuring out which genes are common to each of the data sets. This will involve building a program to find the common genes. I have a good idea of how to do it; it’s a relatively simple problem. The only hangups will probably be translating all the datasets into one universal comparable format, but given enough time to iron out the kinks in all 5 data sets, I think I can manage easily. Eventually, I’ll make a neat Venn diagram illustrating what genes are common to what datasets.

I’m also learning more and more about bayesian probability. This is the basis of the functional interaction network we will eventually use. Anna’s postdoc Ibrahim explained the basics to me. This is a new concept to me, so this week, I’ll also learn more about that as well.

Week 3

This week, we searched for genes related to both schizophrenia and cell motility in order to build out gold standard of genes upon which to base our program off of.

My job was to find collections of genes related to Schizophrenia (SZ), while Miriam searched for cell motility genes. Our program is planned to have the capability to weight evidence for the strength of the data, so even uncertain genes will be really helpful.

The first set of associated genetic information is from the Schizophrenia Working Group of the Psychiatric Genomics Consortium. In a Genome Wide Association Study (GWAS) from 36,989 cases, they identified 108 loci that contained SNPs significantly more likely to be present in people with SZ. The threshold for significance is a p value of less than 5×10^-8, making this study one of the strongest sets of SZ gene data. However, while most of these SNPs are located near protein coding genes, all but 10 are located within non-coding regions. A regulatory region being near a protein coding gene does not necessarily mean that the gene is actually regulated by that region. Instead, the region might regulate some other gene in some other area. However, regionality with a regulator does imply a higher probability of being regulated by the regulator, so this locational information might be useful in the weaker evidence standards.


The next set of data is from the Schizophrenia Research Forum on SZGene.org. It contains every genetic association study paper for SZ genes that’s available. It’s a nice collection, but none of the genes reached the GWAS p value threshold of 10^-8. This evidence will have to be weighted proportional to the strength of the studies themselves.

Another set of data comes from an expression study on human induced pluripotent stem cells. The cells were differentiated to neurons and underwent qPCR. This data, however, is in vitro, whereas the developmental aspect of SZ requires a degree of specificity and communication. https://www.nature.com/nature/journal/v473/n7346/full/nature09915.html

There is another gene expression dataset, but this time, it’s from dead people. While the region of the brain is much more specific (prefrontal cortex), this expression data comes long after development. However, they did find 1 gene, inhibited it in a pluripotent stem cell, and found abnormal cell motility. Therefore, this data may be useful to our study. http://www.nature.com/neuro/journal/v19/n11/full/nn.4399.html

The last genetic data set covers the conformational structure of the chromatin in the nucleus. Brain specific intrachromatin contacts can upregulate protein coding genes promoted by the contact. Thus, they established a strong correlation between gene expression and chromatin contact using 3 neonatal brain slices from separate subjects. They were able to locate the points of contact for the remaining 98 loci in the initial gene, giving a more broad picture of genetic interaction. http://www.nature.com/nature/journal/v538/n7626/full/nature19847.html

We’ll probably use all of these data sets. The next question is how.