{"id":681,"date":"2022-08-16T19:08:35","date_gmt":"2022-08-17T02:08:35","guid":{"rendered":"https:\/\/blogs.reed.edu\/compbio\/?p=681"},"modified":"2022-08-16T19:08:35","modified_gmt":"2022-08-17T02:08:35","slug":"clustering-with-graphlets","status":"publish","type":"post","link":"https:\/\/blogs.reed.edu\/compbio\/2022\/08\/16\/clustering-with-graphlets\/","title":{"rendered":"Clustering with Graphlets"},"content":{"rendered":"\n<p>Hi! My name is Hannah Kuder, I&#8217;m a rising math-physics senior working in Anna Ritz&#8217;s lab this summer. My project has focused on implementing a graphlet-based version of the Markov Clustering (MCL) algorithm on various biological interactomes in order to compare those clusterings to normal MCL. In particular, whether any of the graphlets cluster disease genes &#8220;better&#8221;.  <\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Graphlets<\/h3>\n\n\n\n<p>To put it simply, graphlets are connected induced subgraphs. Starting at G<sub>0<\/sub>, we have all of the connected induced subgraphs that can be created from 2 nodes, which is just an edge. I focused this project on graphlets up to 5 nodes.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/blogs.reed.edu\/compbio\/files\/2022\/08\/Screen-Shot-2022-08-11-at-4.47.53-PM-1024x348.png\" alt=\"\" class=\"wp-image-686\" width=\"631\" height=\"215\" srcset=\"https:\/\/blogs.reed.edu\/compbio\/files\/2022\/08\/Screen-Shot-2022-08-11-at-4.47.53-PM-1024x348.png 1024w, https:\/\/blogs.reed.edu\/compbio\/files\/2022\/08\/Screen-Shot-2022-08-11-at-4.47.53-PM-300x102.png 300w, https:\/\/blogs.reed.edu\/compbio\/files\/2022\/08\/Screen-Shot-2022-08-11-at-4.47.53-PM-768x261.png 768w, https:\/\/blogs.reed.edu\/compbio\/files\/2022\/08\/Screen-Shot-2022-08-11-at-4.47.53-PM-1536x521.png 1536w, https:\/\/blogs.reed.edu\/compbio\/files\/2022\/08\/Screen-Shot-2022-08-11-at-4.47.53-PM-1200x407.png 1200w, https:\/\/blogs.reed.edu\/compbio\/files\/2022\/08\/Screen-Shot-2022-08-11-at-4.47.53-PM.png 1844w\" sizes=\"auto, (max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 984px) 61vw, (max-width: 1362px) 45vw, 600px\" \/><figcaption>Taken from Anna and Pramesh&#8217;s graphlet grant proposal.<\/figcaption><\/figure>\n\n\n\n<p>The &#8220;induced&#8221; part of the definition means that given a subgraph that looks exactly like G<sub>29<\/sub>, a 5 node clique, we would not say that G<sub>15<\/sub>, the 5 node ring is also present. Given 5 connected nodes, there is only one 5 node graphlet present. The graphlet must contain all of the edges between the selected nodes.<\/p>\n\n\n\n<p>So why graphlets? They are a way to investigate patterns and the higher order structure of the network. In between the very local level of two nodes being connected by an edge, which doesn&#8217;t tell us about any greater trends, but much less global than a statistic such as density, which misses much of the network-specific intricacies. <\/p>\n\n\n\n<h3 class=\"has-text-align-left wp-block-heading\">Markov Clustering<\/h3>\n\n\n\n<p>Markov Clustering essentially simulates a random walk. Given an edge list, MCL creates a transition matrix, where every column and row represent a node with the entry  <code>[i,j]<\/code>  being the probability a transition from one node to the other could occur.  MCL then cycles between two steps: expansion and inflation. <\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong>Expansion<\/strong> simulates a normal random walk (taking the power of the probability matrix using the normal matrix product)<\/li><li><strong>Inflation<\/strong> accentuates high probabilities and diminishes lower probabilities (taking the power entrywise aka the Hadamard power)<\/li><\/ul>\n\n\n\n<p>In the implementation of MCL, there is a parameter you can choose called the &#8220;inflation parameter&#8221; which controls how far those probabilities are exaggerated. A lower inflation parameter means less &#8220;inflation&#8221; and thus a coarser clustering and larger clusters, while a higher inflation parameter produces more, smaller clusters. Which inflation parameter to use is dependent on both the specific network and your goals.<\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-full is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/blogs.reed.edu\/compbio\/files\/2022\/08\/mcl_pic.png\" alt=\"\" class=\"wp-image-685\" width=\"644\" height=\"150\" srcset=\"https:\/\/blogs.reed.edu\/compbio\/files\/2022\/08\/mcl_pic.png 564w, https:\/\/blogs.reed.edu\/compbio\/files\/2022\/08\/mcl_pic-300x70.png 300w\" sizes=\"auto, (max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 984px) 61vw, (max-width: 1362px) 45vw, 600px\" \/><figcaption>With each cycle of inflation and expansion, clusters are found. Taken from the official MCL website, <a href=\"https:\/\/micans.org\/mcl\/index.html\" data-type=\"URL\" data-id=\"https:\/\/micans.org\/mcl\/index.html\">https:\/\/micans.org\/mcl\/index.html<\/a>.<\/figcaption><\/figure><\/div>\n\n\n<h3 class=\"wp-block-heading\">Graphlet-Based MCL<\/h3>\n\n\n\n<p>So the main idea behind this project is to combine the two and see if any graphlet-based versions of MCL uncover anything insightful that normal MCL could not. To combine the two, for a given graphlet, all of the edges that did not participate in the given graphlet were deleted. MCL then performed clustering on the graph that remained. Thus we are left with 30 different clusterings of the same network. <\/p>\n\n\n\n<p>Specifically, I have been focusing on a protein-protein interactome (ppi) with 21,559 proteins and 342,353 edges. This interactome contains genes associated with 519 diseases. Each of the 30 different versions of MCL produced different clusterings of these genes. I am currently trying to compare the results of the different higher-order graphlets to G<sub>0<\/sub>, which is normal MCL. <\/p>\n\n\n\n<p>Here are a couple Venn diagrams comparing the &#8220;best&#8221; clusters from G<sub>0<\/sub> and G<sub>29<\/sub> for a couple diseases. &#8220;Best&#8221; means the cluster had the most disease genes in it. <\/p>\n\n\n<div class=\"wp-block-image\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/blogs.reed.edu\/compbio\/files\/2022\/08\/venn_leukemia-1024x580.png\" alt=\"\" class=\"wp-image-698\" width=\"590\" height=\"333\" srcset=\"https:\/\/blogs.reed.edu\/compbio\/files\/2022\/08\/venn_leukemia-1024x580.png 1024w, https:\/\/blogs.reed.edu\/compbio\/files\/2022\/08\/venn_leukemia-300x170.png 300w, https:\/\/blogs.reed.edu\/compbio\/files\/2022\/08\/venn_leukemia-768x435.png 768w, https:\/\/blogs.reed.edu\/compbio\/files\/2022\/08\/venn_leukemia-2048x1161.png 2048w, https:\/\/blogs.reed.edu\/compbio\/files\/2022\/08\/venn_leukemia-1200x680.png 1200w\" sizes=\"auto, (max-width: 590px) 85vw, 590px\" \/><figcaption>Each of the dots is a gene, while the orange dots are disease genes.<\/figcaption><\/figure><\/div>\n\n<div class=\"wp-block-image is-style-default\">\n<figure class=\"aligncenter size-large is-resized\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/blogs.reed.edu\/compbio\/files\/2022\/08\/venn_heatstroke-1-973x1024.png\" alt=\"\" class=\"wp-image-702\" width=\"432\" height=\"455\" srcset=\"https:\/\/blogs.reed.edu\/compbio\/files\/2022\/08\/venn_heatstroke-1-973x1024.png 973w, https:\/\/blogs.reed.edu\/compbio\/files\/2022\/08\/venn_heatstroke-1-285x300.png 285w, https:\/\/blogs.reed.edu\/compbio\/files\/2022\/08\/venn_heatstroke-1-768x808.png 768w, https:\/\/blogs.reed.edu\/compbio\/files\/2022\/08\/venn_heatstroke-1-1460x1536.png 1460w, https:\/\/blogs.reed.edu\/compbio\/files\/2022\/08\/venn_heatstroke-1-1946x2048.png 1946w, https:\/\/blogs.reed.edu\/compbio\/files\/2022\/08\/venn_heatstroke-1-1200x1263.png 1200w\" sizes=\"auto, (max-width: 432px) 85vw, 432px\" \/><\/figure><\/div>\n\n\n<p>We see that in these diseases, the best cluster of G<sub>29<\/sub> is much more enriched than the best cluster of G<sub>0<\/sub>. <\/p>\n\n\n\n<p>I have also calculated the hypergeometric score for each of these best clusters across all of the diseases, which is the probability that given <em>N<\/em> disease genes in a pool of <em>M <\/em>total genes, that a cluster with <em>m<\/em> genes would contain <em>n<\/em> disease genes. Essentially, is the number of disease genes in that cluster significant? Or is it likely to happen just by chance.<\/p>\n\n\n\n<p>A problem I am facing with these hypergeometric scores is how to compare the probabilities across different graphlets, and hopefully I&#8217;ll figure something out before my research ends next week.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Hi! My name is Hannah Kuder, I&#8217;m a rising math-physics senior working in Anna Ritz&#8217;s lab this summer. My project has focused on implementing a graphlet-based version of the Markov Clustering (MCL) algorithm on various biological interactomes in order to compare those clusterings to normal MCL. In particular, whether any of the graphlets cluster disease &hellip; <a href=\"https:\/\/blogs.reed.edu\/compbio\/2022\/08\/16\/clustering-with-graphlets\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Clustering with Graphlets&#8221;<\/span><\/a><\/p>\n","protected":false},"author":2569,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-681","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/blogs.reed.edu\/compbio\/wp-json\/wp\/v2\/posts\/681","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.reed.edu\/compbio\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.reed.edu\/compbio\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.reed.edu\/compbio\/wp-json\/wp\/v2\/users\/2569"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.reed.edu\/compbio\/wp-json\/wp\/v2\/comments?post=681"}],"version-history":[{"count":11,"href":"https:\/\/blogs.reed.edu\/compbio\/wp-json\/wp\/v2\/posts\/681\/revisions"}],"predecessor-version":[{"id":703,"href":"https:\/\/blogs.reed.edu\/compbio\/wp-json\/wp\/v2\/posts\/681\/revisions\/703"}],"wp:attachment":[{"href":"https:\/\/blogs.reed.edu\/compbio\/wp-json\/wp\/v2\/media?parent=681"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.reed.edu\/compbio\/wp-json\/wp\/v2\/categories?post=681"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.reed.edu\/compbio\/wp-json\/wp\/v2\/tags?post=681"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}