{"id":371,"date":"2018-06-27T15:47:24","date_gmt":"2018-06-27T22:47:24","guid":{"rendered":"http:\/\/blogs.reed.edu\/compbio\/?p=371"},"modified":"2018-06-27T15:55:29","modified_gmt":"2018-06-27T22:55:29","slug":"week-4-parsing-files","status":"publish","type":"post","link":"https:\/\/blogs.reed.edu\/compbio\/2018\/06\/27\/week-4-parsing-files\/","title":{"rendered":"Week 4: Parsing Files"},"content":{"rendered":"<p>This week, I discovered how to download files from the <a href=\"https:\/\/portal.gdc.cancer.gov\/\">TCGA database<\/a>, and explored their structure.<\/p>\n<p>So this week, I downloaded 512 files relating to colorectal cancer (COAD in the TCGA database.) These were compressed into a Tar file, which opened into a directory tree that looked like this:<img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-376\" src=\"https:\/\/blogs.reed.edu\/compbio\/files\/2018\/06\/Screen-Shot-2018-06-27-at-3.36.46-PM.png\" alt=\"\" width=\"2254\" height=\"750\" srcset=\"https:\/\/blogs.reed.edu\/compbio\/files\/2018\/06\/Screen-Shot-2018-06-27-at-3.36.46-PM.png 2254w, https:\/\/blogs.reed.edu\/compbio\/files\/2018\/06\/Screen-Shot-2018-06-27-at-3.36.46-PM-300x100.png 300w, https:\/\/blogs.reed.edu\/compbio\/files\/2018\/06\/Screen-Shot-2018-06-27-at-3.36.46-PM-768x256.png 768w, https:\/\/blogs.reed.edu\/compbio\/files\/2018\/06\/Screen-Shot-2018-06-27-at-3.36.46-PM-1024x341.png 1024w, https:\/\/blogs.reed.edu\/compbio\/files\/2018\/06\/Screen-Shot-2018-06-27-at-3.36.46-PM-1200x399.png 1200w\" sizes=\"auto, (max-width: 709px) 85vw, (max-width: 909px) 67vw, (max-width: 1362px) 62vw, 840px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>Each tiny blue dot is a folder, and inside each folder is one gzipped file (compressed). And each one of these files is a sample from a patient with gene, and its expression (from either a tumor sample or a healthy one.)<\/p>\n<p>So my main issue has been to try and parse these files into a data matrix, ideally with sample-ids on the top and gene expression on the sides. So far, I&#8217;ve been able to compress these files into patient and samples relating to them because ideally, we want to look at gene expression in a healthy and tumor sample from the same patient. However, I haven&#8217;t been able to write the sample-ids with gene expression data into a file because of multiple bugs and errors. My goal for next week is to get these errors fixed.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This week, I discovered how to download files from the TCGA database, and explored their structure. So this week, I downloaded 512 files relating to colorectal cancer (COAD in the TCGA database.) These were compressed into a Tar file, which opened into a directory tree that looked like this: &nbsp; Each tiny blue dot is &hellip; <a href=\"https:\/\/blogs.reed.edu\/compbio\/2018\/06\/27\/week-4-parsing-files\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;Week 4: Parsing Files&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1725,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9,3,11],"tags":[],"class_list":["post-371","post","type-post","status-publish","format-standard","hentry","category-cancer","category-computer-science","category-summer-research-2018"],"_links":{"self":[{"href":"https:\/\/blogs.reed.edu\/compbio\/wp-json\/wp\/v2\/posts\/371","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/blogs.reed.edu\/compbio\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/blogs.reed.edu\/compbio\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/blogs.reed.edu\/compbio\/wp-json\/wp\/v2\/users\/1725"}],"replies":[{"embeddable":true,"href":"https:\/\/blogs.reed.edu\/compbio\/wp-json\/wp\/v2\/comments?post=371"}],"version-history":[{"count":2,"href":"https:\/\/blogs.reed.edu\/compbio\/wp-json\/wp\/v2\/posts\/371\/revisions"}],"predecessor-version":[{"id":380,"href":"https:\/\/blogs.reed.edu\/compbio\/wp-json\/wp\/v2\/posts\/371\/revisions\/380"}],"wp:attachment":[{"href":"https:\/\/blogs.reed.edu\/compbio\/wp-json\/wp\/v2\/media?parent=371"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/blogs.reed.edu\/compbio\/wp-json\/wp\/v2\/categories?post=371"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/blogs.reed.edu\/compbio\/wp-json\/wp\/v2\/tags?post=371"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}