Unformatted text preview: Raw Sequence Data •  4 bases: A, C, G, T + other (i.e. N = any, R = G or A (purine), Y = T or (pyrimidine)) –  kb (= kbp) = kilo base pairs = 1,000 bp –  Mb = mega base pairs = 1,000,000 bp –  Gb = giga base pairs = 1,000,000,000 bp. •  Size: –  E. Coli 4.6Mbp (4,600,000) –  Fish 130 Gbp (130,000,000,000) –  Paris japonica (Plant) 150 Gbp –  Human 3.2Gbp Fasta File •  A sequence in FASTA format begins with a single- line descrip9on, followed by lines of sequence data (file extension is .fa). •  It is recommended that all lines of text be shorter than 80 characters in length. 4 8/26/13 Fastq File •  Typically contain 4 lines: –  Line 1 begins with a '@' character and is followed by a sequence iden9fier and an op#onal descrip9on. –  Line 2 is the sequence. –  Line 3 is the delimiter ‘+’, with an op9onal descrip9on. –  Line 4 is the quality score. –  file extension is .fq @SEQ_ID! GATTTGGGGTTCAAAGCT...
