file formats - File Formats 1 What is a File Format \u2022\u202f From Wikipedia \u2013\u202f A \ufb01le format is a standard way that informa8on is encoded for storage

file formats - File Formats 1 What is a File Format...

This preview shows page 1 out of 69 pages.

You've reached the end of your free preview.

Want to read all 69 pages?

Unformatted text preview: File Formats 1 What is a File Format •  From Wikipedia: –  A file format is a standard way that informa8on is encoded for storage in a computer file. It specifies how bits are used to encode informa8on in a digital storage medium. File formats may be either proprietary or free and may be either unpublished or open. 2 Aspects of Formats •  •  •  •  •  Proprietary or open? Text or Binary Single type of data or mul8ple types of data Fixed or extensible Independent from, but may be accompanied by a applica8on programming interface (API) 3 How do we know what format we have? •  File extension –  Typically 3 characters: .txt, .jpg, .mov .mp3, …. •  Internal metadata –  Header of Magic Number –  OPen at start of file, but can be elsewhere •  AQributes Managed by File System –  OS X Uniform Type Iden8fiers (UTIs), OS/2 Extended AQributes, POSIX extended aQributes •  External metadata –  MIME types –  Descriptor file •  File content based format iden8fica8on 4 Types of Formats •  Raw –  Direct representa8on of computer memory •  Chunked –  File format constants one or more data “chunks” along with rule to describe and structure within chunk •  Directory –  Hierarchal structuring of data 5 Text Files •  What is the format for a sequence of “text characters” •  Coding of characters: –  ASCII: 1 byte character coding, la8n only –  Unicode •  1,2, and 4 byte encodings, •  Support for many different languages 6 Unicode •  Abstract character set –  Capitol “A”, Ω, …. –  Separate character from glyph –  Define mapping of abstract character into 20 bit “code point” •  1,114,112 code points in the range 0 to 10FFFF •  Example: –  “e” U+0065 (LATIN SMALL LETTER E) –  “é” U+0065 U+0301 –  “é” U+00E9 (LATIN SMALL LETTER E WITH ACUTE). 7 Universal Character Set Transforma8on Format UTF-­‐8 Bits of code First code Last code Bytes in Byte 1 Byte 2-­‐N Encoding point point point sequence 7 U+0000 U+007F 1 0XXXXXXX 11 U_0080 U_07FF 2 110XXXXX 10XXXXXX 16 U+0800 U+FFFF 3 1110XXXX 10XXXXXX 21 U+100000 U+1FFFFF 4 11110XXX 10XXXXXX 26 U+200000 U+3FFFFFF 5 1111110XX 10XXXXXX 31 U+4000000 U+7FFFFFF 6 1111110X 10XXXXXX Backward compatibility: has the same value as the ASCII code. Self synchronization Clear indication of code sequence length 8 UTF-­‐16 •  U+0000 to U+D7FF and U+E000 to U+FFFF –  Represent directly as 16 bit number •  U+010000 to U+10FFFF –  Subtract 0x10000 to make 20 bit number –  Code into 2 16 bit numbers •  Top 10 bits offset by 0XD800 •  BoQom 10 bits offset by 0xDC00 Direct Top-­‐bits Low-­‐bits Direct 9 Byte Order •  What order to we place the bytes in 16 bits? –  A B C D •  Big Endian: A B C D •  LiQle Endian: B A D C •  Unicode recommends to prepend a Byte Order Mark (BOM) to the string, represen8ng the character U+FEFF. –  FE, FF, the encoding is UTF-­‐16BE. –  FF, FE, it is UTF-­‐16LE. 10 "Example" in different encodings (UTF-­‐16 with BOM): ASCII: UTF-16BE: 00 65! UTF-16LE: 65 00! 45 78 61 6d 70 6c 65! FE FF 00 45 00 78 00 61 00 6d 00 70 00 6c FF FE 45 00 78 00 61 00 6d 00 70 00 6c 00 11 Gulliver’s Travels Besides, our Histories of six thousand Moons make no men8on of any other Regions, than the two great Empires of Lilliput and Blefuscu. Which two mighty Powers have, as I was going to tell you, been engaged in a most obs8nate War for six and thirty Moons past. It began upon the following Occasion. It is allowed on all Hands, that the primi8ve way of breaking Eggs, before we eat them, was upon the larger End: But his present Majesty's Grand-­‐father, while he was a Boy, going to eat an Egg, and breaking it according to the ancient Prac8ce, happened to cut one of his Fingers. Whereupon the Emperor his Father published an Edict, commanding all his Subjects, upon great Penaltys, to break the smaller End of their Eggs. The People so highly resented this Law, that our Histories tell us there have been six Rebellions raised on that account; wherein one Emperor lost his Life, and another his Crown. These civil Commo8ons were constantly fomented by the Monarchs of Blefuscu; and when they were quelled, the Exiles always fled for Refuge to that Empire. It is computed, that eleven thousand Persons have, at several 8mes, suffered Death, rather than submit to break their Eggs at the smaller End. Many hundred large Volumes have been published upon this Controversy: But the books of the Big-­‐Endians have been long forbidden, and the whole Party rendered incapable by Law of holding Employments. During the Course of these Troubles, the Emperors of Blefuscu did frequently expostulate by their Ambassadors, accusing us of making a Schism in Religion, by offending against a fundamental Doctrine of our great Prophet Lustrog, in the fiPy-­‐fourth Chapter of the Brundrecal (which is their Alcoran.) This, however, is thought to be a meer Strain upon the Text: For the Words are these: That all true Believers shall break their Eggs at the convenient End: and which is the convenient End, seems, in my humble Opinion, to be leP to every Man's Conscience, or at least in the power of the Chief Magistrate to determine. Now the Big-­‐Endian Exiles have found so much Credit in the Emperor of Blefuscu's Court, and so much private Assistance and Encouragement from their Party here at home, that a bloody War has been carried on between the two Empires for six and thirty Moons with various Success; during which 8me we have lost forty Capital Ships, and a much greater number of smaller Vessels, together with thirty thousand of our best Seamen and Soldiers; and the Damage received by the Enemy is reckon'd to be somewhat greater than Ours. However, they have now equipped a numerous Fleet, and are just preparing to make a Descent upon us; and his Imperial Majesty, placing great Confidence in your Valour and Strength, has commanded me to lay this Account of his affairs before you. 12 Unicode Text Encoding Examples 13 Character Code Point UTF-16 UTF-8 a U+0061 0061 61 ä U+00E4 00E4 C3 A0 σ U+03C3 03C3 CF 83 ‫א‬ U+05D0 05D0 D7 90 ٣۳ U+0663 0663 D9 A3 カ U+30AB 30AB E3 82 AB 退 U+9000 9000 E9 80 80 U+21BC1 D846 DFC1 F0 A1 AF 81 Unicode Overview 9/9/14 Emoji: U+1F36D LOLLIPOP, U+1F36E CUSTARD, U+1F36F HONEY POT, and U +1F370 SHORTCAKE 14 Lets look at some files…. 15 16 What about new line (end of line)? •  LF (Line feed, '\n', 0x0A, 10 decimal) •  CR (Carriage return, '\r', 0x0D, 13 in decimal) •  Different systems represent newline differently –  Window: \r\n –  Unix based: \n –  Old Mac: \r 17 Unicode Unicode standard defines a number of characters that conforming applica8ons should recognize as line terminators: –  LF: Line Feed, U+000A –  VT: Ver8cal Tab, U+000B –  FF: Form Feed, U+000C –  CR: Carriage Return, U+000D –  CR+LF: CR (U+000D) followed by LF (U+000A) –  NEL: Next Line, U+0085 –  LS: Line Separator, U+2028 –  PS: Paragraph Separator, U+2029 18 Comma Separated Values •  Model –  Plain text file using a character set –  Set of records (one per line) –  Separated into fields by reserved character •  OPen a comma •  every record has the same sequence of fields. 19 Here is no CSV standard •  What about: –  Values with commas in them –  Values with new lines in them –  Character coding? –  Whitespace –  Header or not? 20 RFC 4180: MIME type “text/csv” •  DOS-­‐style lines that end with (CR/LF) character –  op8onal for the last line •  An op8onal header record –  How do we know if there is a header? •  Each record "should" contain the same number of comma-­‐ separated fields •  Whitespace is preserved •  Any field may be quoted (with double quotes). •  Fields containing a line-­‐break, double-­‐quote, and/or commas should be quoted. •  A (double) quote character in a field must be represented by two (double) quote characters. 21 Example Name Favorate Cost Notes [email protected] Pliny the elder > 8 Hoppy, hoppy, hoppy Beer advocate “best beer” Flinstone, Fred Sculpin 3 It’s a good beer August Busch Budvar $2 SVĚTLÝ LEŽÁK 22 File Header •  •  •  •  Iden8fica8on bytes (magic number) Header checksum Version number Offset to data 23 Portable Gray Map •  Header (fields separated by whitespace (blank, TAB, CR, LF). –  magic number iden8fying the file type: "P5". –  A width, formaQed as ASCII characters in decimal. –  A height, again in ASCII decimal. –  maximum gray value again in ASCII decimal. •  A raster –  Height rows, width columns –  0 black, max gray is white –  one byte if Val < less than 256, 2 bytes if > 256, big endian 24 PGM Example P2! # feep.pgm! 24 7! 15! 0 0 0 0 0 3 3 3 0 3 0 0 0 3 3 3 0 3 0 0 0 3 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 7 7 7 7 0 0 7 0 7 0 7 0 0 7 0 7 0 7 0 0 7 0 0 0 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 11 11 11 11 11 0 0 0 11 11 11 0 11 0 0 0 11 11 11 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15 15 15 15 15 0 0 15 15 15 15 15 15 0 0 0 15 0 0 0 0 0 0 0 0! 0! 0! 0! 0! 0! 0! 0x31 0x35 0x20 25 PGM Example (Binary) P4! # feep.pgm! 24 7! 15! 000000000000000000000000! 0333300777700AAAA00FFFF0! 0300000700000A00000F00F0! 0333000777000AAA000FFFF0! 0300000700000A00000F0000! 0300000777700AAAA00F0000! 000000000000000000000000! 24x7 bytes 26 Portable Network Graphics (PNG) •  Chunk based file format •  Header •  Two types of chunks –  Cri8cal (must be able to decode) –  Ancillary (not required to decode) Header Chunk 1 Chunk 2 Chunk N 27 PNG File Header Bytes Purpose 89 Has the high bit set to detect transmission systems that do not support 8 bit data and to reduce the chance that a text file is mistakenly interpreted as a PNG, or vice versa. 50 4E 47 ASCI leQers PNG 0D 0A CR/LF to detect conversion of the data 1A End of file byte that stops display of file under DOS 0A LF to detect unix coding 28 PNG Chunk Format Length (4-­‐bytes) Type (4-­‐bytes) Data (length bytes) CRC (4-­‐bytes) 29 Cri8cal Chunks •  IHDR –  must be the first chunk; it contains the image's width, height, color type and bit depth. •  PLTE –  contains the paleQe; list of colors. •  IDAT –  contains the image, which may be split among mul8ple IDAT chunks. •  IEND marks the image end. 30 IHDR •  The IHDR chunk must appear FIRST. It contains: –  Width: 4 bytes –  Height: 4 bytes –  Bit depth: 1 byte –  Color type: 1 byte –  Compression method: 1 byte –  Filter method: 1 byte –  Interlace method: 1 byte 31 Previous Example in PNG Header IHDR IDAT –  Width: 24 –  Height: 7 –  Bit depth: 4 –  Color type: 0 –  Compression method: 0 –  Filter method: 0 –  Interlace method: 0 IEND 32 MP3 •  Store audio coded with specific compression –  Like PNG in that manner •  Structure –  File consists of a set of chunks (called frames). –  Each frame has a header, followed by frame data •  Bitrate may change between frames •  Coding allows data to span frames –  Descrip8ve metadata may be at beginning or end of file •  Use ID3 tagging 33 ID3 Tagging •  “Trick” MP3 into having metadata –  Ignored by •  Defines a chunk based format for 34 ID3v2 Format Header Format: ID3v2/file identifier ID3v2 version ID3v2 flags ID3v2 size "ID3” 0x03 00 abc00000 4 * 0xxxxxxx Frame Format: Frame ID xx xx xx xx Size xx xx xx xx Flags xx xx Frame Data..... 35 Example ID3 Frames •  TCOM –  The 'Composer(s)' frame is intended for the name of the composer(s). They are seperated with the "/" character. •  TFLT –  The 'File type' frame indicates which type of audio this tag defines, eg. MPEG/ 1, MPEG/2, MPEG/3 •  TIT1 –  The 'Content group descrip8on' frame is used if the sound belongs to a larger category of sounds/music. For example, classical music is oPen sorted in different musical sec8ons (e.g. "Piano Concerto", "Weather -­‐ Hurricane"). •  TIT2 –  The 'Title/Songname/Content descrip8on' frame is the actual name of the piece (e.g. "Adagio", "Hurricane Donna"). •  TIT3 –  The 'Sub8tle/Descrip8on refinement' frame is used for informa8on directly related to the contents 8tle (e.g. "Op. 16" or "Performed live at Wembley"). 36 What is HDF5? HDF = Hierarchical Data Format •  Hierarchal file format •  Binary representation –  Optimized for big, scientific data •  Data model, library and file format for managing data •  Tools for accessing data in the HDF5 format 10/15/08 HDF & HDF-EOS Workshop XII 37 … varied data … LCI Tutorial 10/15/08 HDF & HDF-EOS Workshop Thanks XII to Mark Miller, LLNL 38 38 … and complex relationships … SNP Score Contig Summaries Discrepancies Contig Qualities Coverage Depth Trace Reads Aligned bases Read quality Contig Percent match 10/15/08 HDF & HDF-EOS Workshop XII 39 39 … on big computers … … and small computers … 10/15/08 HDF & HDF-EOS Workshop XII 40 40 How do we… •  Describe our data? •  Read it? Store it? Find it? Share it? Mine it? •  Move it into, out of, and between computers and repositories? •  Achieve storage and I/O efficiency? •  Give applications and tools easy access our data? 10/15/08 HDF & HDF-EOS Workshop XII 41 41 Structure of HDF5 Library Applications Object API (C, F90, C++, Java) Library internals Virtual file I/O File or other “storage” 10/15/08 HDF & HDF-EOS Workshop XII 42 HDF Tools - HDFView and Java Products - Command-line utilities (h5dump, h5ls, h5cc, h5diff, h5repack) 10/15/08 HDF & HDF-EOS Workshop XII 43 43 An HDF5 file is a container… …into which you can put your data objects. 10/15/08 lat | lon | temp ----|-----|----12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6 HDF & HDF-EOS Workshop XII 44 44 HDF5 Structures for Organizing Objects “/” (root) “foo” 3-D array lat | lon | temp ----|-----|----12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6 palette Table Raster image Raster image 10/15/08 2-D array HDF & HDF-EOS Workshop XII 45 45 HDF5 Data Model Primary Objects –  Groups –  Datasets Additional ways to organize and annotate data –  Attributes –  Storage and access properties Everything else is built from these parts. 10/15/08 HDF & HDF-EOS Workshop XII 46 46 HDF5 Dataset Metadata Data Dataspace Rank Dimensions 3 Dim_1 = 4 Dim_2 = 5 Dim_3 = 7 Datatype Integer Storage Info Attributes Time = 32.4 Chunked Pressure = 987 Compressed Temp = 56 10/15/08 HDF & HDF-EOS Workshop XII 47 47 Two roles: Dataspaces –  Dataspace contains spatial info about a dataset stored in a file •  Rank and dimensions •  Permanent part of dataset definition Rank = 2 Dimensions = 4x6 –  Partial I/0: Dataspace describes application’s data buffer and data elements participating in I/O Rank = 1 Dimension = 10 10/15/08 HDF & HDF-EOS Workshop XII 48 48 Datatypes (array elements) •  Datatype – how to interpret a data element –  Permanent part of the dataset definition –  Two classes: atomic and compound 10/15/08 HDF & HDF-EOS Workshop XII 49 49 Datatypes •  HDF5 atomic types include: –  integer & float –  user-definable (e.g., 13-bit integer) –  variable length types (e.g., strings) –  references to objects/dataset regions –  enumeration - names mapped to integers •  HDF5 compound types –  Comparable to C structs (“records”) –  Members can be atomic or compound types 10/15/08 50 50 HDF5 dataset: array of records 3 5 Dimensionality: 5 x 3 int8 int4 int16 2x3x2 array of float32 Datatype: Record 10/15/08 HDF & HDF-EOS Workshop XII 51 51 Properties •  Properties are characteristics of HDF5 objects that can be modified •  Default properties handle most needs •  By changing properties can take advantage of the more powerful features in HDF5 10/15/08 52 Special Storage Properties Better subsetting access time; extensible chunked Improves storage efficiency, transmission speed compressed Arrays can be extended in any direction extensible File B split file Dataset “Fred” File A Metadata for Fred Metadata in one file, raw data in another Data for Fred 10/15/08 53 53 Attributes (optional) •  Attribute – data of the form “name = value”, attached to an object •  Operations similar to dataset operations, but … –  Not extensible –  No compression or partial I/O •  Can be overwritten, deleted, added during the “life” of a dataset 10/15/08 54 54 HDF5 Dataset (again) Metadata Data Dataspace Rank Dimensions 3 Dim_1 = 4 Dim_2 = 5 Dim_3 = 7 Datatype Integer Storage info Attributes Time = 32.4 Chunked Pressure = 987 Compressed Temp = 56 10/15/08 55 55 Groups •  A mechanism for organizing collections •  Every file starts with a root group A •  Similar to UNIX directories •  Can have attributes k “/” C B l m 10/15/08 56 56 Path to HDF5 Object in a File / (root) /x /foo /foo/temp /foo/bar/temp 10/15/08 foo “/” x bar temp temp HDF & HDF-EOS Workshop XII 57 57 eXtensible Markup Language •  Hierarchal file structure •  Text based –  Built on unicode 58 Markup and Content •  Markup –  Provides structure to data –  begin with the character < and end with a >, –  begin with the character & and end with a ;. •  Content –  Actual data within the file 59 Tag •  A markup construct that begins with < and ends with >. •  Three types of tags: –  start-­‐tags; for example: <sec8on> –  end-­‐tags; for example: </sec8on> –  empty-­‐element tags; for example: <line-­‐break /> 60 Element •  A logical document component between a start-­‐tag and ends and a matching end-­‐tag or •  an empty-­‐element tag. •  characters between the start-­‐ and end-­‐tags, if any, are the element's content, –  May be nested (child elements). •  Example –  <Gree8ng>Hello, world.</Gree8ng>. 61 AFribute •  a name/value pair that exists within a start-­‐ tag or empty-­‐element tag. –  <img src="madonna.jpg" alt='Foligno Madonna, by Raphael' /> –  <step number="3">Connect A to B.</step> •  AQributes can only have a single value and each aQribute can appear at most, but you can code lists: –  <div class="inner gree8ng-­‐box" >Hello!</div> 62 XML declaraLon •  XML documents may begin by declaring some informa8on about themselves –  <?xml version="1.0" encoding="UTF-­‐8"?> 63 Element vs. AQribute (IBM) •  Principle of core content –  If you consider the informa8on in ques8on to be part of the essen8al material that is being expressed or communicated in the XML, put it in an element. •  Principle of structured informa8on –  If the informa8on is expressed in a structured form, especially if the structure may be extensible, use elements. I hope to expand on the treatment of people's names in markup in a future ar8cle. •  Principle of readability –  If the informa8on is intended to be read and understood by a person, use elements. •  Principle of element/aQribute binding –  Use an element if you need its value to be modified by another aQribute. 64 Binary in XML •  XML is a “text format” •  Binary data can be included by conver8ng to text –  Issue: making data 7 bit clean and printable •  Solu8on –  Convert 3x8 bits into 4x6 bits –  Code 6 bits into one of 64 printable characters (e.g. BASE64 encoding: a-­‐zA-­‐Z0-­‐9+/) 65 Javascript Object Nota8on (RFC7159) •  An object is a list of members an values –  {“foo” : 12, “bar” : “a string”, “boz” : [1, 2, 3] •  A JSON value MUST be an object, array, number, or string, or one of the following three literal names: false null true 66 Example: JSON Object { "Image": { "Width": 800, "Height": 600, "Title": "View from 15th Floor", "Thumbnail": { "Url": "hQp:// ;, "Height": 125, "Width": 100 }, "Animated" : false, "IDs": [116, 943, 234, 38793] } }] 67 Example: JSON Array with Two Objects [ { "precision": "zip", "La8tude": 37.7668, "Longitude": -­‐122.3959, "Address": "", "City": "SAN FRANCISCO", "State": "CA", "Zip": "94107", "Country": "US" }, { "precision": "zip", "La8tude": 37.371991, "Longitude": -­‐122.026020, "Address": "", "City": "SUNNYVALE", "State": "CA", "Zip": "94085", "Country": "US" } ] 68 ZIP File Structure 69 ...
View Full Document

  • Fall '14
  • Hierarchical Data Format, NetCDF, file  format, character  set

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture