One or more continuation bytes all start UTF 8 Features Clear

One or more continuation bytes all start utf 8

This preview shows page 16 - 27 out of 50 pages.

One or more continuation bytes all start with ‘10’ 16
Image of page 16
UTF-8 Features Clear indication of code sequence length By # of 1’s in leading byte (for multi-byte) Self-synchronization Can find start of characters by backing up at most 3 bytes (5 in original design) 17
Image of page 17
Example Encode ‘€’ using UTF-8 Code point = U+20AC Need 3 bytes in UTF-8 18
Image of page 18
UTF-16 Code unit = 16 bits Variable-length encoding Code point = one/two code units Not compatible with ASCII 19
Image of page 19
UTF-16 Plane 0: encoded using one code unit: 16 bit Rest: two code units 20
Image of page 20
UTF-16 Encoding U+0000 to U+D7FF and U+E000 to U+FFFF One code unit, or 2 bytes U+D800 to U+DFFF Reserved U+10000 to U+10FFFF Two code units, or 4 bytes 21
Image of page 21
Encoding planes 1 .. 16 Code points: 10000 to 10FFFF 1. Subtract 10000 from code point => 0..FFFFF (20 bits) 2. Add 1 st 10 bits to D800 => 1 st code unit 3. Add 2 nd 10 bits to DC00 => 2 nd code unit 22
Image of page 22
Examples Encoding code points in BMP is easy 23
Image of page 23
Big-Endian (BE) and Little-Endian (LE) Two ways of organizing a multi-byte word (code unit) Not a problem in UTF-8 UTF-16BE (BE: big end/most significant value) No change to the order So ABCD stored as “AB CD” UTF-16LE Reverse the order ABCD stored as “CD AB” 24
Image of page 24
Byte Order Mark (BOM) Unicode recommends to add BOM to the beginning of a text Tell which order the text follows BOM: U+FEFF “FE FF” => big endian “FF FE” => small endian 25
Image of page 25
Gulliver’s Travels Besides, our Histories of six thousand Moons make no mention of any other Regions, than the two great Empires of Lilliput and Blefuscu. Which two mighty Powers have, as I was going to tell you, been engaged in a most obstinate War for six and thirty Moons past. It began upon the following Occasion. It is allowed on all Hands, that the primitive way of breaking Eggs, before we eat them, was upon the larger End: But his present Majesty's Grand-father, while he was a Boy, going to eat an Egg, and breaking it according to the ancient Practice, happened to cut one of his Fingers. Whereupon the Emperor his Father published an Edict, commanding all his Subjects, upon great Penaltys, to break the smaller End of their Eggs. The People so highly resented this Law, that our Histories tell us there have been six Rebellions raised on that account; wherein one Emperor lost his Life, and another his Crown. These civil Commotions were constantly fomented by the Monarchs of Blefuscu; and when they were quelled, the Exiles always fled for Refuge to that Empire. It is computed, that eleven thousand Persons have, at several times, suffered Death, rather than submit to break their Eggs at the smaller End.
Image of page 26
Image of page 27

You've reached the end of your free preview.

Want to read all 50 pages?

  • Fall '14
  • Character encoding, ASCII, Carriage return, Unicode, Blefuscu

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture