FileFormats.pdf - File Formats INF 551 Wensheng Wu 1 File Formats \u2022 Specify what information bits in file encode \u2022 Example text file \u2013 String of

FileFormats.pdf - File Formats INF 551 Wensheng Wu 1 File...

This preview shows page 1 out of 21 pages.

You've reached the end of your free preview.

Want to read all 21 pages?

Unformatted text preview: File Formats INF 551 Wensheng Wu 1 File Formats • Specify what information bits in file encode • Example: text file – String of characters with particular encoding scheme, e.g., ASCII and Unicode – E.g., TXT, HTML, JSON, XML • Others: xls, ppt, pdf, jpg, gif, mp3, png, etc. 2 Roadmap • Character encoding – ASCII – Unicode u ‘\u20ac’ u - unicode \u - hexadecimal • JSON (done earlier) • XML (will talk about it next) 3 Code space & points Example: Code Unit: U+20AC Step 1) Convert all to binary2: 0010, 0: 0000, A: 1010, C: 1100 Step 2) Figure out number of bits needed: 16 Step 3) Figure our number of bytes needed keeping in consideration leading byte must have same number of 1’s starting = number of bytes and each continuation byte starts with 10. For this example, let’s say we need 3 bytes. We will need 1110 _ _ _ _ , 10 _ _ _ _ _ _ , 10 _ _ _ _ _ _. All dashes will be filled with binary digits from right to left. Step 4) Then split them in 4 and encode it back to characters, • Code space – A range of numerical values available for encoding characters – E.g., 0 to 10FFFF for Unicode, 0 to 7F for ASCII utf 8 to encode something, you always have 1 leading byte and then continuation bytes. All contintinuation bytes start with ’10’. Each byte has 7 bits in it. • Code point leading byte has multiple 1 starting (urnary) followed by 0 to determine starting point. Number of 1’s in the starting of leading byte = number of bytes used for the encoding – A value for a character in a code space • Unicode code point – U+ followed by its hexadecimal value, e.g., U+0058 for capital letter ‘X’) 4 Encoding (of code points) • Code unit: the smallest unit (comprising a number of bits) used to construct an encoding for a code point – Code unit for UTF-8: 8-bit – UTF-16:16-bit • UTF (Unicode Transformation Format) encoding – E.g., UTF-8 and UTF-16 5 Variable-length encoding • Characters encoded using codes of different length • In Unicode, a code point may be represented using multiple code units – E.g., 1-4 in UTF-8, 1-2 in UTF-16 6 ASCII • American Standard Code for Information Interchange • 128 characters: 7-bit code (code points: 0~7F) – – – – – – Digits: 0-9 (0x30 – 0x39) Uppercase letters: A-Z (0x41 – 0x5A) Lowercase letters: a-z (0x61 – 0x7A) White space (0x20) Punctuation symbols Control characters (e.g., Ctrl-C: 0x03) 7 ASCII 2^7 codes 8 Windows-1253 • Windows code page for Latin + Greek characters • Use 8 bits – 0x00 ~ 0xFF 9 Unicode • Unicode supports more characters than ASCII and various codepages • Unicode separates code points from encoding – In contrast to ASCII, where code point = encoding 10 Unicode • Code space is divided into 17 planes • Each plane = contiguous 216 code points • Recall that code points range from 0 to 10FFFF ⇒Total code points = 17 * 216 or 1,114,112 code points Note 216 = 65,536 11 Planes in Unicode 12 Plane 0: BMP (Basic Multilingual Plane) Block 00 Represents 0000~00FF Each block represents 256 code points 13 UTF-8 • Encoding scheme for Unicode code space • Code unit = 8 bits • Variable length – Code point may be represented using 1-4 code units 14 UTF-8 Design If number of bytes used > 1, then number of bits = 1 in leading byte = number of bytes used. If number of bytes = 1, leading byte starts with 0 (this is ascii) • ASCII characters use one code unit – First bit is zero • Other Unicode characters use up to 4 units 15 UTF-8 Features • Backward compatibility if a byte starts with 0, it is ascii – One byte for ASCII, leading bit of byte is zero • Clear distinction btw single- vs. multi-byte characters – Single-byte/multi-byte: start with 0/1 respectively • Multiple length – a leading byte starts with 2 or more 1’s, followed by a 0, e.g., ‘110’, ‘1110’, etc. – One or more continuation bytes all start with ‘10’ 16 UTF-8 Features • Clear indication of code sequence length – By # of 1’s in leading byte (for multi-byte) • Self-synchronization – Can find start of characters by backing up at most 3 bytes 17 Example • Encode ‘€’ using UTF-8 • Code point = U+20AC • Need 3 bytes in UTF-8 18 Unicode in Python • >>> a = u'\u20AC' # note need u before ' • >>> print a • € u indicates it is a Unicode string • >>> e = u'€' • >>> e • u'\u20ac' 19 Unicode in Python • >>> b = '€' • >>> b • '\xe2\x82\xac' – UTF-8 encoding of € • >>> u'€'.encode('utf-8') • '\xe2\x82\xac' 20 Resources • UTF-8 – • UTF-16 – 21 ...
View Full Document

  • Fall '14
  • Character encoding, ASCII, Unicode, UTF-8

What students are saying

  • Left Quote Icon

    As a current student on this bumpy collegiate pathway, I stumbled upon Course Hero, where I can find study resources for nearly all my courses, get online help from tutors 24/7, and even share my old projects, papers, and lecture notes with other students.

    Student Picture

    Kiran Temple University Fox School of Business ‘17, Course Hero Intern

  • Left Quote Icon

    I cannot even describe how much Course Hero helped me this summer. It’s truly become something I can always rely on and help me. In the end, I was not only able to survive summer classes, but I was able to thrive thanks to Course Hero.

    Student Picture

    Dana University of Pennsylvania ‘17, Course Hero Intern

  • Left Quote Icon

    The ability to access any university’s resources through Course Hero proved invaluable in my case. I was behind on Tulane coursework and actually used UCLA’s materials to help me move forward and get everything together on time.

    Student Picture

    Jill Tulane University ‘16, Course Hero Intern

Stuck? We have tutors online 24/7 who can help you get unstuck.
A+ icon
Ask Expert Tutors You can ask You can ask You can ask (will expire )
Answers in as fast as 15 minutes