This preview shows pages 1–2. Sign up to view the full content.
This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: cs lecture nov 4 Unicode \/ codepoints-assigns values for over 100000 characters from many languages (living & dead) range 0 -> 100,000 (or more) ^ 2^17 ~17 bits ~ 3 bytes!! (3 bytes per character is too much???) a codepoint: 97 (U+0061), binary: 110001, codepoint is same as ASCII value ~ sun 26085 (U+65E5), binary:110010111100101 Representing codepoints one way UTF-8 UTF-8 is backwards compatible with ASCII variable_byte encoding, uses 1-4 bytes depending on the value of the codepoints (other encodings eg UFF-16,etc) 1 byte character (UTF 8) Unicode codepoints are exactly ASCII characters 0->127 0000000->01111111 ^when this bit is zero, it means one byte character 2-byte characters (UTF-8) codepoint xxxxxyyyyyy (up to 11 bits) in binary becomes: 110xxxxx 10yyyyyy encodes codepoints up to 11111111111=2^11-1=2047s 3-byte characters codepoint xxxxyyyyyyzzzzzz up to 16 bits becomes 1110xxxx 10yyyyyy 10zzzzzz codepoints up to 2^16 1=65535 4-byte characters (UTF-8)...
View Full Document
- Fall '10
- Computer Science