This preview has intentionally blurred sections. Sign up to view the full version.View Full Document
Unformatted text preview: cs lecture nov 4 Unicode \/ “codepoints”-assigns values for over 100000 characters from many languages (living & dead) range 0 -> 100,000 (or more) ^ 2^17 ~17 bits ~ 3 bytes!! (3 bytes per character is too much???) ‘a’ – codepoint: 97 (U+0061), binary: 110001, codepoint is same as ASCII value ‘日’ ~ ‘sun’ 26085 (U+65E5), binary:110010111100101 Representing codepoints one way UTF-8 UTF-8 is backwards compatible with ASCII variable_byte encoding, uses 1-4 bytes depending on the value of the codepoints (other encodings eg UFF-16,etc) 1 byte character (UTF – 8) Unicode codepoints are exactly ASCII characters 0->127 0000000->01111111 ^when this bit is zero, it means “one byte character” 2-byte characters (UTF-8) codepoint xxxxxyyyyyy (up to 11 bits) in binary becomes: 110xxxxx 10yyyyyy encodes codepoints up to 11111111111=2^11-1=2047s 3-byte characters codepoint xxxxyyyyyyzzzzzz – up to 16 bits becomes 1110xxxx 10yyyyyy 10zzzzzz codepoints up to 2^16 – 1=65535 4-byte characters (UTF-8)...
View Full Document
- Fall '10
- Computer Science, Character encoding, ASCII, Unicode, const char, UTF-8, const char*s0,const char