Lecture 14.1 (Tuesday, April 26)
- Quiz 5 on Friday
- A3 due today
- P5 next week!
Demos will be on Thursday, May 5 from 11AM to 1:30 PM.
Prepare a 10 minute demo of your game.
Needs to involve live play w/ a team member.
Talk about an interesting (or frustrating) design challenge.
- A smattering of things
- Big emphasis on security
- Return of the CSS Selectors
- Something about today's material
Reviewing the quiz scores, it seems there's a bit of confusion about the way that our technologies interact across client/server boundaries.
Draw a picture.
Encodings and Data Representation
We have to represent data somehow. We have abstract concepts, data types:
In order for the computer to store and transmit these, they must be represented somehow.
We call this encoding.
Strings: Characters and Encodings
In modern computing systems, we model strings as sequences of characters, typically defined by Unicode code points. A code point is a numeric identifier of a character, or character modifier, in the Unicode table.
Code points are just numbers. We have to store them somehow.
The simplest encoding is UCS-4. This just stores the string as an array of 32-bit integers (4 byte, hence UCS-4), each storing a code point. It is big, but it is simple. It is easy to count how many code points: look at the length of the array.
This is not exactly a count of the number of characters, because code points can compose to form a character, such as building ő (‘o’ with an umlaut) out of an ‘o’ and an ‘add umlaut’ character, instead of ‘o with umlaut’. Both methods are fine.
Python (usually) uses UCS-4 internally. If you have a
unicode object, it is stored in UCS-4 in the program's memory (unless your Python is compiled to use UCS-2).
UCS-2 is ancient history. Don't use it. 2 bytes (16 bits) is no longer enough to represent all Unicode code points, because there are around a million. Sadly, a lot of programs think they're using it when they are really using…
…UTF-16. UTF-16 is a 16-bit variable width encoding. Unlike UCS-4, where every code point takes 4 bytes, most code points take 16 bits (2 bytes) in UTF-16. But some of them take 4 bytes.
Many of these APIs were developed back when Unicode could be represented in UCS-2, and this still shows up sometimes. For example, Java's
String.length() method returns the number of
chars in a string, which are UTF-16 byte pairs. It is not the number of code points, because it is not aware of how byte pairs combine to produce other code points. It is also not the same as the number of characters that would be rendered.
For data storage and network transfer, as well as in-memory representation in many Linux or Unix C programs, we generally use UTF-8. UTF-8 works on 8-bit units, and can represent US English with basic punctuation and numbers with one byte per character. For such characters, it is identical to the ASCII character set, which defines 7-bit representations for characters.
Outside the ASCII character set, UTF-8 requires multiple bytes per code point: anywhere from 2 to 6. The first byte indicates how many bytes are used, with a few tricks to allow you to figure out whether you're in the middle of a code point.
So, we have a ‘string’, a sequence of characters. But it isn't so simple. We need to encode it, typically in UTF-8.
Avoid non-UTF-8 encodings. ASCII is fine, as it is UTF-8. LATIN-1 is the typical historic encoding for the U.S. and western Europe. Windows has a similar encoding CP12??.
Also: make sure you always know what the encoding is.
For historical compatibility, though, most Windows programs default to the Latin-1 code page! This is what caused a problem back on some early assignments, since I gave you UTF-8 data. Opening the file as UTF-8 (Python knows how to translate) fixes it.
For more reading, see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
Historically, Python had strings. They were stored in whatever encoding the system used, or whatever.
Later, they added Unicode objects, with the idea being that a Unicode object represents the concept of a string, in a standardized fashion, and a
str stores a string (or any sequence of bytes, really) in a particular encoding.
In Python 3, the switched to Unicode-by-default. Strings are now all Unicode, and if you want a sequence of bytes to store some binary data or an encode string, you use a
Strings have an
encode method that can produce bytes, and
bytes have a
decode method to get a Unicode string.
Encoding Other Things
So we have some dictionaries and lists and strings and things.
JSON is a way to encode these data structures for transmission.
HTML is a way to encode trees of content nodes for transmission.
We have a couple of things:
- Abstract model
- An abstract idea of the data; in JSON, this is a simple model consisting of strings, numbers, booleans, arrays/lists, objects/dictionaries, and
null. For HTML, this is the Document Object Model, or DOM.
- Concrete syntax