Lecture 14.1 (Tuesday, April 26)

Quiz 5 on Friday
A3 due today
P5 next week!

Demos

Demos will be on Thursday, May 5 from 11AM to 1:30 PM.

Prepare a 10 minute demo of your game.

Needs to involve live play w/ a team member.

Talk about an interesting (or frustrating) design challenge.

Quiz 5

A smattering of things
Big emphasis on security
Return of the CSS Selectors
Something about today's material

Architecture

Reviewing the quiz scores, it seems there's a bit of confusion about the way that our technologies interact across client/server boundaries.

Draw a picture.

Encodings and Data Representation

We have to represent data somehow. We have abstract concepts, data types:

strings
trees
numbers
objects
collections

In order for the computer to store and transmit these, they must be represented somehow.

We call this encoding.

Strings: Characters and Encodings

In modern computing systems, we model strings as sequences of characters, typically defined by Unicode code points. A code point is a numeric identifier of a character, or character modifier, in the Unicode table.

Code points are just numbers. We have to store them somehow.

The simplest encoding is UCS-4. This just stores the string as an array of 32-bit integers (4 byte, hence UCS-4), each storing a code point. It is big, but it is simple. It is easy to count how many code points: look at the length of the array.

This is not exactly a count of the number of characters, because code points can compose to form a character, such as building ő (‘o’ with an umlaut) out of an ‘o’ and an ‘add umlaut’ character, instead of ‘o with umlaut’. Both methods are fine.

Python (usually) uses UCS-4 internally. If you have a unicode object, it is stored in UCS-4 in the program's memory (unless your Python is compiled to use UCS-2).

UCS-2 is ancient history. Don't use it. 2 bytes (16 bits) is no longer enough to represent all Unicode code points, because there are around a million. Sadly, a lot of programs think they're using it when they are really using…

…UTF-16. UTF-16 is a 16-bit variable width encoding. Unlike UCS-4, where every code point takes 4 bytes, most code points take 16 bits (2 bytes) in UTF-16. But some of them take 4 bytes.

Many systems use UTF-16 internally. Java, JavaScript, the Qt toolkit, and the OS X APIs all use UTF-16. I believe Unicode-aware Windows APIs also use UTF-16. When you save a text file as Unicode in Notepad, it is saved in UTF-16.

Note

Many of these APIs were developed back when Unicode could be represented in UCS-2, and this still shows up sometimes. For example, Java's String.length() method returns the number of chars in a string, which are UTF-16 byte pairs. It is not the number of code points, because it is not aware of how byte pairs combine to produce other code points. It is also not the same as the number of characters that would be rendered.

For data storage and network transfer, as well as in-memory representation in many Linux or Unix C programs, we generally use UTF-8. UTF-8 works on 8-bit units, and can represent US English with basic punctuation and numbers with one byte per character. For such characters, it is identical to the ASCII character set, which defines 7-bit representations for characters.

Outside the ASCII character set, UTF-8 requires multiple bytes per code point: anywhere from 2 to 6. The first byte indicates how many bytes are used, with a few tricks to allow you to figure out whether you're in the middle of a code point.

So, we have a ‘string’, a sequence of characters. But it isn't so simple. We need to encode it, typically in UTF-8.

Avoid non-UTF-8 encodings. ASCII is fine, as it is UTF-8. LATIN-1 is the typical historic encoding for the U.S. and western Europe. Windows has a similar encoding CP12??.

Also: make sure you always know what the encoding is.

For historical compatibility, though, most Windows programs default to the Latin-1 code page! This is what caused a problem back on some early assignments, since I gave you UTF-8 data. Opening the file as UTF-8 (Python knows how to translate) fixes it.

Note

For more reading, see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

Python History

Historically, Python had strings. They were stored in whatever encoding the system used, or whatever.

Later, they added Unicode objects, with the idea being that a Unicode object represents the concept of a string, in a standardized fashion, and a str stores a string (or any sequence of bytes, really) in a particular encoding.

In Python 3, the switched to Unicode-by-default. Strings are now all Unicode, and if you want a sequence of bytes to store some binary data or an encode string, you use a bytes object.

Strings have an encode method that can produce bytes, and bytes have a decode method to get a Unicode string.

Encoding Other Things

So we have some dictionaries and lists and strings and things.

JSON is a way to encode these data structures for transmission.

HTML is a way to encode trees of content nodes for transmission.

We have a couple of things:

Abstract model: An abstract idea of the data; in JSON, this is a simple model consisting of strings, numbers, booleans, arrays/lists, objects/dictionaries, and null. For HTML, this is the Document Object Model, or DOM.
Concrete syntax: A particular way of writing out the data, represented in the model, as a sequence of characters or bytes. For JSON, this is the JSON syntax, which is a restricted version of the JavaScript syntax for representing data literals. For HTML, this is our tags and things. There is also an XML-based syntax for HTML, which is stricter about the rules and therefore easier to parse.