Let’s Talk About Text
Podcast: Play in new window | Download (42.3MB) | Embed
Subscribe: Apple Podcasts | Spotify | Email | RSS | More
A string is a data type that stores text. There’s more complication to strings than you would think, which is the reason for this episode.
“In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed (after creation). A string is generally understood as a data type and is often implemented as an array of bytes (or words) that stores a sequence of elements, typically characters, using some character encoding.” ~Wikipedia
Characters are minimal units of text that contain semantic value.
Character Sets are collections of characters.
Coded Character Sets are character sets where each character has a corresponding unique number.
Code Point is any value in a character set. Therefore in a 32-bit integer data type the upper 11 bits are 0.
Code Unit is the bit sequence that encodes each character.
Character Repertoire is the full set of abstract characters that a system supports. Some are open (you can extend them), some are not (unless you want to create a new standard).
Everything breaks down to the ones and zeros eventually. To review a byte is 8 bits where a bit is a unit of either 1 or 0 (on or off). A word is two bytes or 16 bits and a double word is four bytes or 32 bits. Just to be interesting a Nibble is half a byte or 4 bits.
There are several standards for encoding characters. ASCII is a well known encoding that uses 7 bits with the first bit in the byte used to identify it as ASCII. Extended Binary Coded Decimal Interchange Code (EBCDIC) is an older 8 bit code unit found in a lot of old systems and mainframe code.
Unicode uses various code unit sizes from UTF-8 (8-bits) to UTF-32 (32 bits). UTF-8 code points map to a sequence of 1 to 4 code units. It is backward compatible with ascii. UTF-16 uses code points that can be encoded in one or two code units also known as Unicode surrogate pairs. With UTF-32 every code point gets a single code unit.
“Endianness refers to the order of the bytes comprising a digital word in computer memory.” – Wikipedia
The etymology of the term comes from a bit in Gulliver’s Travels, in which a civil war erupts over whether the big or small end of a soft-boiled egg is the proper end to crack open.
In big-endian the most significant byte (the one with the most significant bit) is stored first. Little-endian is the opposite with the most significant byte stored last.
Intel x86 processors use little endian. Internet protocols tend to use Big-Endian, which is why it is called network byte order. There is also mixed-Endian, where the ordering of 16 bit words may differ from the ordering of 32 bit words.
A null-terminated string’s endpoint set by ending with NUL (character value 0). Also known as C strings.
A length-prefixed string starts with byte(s) indicating the length. Also known as Pascal strings or P-strings.
23:52 Other Nastiness
Casing is critical in English. However, some letters only exist in one case. The German “ß” (“Eszett”) for instance, is lower-case only and is only used after short vowels and dipthongs, while “ss” is written after short vowels.
Diacritics are glyphs added to a letter that change the sound of the letter. The two dots in “i” in naive in English are an example. Diacritics also indicate vowels in some languages (arabic, Hebrew), tonality (Pinyin).
Ligatures are two or more characters joined. The “ae” in English is one of these. A lot of ligatures eventually become characters. Ampersand is one of these. It was originally a ligature of two Latin characters, but became a single English character.
Operations include concatenation, splitting, casing, sorting, serialization and deserialization, searching, length checking, and packed bits used to compress data. Each of these has it’s own issues and while handled by the compiler in higher level languages can be devastating if not understood in lower levels.
44:30 Security Issues
Security becomes an issue when storing data to avoid situations such as SQL injection and encoding errors resulting from security holes. Character similarity can even be used in social engineering attacks.
IoTease: In The News
NFL to Place Computer Chips into Footballs
In an effort to be even more precise and completely remove refs from the game the NFL is replacing them with data chips and IT departments to interpret the data. They are looking at trying out some IoT data collection in conjunction with referees at games to add more accuracy to ball placement and calls.
Tricks of the Trade
Not invented here syndrome is a bad attitude to have that code that wasn’t written in house is not good. If you can get a library to get what you want it is better to wrap that complexity.
We experienced some technical difficulties with Will’s microphone during this episode.