Research

Character Encoding Issues

Character Encoding Issues

Recently, Core Security announced a vulnerability in VMware Workstation (Server and ESX are unaffected) that allows a guest operating system to break out of its virtualized environment and interact with the host operating systems. They discovered it was possible to break out of the virtualized environment by using a directory traversal attack on a shared folder designed to allow data to be passed between the guest operating system(s) and the host operating system. This attack is possible despite attempts to sanitize the path string for dangerous characters because the sanitation routine is called before the path string is normalized using a Microsoft library call to convert characters from UTF-8 to UTF-16. It is better practice to normalize a string before sanitizing it for dangerous characters, but the complexity of character encoding has caused other vulnerabilities in the past.

Looking into this vulnerability made me curious about the variety of encoding schemes that are in common use today. I had a basic grasp of ASCII, and knew vaguely about Latin-1 and Unicode, but I didn't know the kind of in-depth details that are needed in the security world. I decided that in order to really understand the issues involved I would have to go back and learn how various encoding standards were used historically and how we got from them to our most current batch of standards.

It all started with that familiar mainstay of the computer world. ASCII, or the American Standard Code for Information Interchange was first published as an American Standard back in 1963. The standard is only 12 pages long, and the Standard Code itself only described 100 characters filling 7 bits. The first standard didn't even include lower case letters. The Standard Code was designed for information interchange between such varied devices as punch card reader/writers, tape (perforated and magnetic) machines, telegraphs and other devices. One device that ASCII was not designed for was computer monitors. In the 60's computer time was much too valuable to use just to display information on a screen to a user. In a statement oddly presaging the legendary quote from Bill Gates about 640k of memory, Appendix A of the ASCII-1963 standard states that "A 7-bit set is the minimum size that will meet requirements" Both a 6-bit and 8-bit set were considered and rejected, the 8-bit because it provides far more characters than are now needed in general applications. A later version of ASCII would add lower case letters and would also merge the ASCII standard with ISO-646 (also ECMA-6).

Once the CRT revolution swept the computing industry and we moved away from punch cards and teletypes, people found that there was an extra, unused bit in every byte used to record a character. This led to many different custom 'extended ASCII' character sets. IBM's version used on IBM compatible PCs was probably the most popular of these. Among other things, a primary goal of many of these extensions was to provide accented characters that are used in other Latin script based languages other than English.

In order to have a better way to represent a larger number of languages, the ISO developed the ISO-8859 standard. The ISO-8859 standard defines 15 different character sets designed to be used to represent different language groups, each ISO-8559 character is encoded in one byte, so there are 256 possible characters. The most commonly used of the ISO-8859 family is ISO-8859-1 (also known as ISO-8559-Latin-1, or just Latin-1). The Latin-1 standard is designed for Western European languages. The ISO-8859-1 standard supports the following languages: Afrikaans, Albanian, Breton, Danish, Dutch, English, Estonian, Faroese, French, Finnish, Galician, German, Icelandic, Irish, Italian, Latin, Luxembourgish, Norwegian, Occitan, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Spanish, Swahili, Swedish, Walloon, Welsh and Basque. Other ISO-8859 standards cover other languages, such as ISO-8859-5 for Cryllic based languages and ISO-8859-8 for the Hebrew language.

In the same way that IBM extended ASCII, the ISO-8859-1 character set consists of the base of the printable ASCII characters in the same position and then additional characters to fill the extra space made available by using the 8th bit. The ISO-8859-1 standard was first published in 1985. The ISO-8859 standards allow a large number of languages to be encoded in one family of standards. This allowed much greater interoperability and standardization between speakers of different languages, but there were still some things that could not be done by these standards, such as capturing if a language is ordered right to left or vice versa. There was also the issue of multi-byte sequences used by symbol based languages. There was a desire to have one universal character encoding system. The desire for this universal character encoding system was so great that there ended up being two universal systems.

The ISO's solution to these problems was the Universe Character Set, or ISO 10646. This extremely ambitious undertaking planned to create a truly universal character set. With over a million possible characters (earlier drafts had the space for over two million), the goal was to encode every historic, current or future language in one standard. This massive code space is broken up into different planes to simplify things. Almost all commonly used characters inhabit the Basic Multilingual Plane (BMP). But there are other planes for symbol based languages and also a private code space, analogous to RFC 1918 private IP space for imaginary languages. There is even an organization, ConScript, devoted to unofficially sharing this space to make sure every imaginary language has unique code points. With over 130,000 code points allocated to private use, there should be plenty of space for languages like JRR Tolken's Tengware and Cirth (better known to all but his most die hard fans as Elvish and Dwarish), or not just the expected Klingon, but also Frenghi. I think my favorite of the made up languages is Dr Suess' extensions to the Latin character set. What's not to like about a character called ABCDEFGHIJKLMNOPQRSTUVWXYZ?

In addition to the UCS, there was a second universal encoding scheme known as Unicode. Its goals were merely to encode all modern languages, and it was designed to use a smaller 2 byte (64K) character space. Thankfully the two organizations realized that the world didn't need more than one universal encoding system and has merged the two. Like any merger of two preexisting projects, there have been some complications created, but it's still probably for the best.

ISO 10646 originally defined two different encoding options for using the UCS. There were UCS-2 and UCS-4, which encoded the UCS in 2 or 4 bytes. However, UCS-2 can use special escape sequences to address characters outside of the Basic Multilingual Plane. One big problem with the UCS encoding is that most common ASCII characters are encoded with leading NULL bytes. This causes all kinds of issues for Unix operating Systems, which traditionally use NULLs to terminate strings.

Unicode was originally designed to be encoded in a simple two byte format known as UTF-16. However, Unicode decided to address the entire Universal Code Space, so it needed a trick to allow UTF-16 to address code space beyond that of a value that can be stored in two bytes. That trick is known as surrogate characters.

Surrogate characters are special two byte values that are used to indicate a four byte character. To ensure compatibility between ISO 10646 and Unicode, surrogate code points have been officially reserved in the UCS code space and will not be used for any other purpose. Surrogate characters solved the problem of allowing what was originally a two byte code to address four bytes worth of address space.

There was another problem with Unicode. A large part of the Unix world was (and still is) based around the concept of one character per byte and binary compatibility with the old ASCII standard. One evening a gentleman named Ken Thompson (whom you may be familiar with for such modest innovations as the B programming language and the UNIX operating system) decided to scribble a better encoding standard on a New Jersey diner placemat. This new standard was called UTF-8 and was a variable length encoding which preserved the original one byte encoding of printable characters from the ASCII set. The entire UCS could be addressed by using up to 6 bytes.

There is also an encoding called UTF-7 which was designed to allow Unicode characters to be used in places ( such as SMTP ) which are required to be 7 bit clean. It is not a Unicode or ISO standard, but did have an RFC devoted to it. Another UTF encoding is UTF-32, which just uses four bytes to represent the code, trading space for simplicity.

We now have quite an assortment of encoding options: ASCII, ISO-8559-1, UCS-2, UCS-4, UTF-7, UTF-8, UTF-16, UTF-32. It gets even more interesting when you add in the World Wide Web. Some regular ASCII characters have special meaning in URLs, so a scheme was developed to encode those in a special way on top of a normal ASCII encoding. This is known as URL encoding and is specified in RFC 1738. In this encoding format a special character is represented by a percent symbol followed by two hex digits, which specify the code of the character that is being represented. For example, a space is represented as %20. But because one standard will never do when you can have two, there is another way to represent characters which aren't allowed in a URL. The ISO also specifies the Numerical Character Reference as ways to represent characters from the UCS in various markup languages. NCS is a sequence of decimal or hex digits prepended by an ampersand and a hash mark. A literal x follows the hash if the characters are in hexadecimal.  It's also possible to encode UCS characters using the format by prepending the code point with a %u.

With all these different encoding schemes it's easy to see how the seemingly simple task of interpreting text can become very complex. Areas of complexity make it hard for programmers to understand exactly what their programs are doing. That lack of understanding leads to vulnerabilities. Until the encoding landscape gets simpler and more robust libraries are commonplace for translating from one encoding scheme to another. Security professionals will have to keep an eye on this area.

Back to all Blogs

Talk with an Expert

Thank you for submitting the form! We have received your request. A Secureworks team member will contact you within one business day.

Additional Resources