Resistance is futile: September 2013

Tuesday, September 10, 2013

Converting UTF-8 to Unicode

To be stored digitally, each character of a piece of text is encoded into a particular bit pattern.

For exampe, according to the ASCII (American Standard Code for Information Interchange) standard, which has been around for half a century, the letter 'A' is encoded with the 7 bits 1000001₂ or, in hexadecimal notation, 41₁₆, which I prefer to write with the Java/C syntax: 0x41. With 7 bits, only 128 patterns can be encoded (i.e., 2⁷), just enough for plain Latin characters, numbers, and a few special symbols.

Over the past couple of decades, a different type of encoding called UTF-8, based on a variable number of bytes, has established itself as the most common encoding used in HTML pages.

Often, UTF-8 is confused with Unicode, but while UTF-8 is a way of encoding characters, Unicode is a character set. That is, a list of characters. This means that the same Unicode character can be encoded in UTF-8, UTF-16, ISO-8859, and other formats. You will find that most people on the Internet refer to Unicode as an encoding. Now you know that they are not completely correct, although, to be fair, the distinction is usually irrelevant.

The Wikipedia pages on Unicode and UTF-8 are very informative. Therefore, I don't want to repeat them here. But I would like to show you a couple of examples taken from the UTF-8 encoding table and Unicode characters.

The charcter 'A', which was encoded as 0x41 in ASCII, is character U+0041 in Unicode and is encoded as 0x41 in UTF-8. "Wait a minute", you might say, "what's a point of all the fuss if the number 0x41 stays the same everywhere?"

The answer is simple: the ASCII and UTF-8 encodings for all Unicode characters from U+0000 to U+007F are identical. This makes sense for back compatibility. But while ASCII only encodes 128 characters, UTF can encode all many thousands of Unicode characters. To see the differences, you have go beyond U+007F.

For example, U+00A2, the cent sign '¢', which doesn't exist in ASCII, is encoded as 0xC2A2 in UTF-8. Note that U+C2A2 is a valid Unicode character, but it has nothing to do with the 0xC2A2 UTF-8. Don't get confused! U+C2A2 is the character '슢' (a syllable of the Korean alphabet that, according to Google Translate, is called Syup...). This is the first hint at why we might need to convert UTF-8 to Unicode although Unicode is not an encoding!

The problem arises when you want to work in Java with text that you have 'grabbed' from a web page: the web page is encoded in UTF-8, while Java strings (i.e., objects of type java.lang.String) consist of Unicode characters. If you grab from the Web a piece of text, store it into a Java string, and display it, only the "ASCII-like" characters are displayed correctly.

For example, the Wikipedia page about North Africa contains "Mizrāḥîm", but if you display it without any conversion, you get "MizrƒÅ·∏•√Æm".

In the rest of this article, I will explain how you can correctly store into a Java string text grabbed from the Web. There will perhaps/probably be better ways to do it, but my way works. If you find a better algorithm and would like to share it, I would welcome it.

To help you understand my code, before I show it to you, I would like you to observe that when you match Unicode code points (that's how the U+hexbytes codes are called) and UTF-8 codes, there are discontinuities. For example, U+007F is encoded in UTF-8 as UTF-8 0x007F, but U+0080 (the following character) corresponds in UTF-8 to 0xC280. Another example of discontinuity: while U+00BF corresponds to 0xC2BF, U+00C0 corresponds to 0xC380.

One last thing: all bytes of UTF-8, with the exception of the first 128 (used for the good old ASCII codes), have the most significant bit set. For example, the cent sign is encoded as 0xC2A2, which in binary is 11000010₂ and 10100010₂.
Here is how you can read a web page into a Java string:

    final int     BUF_SIZE = 5000;    URL           url = new URL("http://en.wikipedia.org/wiki/North_Africa");    URLConnection con = url.openConnection();    InputStream   resp = con.getInputStream();    byte[]        b = new byte[BUF_SIZE];    int n = 0;    s = "";    do {      n = resp.read(b);      if (n > 0) s += new String(b, 0, n);      }    resp.close();
Pretty straightforward. But if you do so, when you display the string, all multi-byte UTF-8 characters will show up as rubbish. Here is how I fixed it:

final int     BUF_SIZE = 5000; URL           url = new URL("http://en.wikipedia.org/wiki/North_Africa"); URLConnection con = url.openConnection(); InputStream   resp = con.getInputStream(); byte[]        b = new byte[BUF_SIZE]; int n = 0; s = ""; do {    n = resp.read(b);    if (n > 0) s += new String(b, 0, n);    } resp.close();
Pretty straightforward. But if you do so, when you display the string, all multi-byte UTF-8 characters will show up as rubbish. Here is how I fixed it:
final int     BUF_SIZE = 5000; URL           url = new URL("http://en.wikipedia.org/wiki/North_Africa"); URLConnection con = url.openConnection(); InputStream   resp = con.getInputStream(); byte[]        b = new byte[BUF_SIZE]; final int[]   uniBases = {-1, 0, 0x80, 0x800, 0x10000}; int n = 0; int[] utf = new int[4]; int nUtf = 0; int kUtf = 0; int kar = 0; s = ""; do {    n = resp.read(b);    if (n > 0) {      i1 = -1;      for (int i = 0; i < n; i++) {        if (b[i] < 0) {          kar = b[i];          kar &= 0xFF;          kUtf++;          if (kUtf == 1) {            if (kar >= 0xF0) {              nUtf = 4;              utf[0] = kar - 0xF0;              }            else if (kar >= 0xE0) {              nUtf = 3;              utf[0] = kar - 0xE0;              }            else {              nUtf = 2;              utf[0] = kar - 0xC2;              }            i1++;            if (i > 0) s += new String(b, i1, i - i1);            }          else {            utf[kUtf - 1] = kar - 0x80;            if (kUtf == nUtf) {              kar = uniBases[nUtf] + utf[nUtf - 1] + (utf[nUtf - 2] << 6);               if (nUtf == 3) {                 if (utf[0] > 0) kar += ((64 - 0x20) << 6) + ((utf[0] - 1) << 12);                 }               else if (nUtf == 4) {                 kar += utf[1] << 12;                 if (utf[0] > 0) kar += ((64 - 0x10) << 12) + ((utf[0] - 1) << 18);                }              s += (char)kar;              // Prepare for the next UTF multi-byte code              kUtf = 0;              nUtf = 0;              i1 = i;              }            }          } // if (b[i] ..        } // for (int i..      // Save the remaining characters if any      if (kUtf == 0) {        i1++;        if (i1 < n) s += new String(b, i1, n - i1);        }      } // if (n > 0..    } while (n > 0); resp.close();
Clearly, I only need to process the incoming bytes that have the most significant bit set (i.e., those for which b[i] < 0). First of all, I store the byte into an integer, so that I can work more comfortably with it. When I encounter the first of these "non-ASCII" bytes (i.e., when kUtf == 1), I check its value to determine how many bytes the UTF-8 code requires (four, three, or two). This tells me how many bytes I still have to collect before I can determine the corresponding Unicode character.

I accumulate the bytes into the utf integer array. While I do so, I also do some pre-processing to remove the discontinuities. When I have all the necessary bytes, I just shift them appropriately into the variable kar to form the Unicode character, which I then store into the Java string.

Sunday, September 8, 2013

John Kerry's Negationism

Less than an hour ago, I saw the American Secretary of State John Kerry on TV. Talking about Bashar Al Assad, he stated something like "Since the use of poisonous gases was banned after WWI, only Hitler and Saddam Hussein used them".

He "forgot" Italy and Japan. Everybody can condemn Nazi Germany and Saddam. But it wouldn't be proper to criticise two modern allies, would it?

Italy dropped mustard gas on Ethiopia in 1935, when Mussolini decided to give to Italy's king the additional title of Emperor of Ethiopia. According to Wikipedia, 150,000 people were killed, but even if you don't consider Wikipedia as a reliable source of information, it is clear that Italy killed many Ethipians with chemical warfare.

Still according to Wikipedia, Japan used chemical warfare in China in many occasions.

Apparently, for fear of retaliation, Germany made very limited use of gases during WWII. It doesn't seem entirely convincing because, by the time the allies had landed in Normandy, Nazi Germany had little to lose. In any case, Cyclon-B was used extensively in concentration camps to kill scores of people.

In conclusion, all three Axis powers used chemical warfare before or during WWII. Because of their racist ideologies or perhaps to avoid retaliation in kind, the gases were only used on blacks, Asians, and what the Germans classified as Untermenschen (subhumans: Jews, homosexuals, Romani people, and others).

I don't know about Japan, but I know that, even before the end of WWII, Italy was seen as a key piece of the frontier between Capitalism in the West and Communism in the East. It would have not been convenient for the Allies to institute an Italian version of the Nuremberg trials, especially considering that Italy's population included many Socialists and Communists

That's why all atrocities committed by Fascist Italy befor and during WWII were quietly ignored, including the gassing of thousands of Ethiopians (or the atrocities committed in Albania). The myth of the "good Italian soldier" was created and most Italians were happy to believe it.

John Kerry is only continuing the tradition of neatly dividing the world in goodies and badies according to what is convenient. He could have just stayed quiet, though...

Saturday, September 7, 2013

Australian Federal Elections

This morning I went to vote.

The Australian Federal Parliament consists of two houses: the House of Representative, with 150 members, and the Senate, with 76 members.

There are two major political blocks, the Australian Labor Party (the ALP) and the coalition of two conservative parties (the Liberal Party of Australia and the National Party of Australia, which in Queensland have actually merged into the Liberal National Party of Queensland). The third largest party is The Australian Greens.

As the Representatives are elected for three years in 150 electoral divisions, it is difficult for members of groupings other the ALP and the Coalition to get elected. In 2010, at the last elections, only five independents and one Green were elected to the House of Representatives.

The things are more complicated with the Senate. There are twelve Senators for each state elected every six years, plus two senators for the Australian Capital Territory (ACT) and two for the Northern Territory elected every three years. Half of the six-year senators plus the four three-year senators are elected together with the representatives. Because of the size of the electorates, the minor parties and the independents can more easily get elected than in the House of Representatives. In the current Senate, there are nine Greens senators and three senators not belonging to either of the two major groupings.

I live in the ACT in the electoral division of Fraser.

The ballot paper to elect my Representative for the next three years had a list of seven candidates. To vote, I had to number all the candidates in order of preference, from 1 to 7. That was no so bad.

But the ballot paper to elect my two senators for the next three years had a list of 27 (yes, 27) names, with 13 parties with two names each plus one independent. Now, you can vote for the Senate in two ways: either you write a 1 beside your preferred party or you "grade" all candidates.

This is not nice. I would have liked to be able to give my preference to more than one party but deny my preference to some of the parties. Unreasonably, this is not possible when voting for the Australian Senate: either you choose one party (and that party passes it on to parties of their choice if they cannot use it) or you give a preference to all candidates (including the parties you hate, which might then get your preference).

My best solution was to give the top four preferences to the Greens and the ALP and my bottom two preferences to the Liberals (with the last one, number 27, to the top Liberal candidate). Then, I assigned the preferences 5 to 25 to the remaining candidated from left to right and from top to bottom, without even checking who they were. After all, I know that the two senators will be elected among the top candidates of Greens, ALP, and Liberals. The ALP candidate, Kate Lundy, most likely, and the Greens candidate, Simon Sheikh, hopefully. Simon has a chance, although both the ALP and the Liberals have advised their voters to place the Greens below their main opponent, which is nonsensical, especially for the ALP.

Ther other thing that I don't like in the Australian elections is that no proper identification of the voters is done. Imagine: you are not asked to show any form of identification! When you go to collect the ballot paper, you state your name and address. If that combination is in the big book containing the list of all registered voters, you are asked whether you have already voted somewhere else. If you answer "no", you get your ballot paper and vote.

This is simply ridiculous. I don't suggest that we dip a finger in indelible ink like in many thirld-world countries, but, at the very least, we should show our driver's licence.

In any case, even if we were required to prove our identity, who's going to check whether our name was ticked in two different big books (actually, I'm not even sure that they tick anything)? What would prevent me to go several times to vote in different voting places? In this day and age (don't you just love clichés?), anything short of flagging your name in a centralised database and in real time is simply not good enough.

And why do we still have to use pencil and paper? When are we going to vote electronically at the federal elections? Actually, I would like to be able to vote from home, with my identity proven via an electronic certificate. Come on!

Pages