I use this blog as a soap box to preach (ahem... to talk :-) about subjects that interest me.

Tuesday, September 10, 2013

Converting UTF-8 to Unicode

To be stored digitally, each character of a piece of text is encoded into a particular bit pattern.

For exampe, according to the ASCII (American Standard Code for Information Interchange) standard, which has been around for half a century, the letter 'A' is encoded with the 7 bits 10000012 or, in hexadecimal notation, 4116, which I prefer to write with the Java/C syntax: 0x41.  With 7 bits, only 128 patterns can be encoded (i.e., 27), just enough for plain Latin characters, numbers, and a few special symbols.

Over the past couple of decades, a different type of encoding called UTF-8, based on a variable number of bytes, has established itself as the most common encoding used in HTML pages.

Often, UTF-8 is confused with Unicode, but while UTF-8 is a way of encoding characters, Unicode is a character set.  That is, a list of characters.  This means that the same Unicode character can be encoded in UTF-8, UTF-16, ISO-8859, and other formats.  You will find that most people on the Internet refer to Unicode as an encoding.  Now you know that they are not completely correct, although, to be fair, the distinction is usually irrelevant.

The Wikipedia pages on Unicode and UTF-8 are very informative.  Therefore, I don't want to repeat them here.  But I would like to show you a couple of examples taken from the UTF-8 encoding table and Unicode characters.

The charcter 'A', which was encoded as 0x41 in ASCII, is character U+0041 in Unicode and is encoded as 0x41 in UTF-8.  "Wait a minute", you might say, "what's a point of all the fuss if the number 0x41 stays the same everywhere?"

The answer is simple: the ASCII and UTF-8 encodings for all Unicode characters from U+0000 to U+007F are identical.  This makes sense for back compatibility.  But while ASCII only encodes 128 characters, UTF can encode all many thousands of Unicode characters.  To see the differences, you have go beyond U+007F.

For example, U+00A2, the cent sign '¢', which doesn't exist in ASCII, is encoded as 0xC2A2 in UTF-8.  Note that U+C2A2 is a valid Unicode character, but it has nothing to do with the 0xC2A2 UTF-8.  Don't get confused!  U+C2A2 is the character '슢' (a syllable of the Korean alphabet that, according to Google Translate, is called Syup...).  This is the first hint at why we might need to convert UTF-8 to Unicode although Unicode is not an encoding!

The problem arises when you want to work in Java with text that you have 'grabbed' from a web page: the web page is encoded in UTF-8, while Java strings (i.e., objects of type java.lang.String) consist of Unicode characters.  If you grab from the Web a piece of text, store it into a Java string, and display it, only the "ASCII-like" characters are displayed correctly.

For example, the Wikipedia page about North Africa contains "Mizrāḥîm", but if you display it without any conversion, you get "MizrƒÅ·∏•√Æm".

In the rest of this article, I will explain how you can correctly store into a Java string text grabbed from the Web.  There will perhaps/probably be better ways to do it, but my way works.  If you find a better algorithm and would like to share it, I would welcome it.

To help you understand my code, before I show it to you, I would like you to observe that when you match Unicode code points (that's how the U+hexbytes codes are called) and UTF-8 codes, there are discontinuities.  For example, U+007F is encoded in UTF-8 as UTF-8 0x007F, but U+0080 (the following character) corresponds in UTF-8 to 0xC280.  Another example of discontinuity: while U+00BF corresponds to 0xC2BF, U+00C0 corresponds to 0xC380.

One last thing: all bytes of UTF-8, with the exception of the first 128 (used for the good old ASCII codes), have the most significant bit set.  For example, the cent sign is encoded as 0xC2A2, which in binary is 110000102 and 101000102.
Here is how you can read a web page into a Java string:

    final int     BUF_SIZE = 5000;
    URL           url = new URL("http://en.wikipedia.org/wiki/North_Africa");
    URLConnection con = url.openConnection();
    InputStream   resp = con.getInputStream();
    byte[]        b = new byte[BUF_SIZE];
    int n = 0;
    s = "";
    do {
      n = resp.read(b);
      if (n > 0) s += new String(b, 0, n);
      }
    resp.close();

Pretty straightforward.  But if you do so, when you display the string, all multi-byte UTF-8 characters will show up as rubbish.  Here is how I fixed it:



  final int     BUF_SIZE = 5000;
  URL           url = new URL("http://en.wikipedia.org/wiki/North_Africa");
  URLConnection con = url.openConnection();
  InputStream   resp = con.getInputStream();
  byte[]        b = new byte[BUF_SIZE];
  int n = 0;
  s = "";
  do {
    n = resp.read(b);
    if (n > 0) s += new String(b, 0, n);
    }
  resp.close();

Pretty straightforward.  But if you do so, when you display the string, all multi-byte UTF-8 characters will show up as rubbish.  Here is how I fixed it:

  final int     BUF_SIZE = 5000;
  URL           url = new URL("http://en.wikipedia.org/wiki/North_Africa");
  URLConnection con = url.openConnection();
  InputStream   resp = con.getInputStream();
  byte[]        b = new byte[BUF_SIZE];
  final int[]   uniBases = {-1, 0, 0x80, 0x800, 0x10000};

  int n = 0;
  int[] utf = new int[4];
  int nUtf = 0;
  int kUtf = 0;
  int kar = 0;
  s = "";
  do {
    n = resp.read(b);
    if (n > 0) {
      i1 = -1;
      for (int i = 0; i < n; i++) {
        if (b[i] < 0) {
          kar = b[i];
          kar &= 0xFF;
          kUtf++;
          if (kUtf == 1) {
            if (kar >= 0xF0) {
              nUtf = 4;
              utf[0] = kar - 0xF0;
              }
            else if (kar >= 0xE0) {
              nUtf = 3;
              utf[0] = kar - 0xE0;
              }
            else {
              nUtf = 2;
              utf[0] = kar - 0xC2;
              }
            i1++;
            if (i > 0) s += new String(b, i1, i - i1);
            }
          else {
            utf[kUtf - 1] = kar - 0x80;
            if (kUtf == nUtf) {
              kar = uniBases[nUtf] + utf[nUtf - 1] + (utf[nUtf - 2] << 6);
              if (nUtf == 3) {
                if (utf[0] > 0) kar += ((64 - 0x20) << 6) + ((utf[0] - 1) << 12);
                }
              else if (nUtf == 4) {
                kar += utf[1] << 12;
                if (utf[0] > 0) kar += ((64 - 0x10) << 12) + ((utf[0] - 1) << 18);
                }
              s += (char)kar;

              // Prepare for the next UTF multi-byte code
              kUtf = 0;
              nUtf = 0;
              i1 = i;
              }
            }
          } // if (b[i] ..
        } // for (int i..

      // Save the remaining characters if any
      if (kUtf == 0) {
        i1++;
        if (i1 < n) s += new String(b, i1, n - i1);
        }
      } // if (n > 0..
    } while (n > 0);
  resp.close();

Clearly, I only need to process the incoming bytes that have the most significant bit set (i.e., those for which b[i] < 0).  First of all, I store the byte into an integer, so that I can work more comfortably with it.  When I encounter the first of these "non-ASCII" bytes (i.e., when kUtf == 1), I check its value to determine how many bytes the UTF-8 code requires (four, three, or two).  This tells me how many bytes I still have to collect before I can determine the corresponding Unicode character.

I accumulate the bytes into the utf integer array.  While I do so, I also do some pre-processing to remove the discontinuities.  When I have all the necessary bytes, I just shift them appropriately into the variable kar to form the Unicode character, which I then store into the Java string.

No comments:

Post a Comment