To be stored digitally, each character of a piece of text is encoded into a particular bit pattern.
For exampe, according to the ASCII
(American Standard Code for Information Interchange) standard, which
has been around for half a century, the letter 'A' is encoded with the 7
bits 10000012 or, in hexadecimal notation, 4116, which I prefer to write with the Java/C syntax: 0x41. With 7 bits, only 128 patterns can be encoded (i.e., 27), just enough for plain Latin characters, numbers, and a few special symbols.
Over the past couple of decades, a different type of encoding called UTF-8, based on a variable number of bytes, has established itself as the most common encoding used in HTML pages.
Often, UTF-8 is confused with Unicode,
but while UTF-8 is a way of encoding characters, Unicode is a character
set. That is, a list of characters. This means that the same Unicode
character can be encoded in UTF-8, UTF-16, ISO-8859, and other formats.
You will find that most people on the Internet refer to Unicode as an
encoding. Now you know that they are not completely correct, although,
to be fair, the distinction is usually irrelevant.
The Wikipedia pages on Unicode and UTF-8 are very informative.
Therefore, I don't want to repeat them here. But I would like to show
you a couple of examples taken from the UTF-8 encoding table and Unicode characters.
The charcter 'A', which was encoded as 0x41 in ASCII, is character
U+0041 in Unicode and is encoded as 0x41 in UTF-8. "Wait a minute", you
might say, "what's a point of all the fuss if the number 0x41 stays the
same everywhere?"
The answer is simple: the ASCII and UTF-8 encodings for all Unicode
characters from U+0000 to U+007F are identical. This makes sense for
back compatibility. But while ASCII only encodes 128 characters, UTF
can encode all many thousands of Unicode characters. To see the
differences, you have go beyond U+007F.
For example, U+00A2, the cent sign '¢', which doesn't exist in ASCII, is
encoded as 0xC2A2 in UTF-8. Note that U+C2A2 is a valid Unicode
character, but it has nothing to do with the 0xC2A2 UTF-8. Don't get
confused! U+C2A2 is the character '슢' (a syllable of the Korean
alphabet that, according to Google Translate, is called Syup...). This
is the first hint at why we might need to convert UTF-8 to Unicode
although Unicode is not an encoding!
The problem arises when you want to work in Java with text that you have
'grabbed' from a web page: the web page is encoded in UTF-8, while Java
strings (i.e., objects of type java.lang.String) consist of Unicode
characters. If you grab from the Web a piece of text, store it into a
Java string, and display it, only the "ASCII-like" characters are
displayed correctly.
For example, the Wikipedia page about North Africa contains "Mizrāḥîm", but if you display it without any conversion, you get "MizrƒÅ·∏•√Æm".
In the rest of this article, I will explain how you can correctly store
into a Java string text grabbed from the Web. There will
perhaps/probably be better ways to do it, but my way works. If you find
a better algorithm and would like to share it, I would welcome it.
To help you understand my code, before I show it to you, I would like
you to observe that when you match Unicode code points (that's how the
U+hexbytes codes are called) and UTF-8 codes, there are
discontinuities. For example, U+007F is encoded in UTF-8 as UTF-8
0x007F, but U+0080 (the following character) corresponds in UTF-8 to
0xC280. Another example of discontinuity: while U+00BF corresponds to
0xC2BF, U+00C0 corresponds to 0xC380.
One last thing: all bytes of UTF-8, with the exception of the first 128
(used for the good old ASCII codes), have the most significant bit set.
For example, the cent sign is encoded as 0xC2A2, which in binary is 110000102 and 101000102.
Here is how you can read a web page into a Java string:
final int BUF_SIZE = 5000;
URL url = new URL("http://en.wikipedia.org/wiki/North_Africa");
URLConnection con = url.openConnection();
InputStream resp = con.getInputStream();
byte[] b = new byte[BUF_SIZE];
int n = 0;
s = "";
do {
n = resp.read(b);
if (n > 0) s += new String(b, 0, n);
}
resp.close();
Pretty straightforward. But if you do so, when you display the string, all multi-byte UTF-8 characters will show up as rubbish. Here is how I fixed it:
final int BUF_SIZE = 5000;
URL url = new URL("http://en.wikipedia.org/wiki/North_Africa");
URLConnection con = url.openConnection();
InputStream resp = con.getInputStream();
byte[] b = new byte[BUF_SIZE];
int n = 0;
s = "";
do {
n = resp.read(b);
if (n > 0) s += new String(b, 0, n);
}
resp.close();
Pretty straightforward. But if you do so, when you display the string, all multi-byte UTF-8 characters will show up as rubbish. Here is how I fixed it:
final int BUF_SIZE = 5000;
URL url = new URL("http://en.wikipedia.org/wiki/North_Africa");
URLConnection con = url.openConnection();
InputStream resp = con.getInputStream();
byte[] b = new byte[BUF_SIZE];
final int[] uniBases = {-1, 0, 0x80, 0x800, 0x10000};
int n = 0;
int[] utf = new int[4];
int nUtf = 0;
int kUtf = 0;
int kar = 0;
s = "";
do {
n = resp.read(b);
if (n > 0) {
i1 = -1;
for (int i = 0; i < n; i++) {
if (b[i] < 0) {
kar = b[i];
kar &= 0xFF;
kUtf++;
if (kUtf == 1) {
if (kar >= 0xF0) {
nUtf = 4;
utf[0] = kar - 0xF0;
}
else if (kar >= 0xE0) {
nUtf = 3;
utf[0] = kar - 0xE0;
}
else {
nUtf = 2;
utf[0] = kar - 0xC2;
}
i1++;
if (i > 0) s += new String(b, i1, i - i1);
}
else {
utf[kUtf - 1] = kar - 0x80;
if (kUtf == nUtf) {
kar = uniBases[nUtf] + utf[nUtf - 1] + (utf[nUtf - 2] << 6);
if (nUtf == 3) {
if (utf[0] > 0) kar += ((64 - 0x20) << 6) + ((utf[0] - 1) << 12);
}
else if (nUtf == 4) {
kar += utf[1] << 12;
if (utf[0] > 0) kar += ((64 - 0x10) << 12) + ((utf[0] - 1) << 18);
}
s += (char)kar;
// Prepare for the next UTF multi-byte code
kUtf = 0;
nUtf = 0;
i1 = i;
}
}
} // if (b[i] ..
} // for (int i..
// Save the remaining characters if any
if (kUtf == 0) {
i1++;
if (i1 < n) s += new String(b, i1, n - i1);
}
} // if (n > 0..
} while (n > 0);
resp.close();
Clearly, I only need to process the incoming bytes that have the most significant bit set (i.e., those for which b[i] < 0).
First of all, I store the byte into an integer, so that I can work more
comfortably with it. When I encounter the first of these "non-ASCII"
bytes (i.e., when kUtf == 1), I check its value to determine
how many bytes the UTF-8 code requires (four, three, or two). This
tells me how many bytes I still have to collect before I can determine
the corresponding Unicode character.
I accumulate the bytes into the utf integer array. While I do
so, I also do some pre-processing to remove the discontinuities. When I
have all the necessary bytes, I just shift them appropriately into the
variable kar to form the Unicode character, which I then store into the Java string.
No comments:
Post a Comment