Unicode support is built into Java from the ground up. A char
is
a Unicode character and a String
is a string of Unicode text. The
Unicode support is so intuitive that programmers without the slightest Unicode
experience can easily read or write Unicode data files.
The most common Java task that requires some Unicode know-how is opening a
file that contains Unicode data. Tip #1: Use Reader
s and
Writer
s to work with Unicode data. Java's standard
InputStream
and OutputStream
objects are for reading
and writing binary bytes; they are not Unicode-aware. Reader
s and
Writer
s are Unicode-aware and will perform all the necessary
encoding and decoding for you.
If you're programming entirely in a closed Java environment on a single computer, this might be all the information you need. The Java I/O classes will read and write Unicode data in the "default character encoding" by default. The default encoding depends on some combination of the JVM, the underlying OS, and the OS-level locale settings. For example, on my Windows 2000 computer, the default encoding is CP-1252, a Microsoft-specific variant of Latin-1.
If you think it sounds kind of sketchy to depend on the default encoding too much, you're right. So Tip #2 is: keep in mind which encoding you're using. Consider the following code, which opens a file and reads a line of text from it.
// WARNING: does not specify the file's encoding BufferedReader r = new BufferedReader(new FileReader("infile.txt")); String line = r.readLine();
This code implicitly assumes that the file is in the default encoding. If the
file is, say, UTF-8 instead, then the non-ASCII characters in line
will likely be garbled.
The programmer should instead specify the name of the encoding.
BufferedReader rdr = new BufferedReader( new InputStreamReader(new FileInputStream("infile.txt"), "ISO-8859-1")); String line = rdr.readLine();
Now Java knows what encoding you expect. It's that simple; the
Reader
object decodes the file for you on the fly. Output is
similar:
BufferedWriter out = new BufferedWriter( new OutputStreamWriter(new FileOutputStream("outfile.xml"), "ISO2022KR")); out.write("<?xml version=\"1.0\" encoding=\"ISO-2022-KR\"?>\n" + "<greeting>\uc5ec\ubcf4\uc138\uc694 " + "\uc138\uacc4!</greeting>\n"); out.close();
(See the resulting outfile.xml
file.)
Note that in this example, Java's name for the Korean encoding is
"ISO2022KR"
, but the XML name for the encoding is
"ISO-2022-KR"
. XML uses the international standard
names for character encodings. In version 1.3 of the Java 2 platform, Sun used
non-standard
names for many encodings. (This seems to have been corrected in version
1.4.)
This example also demonstrates how to type Unicode characters in an actual
Java program. The Java compiler itself assumes that Java source code files use
the default encoding; but you can also enter escape sequences of the form
\u0000
; for example, \u263A
specifies the
character U+263A
.
Reading an XML file in Java is even easier. The standard XML parsing API for
Java, SAX, automatically chooses the
right encoding when it opens an XML file. SAX provides the parsed data to you in
String
objects.
Programs also occasionally need to encode or decode data directly. The class
java.lang.String
provides both of these capabilities. For example,
the following method converts data from one encoding to another.
public static byte[] convert(byte[] data, String srcEncoding, String targetEncoding) { // First, decode the data using the source encoding. // The String constructor does this (Javadoc). String str = new String(data, srcEncoding); // Next, encode the data using the target encoding. // The String.getBytes() method does this. byte[] result = str.getBytes(targetEncoding); return result; }
You can also type Unicode data directly into a Java program. The
javac
compiler can read Java source files using the encoding of
your choice. So ordinarily, you can type non-ASCII characters directly into a
string—or even a variable name. Here is a complete "Hello, world!" program in
German.
public class Hallo { public static void main(String[] args) { String Gruß = "Hallo, verrückte Welt!"; System.out.println(Gruß); } }
When you compile this program with the command
javac Hallo.java
, the compiler does not know the encoding of
the source file. Therefore it uses your platform's default encoding. You might
wish to tell javac
which encoding to use explicitly, instead. Use
the -encoding
option to do this:
javac -encoding Latin-1 Hallo.java
. If you do not
specify the right encoding, javac
will be confused and may or may
not generate a lot of syntax errors as a result.
However, once the program is properly compiled, there is another problem. When I run this on my computer, the letter ü is incorrectly displayed:
Hallo, verrnckte Welt!
This is because Java produces its output in Latin-1, which my computer's console does not understand. To properly display Unicode data, a program must use a Unicode-capable GUI. Client applications can use Swing. Java servlets and JSPs can usually count on the client's browser to be a Unicode-capable display.
Note that Swing works beautifully with Unicode strings. Sun's JDK even ships with some special Unicode fonts to ensure that international text has a consistent look. There are no special tips or techniques to learn; just try it!