Unicode in Java

Unicode support is built into Java from the ground up. A char is a Unicode character and a String is a string of Unicode text. The Unicode support is so intuitive that programmers without the slightest Unicode experience can easily read or write Unicode data files.

Unicode I/O in Java

The most common Java task that requires some Unicode know-how is opening a file that contains Unicode data. Tip #1: Use Readers and Writers to work with Unicode data. Java's standard InputStream and OutputStream objects are for reading and writing binary bytes; they are not Unicode-aware. Readers and Writers are Unicode-aware and will perform all the necessary encoding and decoding for you.

If you're programming entirely in a closed Java environment on a single computer, this might be all the information you need. The Java I/O classes will read and write Unicode data in the "default character encoding" by default. The default encoding depends on some combination of the JVM, the underlying OS, and the OS-level locale settings. For example, on my Windows 2000 computer, the default encoding is CP-1252, a Microsoft-specific variant of Latin-1.

If you think it sounds kind of sketchy to depend on the default encoding too much, you're right. So Tip #2 is: keep in mind which encoding you're using. Consider the following code, which opens a file and reads a line of text from it.

  // WARNING: does not specify the file's encoding
  BufferedReader r =
      new BufferedReader(new FileReader("infile.txt"));
  String line = r.readLine();

This code implicitly assumes that the file is in the default encoding. If the file is, say, UTF-8 instead, then the non-ASCII characters in line will likely be garbled.

The programmer should instead specify the name of the encoding.

  BufferedReader rdr = 
      new BufferedReader(
          new InputStreamReader(new FileInputStream("infile.txt"),
                                "ISO-8859-1"));
  String line = rdr.readLine();

Now Java knows what encoding you expect. It's that simple; the Reader object decodes the file for you on the fly. Output is similar:

Sample code to output some Unicode text in an XML document. (full source code)
  BufferedWriter out = 
      new BufferedWriter(
          new OutputStreamWriter(new FileOutputStream("outfile.xml"),
                                 "ISO2022KR"));
  out.write("<?xml version=\"1.0\" encoding=\"ISO-2022-KR\"?>\n" +
            "<greeting>\uc5ec\ubcf4\uc138\uc694 " +
            "\uc138\uacc4!</greeting>\n");
  out.close();

(See the resulting outfile.xml file.)

Note that in this example, Java's name for the Korean encoding is "ISO2022KR", but the XML name for the encoding is "ISO-2022-KR". XML uses the international standard names for character encodings. In version 1.3 of the Java 2 platform, Sun used non-standard names for many encodings. (This seems to have been corrected in version 1.4.)

This example also demonstrates how to type Unicode characters in an actual Java program. The Java compiler itself assumes that Java source code files use the default encoding; but you can also enter escape sequences of the form \u0000; for example, \u263A specifies the character U+263A.

Reading an XML file in Java is even easier. The standard XML parsing API for Java, SAX, automatically chooses the right encoding when it opens an XML file. SAX provides the parsed data to you in String objects.

Encoding and Decoding Java Strings

Programs also occasionally need to encode or decode data directly. The class java.lang.String provides both of these capabilities. For example, the following method converts data from one encoding to another.

  public static byte[] convert(byte[] data, String srcEncoding, String targetEncoding) {
      // First, decode the data using the source encoding.
      // The String constructor does this (Javadoc).
      String str = new String(data, srcEncoding);

      // Next, encode the data using the target encoding.
      // The String.getBytes() method does this.
      byte[] result = str.getBytes(targetEncoding);

      return result;
  }

Unicode Text in Java Source Code

You can also type Unicode data directly into a Java program. The javac compiler can read Java source files using the encoding of your choice. So ordinarily, you can type non-ASCII characters directly into a string—or even a variable name. Here is a complete "Hello, world!" program in German.

Hallo.java (full source code)
  public class Hallo {
      public static void main(String[] args) {
          String Gruß = "Hallo, verrückte Welt!";
          System.out.println(Gruß);
      }
  }

When you compile this program with the command javac Hallo.java, the compiler does not know the encoding of the source file. Therefore it uses your platform's default encoding. You might wish to tell javac which encoding to use explicitly, instead. Use the -encoding option to do this: javac -encoding Latin-1 Hallo.java . If you do not specify the right encoding, javac will be confused and may or may not generate a lot of syntax errors as a result.

However, once the program is properly compiled, there is another problem. When I run this on my computer, the letter ü is incorrectly displayed:

  Hallo, verrnckte Welt!

This is because Java produces its output in Latin-1, which my computer's console does not understand. To properly display Unicode data, a program must use a Unicode-capable GUI. Client applications can use Swing. Java servlets and JSPs can usually count on the client's browser to be a Unicode-capable display.

Note that Swing works beautifully with Unicode strings. Sun's JDK even ships with some special Unicode fonts to ensure that international text has a consistent look. There are no special tips or techniques to learn; just try it!

< Back: HTML and XML | Next: Python >