Thursday, November 4, 2010

Understanding text encoding for Properties Files: Latin1 vs. UTF-8

Folks often want to have readable properties files that look like:
textInEng=this a sample of the japanese language's character set
textInJap=日本語の文字は、このサンプルでは、設定
But anyone who actually tries to use a human-readable file (like the one above) via Java will be surprised to see corrupted string values being loaded. The reason behind this is that by default Java's Properties class uses an input stream that has Latin-1 encoding. Also known as ISO-8859-1 encoding. So, reading UTF-8 as Latin-1 yields seemingly corrupted text as the boundaries for bytes that make upa character are all wrong and when Java once again converts it to UTF-8 (which is how the String class stores its data internally), instead of having access to the raw bytes, the boundaries set by Latin-1 characters are honored and that doesn't help us see any meaningful data either. This is why most properties files end up looking more like this:
textInEng=this a sample of the japanese language's character set
textInJap=\u65E5\u672C\u8A9E\u306E\u6587\u5B57\u306F\u3001\u3053\u306E\u30B5\u30F3\u30D7\u30EB\u3067\u306F\u3001\u8A2D\u5B9A
The patterns you see above (\uXXXX)are called unicode-escapes and allow a Latin-1 stream to properly convert each character into its UTF-8 equivalent and hand it off to Java whose internal storage encoding for the String class happens to be UTF-8 already.

But what if you wanted more? What if you absolutely must have human-readable properties files and they must be loaded into Java correctly?
  • Well if you are using Spring, there is an easy way out: just have a look at ReloadableResourceBundleMessageSource and use it to your advantage, problem solved!
  • Otherwise here's how you can get the best of both worlds yourself:
    1. Let Java's Properties class read in the UTF-8 values incorrectly with the Latin-1 encoding.
    2. Then go over each value and convert it back into bytes by telling the code to pass it through Latin-1 for the reverse transformation.
    3. Now that you have the raw-bytes again, you can read them in properly with the appropriate UTF-8 encoding!
    4. Set them back into your Properties object and you are good to go!
    5. See the code (+/-).
          private static void FourthTry() throws IOException {
              FileInputStream inputStream = null;
              try {
                  inputStream = new FileInputStream("utf8.properties");
                  Properties defaultProperties = new PrettyPrintProperties();
                  defaultProperties.load(inputStream);
                  for(Entry keyValuePair : defaultProperties.entrySet()) {
                      byte[] rawBytes = ((String)keyValuePair.getValue()).getBytes("ISO-8859-1");
                      String recoveredString = new String(rawBytes,"UTF-8");
                      keyValuePair.setValue(recoveredString);
                  }
                  PrintStream customConsoleOuput = new PrintStream(System.out, true, "UTF-8");
                  customConsoleOuput.println(defaultProperties.toString());
              } finally {
                  if (inputStream!= null) inputStream.close();
              }
          }
    6. If you want to test it in Eclipse, make sure to: Open Run Dialog > "your application" > Common Tab > Encoding > Other dropdown > set it to UTF-8
    But keep in mind, you cannot be inconsistent. You must pick choose to have the key-value pairs in your properties files be either utf8 or latin1, do NOT mix them like this:
    textInEng=this a sample of the japanese language's character set
    textInJap=日本語の文字は、このサンプルでは、設定
    escapedJp=\u65E5\u672C\u8A9E\u306E\u6587\u5B57\u306F\u3001\u3053\u306E\u30B5\u30F3\u30D7\u30EB\u3067\u306F\u3001\u8A2D\u5B9A
    Happy Coding!


1 comment:

  1. ... OR you could keep them ISO-8859-1-encoded, and use an *real* IDE like IntelliJ that knows how to display them in human-readable format.

    Of course, if you depend on external vendors and have to shop .properties files to them in human-readable encoding, just transcode to UTF-8 when you send them the files, and unicode-escape them when you get them back....

    I could be mistaken, but I think that's what I used to do at HP... no, wait, I COULDN'T be mistaken ;-)

    ReplyDelete