Mobile Study: How to Find Default Charset/Encoding in Java?

The obvious answer is to use Charset.defaultCharset() but we recently found out that this might not be the right answer. I was told that the result is different from real default charset used by java.io classes in several occasions. Looks like Java keeps 2 sets of default charset. Does anyone have any insights on this issue?

We were able to reproduce one fail case. It's kind of user error but it may still expose the root cause of all other problems. Here is the code,

public class CharSetTest {

    public static void main(String[] args) {
        System.out.println("Default Charset=" + Charset.defaultCharset());
        System.setProperty("file.encoding", "Latin-1");
        System.out.println("file.encoding=" + System.getProperty("file.encoding"));
        System.out.println("Default Charset=" + Charset.defaultCharset());
        System.out.println("Default Charset in Use=" + getDefaultCharSet());
    }

    private static String getDefaultCharSet() {
        OutputStreamWriter writer = new OutputStreamWriter(new ByteArrayOutputStream());
        String enc = writer.getEncoding();
        return enc;
    }
}

Our server requires default charset in Latin-1 to deal with some mixed encoding (ANSI/Latin-1/UTF-8) in a legacy protocol. So all our servers run with this JVM parameter,

-Dfile.encoding=ISO-8859-1

Here is the result on Java 5,

Default Charset=ISO-8859-1
file.encoding=Latin-1
Default Charset=UTF-8
Default Charset in Use=ISO8859_1

Someone tries to change the encoding runtime by setting the file.encoding in the code. We all know that doesn't work. However, this apparently throws off defaultCharset() but it doesn't affect the real default charset used by OutputStreamWriter.

Is this a bug or feature?

EDIT: The accepted answer shows the root cause of the issue. Basically, you can't trust defaultCharset() in Java 5, which is not the default encoding used by I/O classes. Looks like Java 6 corrects this issue.

This is really strange... Once set, the default Charset is cached and it isn't changed while the class is in memory. Setting the "file.encoding" property with System.setProperty("file.encoding", "Latin-1"); does nothing. Every time Charset.defaultCharset() is called it returns the cached charset.

Here are my results:

Default Charset=ISO-8859-1
file.encoding=Latin-1
Default Charset=ISO-8859-1
Default Charset in Use=ISO8859_1

I'm using JVM 1.6 though.

(update)

Ok. I did reproduce your bug with JVM 1.5.

Looking at the source code of 1.5, the cached default charset isn't being set. I don't know if this is a bug or not but 1.6 changes this implementation and uses the cached charset:

JVM 1.5:

public static Charset defaultCharset() {
synchronized (Charset.class) {
    if (defaultCharset == null) {
        java.security.PrivilegedAction pa =
            new GetPropertyAction("file.encoding");
        String csn = (String)AccessController.doPrivileged(pa);
        Charset cs = lookup(csn);
        if (cs != null)
            return cs;
        return forName("UTF-8");
    }
    return defaultCharset;
}
}

JVM 1.6:

public static Charset defaultCharset() {
    if (defaultCharset == null) {
    synchronized (Charset.class) {
        java.security.PrivilegedAction pa =
            new GetPropertyAction("file.encoding");
        String csn = (String)AccessController.doPrivileged(pa);
        Charset cs = lookup(csn);
        if (cs != null)
            defaultCharset = cs;
            else 
            defaultCharset = forName("UTF-8");
        }
}
return defaultCharset;
}

When you set the file encoding to file.encoding=Latin-1 the next time you callCharset.defaultCharset(), what happens is, because the cached default charset isn't set, it will try to find the appropriate charset for the name Latin-1. This name isn't found, because it's incorrect, and returns the default UTF-8.

As for why the IO classes such as OutputStreamWriter return an unexpected result,
the implementation of sun.nio.cs.StreamEncoder (witch is used by these IO classes) is different as well for JVM 1.5 and JVM 1.6. The JVM 1.6 implementation is based in theCharset.defaultCharset() method to get the default encoding, if one is not provided to IO classes. The JVM 1.5 implementation uses a different method Converters.getDefaultEncodingName(); to get the default charset. This method uses it's own cache of the default charset that is set upon JVM initialization:

JVM 1.6:

   public static StreamEncoder forOutputStreamWriter(OutputStream out,
                                                     Object lock,
                                                     String charsetName)
       throws UnsupportedEncodingException
   {
       String csn = charsetName;
       if (csn == null)
           csn = Charset.defaultCharset().name();
       try {
           if (Charset.isSupported(csn))
               return new StreamEncoder(out, lock, Charset.forName(csn));
       } catch (IllegalCharsetNameException x) { }
       throw new UnsupportedEncodingException (csn);
   }

JVM 1.5:

public static StreamEncoder forOutputStreamWriter(OutputStream out,
                                              Object lock,
                                              String charsetName)
throws UnsupportedEncodingException
{
String csn = charsetName;
if (csn == null)
    csn = Converters.getDefaultEncodingName();
if (!Converters.isCached(Converters.CHAR_TO_BYTE, csn)) {
    try {
        if (Charset.isSupported(csn))
            return new CharsetSE(out, lock, Charset.forName(csn));
    } catch (IllegalCharsetNameException x) { }
}
return new ConverterSE(out, lock, csn);
}

But I agree with the comments. You shouldn't rely on this property. It's an implementation detail.

http://java.sun.com/javase/technologies/core/basic/intl/faq_ja.jsp

■摘要

デフォルトエンコーディングとは何ですか。

デフォルトエンコーディングは、ホストオペレーティングシステムおよびそのロケールに基づく JRE により選択されます。たとえば、Windows の US ロケールでは、windows-1252 が使用されます。Solaris の簡体字中国語ロケールでは、Solaris へのログイン時の選択に基づき、GB2312、GBK、GB18030、UTF-8 のいずれかがデフォルトエンコーディングになります。

デフォルトエンコーディングは、通常、JRE とホストオペレーティングシステム間でやりとりされるテキストに使用される、重要なエンコーディングです。正確な相互変換を保証するために、デフォルトエンコーディングを、ホストオペレーティングシステムが使用するエンコーディングに一致させる必要があります。

アプリケーションは、J2SE 5 から提供されている Charset.defaultCharset メソッドを呼び出すことにより、デフォルトエンコーディングを特定します。なお、J2SE 5 より古いバージョンの Java プラットフォームでは、次のような式が使用されていました。

(new OutputStreamWriter(new ByteArrayOutputStream())).getEncoding()

Java アプリケーションでは複数のロケールを使用できますか。

ロケールの影響を受ける API では、多くの場合、ロケールの指定が可能です。このため、複数の異なったロケールを同時に処理できるアプリケーションを作成できます。たとえば、Java で作成された Web アプリケーションでは、複数のロケールを同時に使用して、異なった要求を処理できます。

アプリケーションの外部からデフォルトのロケールを設定できますか。

使用している Java プラットフォームの実装によって異なります。通常、初期のデフォルトロケールは、ホストオペレーティングシステムのロケールによって決まります。Sun の JRE のバージョン 1.4 以降では、コマンド行から user.language、user.country、および user.variant の各システムプロパティを設定することで、初期のデフォルトロケールを変更できます。たとえば、初期のデフォルトロケールとして Locale("th", "TH", "TH") を選択するには、次のコマンドを使用します。

java -Duser.language=th -Duser.country=TH -Duser.variant=TH MainClass

この機能を使用できない実行環境もあるため、この機能はテスト目的だけに使用してください。

Mobile Study

2012年5月8日火曜日

How to Find Default Charset/Encoding in Java?

0 件のコメント:

コメントを投稿