Passing command line unicode argument to Java code

11,463

Solution 1

Unfortunately you cannot reliably use non-ASCII characters with command-line apps that use the Windows C runtime's stdlib, like Java (and pretty much all non-Windows-specific scripting languages really).

This is because they read their input and output using a locale-specific code page by default, which is never a UTF, unlike every other modern OS which uses UTF-8.

Whilst you can change the code page of a terminal to something else using the chcp command, the support for the UTF-8 encoding under chcp 65001 is broken in a few ways that are likely to trip apps up fatally.

If you only need Japanese you could switch to code page 932 (similar to Shift-JIS) by setting your locale (‘language for non-Unicode applications’ in the Regional settings) to Japan. This will still fail for characters that aren't in that code page though.

If you need to get non-ASCII characters through the command line reliably on Windows, you need to call the Win32 API function GetCommandLineW directly to avoid the encode-to-system-code-page layer. Probably you'd want to do that using JNA.

Solution 2

Unfortunately the standard Java launcher has a known and long-living bug in handling Unicode command line arguments on Windows. Maybe on some other platforms too. For Java 7 update 1 it was still in place.

If you feel good at programming in C/C++, you may try writing your own launcher. Some specialized launcher might be not a big deal... Just see the initial example at JNI Invocation API page.

Another possibility is to use a combination of a Java wrapper and a temporary file for passing Unicode parameters to a Java app. See my blog Java, Xalan, Unicode command line arguments... for more comments and the wrapper code.

Solution 3

https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows#UTF-8

With insider build 17035 and the April 2018 update (nominal build 17134) for Windows 10, a "Beta: Use Unicode UTF-8 for worldwide language support" checkbox appeared for setting the locale code page to UTF-8

This actually works for me. Without it, no matter what I set chcp to or what I supplied as -Dsun.jnu.encoding, the argument was always garbled.

I had a test class that would just print the argument that is passed to it:

Before:

> java test "üůßβαa"
üußßaa

Interesting that with sun.jnu.encoding=Cp1252, U+03B2 (beta, β) will become a German sharp s (ß) and the Czech ů will become a plain u.

> chcp 65001
Active code page: 65001
> java test "üůßβαa"
uaa

Hmm…

> java -Dsun.jnu.encoding=utf-8 test "üůßβαa"
?u??aa

This is not better. And it becomes worse when CJK characters come into play, for example U+4E80 (亀):

> java test "üůßβαa亀"
uaa?
Exception in thread "main" java.nio.file.InvalidPathException: Illegal char <?> at index 6: uaa?
        at sun.nio.fs.WindowsPathParser.normalize(Unknown Source)
        at sun.nio.fs.WindowsPathParser.parse(Unknown Source)
        at sun.nio.fs.WindowsPathParser.parse(Unknown Source)
        at sun.nio.fs.WindowsPath.parse(Unknown Source)
        at sun.nio.fs.WindowsFileSystem.getPath(Unknown Source)
        at java.nio.file.Paths.get(Unknown Source)
        at test.urify(test.java:33)
        at test.urify(test.java:43)
        at test.main(test.java:13)

The class that I used not only prints its argument, it also tries to convert it to a file: URI, and it crashed.

Setting the Windows locale to UTF-8 with the approach quoted above solved this issue.

Unfortunately, it didn’t fix encoding issues with arguments passed to another Java program, the XProc processor XML Calabash. A sample pipeline that takes a value from the command line and inserts it as an attribute into a document yielded this mojibake:

> calabash.bat Untitled3.xpl foo='rαaßβöů亊'
<doc xmlns:c="http://www.w3.org/ns/xproc-step" foo="rαaßβöů亊">Hello world!</doc>

Adding -Dsun.jnu.encoding=UTF-8 to the Java invocation fixed this:

<doc xmlns:c="http://www.w3.org/ns/xproc-step" foo="rαaßβöů亊">Hello world!</doc>

For completeness, before switching the Windows locale to UTF-8, depending on whether the code page was 1252 or 65001, the invocation yielded different variations of mojibake that -Dsun.jnu.encoding=UTF-8 couldn’t fix.

So the beta feature to switch the Windows locale finally seems to solve this issue. Some applications might need an additional -Dsun.jnu.encoding=UTF-8, for reasons not thoroughly researched.

This doesn’t solve your years-old issue with Windows 2000. But maybe you have switched to Windows 10 in the meantime.

Ah, btw, I ran your program and it works with the Windows UTF-8 locale setting.

> java test t=r_ä亀
> type C:\Temp\abc.txt
t=r_ä亀
Share:
11,463
Pankaj Agrawal
Author by

Pankaj Agrawal

Java, J2EE, Microservice, Spring Boot, Docker, Cloud, Database Design, Big Data, Streams, ETL

Updated on June 05, 2022

Comments

  • Pankaj Agrawal
    Pankaj Agrawal about 2 years

    I have to pass command line argument which is Japanese to Java main method. If I type Unicode characters on command-line window, it displays '?????' which is OK, but the value passed to java program is also '?????'. How do I get the correct value of argument passed by the command window? Below is sample program which writes to a file the value supplied by command line argument.

    public static void main(String[] args) {
            String input = args[0];
            try {
                String filePath = "C:/Temp/abc.txt";
                File file = new File(filePath);
                OutputStream out = new FileOutputStream(file);
                byte buf[] = new byte[1024];
                int len;
                InputStream is = new ByteArrayInputStream(input.getBytes());
                while ((len = is.read(buf)) > 0) {
                    out.write(buf, 0, len);
                }
                out.close();
                is.close();
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    
  • Pankaj Agrawal
    Pankaj Agrawal over 12 years
    shouldn't we be able to pass any Unicode value whether Japanese or Korean without changing system locale? Right now don't have resource to do it, will give it a shot.
  • Sergey Karpushin
    Sergey Karpushin over 7 years
    This is completely irrelevant to the question. Question is about invoking application with unicode symbols in args. It is NOT about compiling source code which has unicode symbols in it.
  • Sergey Karpushin
    Sergey Karpushin over 7 years
    This is just a workaround for single language. What if person has more than 1 non-English language on his/her computer? If other applications (like notepad) can handle non-english letters that java application must also be able to do it without changing system locale. See answer below stackoverflow.com/a/41923480/285060 that will not require to change OS locale
  • senderj
    senderj over 2 years
    I am able to add jna jar into my project but I am not able to find OsNative in your coding. please help.
  • Sergey Karpushin
    Sergey Karpushin over 2 years
    @senderj OsNative is just an interface that this class implements, you can create it based on the only public method in this class.