Which encoding does Process.getInputStream() use?

16,874

Solution 1

As I understand, an operation system streams are byte-streams, there are no characters here. The InputStreamReader constructor uses jvm default character set java.nio.charset.Charset#defaultCharset(), you could use another constructor to explicitly specify a character set.

Solution 2

An InputStream is a binary stream, so there is no encoding. When you create the Reader, you need to know what character encoding to use, and that would depend on what the program you called produces (Java will not convert it in any way).

If you do not specify anything for InputStreamReader, it will use the platform default encoding, which may not be appropriate. There is another constructor that allows you to specify the encoding.

If you know what encoding to use (and you really have to know):

new InputStreamReader(process.getInputStream(), "UTF-8") // for example

Solution 3

Interestingly enough, when running on Windows:

ProcessBuilder pb = new ProcessBuilder("cmd", "/c dir");
Process process = pb.start();

Then CP437 code page works quite well for

new InputStreamReader(process.getInputStream(), "CP437");

Solution 4

According to http://www.fileformat.info/info/unicode/char/e9/index.htm '\uFFFD' is a unicode code for character 'é'. It actually means that you are reading the stream correctly. Your problem is in writing.

Windows console does not support unicode by default. So, if you want to test your code open file and write your stream there. But do not forget to set the encoding UTF-8.

Solution 5

If you, like me, know in what encoding you want to use for all input/output, you can either encode it in the Java API calls to some (not all) CreateReader methods, which some other answers have pointed out.

But this will hard code it in the source, which might or might not, be ok.

I found a better way after reading this answer which reveals that you can set the encoding before the JVM starts up to what you need.

java -Dfile.encoding=ISO-8859-1 ...
Share:
16,874
rds
Author by

rds

Updated on September 17, 2022

Comments

  • rds
    rds about 1 year

    In a Java program, I spawn a new Process via ProcessBuilder.

    args[0] = directory.getAbsolutePath() + File.separator + program;
    ProcessBuilder pb = new ProcessBuilder(args);
    pb.directory(directory);
    final Process process = pb.start();
    

    Then, I read the process standard output with a new Thread

    new Thread() {
        public void run() {
            BufferedReader reader = new BufferedReader(
                new InputStreamReader(process.getInputStream()));
            String line = "";
            while ((line = reader.readLine()) != null) {
                System.out.println(line);
        }
    }.start();
    

    However, when the process outputs non-ASCII characters (such as 'é'), the line has character '\uFFFD' instead.

    What is the encoding in the InputStream returned by getInputStream (my platform is Windows in Europe)?

    How can I change things so that line contains the expected data (i.e. '\u00E9' for 'é')?

    Edit: I tried new InputStreamReader(...,"UTF-8"): é becomes \uFFFD

  • hansvb
    hansvb almost 12 years
    And as @AlexR points out, the same reasoning applies to writing data, too.
  • Joop Eggen
    Joop Eggen almost 12 years
    Correct. new PrintWriter(OutputStreamWriter(..., "Cp1252")) where Cp1252 is the Latin-1 with Windows extension, as used in a small part of western Europe (France, Germany and some).
  • rds
    rds almost 12 years
    Why do you point to character (0xE9 that I want) when I have character 0xFFFD aka 'REPLACEMENT CHARACTER' fileformat.info/info/unicode/char/fffd/index.htm
  • rds
    rds almost 12 years
    Yes, I had to new InputStreamReader(...,"ISO-8859-1")
  • rds
    rds almost 12 years
    UTF-8 is the default encoding. So, this does not help.
  • rds
    rds almost 12 years
    UTF-8 is the default encoding in Java, so "UTF-8" cannot help. The solution is close, it just needs "Cp1252" or "ISO-8859-1" (depending on what getInputStream() returns)
  • hansvb
    hansvb almost 9 years
    UTF-8 is not the default encoding in Java. There is no default at all, it always uses something platform dependent (which can be controlled by environment variables and system properties). Not something an application developer should usually rely on. Better to always be explicit in what encoding you want.
  • Matthew Oakley
    Matthew Oakley over 8 years
    UTF-16 is java's standard internal representation of characters. Hence the unsigned 16-bit 'char' primitive. The InputStreamReader will ALWAYS convert to UTF-16. Although the InputStream is a binary stream, if it represents characters the bytes will follow whatever encoding was used to create the resource. The InputStreamReader constructor mentioned by Thilo includes an argument to specify the encoding of that resource - how the stream should be treated.
  • rds
    rds over 8 years
    As other sais the InputStream contains characters in the platform encoding. Since I have a modern operating system, I have UTF-8; since you have Windows, you have CP437.
  • IvanRF
    IvanRF about 8 years
    Thanks, CP437 was the only charset name that worked for me (Windows + Spanish characters)
  • Etienne Delavennat
    Etienne Delavennat about 7 years
    Actually, nowadays, that should be CP850. The odd thing is that it seems all the windows system is set to windows-1252/cp1252 (at least in western europe), but the console uses CP850 specifically instead. CP437 is the ancestor of CP850. Opening the command prompt and running "chcp" should tell you exactly which encoding is it using to print char data.
  • Etienne Delavennat
    Etienne Delavennat about 7 years
    Also, the encoding to use for parsing the InputStream depends on what program the ProcessBuilder is built around. Let's say for example : CP850 for cmd, windows-1252 for some other windows tools you might invoke directly (without wrapping them in cmd), and possibly UTF-8 if the program you're calling outputs UTF-8. This is program-specific and should be looked up in the program's documentation.
  • jan.supol
    jan.supol almost 7 years
    Nice! I have checked some windows 10 settings. For various europian settings, it's CP850, but for defaultians (US settings), it still remains CP437.
  • Franz D.
    Franz D. almost 6 years
    Hmmm... nice idea, but it actually it doesn't work on my system (Windows 7 SP1, 64-bit, Java 8 build 71) -- none of the available encodings produces the original string. The problem seems to be that the given example string is not even correctly transferred to the system, producing "?" characters instead. Apart of that, I also get an additional "\r\n" endline in the output.