Which encoding does Process.getInputStream() use?
Solution 1
As I understand, an operation system streams are byte-streams, there are no characters here. The InputStreamReader
constructor uses jvm default character set java.nio.charset.Charset#defaultCharset()
, you could use another constructor to explicitly specify a character set.
Solution 2
An InputStream is a binary stream, so there is no encoding. When you create the Reader, you need to know what character encoding to use, and that would depend on what the program you called produces (Java will not convert it in any way).
If you do not specify anything for InputStreamReader, it will use the platform default encoding, which may not be appropriate. There is another constructor that allows you to specify the encoding.
If you know what encoding to use (and you really have to know):
new InputStreamReader(process.getInputStream(), "UTF-8") // for example
Solution 3
Interestingly enough, when running on Windows:
ProcessBuilder pb = new ProcessBuilder("cmd", "/c dir");
Process process = pb.start();
Then CP437 code page works quite well for
new InputStreamReader(process.getInputStream(), "CP437");
Solution 4
According to http://www.fileformat.info/info/unicode/char/e9/index.htm '\uFFFD' is a unicode code for character 'é'. It actually means that you are reading the stream correctly. Your problem is in writing.
Windows console does not support unicode by default. So, if you want to test your code open file and write your stream there. But do not forget to set the encoding UTF-8
.
Solution 5
If you, like me, know in what encoding you want to use for all input/output, you can either encode it in the Java API calls to some (not all) CreateReader methods, which some other answers have pointed out.
But this will hard code it in the source, which might or might not, be ok.
I found a better way after reading this answer which reveals that you can set the encoding before the JVM starts up to what you need.
java -Dfile.encoding=ISO-8859-1 ...
rds
Updated on September 17, 2022Comments
-
rds about 1 year
In a Java program, I spawn a new
Process
viaProcessBuilder
.args[0] = directory.getAbsolutePath() + File.separator + program; ProcessBuilder pb = new ProcessBuilder(args); pb.directory(directory); final Process process = pb.start();
Then, I read the process standard output with a new
Thread
new Thread() { public void run() { BufferedReader reader = new BufferedReader( new InputStreamReader(process.getInputStream())); String line = ""; while ((line = reader.readLine()) != null) { System.out.println(line); } }.start();
However, when the process outputs non-ASCII characters (such as
'é'
), theline
has character'\uFFFD'
instead.What is the encoding in the
InputStream
returned bygetInputStream
(my platform is Windows in Europe)?How can I change things so that
line
contains the expected data (i.e.'\u00E9'
for'é'
)?Edit: I tried
new InputStreamReader(...,"UTF-8")
:é
becomes\uFFFD
-
hansvb almost 12 yearsAnd as @AlexR points out, the same reasoning applies to writing data, too.
-
Joop Eggen almost 12 yearsCorrect. new PrintWriter(OutputStreamWriter(..., "Cp1252")) where Cp1252 is the Latin-1 with Windows extension, as used in a small part of western Europe (France, Germany and some).
-
rds almost 12 yearsWhy do you point to character (
0xE9
that I want) when I have character0xFFFD
aka 'REPLACEMENT CHARACTER' fileformat.info/info/unicode/char/fffd/index.htm -
rds almost 12 yearsYes, I had to
new InputStreamReader(...,"ISO-8859-1")
-
rds almost 12 yearsUTF-8 is the default encoding. So, this does not help.
-
rds almost 12 yearsUTF-8 is the default encoding in Java, so "UTF-8" cannot help. The solution is close, it just needs "Cp1252" or "ISO-8859-1" (depending on what
getInputStream()
returns) -
hansvb almost 9 yearsUTF-8 is not the default encoding in Java. There is no default at all, it always uses something platform dependent (which can be controlled by environment variables and system properties). Not something an application developer should usually rely on. Better to always be explicit in what encoding you want.
-
Matthew Oakley over 8 yearsUTF-16 is java's standard internal representation of characters. Hence the unsigned 16-bit 'char' primitive. The InputStreamReader will ALWAYS convert to UTF-16. Although the InputStream is a binary stream, if it represents characters the bytes will follow whatever encoding was used to create the resource. The InputStreamReader constructor mentioned by Thilo includes an argument to specify the encoding of that resource - how the stream should be treated.
-
rds over 8 yearsAs other sais the InputStream contains characters in the platform encoding. Since I have a modern operating system, I have UTF-8; since you have Windows, you have CP437.
-
IvanRF about 8 yearsThanks,
CP437
was the only charset name that worked for me (Windows + Spanish characters) -
Etienne Delavennat about 7 yearsActually, nowadays, that should be CP850. The odd thing is that it seems all the windows system is set to windows-1252/cp1252 (at least in western europe), but the console uses CP850 specifically instead. CP437 is the ancestor of CP850. Opening the command prompt and running "chcp" should tell you exactly which encoding is it using to print char data.
-
Etienne Delavennat about 7 yearsAlso, the encoding to use for parsing the InputStream depends on what program the ProcessBuilder is built around. Let's say for example : CP850 for cmd, windows-1252 for some other windows tools you might invoke directly (without wrapping them in cmd), and possibly UTF-8 if the program you're calling outputs UTF-8. This is program-specific and should be looked up in the program's documentation.
-
jan.supol almost 7 yearsNice! I have checked some windows 10 settings. For various europian settings, it's CP850, but for defaultians (US settings), it still remains CP437.
-
Franz D. almost 6 yearsHmmm... nice idea, but it actually it doesn't work on my system (Windows 7 SP1, 64-bit, Java 8 build 71) -- none of the available encodings produces the original string. The problem seems to be that the given example string is not even correctly transferred to the system, producing "?" characters instead. Apart of that, I also get an additional "\r\n" endline in the output.