Files.readAllBytes vs Files.lines getting MalformedInputException

27,812

Solution 1

This has to do with character encoding. Computers only deal with numbers. To store text, the characters in the text have to be converted to and from numbers, using some scheme. That scheme is called the character encoding. There are many different character encodings; some of the well-known standard character encodings are ASCII, ISO-8859-1 and UTF-8.

In the first example, you read all the bytes (numbers) in the file and then convert them to characters by passing them to the constructor of class String. This will use the default character encoding of your system (whatever it is on your operating system) to convert the bytes to characters.

In the second example, where you use Files.lines(...), the UTF-8 character encoding will be used, according to the documentation. When a sequence of bytes is found in the file that is not a valid UTF-8 sequence, you'll get a MalformedInputException.

The default character encoding of your system may or may not be UTF-8, so that can explain a difference in behaviour.

You'll have to find out what character encoding is used for the file, and then explicitly use that. For example:

String content = new String(Files.readAllBytes(Paths.get("_template.txt")),
        StandardCharsets.ISO_8859_1);

Second example:

Stream<String> lines = Files.lines(Paths.get("_template.txt"),
        StandardCharsets.ISO_8859_1);

Solution 2

To complement Jesper's answer, what happens here (and is undocumented!) is that Files.lines() creates a CharsetDecoder whose policy is to reject invalid byte sequences; that is, its CodingErrorAction is set to REPORT.

This is unlike what happens for nearly all other Reader implementations provided by the JDK, whose standard policy is to REPLACE. This policy will result in all unmappable byte sequences to emit a replacement character (U+FFFD).

Solution 3

Files.lines by default uses the UTF-8 encoding, whereas instantiating a new String from bytes will use the default system encoding. It appears that your file is not in UTF-8, which is why it is failing.

Check what encoding your file is using, and pass it as the second parameter.

Share:
27,812
Angelo.Hannes
Author by

Angelo.Hannes

Updated on July 22, 2022

Comments

  • Angelo.Hannes
    Angelo.Hannes almost 2 years

    I would have thought that the following two approaches to read a file should behave equally. But they don't. The second approach is throwing a MalformedInputException.

    public static void main(String[] args) {    
        try {
            String content = new String(Files.readAllBytes(Paths.get("_template.txt")));
            System.out.println(content);
        } catch (IOException e) {
            e.printStackTrace();
        }
    
        try(Stream<String> lines = Files.lines(Paths.get("_template.txt"))) {
            lines.forEach(System.out::println);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
    

    This is the stack trace:

    Exception in thread "main" java.io.UncheckedIOException: java.nio.charset.MalformedInputException: Input length = 1
        at java.io.BufferedReader$1.hasNext(BufferedReader.java:574)
        at java.util.Iterator.forEachRemaining(Iterator.java:115)
        at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
        at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
        at Test.main(Test.java:19)
    Caused by: java.nio.charset.MalformedInputException: Input length = 1
        at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
        at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
        at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
        at java.io.InputStreamReader.read(InputStreamReader.java:184)
        at java.io.BufferedReader.fill(BufferedReader.java:161)
        at java.io.BufferedReader.readLine(BufferedReader.java:324)
        at java.io.BufferedReader.readLine(BufferedReader.java:389)
        at java.io.BufferedReader$1.hasNext(BufferedReader.java:571)
        ... 4 more
    

    What is the difference here, and how do I fix it?

  • fge
    fge about 9 years
    Interesting... This means that the CharsetDecoder used in Files.lines() uses CodingErrorAction.REPORT by default. This is not documented, and is unlike any default Reader provided by the JDK...
  • Jesper
    Jesper about 9 years
    @fge Files.lines() creates a CharsetDecoder from the Charset that you pass it, and the API docs of CharsetDecoder say: The default action for malformed-input and unmappable-character errors is to report them.
  • fge
    fge about 9 years
    Well, this default action is pretty much overriden everywhere in the JDK :)
  • Jesper
    Jesper about 9 years
    @fge I guess that older Reader classes do this differently for backward compatibility. (It's also curious that Files.lines() uses UTF-8 by default while most other I/O classes use the system's default charset by default). The standard classes are not very consistent in many ways...