Read unicode text files with java

60,014

Solution 1

you wouldn't wrap the Reader, instead you would wrap the stream using an InputStreamReader. You could then wrap that with your BufferedReader that you currently use

BufferedReader in = new BufferedReader(new InputStreamReader(stream, encoding));

Solution 2

Check https://docs.oracle.com/javase/1.5.0/docs/api/java/io/InputStreamReader.html.

I would read source file with something like:

Reader in = new InputStreamReader(new FileInputStream("file"), "UTF-8"));

Solution 3

Some notes:

  • the "UTF-16" encoding can read either little- or big-endian encoded files marked with a BOM; see here for a list of Java 6 encodings; it is not explicitly stated what endianness will be used when writing using "UTF-16" - it appears to be big-endian - so you might want to use "UnicodeLittle" when saving the data
  • be careful when using String class encode/decode methods, especially with a marked variable-width encoding like UTF-16 - use them only on whole data
  • as others have said, it is often best to read character data by wrapping your InputStream with an InputStreamReader; you can concatenate your input into a single String using a StringBuilder or similar buffer.

Solution 4

I would recommend to use UnicodeReader from Google Data API, see this answer for a similar question. It will automatically detect encoding from the Byte order mark (BOM).

You may also consider BOMInputStream in Apache Commons IO which does basically the same but does not cover all alternative versions of BOM.

Share:
60,014
Ron Tuffin
Author by

Ron Tuffin

Husband, Father and Science geek. I ask the questions that you think are too stupid to ask :)

Updated on June 03, 2020

Comments

  • Ron Tuffin
    Ron Tuffin almost 4 years

    Real simple question really. I need to read a Unicode text file in a Java program.

    I am used to using plain ASCII text with a BufferedReader FileReader combo which is obviously not working :(

    I know that I can read a String in the 'traditional' way using a Buffered Reader and then convert it using something like:

    temp = new String(temp.getBytes(), "UTF-16");
    

    But is there a way to wrap the Reader in a 'Converter'?

    EDIT: the file starts with FF FE

  • Roger C S Wernersson
    Roger C S Wernersson about 14 years
    Thanks for the link to the encoding types. I found the right one for me.
  • CodyBugstein
    CodyBugstein over 10 years
    I want to read Hebrew letters, what would I replace with "encoding"?
  • CodyBugstein
    CodyBugstein over 10 years
    to answer my own question, it's "UTF-8"
  • BradleyDotNET
    BradleyDotNET almost 10 years
    Is the Scanner class specific to unicode? Just reading the code (and not being aware of such things) it is difficult to ascertain if this actually answers the question. For issues where the OP may need some conceptual understanding as well as code, it is useful to include a short text description of why the code works in your answer. Such a description would be beneficial here. Also, I have edited your post to put the code in "Code Markup" Please do the same in the future as it makes it much easier to read. Welcome to StackOverflow!
  • Squareoot
    Squareoot over 7 years
    'The constructor BufferedReader(InputStreamReader) is undefined'?