How to detect file encoding in NodeJS?

13,595

Solution 1

I used encoding-japanese package, and it worked well.

Example :

var encoding = require('encoding-japanese');
var fileBuffer = fs.readFileSync('file.txt');
console.log(encoding.detect(fileBuffer))

Available Encodings:

  • 'UTF32' (detect only)
  • 'UTF16'
  • 'UTF16BE'
  • 'UTF16LE'
  • 'BINARY' (detect only)
  • 'ASCII' (detect only)
  • 'JIS'
  • 'UTF8'
  • 'EUCJP'
  • 'SJIS'
  • 'UNICODE' (JavaScript Unicode Array)

It can be used both in node or browsers. Oh... And it has zero dependency.

Solution 2

You can use an npm module that does exactly this: https://www.npmjs.com/package/detect-character-encoding

You can use it like this:

const fs = require('fs');
const detectCharacterEncoding = require('detect-character-encoding');

const fileBuffer = fs.readFileSync('file.txt');
const charsetMatch = detectCharacterEncoding(fileBuffer);

console.log(charsetMatch);
// {
//   encoding: 'UTF-8',
//   confidence: 60
// }

Solution 3

I don't think there is a "native Node.js function" that can do this.

The simplest solution I know is using an npm module like detect-file-encoding-and-language. As long as the input file is not too small it should work fine.

// Install plugin using npm

$ npm install detect-file-encoding-and-language
// Sample code

const languageEncoding = require("detect-file-encoding-and-language");

const pathToFile = "/home/username/documents/my-text-file.txt"

languageEncoding(pathToFile).then(fileInfo => console.log(fileInfo));
// Possible result: { language: japanese, encoding: Shift-JIS, confidence: { language: 0.97, encoding: 0.97 } }

Solution 4

This is what I've been using, for a while now. YMMV. Hope it helps.


var fs = require('fs');
...
getFileEncoding( f ) {

    var d = new Buffer.alloc(5, [0, 0, 0, 0, 0]);
    var fd = fs.openSync(f, 'r');
    fs.readSync(fd, d, 0, 5, 0);
    fs.closeSync(fd);

    // https://en.wikipedia.org/wiki/Byte_order_mark
    var e = false;
    if ( !e && d[0] === 0xEF && d[1] === 0xBB && d[2] === 0xBF)
        e = 'utf8';
    if (!e && d[0] === 0xFE && d[1] === 0xFF)
        e = 'utf16be';
    if (!e && d[0] === 0xFF && d[1] === 0xFE)
        e = 'utf16le';
    if (!e)
        e = 'ascii';

    return e;

}

Share:
13,595

Related videos on Youtube

Hemã Vidal
Author by

Hemã Vidal

Updated on June 07, 2022

Comments

  • Hemã Vidal
    Hemã Vidal almost 2 years

    How to detect which encoding was defined to a file?

    I want something like this:

    fs.getFileEncoding('C:/path/to/file.txt') // it returns 'UTF-8', 'CP-1252', ...
    

    Is there a simple way to do it using a nodejs native function?

    • Muhammad Usman
      Muhammad Usman almost 6 years
      fs is a native module of node
  • Joshua Dannemann
    Joshua Dannemann over 4 years
    The project you mention does not work on Windows. Is there another tool that works well?
  • LUser
    LUser over 2 years
    It doesn't appear to install well in Linux either. That bug was reported in 2017. I would consider this a waste of time.
  • ruffin
    ruffin about 2 years
    Depending on use case & how sure I need to be -- BOM sniffing suggests not very -- I'd probably start with e = 'utf8', remove utf8 check, then run the rest of the ladder without the !e && preamble (adding some elses/ternaries). Duck typing by BOM is a very practical idea for, say, reading files! @Falaen's answer, when no BOM or obvious tipoff, sniffs the whole file looking for telltale signs, which is clever, but perhaps overkill.
  • ruffin
    ruffin about 2 years
    Yeah, since UTF-8 is essentially a superset of at least 7-bit ASCII, if you're just looking for a practical "how should I read this?", you don't lose any utility with return d[0] === 0xfe && d[1] === 0xff ? "utf16be" : d[0] === 0xff && d[1] === 0xfe ? "utf16le" : "utf8";, I don't think.