How to detect file encoding in NodeJS?

node.js file encoding

13,595

Solution 1

I used encoding-japanese package, and it worked well.

Example :

var encoding = require('encoding-japanese');
var fileBuffer = fs.readFileSync('file.txt');
console.log(encoding.detect(fileBuffer))

Available Encodings:

'UTF32' (detect only)
'UTF16'
'UTF16BE'
'UTF16LE'
'BINARY' (detect only)
'ASCII' (detect only)
'JIS'
'UTF8'
'EUCJP'
'SJIS'
'UNICODE' (JavaScript Unicode Array)

It can be used both in node or browsers. Oh... And it has zero dependency.

Solution 2

You can use an npm module that does exactly this: https://www.npmjs.com/package/detect-character-encoding

You can use it like this:

const fs = require('fs');
const detectCharacterEncoding = require('detect-character-encoding');

const fileBuffer = fs.readFileSync('file.txt');
const charsetMatch = detectCharacterEncoding(fileBuffer);

console.log(charsetMatch);
// {
//   encoding: 'UTF-8',
//   confidence: 60
// }

Solution 3

I don't think there is a "native Node.js function" that can do this.

The simplest solution I know is using an npm module like detect-file-encoding-and-language. As long as the input file is not too small it should work fine.

// Install plugin using npm

$ npm install detect-file-encoding-and-language

// Sample code

const languageEncoding = require("detect-file-encoding-and-language");

const pathToFile = "/home/username/documents/my-text-file.txt"

languageEncoding(pathToFile).then(fileInfo => console.log(fileInfo));
// Possible result: { language: japanese, encoding: Shift-JIS, confidence: { language: 0.97, encoding: 0.97 } }

Solution 4

This is what I've been using, for a while now. YMMV. Hope it helps.


var fs = require('fs');
...
getFileEncoding( f ) {

    var d = new Buffer.alloc(5, [0, 0, 0, 0, 0]);
    var fd = fs.openSync(f, 'r');
    fs.readSync(fd, d, 0, 5, 0);
    fs.closeSync(fd);

    // https://en.wikipedia.org/wiki/Byte_order_mark
    var e = false;
    if ( !e && d[0] === 0xEF && d[1] === 0xBB && d[2] === 0xBF)
        e = 'utf8';
    if (!e && d[0] === 0xFE && d[1] === 0xFF)
        e = 'utf16be';
    if (!e && d[0] === 0xFF && d[1] === 0xFE)
        e = 'utf16le';
    if (!e)
        e = 'ascii';

    return e;


}

View more solutions

13,595

Hemã Vidal

Updated on June 07, 2022

Comments

Hemã Vidal almost 2 years
How to detect which encoding was defined to a file?

I want something like this:
```
fs.getFileEncoding('C:/path/to/file.txt') // it returns 'UTF-8', 'CP-1252', ...
```
Is there a simple way to do it using a nodejs native function?
- Muhammad Usman almost 6 years
  
  fs is a native module of node
Joshua Dannemann over 4 years

The project you mention does not work on Windows. Is there another tool that works well?
LUser over 2 years

It doesn't appear to install well in Linux either. That bug was reported in 2017. I would consider this a waste of time.
ruffin about 2 years

Depending on use case & how sure I need to be -- BOM sniffing suggests not very -- I'd probably start with e = 'utf8', remove utf8 check, then run the rest of the ladder without the !e && preamble (adding some elses/ternaries). Duck typing by BOM is a very practical idea for, say, reading files! @Falaen's answer, when no BOM or obvious tipoff, sniffs the whole file looking for telltale signs, which is clever, but perhaps overkill.
ruffin about 2 years

Yeah, since UTF-8 is essentially a superset of at least 7-bit ASCII, if you're just looking for a practical "how should I read this?", you don't lose any utility with return d[0] === 0xfe && d[1] === 0xff ? "utf16be" : d[0] === 0xff && d[1] === 0xfe ? "utf16le" : "utf8";, I don't think.