How to do a regex search in a UTF-16LE file while in a UTF-8 locale?

text-processing grep regular-expression perl unicode

9,400

Solution 1

My answer is essentially the same as in your other question on this topic:

$ iconv -f UTF-16LE -t UTF-8 myfile.txt | grep pattern

As in the other question, you might need line ending conversion as well, but the point is that you should convert the file to the local encoding so you can use native tools directly.

Solution 2

Install ripgrep utility which supports UTF-16.

For example:

rg pattern filename

ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the -E/--encoding flag.)

To print all lines, run: rg -N . filename.

Solution 3

ugrep (Universal grep) supports Unicode, UTF-8/16/32 files, detects invalid Unicode to ensure proper results, displays text and binary files, and is fast and free:

ugrep searches UTF-8/16/32 input and other formats. Option -Q permits many other file formats to be searched, such as ISO-8859-1 to 16, EBCDIC, code pages 437, 850, 858, 1250 to 1258, MacRoman, and KOI8.

Simply give it a pattern of Unicode characters to match:

ugrep -QUTF-16LE "ऊपर" filename

or with the code points in hex:

ugrep -QUTF-16LE "\x{090A}\x{092A}\x{0930}" filename

See ugrep on GitHub for details.

Solution 4

I believe that Warren's answer is a better general *nix solution, but this perl script works exactly as I wanted (for my somewhat non-standard situation). It does require that I change the search-pattern's current format slightly:
from \x09\x0A\x09\x2A\x09\x30\x00\s09
to \x{090A}\x{092A}\x{0930}\x{0009}

It does everything in one process which is particularly what I was after.

#! /usr/bin/env perl
use strict;
use warnings;
die "3 args are required" if scalar @ARGV != 3;
my $if =$ARGV[0];
my $of =$ARGV[1];
my $pat=$ARGV[2];
open(my $ifh, '<:encoding(UTF-16LE)', $if) or warn "Can't open $if: $!";
open(my $ofh, '>:encoding(UTF-16LE)', $of) or warn "Can't open $of: $!";
while (<$ifh>) { print $ofh $_ if /^$pat/; }

View more solutions

9,400

Peter.O

Updated on September 18, 2022

Comments

Peter.O over 1 year
EDIT: Due to a comment Warren Young made, it made me realize that I was not clear on one quite relevant point. My search string is already in UTF-16LE order (not in Unicode Codepoint order, which is UTF-16BE), so perhaps the Unicode issue is somewhat moot,

Perhaps my issue is a question of how do I grep for bytes (not chars) in groups of 2-bytes, ie. so that UTF-16LE \x09\x0A is not treated as TAB,newline, but just as 2 bytes which happen to be UTF-16LE ऊ? ... Note: I do not need to be concerned about UTF-16 surrogate pairs, so 2-byte blocks is fine.

Here is sample pattern for this 3-character string ऊपर:
- \x09\x0A\x09\x2A\x09\x30
  
  but it returns nothing, though the string is in the file.
(here is the original post)
When searching a UTF-16LE file with a pattern in \x00\x01\x...etc format, I have encountered problems for some values. I've been using sed (and experimented with grep), but being in the UTF-8 locale they recognize some UTF-16LE values as ASCII characters. I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option.

eg. In this text ऊ (UNICODE 090A), though it is a single character, ऊ is perceived as two ASCII chars \x09 and \x0A.

grep has a -P (perl) option which can search for \x00\x... patterns, but I'm getting the same ASCII interpretation.

Is there some way to use grep -P to search in a UTF-16 mode, or perhaps better, how can this be done is perl or some other script.

grep seems to be the most appealing because of its compactness, but whatever gets the job done will overrride that preference.

PS; My ऊ example uses a literal string, but my actual usage needs a regex style search. So this perl example is not quite what I'm after, though it does process the file as UTF-16... I'd prefer to not have to open and close the file... I think perl has more compact ways for basic things like a regex search. I'm after something with that type of compact syntax.
- vonbrand over 11 years
  
  I'm not so sure regexp machinery is really up to snuff with respect to UTF-8, much less other Unicode encodings. They will mostly work on UTF-8, as long as characters that are represented by several bytes do not appear in character sets or as arguments to repetition. E.g., [ña-z] will probably do surpising stuff, and so will gñ* or g[ñn]u, but g$ñ$*, g(n\|ñ)u` should work fine (it just means something different than you see ;-). The machinery is 8-bit clean nowadays, and swallows the UTF-8 bytes without complaint, but doesn't combine them up to characters.
Peter.O almost 12 years

Thanks Warren, but as I mentioned in the question: "I'm locked-in to using UTF-16, so recoding to UTF-8 is not an option." ... I could, if all else fails, do something like what you suggest, but I'm certainly trying to avoid it, because all the search criteria are already in '\xXX` UTF-16 format and that would mean converting them also, plus I need to re-convert the result back into UTF-16. So a more direct (probably/possibly perl) way is preferred... and also, I'd just like to learn how to do it without re-encoding...
Warren Young almost 12 years

I think you may be borrowing trouble. If you provide grep a Unicode code point, it should find it, if the input is in its native Unicode encoding. The only way I see it not working is if you are searching for hex byte pairs instead, and they are byte swapped as compared to how grep sees the data. Keep in mind that internally, grep is processing the input as 32-bit Unicode characters, not as a raw byte stream. Anyway, try it before you reject the answer. You might be surprised and find that it works.
Peter.O almost 12 years

As the Codepoint for @ is 0x0040, the Codepoint for ऊ is 0x090A (U+090A). My patterns are flipped into Little-Endian order \x0A\x09 which is how they are stored. This basically works fine for most patterns, but produces unexpected results when the UTF-16 representation of the the codepoint(s) clashes with grep's UTF-8 interpretation of the pattern and data; especially with the \x0A\x09 combination, which I do encounter.
Peter.O almost 12 years

Your method will certainly work, and I'll mark it up once the dust has settled. At the moment, I'm just hanging out for a method which doesn't need to re-encode data.. (I'm currently diving into perl. The last time I did that I think I drowned :) ... Perhaps what I am looking for is to grep raw byte data, but I'm not sure yet.
Warren Young almost 12 years

I don't see any virtue in not re-coding the data. The Perl answer to your other question also re-coded it on the fly. It's not like I'm asking you to change your files on disk; we're just performing a bit of a transform to the data in order to get it into the form we need to process it. This is what computers are best at. Input-process-output.
Warren Young almost 12 years

Your main loop can be rewritten as while (<$ifh>) { print $ofh $_ if /^$pat/; } You won't get the diagnostic on bad readline, but that's not going to happen on a modern OS unless the hardware is failing while you read the file.
Peter.O almost 12 years

@Warren, thanks for the help. I've changed the script to the simpler loop.