How can I find non-ASCII characters in text files?

java groovy ascii

12,573

Solution 1

Well, it's still here after an hour, so I may as well answer it. Here's a simple filter that prints only non-ASCII characters from its input, and gives exit code 0 if there weren't any and 1 if there were. Reads from standard input only.

#include <stdio.h>
#include <ctype.h>

int main(void)
{
    int c, flag = 0;

    while ((c = getchar()) != EOF)
        if (!isascii(c)) {
            putchar(c);
            flag = 1;
        }

    return flag;
}

Solution 2

Just run $JDK_HOME/bin/native2ascii on the text file and search for "\u" in the output file. I'm assuming you want to find it so you can escape it anyway and this will save you a step. ;)

12,573

Marcus Leon

Director Clearing Technology, Intercontinental Exchange. Develop the clearing systems that power ICE/NYSE's derivatives markets.

Updated on September 18, 2022

Comments

Marcus Leon over 1 year

Is there a tool that can scan a small text file and look for any character not in the simple ASCII character set?

A simple Java or Groovy script would also do.
- Marcus Leon over 12 years
  
  It can be moved there, though would think this would be directly of interest to programmers in the process of certain programming tasks.. (such as where I am at right now)
- Ken White over 12 years
  
  It's not a programming question, and therefore is off-topic. You've been here long enough to know that, but if not please read the FAQ for info on what questions are on-topic here. :)
- Tom Zych over 12 years
  
  You could of course use grep with a negated character class.
- Tom Zych over 12 years
  
  @tchrist: Doesn't ASCII run from 00 to 7F?
Marcus Leon over 12 years

Thanks, happen to have a Java version? :)
Tom Zych over 12 years

Nope, don't do Java, sorry.
tchrist over 12 years

@Marcus: Monolingualism is about as environmentally healthy as any other monoculture.
tchrist over 12 years

It doesn’t make any sense to read the whole file into memory. Note that EVERY SINGLE STRING EVER CREATED matches something like /[\x00-\xFF]*/, just as every single string also matches /a*/, even "xxx". Zero or more means you’re content with 0. And /[\x80-\xFF]/ is not ASCII! You need to match /^[\x00-\x7F]+$/ to be all ASCII. A normal regex engine with the very most basic Unicode support would simply use \p{ASCII} vs \P{ASCII}.
OverZealous over 12 years

@tchrist I appreciate the feedback. Of course, it would be more efficient to stream the file. However, the original question specifically asked about scanning a small file. Your comment about the regex is incorrect, simply due to the fact that I actually tested my code before I posted it. Sorry if my range is incorrect - that might be a valid comment, but your comment is unnecessarily aggressive and rude. I was simply providing a working Groovy-based example, since the question mentioned it.
OverZealous over 12 years

Also, you have to match the empty string, or empty files will show up as non-ASCII. I think that is incorrect behavior.
tchrist over 12 years

Nop, ASCII is code points 0 through 127. Your pattern matches 0 through 255. It is therefore wrong.
OverZealous over 12 years

I shouldn't bother responding, but I need to point out two things: First, you could have simply pointed that out, and suggested a fix, and I would have updated my suggestion. That's how StackExchange works - answers can be edited and cleaned up. Second, it's funny you are making such a big deal about the range, since that's the exact same range you suggested above! It's OK though, I understand that you would rather knock someone down than be helpful.