How can I find non-ASCII characters in text files?

12,573

Solution 1

Well, it's still here after an hour, so I may as well answer it. Here's a simple filter that prints only non-ASCII characters from its input, and gives exit code 0 if there weren't any and 1 if there were. Reads from standard input only.

#include <stdio.h>
#include <ctype.h>

int main(void)
{
    int c, flag = 0;

    while ((c = getchar()) != EOF)
        if (!isascii(c)) {
            putchar(c);
            flag = 1;
        }

    return flag;
}

Solution 2

Just run $JDK_HOME/bin/native2ascii on the text file and search for "\u" in the output file. I'm assuming you want to find it so you can escape it anyway and this will save you a step. ;)

Share:
12,573

Related videos on Youtube

Marcus Leon
Author by

Marcus Leon

Director Clearing Technology, Intercontinental Exchange. Develop the clearing systems that power ICE/NYSE's derivatives markets.

Updated on September 18, 2022

Comments

  • Marcus Leon
    Marcus Leon over 1 year

    Is there a tool that can scan a small text file and look for any character not in the simple ASCII character set?

    A simple Java or Groovy script would also do.

    • Marcus Leon
      Marcus Leon over 12 years
      It can be moved there, though would think this would be directly of interest to programmers in the process of certain programming tasks.. (such as where I am at right now)
    • Ken White
      Ken White over 12 years
      It's not a programming question, and therefore is off-topic. You've been here long enough to know that, but if not please read the FAQ for info on what questions are on-topic here. :)
    • Tom Zych
      Tom Zych over 12 years
      You could of course use grep with a negated character class.
    • Tom Zych
      Tom Zych over 12 years
      @tchrist: Doesn't ASCII run from 00 to 7F?
  • Marcus Leon
    Marcus Leon over 12 years
    Thanks, happen to have a Java version? :)
  • Tom Zych
    Tom Zych over 12 years
    Nope, don't do Java, sorry.
  • tchrist
    tchrist over 12 years
    @Marcus: Monolingualism is about as environmentally healthy as any other monoculture.
  • tchrist
    tchrist over 12 years
    It doesn’t make any sense to read the whole file into memory. Note that EVERY SINGLE STRING EVER CREATED matches something like /[\x00-\xFF]*/, just as every single string also matches /a*/, even "xxx". Zero or more means you’re content with 0. And /[\x80-\xFF]/ is not ASCII! You need to match /^[\x00-\x7F]+$/ to be all ASCII. A normal regex engine with the very most basic Unicode support would simply use \p{ASCII} vs \P{ASCII}.
  • OverZealous
    OverZealous over 12 years
    @tchrist I appreciate the feedback. Of course, it would be more efficient to stream the file. However, the original question specifically asked about scanning a small file. Your comment about the regex is incorrect, simply due to the fact that I actually tested my code before I posted it. Sorry if my range is incorrect - that might be a valid comment, but your comment is unnecessarily aggressive and rude. I was simply providing a working Groovy-based example, since the question mentioned it.
  • OverZealous
    OverZealous over 12 years
    Also, you have to match the empty string, or empty files will show up as non-ASCII. I think that is incorrect behavior.
  • tchrist
    tchrist over 12 years
    Nop, ASCII is code points 0 through 127. Your pattern matches 0 through 255. It is therefore wrong.
  • OverZealous
    OverZealous over 12 years
    I shouldn't bother responding, but I need to point out two things: First, you could have simply pointed that out, and suggested a fix, and I would have updated my suggestion. That's how StackExchange works - answers can be edited and cleaned up. Second, it's funny you are making such a big deal about the range, since that's the exact same range you suggested above! It's OK though, I understand that you would rather knock someone down than be helpful.