Find Unique Characters in a File

12,127

Solution 1

Here's a PowerShell example:

gc file.txt | select -Skip 2 | % { $_.ToCharArray() } | sort -CaseSensitive -Unique

which produces:

D
Y
a
b
o

I like that it's easy to read.

EDIT: Here's a faster version:

$letters = @{} ; gc file.txt | select -Skip 2 | % { $_.ToCharArray() } | % { $letters[$_] = $true } ; $letters.Keys

Solution 2

BASH shell script version (no sed/awk):

while read -n 1 char; do echo "$char"; done < entry.txt | tr [A-Z] [a-z] |  sort -u

UPDATE: Just for the heck of it, since I was bored and still thinking about this problem, here's a C++ version using set. If run time is important this would be my recommended option, since the C++ version takes slightly more than half a second to process a file with 450,000+ entries.

#include <iostream>
#include <set>

int main() {
    std::set<char> seen_chars;
    std::set<char>::const_iterator iter;
    char ch;

    /* ignore whitespace and case */
    while ( std::cin.get(ch) ) {
        if (! isspace(ch) ) {
            seen_chars.insert(tolower(ch));
        }
    }

    for( iter = seen_chars.begin(); iter != seen_chars.end(); ++iter ) {
        std::cout << *iter << std::endl;
    }

    return 0;
}

Note that I'm ignoring whitespace and it's case insensitive as requested.

For a 450,000+ entry file (chars.txt), here's a sample run time:

[user@host]$ g++ -o unique_chars unique_chars.cpp 
[user@host]$ time ./unique_chars < chars.txt
a
b
d
o
y

real    0m0.638s
user    0m0.612s
sys     0m0.017s

Solution 3

As requested, a pure shell-script "solution":

sed -e "s/./\0\n/g" inputfile | sort -u

It's not nice, it's not fast and the output is not exactly as specified, but it should work ... mostly.

For even more ridiculousness, I present the version that dumps the output on one line:

sed -e "s/./\0\n/g" inputfile | sort -u | while read c; do echo -n "$c" ; done

Solution 4

Use a set data structure. Most programming languages / standard libraries come with one flavour or another. If they don't, use a hash table (or generally, dictionary) implementation and just omit the value field. Use your characters as keys. These data structures generally filter out duplicate entries (hence the name set, from its mathematical usage: sets don't have a particular order and only unique values).

Solution 5

Quick and dirty C program that's blazingly fast:

#include <stdio.h>

int main(void)
{
  int chars[256] = {0}, c;
  while((c = getchar()) != EOF)
    chars[c] = 1;
  for(c = 32; c < 127; c++)  // printable chars only
  {
    if(chars[c])
      putchar(c);
  }

  putchar('\n');

  return 0;
}

Compile it, then do

cat file | ./a.out

To get a list of the unique printable characters in file.

Share:
12,127
Admin
Author by

Admin

Updated on June 03, 2022

Comments

  • Admin
    Admin almost 2 years

    I have a file with 450,000+ rows of entries. Each entry is about 7 characters in length. What I want to know is the unique characters of this file.

    For instance, if my file were the following;

    Entry
    -----
    Yabba
    Dabba
    Doo
    

    Then the result would be

    Unique characters: {abdoy}

    Notice I don't care about case and don't need to order the results. Something tells me this is very easy for the Linux folks to solve.

    Update

    I'm looking for a very fast solution. I really don't want to have to create code to loop over each entry, loop through each character...and so on. I'm looking for a nice script solution.

    Update 2

    By Fast, I mean fast to implement...not necessarily fast to run.

  • Joachim Sauer
    Joachim Sauer over 15 years
    Using a set is slightly faster than using a list and checking for contains every time, as @Jough did.
  • Dustin
    Dustin over 15 years
    Which alphabet? What if the alphabet you use is significantly smaller than the file? What if it's considerably larger?
  • Konrad Rudolph
    Konrad Rudolph over 15 years
    Not only “slightly” actually. For large files, the difference is truly significant, i.e. in the order of O(n^2) vs. O(n).
  • Russ Bradberry
    Russ Bradberry over 15 years
    well he already gave his filesize in the question. im not sure of the performance differences in each, i imagine it would depend on how much memory you have.
  • Joachim Sauer
    Joachim Sauer over 15 years
    This only works if you assume a 8 bit encoding and therefore don't support unicode characters. Or at least you'll need to modify the bitfield size.
  • Konrad Rudolph
    Konrad Rudolph over 15 years
    You're missing a wc -l at the end. Other than that, nice solution. I tried something similar but didn't get it to work (I forgot the g option on sed).
  • Joachim Sauer
    Joachim Sauer over 15 years
    To me sed/awk are part of a shell: If the shell is available and sed/awk are not, then I'm in bizzaro-world ;-)
  • Joachim Sauer
    Joachim Sauer over 15 years
    But thats some nice work there ... always good to learn some new tricks.
  • Joachim Sauer
    Joachim Sauer over 15 years
    But: "sort -u" should be a lot more efficient than "sort | uniq"
  • Jay
    Jay over 15 years
    @saua - Noted, I removed the call to uniq after I checked the manpage to sort. Old habit ;)
  • Konrad Rudolph
    Konrad Rudolph over 15 years
    But my solution works, the array doesn't (of course I'm using Unicode files). ;-)
  • Konrad Rudolph
    Konrad Rudolph over 15 years
    I also claim that the array solution is in fact an optimized and specialized set.
  • Joachim Sauer
    Joachim Sauer over 15 years
    I'm afraid the while-lop at the beginning really slows things down, as this is still by far the slowest bash-alternative in this thread, it seems.
  • Joachim Sauer
    Joachim Sauer over 15 years
    "wc -l"? As I understand he's not interested in the number of unique characters but in which ones there are. Did I get that wrong?
  • Joachim Sauer
    Joachim Sauer over 15 years
    @Konrad: but I stole your sed-magic using "\0" it looks much nicer that way. Thanks ;-)
  • Konrad Rudolph
    Konrad Rudolph over 15 years
    You're welcome. I'll probably delete my other answer in due course since it's now redundant.
  • Greg Hewgill
    Greg Hewgill over 15 years
    Or what if the file is very large but only contains a few characters from the alphabet? This will scan through the whole input file multiple times.
  • Jay
    Jay over 15 years
    @saua - yes, not exactly speedy as requested, but then if speed is the major concern I'd probably want to write this in C anyway.
  • Russ Bradberry
    Russ Bradberry over 15 years
    point take, then maybe after finding the character you remove all instances of it from the search string. this way it speeds up as it finds more.
  • EvilTeach
    EvilTeach over 15 years
    He is talking characters, not unicode. He also could have probably implemented and adequate solution by now, without help.
  • Cervo
    Cervo over 15 years
    Why not just use a dictionary, should be of the same order of magnitude as a set, and is as integrated into the language as lists.
  • codelogic
    codelogic over 15 years
    Nice, very fast, however it doesn't handle accented characters (anything non-ASCII). Try this test case: cat /usr/share/dict/american-english | ./a.out
  • Adam Rosenfield
    Adam Rosenfield over 15 years
    I assumed that Brig only cared about ASCII. If not, that's easy to fix - just change the 127 in the loop bound to 256.
  • Jay
    Jay over 15 years
    I agree, nice and fast! Doesn't ignore the case of the characters as requested, but easy enough to fix with a call to tolower() in the while loop.
  • Yuliy
    Yuliy over 15 years
    You're confusing n's here. the n in question for this algorithm is the number of unique characters, not the size of the file, so it's not that big a performance hit. That said, if you only care about ASCII characters then a bitfield is the easiest approach.
  • Jay
    Jay over 15 years
    This doesn't ignore case, if you pipe the sed output through "tr [A-Z] [a-z]" before passing to sort, it'll ignore case also, and it's only a few tenths of a second slower.
  • Lasse V. Karlsen
    Lasse V. Karlsen over 15 years
    There's ways to make this superfast by using assembly code piped to an assembler, thus keeping in line with the "script" part of the question, but my solution is quick to implement and runs decently fast. Yes, it can be improved, but is it necessary?
  • Admin
    Admin over 15 years
    As much as I like PowerShell, it is not a good solution for this many rows. The script has been running for at least 5 min using 1.7 GB of memory.
  • Jay Bazuzi
    Jay Bazuzi over 15 years
    Bummer. I wonder if PowerShell v2 will do any better? I bet replacing the sort with an accumulation set (like some of the other answers) would improve the memory performance. I will ponder it.
  • Jay Bazuzi
    Jay Bazuzi over 15 years
    There, I wrote one that uses a hashtable as a set.
  • Abdul
    Abdul over 14 years
    the while at the end could be replaced by tr -d "\n"