Find Unique Characters in a File

search parsing scripting

12,127

Solution 1

Here's a PowerShell example:

gc file.txt | select -Skip 2 | % { $_.ToCharArray() } | sort -CaseSensitive -Unique

which produces:

D
Y
a
b
o

I like that it's easy to read.

EDIT: Here's a faster version:

$letters = @{} ; gc file.txt | select -Skip 2 | % { $_.ToCharArray() } | % { $letters[$_] = $true } ; $letters.Keys

Solution 2

BASH shell script version (no sed/awk):

while read -n 1 char; do echo "$char"; done < entry.txt | tr [A-Z] [a-z] |  sort -u

UPDATE: Just for the heck of it, since I was bored and still thinking about this problem, here's a C++ version using set. If run time is important this would be my recommended option, since the C++ version takes slightly more than half a second to process a file with 450,000+ entries.

#include <iostream>
#include <set>

int main() {
    std::set<char> seen_chars;
    std::set<char>::const_iterator iter;
    char ch;

    /* ignore whitespace and case */
    while ( std::cin.get(ch) ) {
        if (! isspace(ch) ) {
            seen_chars.insert(tolower(ch));
        }
    }

    for( iter = seen_chars.begin(); iter != seen_chars.end(); ++iter ) {
        std::cout << *iter << std::endl;
    }

    return 0;
}

Note that I'm ignoring whitespace and it's case insensitive as requested.

For a 450,000+ entry file (chars.txt), here's a sample run time:

[user@host]$ g++ -o unique_chars unique_chars.cpp 
[user@host]$ time ./unique_chars < chars.txt
a
b
d
o
y

real    0m0.638s
user    0m0.612s
sys     0m0.017s

Solution 3

As requested, a pure shell-script "solution":

sed -e "s/./\0\n/g" inputfile | sort -u

It's not nice, it's not fast and the output is not exactly as specified, but it should work ... mostly.

For even more ridiculousness, I present the version that dumps the output on one line:

sed -e "s/./\0\n/g" inputfile | sort -u | while read c; do echo -n "$c" ; done

Solution 4

Use a set data structure. Most programming languages / standard libraries come with one flavour or another. If they don't, use a hash table (or generally, dictionary) implementation and just omit the value field. Use your characters as keys. These data structures generally filter out duplicate entries (hence the name set, from its mathematical usage: sets don't have a particular order and only unique values).

Solution 5

Quick and dirty C program that's blazingly fast:

#include <stdio.h>

int main(void)
{
  int chars[256] = {0}, c;
  while((c = getchar()) != EOF)
    chars[c] = 1;
  for(c = 32; c < 127; c++)  // printable chars only
  {
    if(chars[c])
      putchar(c);
  }

  putchar('\n');

  return 0;
}

Compile it, then do

cat file | ./a.out

To get a list of the unique printable characters in file.

View more solutions

12,127

Author by

Admin

Updated on June 03, 2022

Comments

Admin almost 2 years
I have a file with 450,000+ rows of entries. Each entry is about 7 characters in length. What I want to know is the unique characters of this file.

For instance, if my file were the following;
```
Entry
-----
Yabba
Dabba
Doo
```
Then the result would be

Unique characters: {abdoy}

Notice I don't care about case and don't need to order the results. Something tells me this is very easy for the Linux folks to solve.

Update

I'm looking for a very fast solution. I really don't want to have to create code to loop over each entry, loop through each character...and so on. I'm looking for a nice script solution.

Update 2

By Fast, I mean fast to implement...not necessarily fast to run.
Joachim Sauer over 15 years

Using a set is slightly faster than using a list and checking for contains every time, as @Jough did.
Dustin over 15 years

Which alphabet? What if the alphabet you use is significantly smaller than the file? What if it's considerably larger?
Konrad Rudolph over 15 years

Not only “slightly” actually. For large files, the difference is truly significant, i.e. in the order of O(n^2) vs. O(n).
Russ Bradberry over 15 years

well he already gave his filesize in the question. im not sure of the performance differences in each, i imagine it would depend on how much memory you have.
Joachim Sauer over 15 years

This only works if you assume a 8 bit encoding and therefore don't support unicode characters. Or at least you'll need to modify the bitfield size.
Konrad Rudolph over 15 years

You're missing a wc -l at the end. Other than that, nice solution. I tried something similar but didn't get it to work (I forgot the g option on sed).
Joachim Sauer over 15 years

To me sed/awk are part of a shell: If the shell is available and sed/awk are not, then I'm in bizzaro-world ;-)
Joachim Sauer over 15 years

But thats some nice work there ... always good to learn some new tricks.
Joachim Sauer over 15 years

But: "sort -u" should be a lot more efficient than "sort | uniq"
Jay over 15 years

@saua - Noted, I removed the call to uniq after I checked the manpage to sort. Old habit ;)
Konrad Rudolph over 15 years

But my solution works, the array doesn't (of course I'm using Unicode files). ;-)
Konrad Rudolph over 15 years

I also claim that the array solution is in fact an optimized and specialized set.
Joachim Sauer over 15 years

I'm afraid the while-lop at the beginning really slows things down, as this is still by far the slowest bash-alternative in this thread, it seems.
Joachim Sauer over 15 years

"wc -l"? As I understand he's not interested in the number of unique characters but in which ones there are. Did I get that wrong?
Joachim Sauer over 15 years

@Konrad: but I stole your sed-magic using "\0" it looks much nicer that way. Thanks ;-)
Konrad Rudolph over 15 years

You're welcome. I'll probably delete my other answer in due course since it's now redundant.
Greg Hewgill over 15 years

Or what if the file is very large but only contains a few characters from the alphabet? This will scan through the whole input file multiple times.
Jay over 15 years

@saua - yes, not exactly speedy as requested, but then if speed is the major concern I'd probably want to write this in C anyway.
Russ Bradberry over 15 years

point take, then maybe after finding the character you remove all instances of it from the search string. this way it speeds up as it finds more.
EvilTeach over 15 years

He is talking characters, not unicode. He also could have probably implemented and adequate solution by now, without help.
Cervo over 15 years

Why not just use a dictionary, should be of the same order of magnitude as a set, and is as integrated into the language as lists.
codelogic over 15 years

Nice, very fast, however it doesn't handle accented characters (anything non-ASCII). Try this test case: cat /usr/share/dict/american-english | ./a.out
Adam Rosenfield over 15 years

I assumed that Brig only cared about ASCII. If not, that's easy to fix - just change the 127 in the loop bound to 256.
Jay over 15 years

I agree, nice and fast! Doesn't ignore the case of the characters as requested, but easy enough to fix with a call to tolower() in the while loop.
Yuliy over 15 years

You're confusing n's here. the n in question for this algorithm is the number of unique characters, not the size of the file, so it's not that big a performance hit. That said, if you only care about ASCII characters then a bitfield is the easiest approach.
Jay over 15 years

This doesn't ignore case, if you pipe the sed output through "tr [A-Z] [a-z]" before passing to sort, it'll ignore case also, and it's only a few tenths of a second slower.
Lasse V. Karlsen over 15 years

There's ways to make this superfast by using assembly code piped to an assembler, thus keeping in line with the "script" part of the question, but my solution is quick to implement and runs decently fast. Yes, it can be improved, but is it necessary?
Admin over 15 years

As much as I like PowerShell, it is not a good solution for this many rows. The script has been running for at least 5 min using 1.7 GB of memory.
Jay Bazuzi over 15 years

Bummer. I wonder if PowerShell v2 will do any better? I bet replacing the sort with an accumulation set (like some of the other answers) would improve the memory performance. I will ponder it.
Jay Bazuzi over 15 years

There, I wrote one that uses a hashtable as a set.
Abdul over 14 years

the while at the end could be replaced by tr -d "\n"