Find Unique Characters in a File
Solution 1
Here's a PowerShell example:
gc file.txt | select -Skip 2 | % { $_.ToCharArray() } | sort -CaseSensitive -Unique
which produces:
D
Y
a
b
o
I like that it's easy to read.
EDIT: Here's a faster version:
$letters = @{} ; gc file.txt | select -Skip 2 | % { $_.ToCharArray() } | % { $letters[$_] = $true } ; $letters.Keys
Solution 2
BASH shell script version (no sed/awk):
while read -n 1 char; do echo "$char"; done < entry.txt | tr [A-Z] [a-z] | sort -u
UPDATE: Just for the heck of it, since I was bored and still thinking about this problem, here's a C++ version using set. If run time is important this would be my recommended option, since the C++ version takes slightly more than half a second to process a file with 450,000+ entries.
#include <iostream>
#include <set>
int main() {
std::set<char> seen_chars;
std::set<char>::const_iterator iter;
char ch;
/* ignore whitespace and case */
while ( std::cin.get(ch) ) {
if (! isspace(ch) ) {
seen_chars.insert(tolower(ch));
}
}
for( iter = seen_chars.begin(); iter != seen_chars.end(); ++iter ) {
std::cout << *iter << std::endl;
}
return 0;
}
Note that I'm ignoring whitespace and it's case insensitive as requested.
For a 450,000+ entry file (chars.txt), here's a sample run time:
[user@host]$ g++ -o unique_chars unique_chars.cpp
[user@host]$ time ./unique_chars < chars.txt
a
b
d
o
y
real 0m0.638s
user 0m0.612s
sys 0m0.017s
Solution 3
As requested, a pure shell-script "solution":
sed -e "s/./\0\n/g" inputfile | sort -u
It's not nice, it's not fast and the output is not exactly as specified, but it should work ... mostly.
For even more ridiculousness, I present the version that dumps the output on one line:
sed -e "s/./\0\n/g" inputfile | sort -u | while read c; do echo -n "$c" ; done
Solution 4
Use a set
data structure. Most programming languages / standard libraries come with one flavour or another. If they don't, use a hash table (or generally, dictionary) implementation and just omit the value field. Use your characters as keys. These data structures generally filter out duplicate entries (hence the name set
, from its mathematical usage: sets don't have a particular order and only unique values).
Solution 5
Quick and dirty C program that's blazingly fast:
#include <stdio.h>
int main(void)
{
int chars[256] = {0}, c;
while((c = getchar()) != EOF)
chars[c] = 1;
for(c = 32; c < 127; c++) // printable chars only
{
if(chars[c])
putchar(c);
}
putchar('\n');
return 0;
}
Compile it, then do
cat file | ./a.out
To get a list of the unique printable characters in file
.
Admin
Updated on June 03, 2022Comments
-
Admin almost 2 years
I have a file with 450,000+ rows of entries. Each entry is about 7 characters in length. What I want to know is the unique characters of this file.
For instance, if my file were the following;
Entry ----- Yabba Dabba Doo
Then the result would be
Unique characters: {abdoy}
Notice I don't care about case and don't need to order the results. Something tells me this is very easy for the Linux folks to solve.
Update
I'm looking for a very fast solution. I really don't want to have to create code to loop over each entry, loop through each character...and so on. I'm looking for a nice script solution.
Update 2
By Fast, I mean fast to implement...not necessarily fast to run.
-
Joachim Sauer over 15 yearsUsing a set is slightly faster than using a list and checking for contains every time, as @Jough did.
-
Dustin over 15 yearsWhich alphabet? What if the alphabet you use is significantly smaller than the file? What if it's considerably larger?
-
Konrad Rudolph over 15 yearsNot only “slightly” actually. For large files, the difference is truly significant, i.e. in the order of O(n^2) vs. O(n).
-
Russ Bradberry over 15 yearswell he already gave his filesize in the question. im not sure of the performance differences in each, i imagine it would depend on how much memory you have.
-
Joachim Sauer over 15 yearsThis only works if you assume a 8 bit encoding and therefore don't support unicode characters. Or at least you'll need to modify the bitfield size.
-
Konrad Rudolph over 15 yearsYou're missing a
wc -l
at the end. Other than that, nice solution. I tried something similar but didn't get it to work (I forgot theg
option on sed). -
Joachim Sauer over 15 yearsTo me sed/awk are part of a shell: If the shell is available and sed/awk are not, then I'm in bizzaro-world ;-)
-
Joachim Sauer over 15 yearsBut thats some nice work there ... always good to learn some new tricks.
-
Joachim Sauer over 15 yearsBut: "sort -u" should be a lot more efficient than "sort | uniq"
-
Jay over 15 years@saua - Noted, I removed the call to uniq after I checked the manpage to sort. Old habit ;)
-
Konrad Rudolph over 15 yearsBut my solution works, the array doesn't (of course I'm using Unicode files). ;-)
-
Konrad Rudolph over 15 yearsI also claim that the array solution is in fact an optimized and specialized set.
-
Joachim Sauer over 15 yearsI'm afraid the while-lop at the beginning really slows things down, as this is still by far the slowest bash-alternative in this thread, it seems.
-
Joachim Sauer over 15 years"wc -l"? As I understand he's not interested in the number of unique characters but in which ones there are. Did I get that wrong?
-
Joachim Sauer over 15 years@Konrad: but I stole your sed-magic using "\0" it looks much nicer that way. Thanks ;-)
-
Konrad Rudolph over 15 yearsYou're welcome. I'll probably delete my other answer in due course since it's now redundant.
-
Greg Hewgill over 15 yearsOr what if the file is very large but only contains a few characters from the alphabet? This will scan through the whole input file multiple times.
-
Jay over 15 years@saua - yes, not exactly speedy as requested, but then if speed is the major concern I'd probably want to write this in C anyway.
-
Russ Bradberry over 15 yearspoint take, then maybe after finding the character you remove all instances of it from the search string. this way it speeds up as it finds more.
-
EvilTeach over 15 yearsHe is talking characters, not unicode. He also could have probably implemented and adequate solution by now, without help.
-
Cervo over 15 yearsWhy not just use a dictionary, should be of the same order of magnitude as a set, and is as integrated into the language as lists.
-
codelogic over 15 yearsNice, very fast, however it doesn't handle accented characters (anything non-ASCII). Try this test case: cat /usr/share/dict/american-english | ./a.out
-
Adam Rosenfield over 15 yearsI assumed that Brig only cared about ASCII. If not, that's easy to fix - just change the 127 in the loop bound to 256.
-
Jay over 15 yearsI agree, nice and fast! Doesn't ignore the case of the characters as requested, but easy enough to fix with a call to tolower() in the while loop.
-
Yuliy over 15 yearsYou're confusing n's here. the n in question for this algorithm is the number of unique characters, not the size of the file, so it's not that big a performance hit. That said, if you only care about ASCII characters then a bitfield is the easiest approach.
-
Jay over 15 yearsThis doesn't ignore case, if you pipe the sed output through "tr [A-Z] [a-z]" before passing to sort, it'll ignore case also, and it's only a few tenths of a second slower.
-
Lasse V. Karlsen over 15 yearsThere's ways to make this superfast by using assembly code piped to an assembler, thus keeping in line with the "script" part of the question, but my solution is quick to implement and runs decently fast. Yes, it can be improved, but is it necessary?
-
Admin over 15 yearsAs much as I like PowerShell, it is not a good solution for this many rows. The script has been running for at least 5 min using 1.7 GB of memory.
-
Jay Bazuzi over 15 yearsBummer. I wonder if PowerShell v2 will do any better? I bet replacing the
sort
with an accumulation set (like some of the other answers) would improve the memory performance. I will ponder it. -
Jay Bazuzi over 15 yearsThere, I wrote one that uses a hashtable as a set.
-
Abdul over 14 yearsthe while at the end could be replaced by tr -d "\n"