Create a term-document matrix from files
Solution 1
With Python, you can install textmining
through
sudo pip install textmining
Then, create a new file – let's call it matrix.py
, and add the following:
#!/usr/bin/env python
import textmining
import glob
tdm = textmining.TermDocumentMatrix()
files = glob.glob("/Users/foo/files/*.txt")
print(files)
for f in files:
content = open(f).read()
content = content.replace('\n', ' ')
tdm.add_doc(content)
tdm.write_csv('matrix.csv', cutoff=1)
Save it and call chmod +x matrix.py
. Now, simply run it with ./matrix.py
.
This program will search in the directory specified in glob()
and write the output matrix to matrix.csv
in your current directory, maybe like this:
As you can see, the only drawback is that it doesn't output the document names. We can prepend this list though, using a couple of bash commands – we only need a list of the file names:
echo "" > files.txt; find /Users/foo/files/ -type f -iname "*.txt" >> files.txt
And then, paste this together with the matrix.csv
:
paste -d , files.txt matrix.csv > matrix2.csv
Voilà, our complete term-document matrix:
I could imagine there are less convoluted solutions, but this is Python and I don't know it well enough to change the code to output the entire correct matrix.
Solution 2
It is almost the slhck solution. I just added inside Python script the bash commands executed via os.sytem, to put all in one python script without necessity to shift between python and bash console.
#!/usr/bin/env python
import textmining
import glob
import os
tdm = textmining.TermDocumentMatrix()
files = glob.glob("/Users/andi/Desktop/python_nltk/dane/*.txt")
os.system("""echo "" > files.txt; find /Users/andi/Desktop/python_nltk/dane -type f -iname "*.txt" >> files.txt""")
print(files)
for f in files:
content = open(f).read()
content = content.replace('\n', ' ')
tdm.add_doc(content)
tdm.write_csv('matrix.csv', cutoff=1)
os.system("""paste -d , files.txt matrix.csv > matrix2.csv """)
Solution 3
I can't give you something as pretty as slhck's python solution but here's a pure bash one:
printf "\t" &&
for file in ex*; do \
printf "%-15s" "$file ";
done &&
echo "" &&
while read fruit; do \
printf "$fruit\t";
for file in ex*; do \
printf "%-15s" `grep -wc $fruit $file`;
done;
echo "";
done < superset.txt
If you copy/paste that horrible thing into a terminal, assuming your list of fruits is in a file called superset.txt
with one fruit per line, you get:
example1 example2 example3
apple 1 2 2
banana 1 1 2
mango 0 1 1
orange 1 1 2
pear 0 1 1
plum 0 0 1
EXPLANATION:
-
printf "\t"
: print a TAB to have the filenames aligned to the end of the fruit names. -
for file in ex*; [...] done
: print the file names (assuming they are the only files whose name starts withex
. -
echo ""
: print a new line -
while read fruit; do [...]; done <list
:list
must be a text file containing the superset you mentioned, i.e., all fruit, one fruit per line. This file is read in this loop and each fruit is saved as$fruit
. -
printf "$fruit\t";
: print the fruit name and a TAB. -
for file in ex*; do [...]; done
: Here we go through each file again and usegrep -wc $fruit $file
to get the number of times the fruit we are currently processing was found in that file.
You might also be able to use column
but I never have so did not try:
The column utility formats its input into multiple columns.
Rows are filled before columns. Input is taken from file oper‐
ands, or, by default, from the standard input. Empty lines are
ignored unless the -e option is used.
And here's a Perl one. Technically, this is a one liner, albeit a LONG one:
perl -e 'foreach $file (@ARGV){open(F,"$file"); while(<F>){chomp; $fruits{$_}{$file}++}} print "\t";foreach(sort @ARGV){printf("%-15s",$_)}; print "\n"; foreach $fruit (sort keys(%fruits)){print "$fruit\t"; do {$fruits{$fruit}{$_}||=0; printf("%-15s",$fruits{$fruit}{$_})} for @ARGV; print "\n";}' ex*
Here it is in commented script form which might actually be intelligible:
#!/usr/bin/env perl
foreach $file (@ARGV){ ## cycle through the files
open(F,"$file");
while(<F>){
chomp;## remove newlines
## Count the fruit. This is a hash of hashes
## where the fruit is the first key and the file
## the second. For each fruit then, we will end up
## with something like this: $fruits{apple}{example1}=1
$fruits{$_}{$file}++;
}
}
print "\t"; ## pretty formatting
## Print each of the file names
foreach(sort @ARGV){
printf("%-15s",$_)
}
print "\n"; ## pretty formatting
## Now, cycle through each of the "fruit" we
## found when reading the files and print its
## count in each file.
foreach $fruit (sort keys(%fruits)){
print "$fruit\t"; ## print the fruit names
do {
$fruits{$fruit}{$_}||=0; ## Count should be 0 if none were found
printf("%-15s",$fruits{$fruit}{$_}) ## print the value for each fruit
} for @ARGV;
print "\n"; ## pretty formatting
}
This has the benefit of coping with arbitrary "fruit", not superset is needed. Also, both these solutions use native *nix tools and do not require the installation of additional packages. That said, the python solution in slhck's answer is more concise and gives prettier output.
Related videos on Youtube
![Joe](https://i.stack.imgur.com/5WmQq.png?s=256&g=1)
Joe
I’m Joe. I have interests that include Disability, Creative Writing, Open data and Computer Science. I’m contactable at [email protected]. My github profile is at: https://github.com/joereddington
Updated on September 18, 2022Comments
-
Joe almost 2 years
I have a set of files from
example001.txt
toexample100.txt
. Each file contains a list of keywords from a superset (the superset is available if we want it).So
example001.txt
might containapple banana ... otherfruit
I'd like to be able to process these files and produce something akin to a matrix so there is the list of
examples*
on the top row, the fruit down the side, and a '1' in a column if the fruit is in the file.An example might be...
x example1 example2 example3 Apple 1 1 0 Babana 0 1 0 Coconut 0 1 1
Any idea how I might build some sort of command-line magic to put this together? I'm on OSX and happy with perl or python...
-
slhck over 11 yearsWhat you're looking for is a so-called "term document matrix", usually found in information retrieval and text mining applications. This might help you search for solution. What OS are you on?
-
terdon over 11 yearsWill a Perl solution be OK?
-
-
andilabs over 10 years+1 that's cool. What about using os.system() and ther from inside of python script call bash commands? Wouldn't be it more handy if You just run one python script?
-
slhck over 10 yearsWell, you did it already :) As I said, I wasn't that familiar with Python back then and more at home in the shell.