Create a term-document matrix from files

5,063

Solution 1

With Python, you can install textmining through

sudo pip install textmining

Then, create a new file – let's call it matrix.py, and add the following:

#!/usr/bin/env python
import textmining
import glob

tdm = textmining.TermDocumentMatrix()

files = glob.glob("/Users/foo/files/*.txt")
print(files)
for f in files:
  content = open(f).read()
  content = content.replace('\n', ' ')
  tdm.add_doc(content)
tdm.write_csv('matrix.csv', cutoff=1)

Save it and call chmod +x matrix.py. Now, simply run it with ./matrix.py. This program will search in the directory specified in glob() and write the output matrix to matrix.csv in your current directory, maybe like this:

As you can see, the only drawback is that it doesn't output the document names. We can prepend this list though, using a couple of bash commands – we only need a list of the file names:

echo "" > files.txt; find /Users/foo/files/ -type f -iname "*.txt" >> files.txt

And then, paste this together with the matrix.csv:

paste -d , files.txt matrix.csv > matrix2.csv 

Voilà, our complete term-document matrix:

I could imagine there are less convoluted solutions, but this is Python and I don't know it well enough to change the code to output the entire correct matrix.

Solution 2

It is almost the slhck solution. I just added inside Python script the bash commands executed via os.sytem, to put all in one python script without necessity to shift between python and bash console.

#!/usr/bin/env python
import textmining
import glob
import os
tdm = textmining.TermDocumentMatrix()
files = glob.glob("/Users/andi/Desktop/python_nltk/dane/*.txt")
os.system("""echo "" > files.txt; find /Users/andi/Desktop/python_nltk/dane -type f -iname "*.txt" >> files.txt""")
print(files)
for f in files:
  content = open(f).read()
  content = content.replace('\n', ' ')
  tdm.add_doc(content)
tdm.write_csv('matrix.csv', cutoff=1)

os.system("""paste -d , files.txt matrix.csv > matrix2.csv """)

Solution 3

I can't give you something as pretty as slhck's python solution but here's a pure bash one:

printf "\t" && 
for file in ex*; do \
  printf "%-15s" "$file "; 
done &&
echo "" && 
while read fruit; do \
    printf "$fruit\t";
    for file in ex*; do \
      printf "%-15s" `grep -wc $fruit $file`;  
    done;  
echo ""; 
done < superset.txt

If you copy/paste that horrible thing into a terminal, assuming your list of fruits is in a file called superset.txt with one fruit per line, you get:

        example1       example2       example3       
apple   1              2              2              
banana  1              1              2              
mango   0              1              1              
orange  1              1              2              
pear    0              1              1              
plum    0              0              1              

EXPLANATION:

  • printf "\t" : print a TAB to have the filenames aligned to the end of the fruit names.
  • for file in ex*; [...] done : print the file names (assuming they are the only files whose name starts with ex.
  • echo "" : print a new line
  • while read fruit; do [...]; done <list : list must be a text file containing the superset you mentioned, i.e., all fruit, one fruit per line. This file is read in this loop and each fruit is saved as $fruit.
  • printf "$fruit\t"; : print the fruit name and a TAB.
  • for file in ex*; do [...]; done : Here we go through each file again and use grep -wc $fruit $file to get the number of times the fruit we are currently processing was found in that file.

You might also be able to use column but I never have so did not try:

 The column utility formats its input into multiple columns.
 Rows are filled before columns.  Input is taken from file oper‐
 ands, or, by default, from the standard input.  Empty lines are
 ignored unless the -e option is used.

And here's a Perl one. Technically, this is a one liner, albeit a LONG one:

perl -e 'foreach $file (@ARGV){open(F,"$file"); while(<F>){chomp; $fruits{$_}{$file}++}} print "\t";foreach(sort @ARGV){printf("%-15s",$_)}; print "\n"; foreach $fruit (sort keys(%fruits)){print "$fruit\t"; do {$fruits{$fruit}{$_}||=0; printf("%-15s",$fruits{$fruit}{$_})} for @ARGV; print "\n";}' ex*

Here it is in commented script form which might actually be intelligible:

#!/usr/bin/env perl
foreach $file (@ARGV){ ## cycle through the files
    open(F,"$file");
    while(<F>){
    chomp;## remove newlines
    ## Count the fruit. This is a hash of hashes
    ## where the fruit is the first key and the file
    ## the second. For each fruit then, we will end up
    ## with something like this: $fruits{apple}{example1}=1
    $fruits{$_}{$file}++; 
    }
}
print "\t"; ## pretty formatting

## Print each of the file names
foreach(sort @ARGV){
    printf("%-15s",$_)
}
print "\n";  ## pretty formatting

## Now, cycle through each of the "fruit" we 
## found when reading the files and print its
## count in each file.
foreach $fruit (sort keys(%fruits)){
    print "$fruit\t"; ## print the fruit names
    do {
        $fruits{$fruit}{$_}||=0; ## Count should be 0 if none were found
        printf("%-15s",$fruits{$fruit}{$_}) ## print the value for each fruit
    } for @ARGV;
    print "\n"; ## pretty formatting
} 

This has the benefit of coping with arbitrary "fruit", not superset is needed. Also, both these solutions use native *nix tools and do not require the installation of additional packages. That said, the python solution in slhck's answer is more concise and gives prettier output.

Share:
5,063

Related videos on Youtube

Joe
Author by

Joe

I’m Joe. I have interests that include Disability, Creative Writing, Open data and Computer Science. I’m contactable at [email protected]. My github profile is at: https://github.com/joereddington

Updated on September 18, 2022

Comments

  • Joe
    Joe almost 2 years

    I have a set of files from example001.txt to example100.txt. Each file contains a list of keywords from a superset (the superset is available if we want it).

    So example001.txt might contain

    apple
    banana
    ...
    otherfruit
    

    I'd like to be able to process these files and produce something akin to a matrix so there is the list of examples* on the top row, the fruit down the side, and a '1' in a column if the fruit is in the file.

    An example might be...

    x           example1    example2   example3
    Apple         1            1          0
    Babana        0            1          0
    Coconut       0            1          1
    

    Any idea how I might build some sort of command-line magic to put this together? I'm on OSX and happy with perl or python...

    • slhck
      slhck over 11 years
      What you're looking for is a so-called "term document matrix", usually found in information retrieval and text mining applications. This might help you search for solution. What OS are you on?
    • terdon
      terdon over 11 years
      Will a Perl solution be OK?
  • andilabs
    andilabs over 10 years
    +1 that's cool. What about using os.system() and ther from inside of python script call bash commands? Wouldn't be it more handy if You just run one python script?
  • slhck
    slhck over 10 years
    Well, you did it already :) As I said, I wasn't that familiar with Python back then and more at home in the shell.