Saving huge bigram dictionary to file using pickle

13,648

Solution 1

Pickle is only meant to write complete (small) objects. Your dictionary is a bit large to even hold in memory, you'd better use a database instead so you can store and retrieve entries one by one instead of all at once.

Some good and easily integratable singe-file database formats you can use from Python are SQLite or one of the DBM variants. The last one acts just like a dictionary (i.e. you can read and write key/value-pairs) but uses the disk as storage rather than 1.2 GBs of memory.

Solution 2

Do you really need the whole data in memory? You could split it in naive ways like one file for each year o each month if you want the dictionary/pickle approach.

Also, remember that the dictionaries are not sorted, you can have problems having to sort that ammount of data. In case you want to search or sort the data, of course...

Anyway, I think that the database approach commented before is the most flexible one, specially on the long run...

Solution 3

One solution is to use buzhug instead of pickle. It's a pure Python solution, and retains very Pythonic syntax. I think of it as the next step up from shelve and their ilk. It will handle the data sizes you're talking about. Its size limit is 2 GB per field (each field is stored in a separate file).

Share:
13,648

Related videos on Youtube

João Portela
Author by

João Portela

:)

Updated on April 15, 2022

Comments

  • João Portela
    João Portela about 2 years

    a friend of mine wrote this little progam. the textFile is 1.2GB in size (7 years worth of newspapers). He successfully manages to create the dictionary but he cannot write it to a file using pickle(program hangs).

    import sys
    import string
    import cPickle as pickle
    
    biGramDict = {}
    
    textFile = open(str(sys.argv[1]), 'r')
    biGramDictFile = open(str(sys.argv[2]), 'w')
    
    
    for line in textFile:
       if (line.find('<s>')!=-1):
          old = None
          for line2 in textFile:
             if (line2.find('</s>')!=-1):
                break
             else:
                line2=line2.strip()
                if line2 not in string.punctuation:
                   if old != None:
                      if old not in biGramDict:
                         biGramDict[old] = {}
                      if line2 not in biGramDict[old]:
                         biGramDict[old][line2] = 0
                      biGramDict[old][line2]+=1
                   old=line2
    
    textFile.close()
    
    print "going to pickle..."    
    pickle.dump(biGramDict, biGramDictFile,2)
    
    print "pickle done. now load it..."
    
    biGramDictFile.close()
    biGramDictFile = open(str(sys.argv[2]), 'r')
    
    newBiGramDict = pickle.load(biGramDictFile)
    

    thanks in advance.

    EDIT
    for anyone interested i will briefly explain what this program does. assuming you have a file formated roughly like this:

    <s>
    Hello
    ,
    World
    !
    </s>
    <s>
    Hello
    ,
    munde
    !
    </s>
    <s>
    World
    domination
    .
    </s>
    <s>
    Total
    World
    domination
    !
    </s>
    
    • <s> are sentences separators.
    • one word per line.

    a biGramDictionary is generated for later use.
    something like this:

    {
     "Hello": {"World": 1, "munde": 1}, 
     "World": {"domination": 2},
     "Total": {"World": 1},
    }
    

    hope this helps. right now the strategy changed to using mysql because sqlite just wasn't working (probably because of the size)

    • Messa
      Messa
      ghostdog74, you see 2 for statements, but there is only one loop over the file :) Iterating over a file is just reading lines (from actual position), it does not seek to the beginning of the file.
  • Khelben
    Khelben over 14 years
    Sqlite is a fully relational database, while Berkeley DB is not, just key/value. If it's just storing, I think Berkeley is a better option, while if you want to make some queries and store the information in more organized way, sqlite it's more appropiate.
  • Thomas Wouters
    Thomas Wouters over 14 years
    BerkeleyDB is rather fickle and difficult to manage, especially with larger amounts of data. Even for a single string->string store (which is what BerkeleyDB would be) I would use SQLite, which will take care of all the BerkeleyDB management.
  • Thomas Wouters
    Thomas Wouters over 14 years
    SQLite does not act like a dictionary.
  • new name
    new name over 14 years
    The Python page for the bsddb moddule (python.org/doc/2.6/library/bsddb.html) says that it is deprecated. Is there another non-deprecated Python option for a BSD DB?
  • Wim
    Wim over 14 years
    python.org/doc/2.6/library/persistence.html lists a number of data persistence modules. The gdbm module looks very similar and still supported, I'd go for that one.
  • João Portela
    João Portela over 14 years
    your database suggestion was adequate. although he had to use mysql because SQLite just wasn't cutting it.
  • WinEunuuchs2Unix
    WinEunuuchs2Unix over 3 years
    I'm thinking of using SQLite3 in python down the road. I'm not sure what the relationship is to sqlalchemy in your answer though.
  • WinEunuuchs2Unix
    WinEunuuchs2Unix over 3 years
    @JoãoPortela It would be good to update your question with why "SQLite just wasn't cutting it." and I wonder if a decade later SQLite3 is still playing second fiddle now that it is used in Google Chrome and Firefox on Linux platforms.
  • João Portela
    João Portela over 3 years
    I would love to, but this was so long ago I'm not sure anymore. I think the whole process (inserting and latter querying) ended up being too slow.