How to reduce the size of an sqlite3 database for iphone?

11,441

Solution 1

I'm not clear on all the use cases for the signature field but it seems like storing an alphabetized version of the word instead would be beneficial.

Solution 2

Have you tried typing the "vacuum" command to make sure you don't have extra space in the db you forgot to reclame?

Solution 3

Remove the indexes on sowpods and twl -- they are probably not helping your query times and are definitely taking lots of space.

You can get stats on the database using sqlite3_analyzer from the SQLite downloads page.

Solution 4

As a totally different approach, you could try using a bloom filter instead of a comprehensive database. Basically, a bloom filter consists of a bunch of hash functions, each of which is associated with a bitfield. For each legal word, each hash function is evaluated, and the corresponding bit in the corresponding bit field is set. Drawback is it's theoretically possible to get false positives, but those can be minimized/practically eliminated with enough hashes. Plus side is a huge space savings.

Solution 5

Your best bet is to use compression, which unfortunately SQLite does not support natively at this point. Luckily, someone took the time to develop a compression extension for it which could be what you need.

Otherwise I'd recommend storing your data mostly in compressed format and uncompressing on the fly.

Share:
11,441
frankodwyer
Author by

frankodwyer

Updated on July 05, 2022

Comments

  • frankodwyer
    frankodwyer almost 2 years

    edit: many thanks for all the answers. Here are the results after applying the optimisations so far:

    • Switching to sorting the characters and run length encoding - new DB size 42M
    • Dropping the indexes on the booleans - new DB size 33M

    The really nice part is this hasn't required any changes in the iphone code

    I have an iphone application with a large dictionary held in sqlite format (read only). I'm looking for ideas to reduce the size of the DB file, which is currently very large.

    Here is the number of entries and resulting size of the sqlite DB:

    franks-macbook:DictionaryMaker frank$ ls -lh dictionary.db
    -rw-r--r--  1 frank  staff    59M  8 Oct 23:08 dictionary.db
    franks-macbook:DictionaryMaker frank$ wc -l dictionary.txt
      453154 dictionary.txt
    

    ...an average of about 135 bytes per entry.

    Here is my DB schema:

    create table words (word text primary key, sowpods boolean, twl boolean, signature text)
    create index sowpods_idx on words(sowpods)
    create index twl_idx on words(twl)
    create index signature_idx on words(signature)
    

    Here is some sample data:

    photoengrave|1|1|10002011000001210101010000
    photoengraved|1|1|10012011000001210101010000
    photoengraver|1|1|10002011000001210201010000
    photoengravers|1|1|10002011000001210211010000
    photoengraves|1|1|10002011000001210111010000
    photoengraving|1|1|10001021100002210101010000
    

    The last field represents the letter frequencies for anagram retrieval (each position is in the range 0..9). The two booleans represent sub dictionaries.

    I need to do queries such as:

    select signature from words where word = 'foo'
    select word from words where signature = '10001021100002210101010000' order by word asc
    select word from words where word like 'foo' order by word asc
    select word from words where word = 'foo' and (sowpods='1' or twl='1')
    

    One idea I have is to encode the letter frequencies more efficiently, e.g. binary encode them as a blob (perhaps with RLE as there are many zeros?). Any ideas for how best to achieve this, or other ideas to reduce the size? I am building the DB in ruby, and reading it on the phone in objective C.

    Also is there any way to get stats on the DB so I can see what is using the most space?

  • frankodwyer
    frankodwyer over 15 years
    the difficulty with an extension would be getting it loaded on the iphone side
  • frankodwyer
    frankodwyer over 15 years
    I haven't tried it, however as I never delete anything from the database (it is created in a single import from a text file), I wouldn't expect it to save any space. I will try it though.
  • codelogic
    codelogic over 15 years
    I think this approach would save the OP a lot of space. It will require the use of the BLOB data type in SQLite.
  • frankodwyer
    frankodwyer over 15 years
    Yes, about 450K words. An in memory approach probably won't work as the raw file data is 18M and memory on the phone is pretty limited. Nor will fulltext search work as I need to run the queries shown - i.e. find all words matching a pattern, or having the same signature (for anagrams).
  • frankodwyer
    frankodwyer over 15 years
    True, that might be smaller and also works for anagram searches (which is the only use case). I will try that.
  • codelogic
    codelogic over 15 years
    Also, rounding up the 13 bytes to 16, this can even be split into two 8 byte integers, which SQLite supports, allowing the use of indexes on the tables. This approach would require the signature to be split into two though and the query would need to check equality on both.
  • frankodwyer
    frankodwyer over 15 years
    Yes I was thinking of something along these lines...each position in the signature field is 0..9 so could be packed into 4 bits. Coding the signature as a large binary integer I think would be smaller again.
  • Marc Novakowski
    Marc Novakowski over 15 years
    Good idea! The only enhancement I can think of for this method would be to replace sequences of 3 or more letters in a row with a number and the letter. For example if your signature for "mississippi" is "iiiimppssss" you could shorten it to "4impp4s"
  • codelogic
    codelogic over 15 years
    frankodwyer, I'm not sure about indexing blobs.
  • frankodwyer
    frankodwyer over 15 years
    also a nice idea...plus it will probably work without changing my current iphone code at all (still would just look for equality). I will try that too. extra points for ruby code that computes that signature given a word :-)
  • Robert Simmons
    Robert Simmons over 15 years
    Please excuse the crudity of this solution as I don't really know Ruby: "mississippi".split( // ).sort.join.gsub(/(.)\1{2,}/) { |s| s.length.to_s + s[0,1] }
  • frankodwyer
    frankodwyer over 15 years
    thanks - that is a really interesting idea and new to me. I am going to pursue the idea of sorting the letters and run length encoding first, but will also look at this approach. A post-processing step could be used to remove false positives (easily identified in this case).
  • frankodwyer
    frankodwyer over 15 years
    Sounds reasonable however quite complex to implement.
  • frankodwyer
    frankodwyer over 15 years
    That's true, but I want this to be able to work offline. Also it's just a free app and I don't want to maintain a server for it.
  • frankodwyer
    frankodwyer over 15 years
    I've set up another question to explore this option a bit more. stackoverflow.com/questions/401834/…
  • frankodwyer
    frankodwyer over 15 years
    re the primary key, I understand it is automatically indexed in sqlite
  • lImbus
    lImbus over 15 years
    I am highly interested to see the new file size after this optimization.
  • Bill
    Bill almost 13 years
    Important word of warning for people reading this: if you use implicit rowids on your sqlite3 tables, VACUUM can (and will!) remove some of your rows.
  • Alix Axel
    Alix Axel over 10 years
    @frankodwyer: Actually, that extension call all be done with whatever SQLite API using simple SQLite UDFs.