How to reduce the size of an sqlite3 database for iphone?
Solution 1
I'm not clear on all the use cases for the signature field but it seems like storing an alphabetized version of the word instead would be beneficial.
Solution 2
Have you tried typing the "vacuum" command to make sure you don't have extra space in the db you forgot to reclame?
Solution 3
Remove the indexes on sowpods and twl -- they are probably not helping your query times and are definitely taking lots of space.
You can get stats on the database using sqlite3_analyzer from the SQLite downloads page.
Solution 4
As a totally different approach, you could try using a bloom filter instead of a comprehensive database. Basically, a bloom filter consists of a bunch of hash functions, each of which is associated with a bitfield. For each legal word, each hash function is evaluated, and the corresponding bit in the corresponding bit field is set. Drawback is it's theoretically possible to get false positives, but those can be minimized/practically eliminated with enough hashes. Plus side is a huge space savings.
Solution 5
Your best bet is to use compression, which unfortunately SQLite does not support natively at this point. Luckily, someone took the time to develop a compression extension for it which could be what you need.
Otherwise I'd recommend storing your data mostly in compressed format and uncompressing on the fly.
frankodwyer
Updated on July 05, 2022Comments
-
frankodwyer almost 2 years
edit: many thanks for all the answers. Here are the results after applying the optimisations so far:
- Switching to sorting the characters and run length encoding - new DB size 42M
- Dropping the indexes on the booleans - new DB size 33M
The really nice part is this hasn't required any changes in the iphone code
I have an iphone application with a large dictionary held in sqlite format (read only). I'm looking for ideas to reduce the size of the DB file, which is currently very large.
Here is the number of entries and resulting size of the sqlite DB:
franks-macbook:DictionaryMaker frank$ ls -lh dictionary.db -rw-r--r-- 1 frank staff 59M 8 Oct 23:08 dictionary.db franks-macbook:DictionaryMaker frank$ wc -l dictionary.txt 453154 dictionary.txt
...an average of about 135 bytes per entry.
Here is my DB schema:
create table words (word text primary key, sowpods boolean, twl boolean, signature text) create index sowpods_idx on words(sowpods) create index twl_idx on words(twl) create index signature_idx on words(signature)
Here is some sample data:
photoengrave|1|1|10002011000001210101010000 photoengraved|1|1|10012011000001210101010000 photoengraver|1|1|10002011000001210201010000 photoengravers|1|1|10002011000001210211010000 photoengraves|1|1|10002011000001210111010000 photoengraving|1|1|10001021100002210101010000
The last field represents the letter frequencies for anagram retrieval (each position is in the range 0..9). The two booleans represent sub dictionaries.
I need to do queries such as:
select signature from words where word = 'foo' select word from words where signature = '10001021100002210101010000' order by word asc select word from words where word like 'foo' order by word asc select word from words where word = 'foo' and (sowpods='1' or twl='1')
One idea I have is to encode the letter frequencies more efficiently, e.g. binary encode them as a blob (perhaps with RLE as there are many zeros?). Any ideas for how best to achieve this, or other ideas to reduce the size? I am building the DB in ruby, and reading it on the phone in objective C.
Also is there any way to get stats on the DB so I can see what is using the most space?
-
frankodwyer over 15 yearsthe difficulty with an extension would be getting it loaded on the iphone side
-
frankodwyer over 15 yearsI haven't tried it, however as I never delete anything from the database (it is created in a single import from a text file), I wouldn't expect it to save any space. I will try it though.
-
codelogic over 15 yearsI think this approach would save the OP a lot of space. It will require the use of the BLOB data type in SQLite.
-
frankodwyer over 15 yearsYes, about 450K words. An in memory approach probably won't work as the raw file data is 18M and memory on the phone is pretty limited. Nor will fulltext search work as I need to run the queries shown - i.e. find all words matching a pattern, or having the same signature (for anagrams).
-
frankodwyer over 15 yearsTrue, that might be smaller and also works for anagram searches (which is the only use case). I will try that.
-
codelogic over 15 yearsAlso, rounding up the 13 bytes to 16, this can even be split into two 8 byte integers, which SQLite supports, allowing the use of indexes on the tables. This approach would require the signature to be split into two though and the query would need to check equality on both.
-
frankodwyer over 15 yearsYes I was thinking of something along these lines...each position in the signature field is 0..9 so could be packed into 4 bits. Coding the signature as a large binary integer I think would be smaller again.
-
Marc Novakowski over 15 yearsGood idea! The only enhancement I can think of for this method would be to replace sequences of 3 or more letters in a row with a number and the letter. For example if your signature for "mississippi" is "iiiimppssss" you could shorten it to "4impp4s"
-
codelogic over 15 yearsfrankodwyer, I'm not sure about indexing blobs.
-
frankodwyer over 15 yearsalso a nice idea...plus it will probably work without changing my current iphone code at all (still would just look for equality). I will try that too. extra points for ruby code that computes that signature given a word :-)
-
Robert Simmons over 15 yearsPlease excuse the crudity of this solution as I don't really know Ruby: "mississippi".split( // ).sort.join.gsub(/(.)\1{2,}/) { |s| s.length.to_s + s[0,1] }
-
frankodwyer over 15 yearsthanks - that is a really interesting idea and new to me. I am going to pursue the idea of sorting the letters and run length encoding first, but will also look at this approach. A post-processing step could be used to remove false positives (easily identified in this case).
-
frankodwyer over 15 yearsSounds reasonable however quite complex to implement.
-
frankodwyer over 15 yearsThat's true, but I want this to be able to work offline. Also it's just a free app and I don't want to maintain a server for it.
-
frankodwyer over 15 yearsI've set up another question to explore this option a bit more. stackoverflow.com/questions/401834/…
-
frankodwyer over 15 yearsre the primary key, I understand it is automatically indexed in sqlite
-
lImbus over 15 yearsI am highly interested to see the new file size after this optimization.
-
Bill almost 13 yearsImportant word of warning for people reading this: if you use implicit rowids on your sqlite3 tables, VACUUM can (and will!) remove some of your rows.
-
Alix Axel over 10 years@frankodwyer: Actually, that extension call all be done with whatever SQLite API using simple SQLite UDFs.