Performance of Arrays and Hashes in Ruby

24,514

Solution 1

Hashes are much faster for lookups:

require 'benchmark'
Document = Struct.new(:id,:a,:b,:c)
documents_a = []
documents_h = {}
1.upto(10_000) do |n|
  d = Document.new(n)
  documents_a << d
  documents_h[d.id] = d
end
searchlist = Array.new(1000){ rand(10_000)+1 }

Benchmark.bm(10) do |x|
  x.report('array'){searchlist.each{|el| documents_a.any?{|d| d.id == el}} }
  x.report('hash'){searchlist.each{|el| documents_h.has_key?(el)} }
end

#                user     system      total        real
#array       2.240000   0.020000   2.260000 (  2.370452)
#hash        0.000000   0.000000   0.000000 (  0.000695)

Solution 2

When using unique values, you can use the Ruby Set which has been previously mentioned. Here are benchmark results. It's slightly slower than the hash though.

                 user     system      total        real
array        0.460000   0.000000   0.460000 (  0.460666)
hash         0.000000   0.000000   0.000000 (  0.000219)
set          0.000000   0.000000   0.000000 (  0.000273)

I simply added to @steenslag's code which can be found here https://gist.github.com/rsiddle/a87df54191b6b9dfe7c9.

I used ruby 2.1.1p76 for this test.

Solution 3

Ruby has a set class in its standard library, have you considering keeping an (additional) set of IDs only?

http://stdlib.rubyonrails.org/libdoc/set/rdoc/index.html

To quote the docs: "This is a hybrid of Array’s intuitive inter-operation facilities and Hash’s fast lookup".

Solution 4

  1. Use a Set of Documents. It has most of the properties you want (constant-time lookup and does not allow duplicates),. Smalltalkers would tell you that using a collection that already has the properties you want is most of the battle.

  2. Use a Hash of Documents by document id, with ||= for conditional insertion (rather than has_key?).

Hashes are designed for constant-time insertion and lookup. Ruby's Set uses a Hash internally.

Be aware that your Document objects will need to implement #hash and #eql? properly in order for them to behave as you would expect as Hash keys or members of a set, as these are used to define hash equality.

Share:
24,514

Related videos on Youtube

slhck
Author by

slhck

Video quality guy and researcher, PhD student in computer science. Founder/CEO of AVEQ. I offer personal consulting and help with video encoding, especially with FFmpeg. Send a mail to werner.robitza at gmail.com. More info on my website.

Updated on October 03, 2020

Comments

  • slhck
    slhck over 3 years

    I have a program that will store many instances of one class, let's say up to 10.000 or more. The class instances have several properties that I need from time to time, but their most important one is the ID.

    class Document
      attr_accessor :id
      def ==(document)
        document.id == self.id
      end
    end
    

    Now, what is the fastest way of storing thousands of these objects?

    I used to put them all into an array of Documents:

    documents = Array.new
    documents << Document.new
    # etc
    

    Now an alternative would be to store them in a Hash:

    documents = Hash.new
    doc = Document.new
    documents[doc.id] = doc
    # etc
    

    In my application, I mostly need to find out whether a document exists at all. Is the Hash's has_key? function significantly faster than a linear search of the Array and the comparison of Document objects? Are both within O(n) or is has_key? even O(1). Will I see the difference?

    Also, sometimes I need to add Documents when it is already existing. When I use an Array, I would have to check with include? before, when I use a Hash, I'd just use has_key? again. Same question as above.

    What are your thoughts? What is the fastest method of storing large amounts of data when 90% of the time I only need to know whether the ID exists (not the object itself!)

  • Elad
    Elad about 13 years
    Just a little note: the implementation of Set is based on Hash internally, so performance-wise using Set.include? is the same as Hash.has_key?
  • Michael Kohl
    Michael Kohl about 13 years
    I know, it says so in the docs. I just thought that maybe instead of keeping all documents around, it may make sense to have a set of only IDs and retrieve documents on demand (depending on the cost of document retrieval). Granted I wasn't very explicit about that, but I'm writing from work...
  • sawa
    sawa about 13 years
    That's a surprising and useful result. I expected it, but but didn't expect the difference to be that big.
  • Rein Henrichs
    Rein Henrichs about 13 years
    Given that the purpose of a hash is to provide constant time lookup, while Array#include? has O(n) worst-case performance, this is not at all surprising.
  • cdunn2001
    cdunn2001 almost 12 years
    Yes, but try upto(10), or even upto(1). The hash is always faster, and becomes "fasterer" for larger N.
  • Michael Mior
    Michael Mior about 10 years
    Of course this wasn't the question, but a better comparison for membership testing would be sets and hashes.
  • steenslag
    steenslag about 10 years
    @MichaelMior Sets use a hash as storage ,according to the docs.
  • Michael Mior
    Michael Mior about 10 years
    That may be the case, but that doesn't mean the performance is the same. I did some quick tests and storing a nil value in a set and checking membership with has_key? outperforms sets somewhat.
  • Quolonel Questions
    Quolonel Questions almost 10 years
  • Brad Nauta
    Brad Nauta over 8 years
    Note that while Hash is fastest for lookups, if you're stuck with arrays Ruby 2.0 introduced Array#bsearch which is much faster than any?, include?, find/detect, etc. Added to the above benchmark, yielded ~6x slower than Hash, any? being ~ 2000x slower. x.report('array bsearch'){searchlist.each{|el| documents_a.bsearch {|d| d.id == el}} }
  • steenslag
    steenslag over 8 years
    @BradNauta bsearch however only works on sorted arrays (which happens to be the case here).
  • Brad Nauta
    Brad Nauta over 8 years
    @steenslag True - thanks for highlighting this gotcha.
  • jBeas
    jBeas almost 8 years
    If you know the key will be present, and you are simply accessing the key ex: documents_a[el], arrays are x14 faster. Benchmark: array ( 0.000247) VS hash ( 0.003607)
  • Magne
    Magne over 6 years
    @jBeas yeah, but its not actually documents_a[el] but documents_a[index] which will return el, so you'd have to know the index of the element in the array, which you oftentimes don't know.