Matrix Multiplication in Clojure vs Numpy

15,554

Solution 1

The Python version is compiling down to a loop in C while the Clojure version is building a new intermediate sequence for each of the calls to map in this code. It is likely that the performance difference you see is coming from the difference of data structures.

To get better than this you could play with a library like Incanter or write your own version as explained in this SO question. see also this one, neanderthal or nd4j. If you really want to stay with sequences to keep the lazy evaluation properties etc. then you may get a real boost by looking into transients for the internal matrix calculations

EDIT: forgot to add the first step in tuning clojure, turn on "warn on reflection"

Solution 2

Numpy is linking to BLAS/Lapack routines that have been optimized for decades at the level of machine architecture while the Clojure is a implementing the multiplication in the most straightforward and naive manner.

Any time you have non-trivial matrix/vector operations to perform, you should probably link to BLAS/LAPACK.

The only time this won't be faster is for small matrices from languages where the overhead of translating the data representation between the language runtime and the LAPACK exceed the time spent doing the calculation.

Solution 3

I've just staged a small shootout between Incanter 1.3 and jBLAS 1.2.1. Here's the code:

(ns ml-class.experiments.mmult
  [:use [incanter core]]
  [:import [org.jblas DoubleMatrix]])

(defn -main [m]
  (let [n 23 m (Integer/parseInt m)
        ai (matrix (vec (double-array (* m n) (repeatedly rand))) n)
        ab (DoubleMatrix/rand m n)
        ti (copy (trans ai))
        tb (.transpose ab)]
    (dotimes [i 20]
      (print "Incanter: ") (time (mmult ti ai))
      (print "   jBLAS: ") (time (.mmul tb ab)))))

In my test, Incanter is consistently slower than jBLAS by about 45% in plain matrix multiplication. However, Incanter trans function does not create a new copy of a matrix, and therefore (.mmul (.transpose ab) ab) in jBLAS takes twice as much memory and is only 15% faster than (mmult (trans ai) ai) in Incanter.

Given Incanter rich feature set (especially it's plotting library), I don't think I'll switch to jBLAS any time soon. Still, I would love to see another shootout between jBLAS and Parallel Colt, and maybe it's worth considering to replace Parallel Colt with jBLAS in Incanter? :-)


EDIT: Here are absolute numbers (in msec.) I got on my (rather slow) PC:

Incanter: 665.362452
   jBLAS: 459.311598
   numpy: 353.777885

For each library, I've picked the best time out of 20 runs, matrix size 23x400000.

PS. Haskell hmatrix results are close to numpy, but I am not sure how to benchmark it correctly.

Solution 4

Numpy code uses built-in libraries, written in Fortran over the last few decades and optimized by the authors, your CPU vendor, and you OS distributor (as well as the Numpy people) for maximal performance. You just did the completely direct, obvious approach to matrix multiplication. It's not surprise, really, that performance differs.

But if you're insistant upon doing it in Clojure, consider looking up better algorithms, using direct loops as opposed to higher-order functions like reduce, or find a proper matrix algebra library for Java (I doubt there are good ones in Clojure, but I don't really know) written by a competent mathematician.

Finally, look up how to properly write fast Clojure. Use type hints, run a profiler on your code (surprise! you dot product function is using up the most time), and drop the high-level features inside your tight loops.

Solution 5

As @littleidea and others have pointed out your numpy version is using LAPACK/BLAS/ATLAS which will be much faster than anything you do in clojure since it has been finely tuned for years. :)

That said the biggest problem with the Clojure code is that it is using Doubles, as in boxed doubles. I call this the "lazy Double" problem and I've ran into it at work a number of times. As of now, even with 1.3, clojure's collections are not primitive friendly. (You can create a vector of primitives but it won't help you any since all of the seq. functions will end up boxing them! I should also say that the primitive improvements in 1.3 are quite nice and end up helping.. we just aren't 100% there WRT primitive support in collections.)

When doing any kind of matrix math in clojure you really need to use java arrays or, better yet, matrix libraries. Incanter does use parrelcolt but you need to be careful about what incanter functions you use... since a lot of them make the matrices seqable which ends up boxing the doubles giving you similar performance to what you are currently seeing. (BTW, I have my own set up parrelcolt wrappers that I could release if you think they would be helpful.)

In order to use the BLAS libraries you have a couple of options in java-land. With all of these options you have to pay a JNA tax... all of your data has to be copied before it can be processed. This tax makes sense when you are doing CPU bound operations like matrix decompositions and whose processing time takes longer than the time it takes to copy the data. For simpler operations with small matrices then staying in java-land will probably be faster. You just need to do a few tests like you've done above to see what works best for you.

Here are your options for using BLAS from java:

http://jblas.org/

http://code.google.com/p/netlib-java/

I should point out that parrelcolt uses the netlib-java project. Which means, I beleive, if you set it up correcly it will use BLAS. However, I have not verifed this. For an explaination on the differences between jblas and netlib-java see this thread I started about it on jblas's mailing list:

http://groups.google.com/group/jblas-users/browse_thread/thread/c9b3867572331aa5

I should also point out the Universal Java Matrix Package library:

http://sourceforge.net/projects/ujmp/

It wraps all of the libraries I have mentioned, and then some! I haven't looked too much at the API though to know how leaky their abstraction is. It seems like a nice project. I've ended up using my own parrelcolt clojure wrappers since they were fast enough and I actually quite liked the colt API. (Colt uses function objects, which means I was able to just pass clojure functions with little trouble!)

Share:
15,554
Sam Ritchie
Author by

Sam Ritchie

I'm a machine learning researcher and software engineer with extensive experience working on large-scale streaming applications at companies like Google, Twitter and Stripe. I'm the co-author of Summingbird, Algebird, and the co-founder of PaddleGuru and Racehub.

Updated on July 22, 2022

Comments

  • Sam Ritchie
    Sam Ritchie almost 2 years

    I'm working on an application in Clojure that needs to multiply large matrices and am running into some large performance issues compared to an identical Numpy version. Numpy seems to be able to multiply a 1,000,000x23 matrix by its transpose in under a second, while the equivalent clojure code takes over six minutes. (I can print out the resulting matrix from Numpy, so it's definitely evaluating everything).

    Am I doing something terribly wrong in this Clojure code? Is there some trick of Numpy that I can try to mimic?

    Here's the python:

    import numpy as np
    
    def test_my_mult(n):
        A = np.random.rand(n*23).reshape(n,23)
        At = A.T
    
        t0 = time.time()
        res = np.dot(A.T, A)
        print time.time() - t0
        print np.shape(res)
    
        return res
    
    # Example (returns a 23x23 matrix):
    # >>> results = test_my_mult(1000000)
    # 
    # 0.906938076019
    # (23, 23)
    

    And the clojure:

    (defn feature-vec [n]
      (map (partial cons 1)
           (for [x (range n)]
             (take 22 (repeatedly rand)))))
    
    (defn dot-product [x y]
      (reduce + (map * x y)))
    
    (defn transpose
      "returns the transposition of a `coll` of vectors"
      [coll]
      (apply map vector coll))
    
    (defn matrix-mult
      [mat1 mat2]
      (let [row-mult (fn [mat row]
                       (map (partial dot-product row)
                            (transpose mat)))]
        (map (partial row-mult mat2)
             mat1)))
    
    (defn test-my-mult
      [n afn]
      (let [xs  (feature-vec n)
            xst (transpose xs)]
        (time (dorun (afn xst xs)))))
    
    ;; Example (yields a 23x23 matrix):
    ;; (test-my-mult 1000 i/mmult) => "Elapsed time: 32.626 msecs"
    ;; (test-my-mult 10000 i/mmult) => "Elapsed time: 628.841 msecs"
    
    ;; (test-my-mult 1000 matrix-mult) => "Elapsed time: 14.748 msecs"
    ;; (test-my-mult 10000 matrix-mult) => "Elapsed time: 434.128 msecs"
    ;; (test-my-mult 1000000 matrix-mult) => "Elapsed time: 375751.999 msecs"
    
    
    ;; Test from wikipedia
    ;; (def A [[14 9 3] [2 11 15] [0 12 17] [5 2 3]])
    ;; (def B [[12 25] [9 10] [8 5]])
    
    ;; user> (matrix-mult A B)
    ;; ((273 455) (243 235) (244 205) (102 160))
    

    UPDATE: I implemented the same benchmark using the JBLAS library and found massive, massive speed improvements. Thanks to everyone for their input! Time to wrap this sucker in Clojure. Here's the new code:

    (import '[org.jblas FloatMatrix])
    
    (defn feature-vec [n]
      (FloatMatrix.
       (into-array (for [x (range n)]
                     (float-array (cons 1 (take 22 (repeatedly rand))))))))
    
    (defn test-mult [n]
      (let [xs  (feature-vec n)
            xst (.transpose xs)]
        (time (let [result (.mmul xst xs)]
                [(.rows result)
                 (.columns result)]))))
    
    ;; user> (test-mult 10000)
    ;; "Elapsed time: 6.99 msecs"
    ;; [23 23]
    
    ;; user> (test-mult 100000)
    ;; "Elapsed time: 43.88 msecs"
    ;; [23 23]
    
    ;; user> (test-mult 1000000)
    ;; "Elapsed time: 383.439 msecs"
    ;; [23 23]
    
    (defn matrix-stream [rows cols]
      (repeatedly #(FloatMatrix/randn rows cols)))
    
    (defn square-benchmark
      "Times the multiplication of a square matrix."
      [n]
      (let [[a b c] (matrix-stream n n)]
        (time (.mmuli a b c))
        nil))
    
    ;; forma.matrix.jblas> (square-benchmark 10)
    ;; "Elapsed time: 0.113 msecs"
    ;; nil
    ;; forma.matrix.jblas> (square-benchmark 100)
    ;; "Elapsed time: 0.548 msecs"
    ;; nil
    ;; forma.matrix.jblas> (square-benchmark 1000)
    ;; "Elapsed time: 107.555 msecs"
    ;; nil
    ;; forma.matrix.jblas> (square-benchmark 2000)
    ;; "Elapsed time: 793.022 msecs"
    ;; nil
    
  • Sam Ritchie
    Sam Ritchie over 12 years
    Ah, those links are really helpful. I'll try to see what I can get from some of the matrix packages.
  • Joe Kington
    Joe Kington over 12 years
    Actually, numpy isn't even doing the calculation in C, necessarily. It's just calling a BLAS routine. If you have MKL or ATLAS installed, (and if numpy is configured to use them) it will call their BLAS routines for matrix multiplication, a portion of which are basically hand-tuned assembly. At any rate, my point is that numpy (if configured properly) would beat the equivalent of the OP's clojure example written in C, so the best bet is probably something more like Incanter, etc (as you suggested, I'm just rambling).
  • dnolen
    dnolen over 12 years
    Incanter uses parallel-colt which can also leverage MKL / ATLAS
  • John Cromartie
    John Cromartie over 12 years
    A good matrix algebra lib for Java is just a good matrix algebra lib for Clojure waiting for a wrapper.
  • Sam Ritchie
    Sam Ritchie over 12 years
    You said it. I've found JBLAS and am working on wrapping it up right now.
  • pavpanchekha
    pavpanchekha over 12 years
    Not C; much of BLAS and LAPACK are Fortran. The Python version is also not "compiling to" a loop in C; it is simply using a library written in C/Fortran. Finally, the algorithms used are completely different. It is rather unlikely that a description of Numpy's approach is at all related to the code our original poster displayed.
  • gurney alex
    gurney alex about 12 years
    @pavpanchekha: also, the underlying algorithm of the BLAS/LAPACK matrix multiplication is certainly not the naive one implemented in closure above (which has O(n³) complexity (for square matrix). See en.wikipedia.org/wiki/…
  • matanster
    matanster about 6 years
    Being a great answer, I've taken the liberty adding a couple of additional modern options.