reading csv in Julia is slow compared to Python

10,289

Solution 1

The best answer is probably that I'm not as a good a programmer as Wes.

In general, the code in DataFrames is much less well-optimized than the code in Pandas. I'm confident that we can catch up, but it will take some time as there's a lot of basic functionality that we need to implement first. Since there's so much that needs to be built in Julia, I tend to focus on doing things in three parts: (1) build any version, (2) build a correct version, (3) build a fast, correct version. For the work I do, Julia often doesn't offer any versions of essential functionality, so my work gets focused on (1) and (2). As more of the tools I need get built, it'll be easier to focus on performance.

As for memory usage, I think the answer is that we use a set of data structures when parsing tabular data that's much less efficient than those used by Pandas. If I knew the internals of Pandas better, I could list off places where we're less efficient, but for now I'll just speculate that one obvious failing is that we're reading the whole dataset into memory rather than grabbing chunks from disk. This certainly can be avoided and there are issues open for doing so. It's just a matter of time.

On that note, the readtable code is fairly easy to read. The most certain way to get readtable to be faster is to whip out the Julia profiler and start fixing the performance flaws it uncovers.

Solution 2

There is a relatively new julia package called CSV.jl by Jacob Quinn that provides a much faster CSV parser, in many cases on par with pandas: https://github.com/JuliaData/CSV.jl

Solution 3

Note that the "n bytes allocated" output from @time is the total size of all allocated objects, ignoring how many of them might have been freed. This number is often much higher than the final size of live objects in memory. I don't know if this is what your memory size estimate is based on, but I wanted to point this out.

Solution 4

I've found a few things that can partially help this situation.

  1. using the readdlm() function in Julia seems to work considerably faster (e.g. 3x on a recent trial) than readtable(). Of course, if you want the DataFrame object type, you'll then need to convert to it, which may eat up most or all of the speed improvement.

  2. Specifying dimensions of your file can make a BIG difference, both in speed and in memory allocations. I ran this trial reading in a file that is 258.7 MB on disk:

    julia> @time Data = readdlm("MyFile.txt", '\t', Float32, skipstart = 1);
    19.072266 seconds (221.60 M allocations: 6.573 GB, 3.34% gc time)
    
    julia> @time Data = readdlm("MyFile.txt", '\t', Float32, skipstart = 1, dims = (File_Lengths[1], 62));
    10.309866 seconds (87 allocations: 528.331 MB, 0.03% gc time)
    
  3. The type specification for your object matters a lot. For instance, if your data has strings in it, then the data of the array that you read in will be of type Any, which is expensive memory wise. If memory is really an issue, you may want to consider preprocessing your data by first converting the strings to integers, doing your computations, and then converting back. Also, if you don't need a ton of precision, using Float32 type instead of Float64 can save a LOT of space. You can specify this when reading the file in, e.g.:

    Data = readdlm("file.csv", ',', Float32)

  4. Regarding memory usage, I've found in particular that the PooledDataArray type (from the DataArrays package) can be helpful in cutting down memory usage if your data has a lot of repeated values. The time to convert to this type is relatively large, so this isn't a time saver per se, but at least helps reduce the memory usage somewhat. E.g. when loading a data set with 19 million rows and 36 columns, 8 of which represented categorical variables for statistical analysis, this reduced the memory allocation of the object from 5x its size on disk to 4x its size. If there are even more repeated values, the memory reduction can be even more significant (I've had situations where the PooledDataArray cuts memory allocation in half).

  5. It can also sometimes help to run the gc() (garbage collector) function after loading and formatting data to clear out any unneeded ram allocation, though generally Julia will do this automatically pretty well.

Still though, despite all of this, I'll be looking forward to further developments on Julia to enable faster loading and more efficient memory usage for large data sets.

Solution 5

Let us first create a file you are talking about to provide reproducibility:

open("myFile.txt", "w") do io
    foreach(i -> println(io, join(i+1:i+644, '|')), 1:153895)
end

Now I read this file in in Julia 1.4.2 and CSV.jl 0.7.1.

Single threaded:

julia> @time CSV.File("myFile.txt", delim='|', header=false);
  4.747160 seconds (1.55 M allocations: 1.281 GiB, 4.29% gc time)

julia> @time CSV.File("myFile.txt", delim='|', header=false);
  2.780213 seconds (13.72 k allocations: 1.206 GiB, 5.80% gc time)

and using e.g. 4 threads:

julia> @time CSV.File("myFile.txt", delim='|', header=false);
  4.546945 seconds (6.02 M allocations: 1.499 GiB, 5.05% gc time)

julia> @time CSV.File("myFile.txt", delim='|', header=false);
  0.812742 seconds (47.28 k allocations: 1.208 GiB)

In R it is:

> system.time(myData<-read.delim("myFile.txt",sep="|",header=F,
+                                stringsAsFactors=F,na.strings=""))
   user  system elapsed 
 28.615   0.436  29.048 

In Python (Pandas) it is:

>>> import pandas as pd
>>> import time
>>> start=time.time()
>>> myData=pd.read_csv("myFile.txt",sep="|",header=None,low_memory=False)
>>> print(time.time()-start)
25.95710587501526

Now if we test fread from R (which is fast) we get:

> system.time(fread("myFile.txt", sep="|", header=F,
                    stringsAsFactors=F, na.strings="", nThread=1))
   user  system elapsed 
  1.043   0.036   1.082 
> system.time(fread("myFile.txt", sep="|", header=F,
                    stringsAsFactors=F, na.strings="", nThread=4))
   user  system elapsed 
  1.361   0.028   0.416 

So in this case the summary is:

  • despite the cost of compilation of CSV.File in Julia when you run it for the first time it is significantly faster than base R or Python
  • it is comparable in speed to fread in R (in this case slightly slower, but other benchmark made here shows cases when it is faster)

EDIT: Following the request I have added a benchmark for a small file: 10 columns, 100,000 rows Julia vs Pandas.

Data preparation step:

open("myFile.txt", "w") do io
    foreach(i -> println(io, join(i+1:i+10, '|')), 1:100_000)
end

CSV.jl, single threaded:

julia> @time CSV.File("myFile.txt", delim='|', header=false);
  1.898649 seconds (1.54 M allocations: 93.848 MiB, 1.48% gc time)

julia> @time CSV.File("myFile.txt", delim='|', header=false);
  0.029965 seconds (248 allocations: 17.037 MiB)

Pandas:

>>> import pandas as pd
>>> import time
>>> start=time.time()
>>> myData=pd.read_csv("myFile.txt",sep="|",header=None,low_memory=False)
>>> print(time.time()-start)
0.07587623596191406

Conclusions:

  • the compilation cost is a one-time cost that has to be paid and it is constant (roughly it does not depend on how big is the file you want to read in)
  • for small files CSV.jl is faster than Pandas (if we exclude compilation cost)

Now, if you would like to avoid having to pay compilation cost on every fresh Julia session this is doable with https://github.com/JuliaLang/PackageCompiler.jl.

From my experience, if you are doing data science work, where e.g. you read-in thousands of CSV files, I do not have a problem with waiting 2 seconds for the compilation, if later I can save hours. It takes more than 2 seconds to write the code that reads in the files.

Of course - if you write a script that does little work and terminates after it is done then it is a different use case as compilation time would be a majority of computational cost actually. In this case using PackageCompiler.jl is a strategy I use.

Share:
10,289

Related videos on Youtube

uday
Author by

uday

Updated on July 02, 2020

Comments

  • uday
    uday almost 4 years

    reading large text / csv files in Julia takes a long time compared to Python. Here are the times to read a file whose size is 486.6 MB and has 153895 rows and 644 columns.

    python 3.3 example

    import pandas as pd
    import time
    start=time.time()
    myData=pd.read_csv("C:\\myFile.txt",sep="|",header=None,low_memory=False)
    print(time.time()-start)
    
    Output: 19.90
    

    R 3.0.2 example

    system.time(myData<-read.delim("C:/myFile.txt",sep="|",header=F,
       stringsAsFactors=F,na.strings=""))
    
    Output:
    User    System  Elapsed
    181.13  1.07    182.32
    

    Julia 0.2.0 (Julia Studio 0.4.4) example # 1

    using DataFrames
    timing = @time myData = readtable("C:/myFile.txt",separator='|',header=false)
    
    Output:
    elapsed time: 80.35 seconds (10319624244 bytes allocated)
    

    Julia 0.2.0 (Julia Studio 0.4.4) example # 2

    timing = @time myData = readdlm("C:/myFile.txt",'|',header=false)
    
    Output:
    elapsed time: 65.96 seconds (9087413564 bytes allocated)
    
    1. Julia is faster than R, but quite slow compared to Python. What can I do differently to speed up reading a large text file?

    2. a separate issue is the size in memory is 18 x size of hard disk file size in Julia, but only 2.5 x size for python. in Matlab, which I have found to be most memory efficient for large files, it is 2 x size of hard disk file size. Any particular reason for the large file size in memory in Julia?

    • baptiste
      baptiste
      btw, in R I would recommend fread from the data.table package, it's much faster.
  • uday
    uday about 10 years
    thanks, John. keep up the good work. I will surely continue to monitor Julia
  • skan
    skan over 7 years
    Julia should have been built to use a database transparently for the user and be able to do all operations streaming from it.
  • Oscar Smith
    Oscar Smith almost 4 years
    What do you mean by this?

Related