Fast textfile reading in c++

87,990

Solution 1

Updates: Be sure to check the (surprising) updates below the initial answer


Memory mapped files have served me well1:

#include <boost/iostreams/device/mapped_file.hpp> // for mmap
#include <algorithm>  // for std::find
#include <iostream>   // for std::cout
#include <cstring>

int main()
{
    boost::iostreams::mapped_file mmap("input.txt", boost::iostreams::mapped_file::readonly);
    auto f = mmap.const_data();
    auto l = f + mmap.size();

    uintmax_t m_numLines = 0;
    while (f && f!=l)
        if ((f = static_cast<const char*>(memchr(f, '\n', l-f))))
            m_numLines++, f++;

    std::cout << "m_numLines = " << m_numLines << "\n";
}

This should be rather quick.

Update

In case it helps you test this approach, here's a version using mmap directly instead of using Boost: see it live on Coliru

#include <algorithm>
#include <iostream>
#include <cstring>

// for mmap:
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>

const char* map_file(const char* fname, size_t& length);

int main()
{
    size_t length;
    auto f = map_file("test.cpp", length);
    auto l = f + length;

    uintmax_t m_numLines = 0;
    while (f && f!=l)
        if ((f = static_cast<const char*>(memchr(f, '\n', l-f))))
            m_numLines++, f++;

    std::cout << "m_numLines = " << m_numLines << "\n";
}

void handle_error(const char* msg) {
    perror(msg); 
    exit(255);
}

const char* map_file(const char* fname, size_t& length)
{
    int fd = open(fname, O_RDONLY);
    if (fd == -1)
        handle_error("open");

    // obtain file size
    struct stat sb;
    if (fstat(fd, &sb) == -1)
        handle_error("fstat");

    length = sb.st_size;

    const char* addr = static_cast<const char*>(mmap(NULL, length, PROT_READ, MAP_PRIVATE, fd, 0u));
    if (addr == MAP_FAILED)
        handle_error("mmap");

    // TODO close fd at some point in time, call munmap(...)
    return addr;
}

Update

The last bit of performance I could squeeze out of this I found by looking at the source of GNU coreutils wc. To my surprise using the following (greatly simplified) code adapted from wc runs in about 84% of the time taken with the memory mapped file above:

static uintmax_t wc(char const *fname)
{
    static const auto BUFFER_SIZE = 16*1024;
    int fd = open(fname, O_RDONLY);
    if(fd == -1)
        handle_error("open");

    /* Advise the kernel of our access pattern.  */
    posix_fadvise(fd, 0, 0, 1);  // FDADVICE_SEQUENTIAL

    char buf[BUFFER_SIZE + 1];
    uintmax_t lines = 0;

    while(size_t bytes_read = read(fd, buf, BUFFER_SIZE))
    {
        if(bytes_read == (size_t)-1)
            handle_error("read failed");
        if (!bytes_read)
            break;

        for(char *p = buf; (p = (char*) memchr(p, '\n', (buf + bytes_read) - p)); ++p)
            ++lines;
    }

    return lines;
}

1 see e.g. the benchmark here: How to parse space-separated floats in C++ quickly?

Solution 2

4000 * 400,000 = 1.6 GB if you're hard drive isn't an SSD you're likely getting ~100 MB/s sequential read. That's 16 seconds just in I/O.

Since you don't elaborate on the specific code your using or how you need to parse these files (do you need to read it line by line, does the system have a lot of RAM could you read the whole file into a large RAM buffer and then parse it?) There's little you can do to speed up the process.

Memory mapped files won't offer any performance improvement when reading a file sequentially. Perhaps manually parsing large chunks for new lines rather than using "getline" would offer an improvement.

EDIT After doing some learning (thanks @sehe). Here's the memory mapped solution I would likely use.

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <errno.h>

int main() {
    char* fName = "big.txt";
    //
    struct stat sb;
    long cntr = 0;
    int fd, lineLen;
    char *data;
    char *line;
    // map the file
    fd = open(fName, O_RDONLY);
    fstat(fd, &sb);
    //// int pageSize;
    //// pageSize = getpagesize();
    //// data = mmap((caddr_t)0, pageSize, PROT_READ, MAP_PRIVATE, fd, pageSize);
    data = mmap((caddr_t)0, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
    line = data;
    // get lines
    while(cntr < sb.st_size) {
        lineLen = 0;
        line = data;
        // find the next line
        while(*data != '\n' && cntr < sb.st_size) {
            data++;
            cntr++;
            lineLen++;
        }
        /***** PROCESS LINE *****/
        // ... processLine(line, lineLen);
    }
    return 0;
}

Solution 3

Neil Kirk, unfortunately I can not reply to your comment (not enough reputation) but I did a performance test on ifstream an stringstream and the performance, reading a text file line by line, is exactly the same.

std::stringstream stream;
std::string line;
while(std::getline(stream, line)) {
}

This takes 1426ms on a 106MB file.

std::ifstream stream;
std::string line;
while(ifstream.good()) {
    getline(stream, line);
}

This takes 1433ms on the same file.

The following code is faster instead:

const int MAX_LENGTH = 524288;
char* line = new char[MAX_LENGTH];
while (iStream.getline(line, MAX_LENGTH) && strlen(line) > 0) {
}

This takes 884ms on the same file. It is just a little tricky since you have to set the maximum size of your buffer (i.e. maximum length for each line in the input file).

Solution 4

As someone with a little background in competitive programming, I can tell you: At least for simple things like integer parsing the main cost in C is locking the file streams (which is by default done for multi-threading). Use the unlocked_stdio versions instead (fgetc_unlocked(), fread_unlocked()). For C++, the common lore is to use std::ios::sync_with_stdio(false) but I don't know if it's as fast as unlocked_stdio.

For reference here is my standard integer parsing code. It's a lot faster than scanf, as I said mainly due to not locking the stream. For me it was as fast as the best hand-coded mmap or custom buffered versions I'd used previously, without the insane maintenance debt.

int readint(void)
{
        int n, c;
        n = getchar_unlocked() - '0';
        while ((c = getchar_unlocked()) > ' ')
                n = 10*n + c-'0';
        return n;
}

(Note: This one only works if there is precisely one non-digit character between any two integers).

And of course avoid memory allocation if possible...

Solution 5

Do you have to read all files at the same time? (at the start of your application for example)

If you do, consider parallelizing the operation.

Either way, consider using binary streams, or unbffered read for blocks of data.

Share:
87,990
Arne
Author by

Arne

I am interested in Cloud Services and Machine Learning, and generally programming in Python. When I am not writing code you can find me cooking, running, hiking, reading, doodling, writing, or loafing.

Updated on July 05, 2022

Comments

  • Arne
    Arne almost 2 years

    I am currently writing a program in c++ which includes reading lots of large text files. Each has ~400.000 lines with in extreme cases 4000 or more characters per line. Just for testing, I read one of the files using ifstream and the implementation offered by cplusplus.com. It took around 60 seconds, which is way too long. Now I was wondering, is there a straightforward way to improve reading speed?

    edit: The code I am using is more or less this:

    string tmpString;
    ifstream txtFile(path);
    if(txtFile.is_open())
    {
        while(txtFile.good())
        {
            m_numLines++;
            getline(txtFile, tmpString);
        }
        txtFile.close();
    }
    

    edit 2: The file I read is only 82 MB big. I mainly said that it could reach 4000 because I thought it might be necessary to know in order to do buffering.

    edit 3: Thank you all for your answers, but it seems like there is not much room to improve given my problem. I have to use readline, since I want to count the number of lines. Instantiating the ifstream as binary didn't make reading any faster either. I will try to parallelize it as much as I can, that should work at least.

    edit 4: So apparently there are some things I can to. Big thank you to sehe for putting so much time into this, I appreciate it a lot! =)

  • sehe
    sehe almost 11 years
    +1 for beer coaster calculations. SSD could reach ~500Gb/s though. Memory mapping could be more efficient depending on the usage scenarios
  • Arne
    Arne almost 11 years
    I need to read it line by line, because they don't contain a header which tells me how long they are. I could put them into a RAM buffer because I can discard each one after reading it, but then again, i thought that was what ifstream did. Is there a way to tell a program to just throw the whole thing into RAM?
  • Louis Ricci
    Louis Ricci almost 11 years
    @sehe - I was always under the impression that memory mapping files was more of a convenience abstraction for overlapping I/O than a performance boost, especially for a sequential read task. My guess is the OP is using "getline" which is reading 1 byte at a time looking for \n and causing a lot of unnecessarily small file reads. Using a larger read buffer in a sequential ifstream would offer the exact same performance a mapped file (but I am very open to be proven wrong).
  • Louis Ricci
    Louis Ricci almost 11 years
    @ArneRecknagel - if you have enough RAM to handle it you can get the file size and allocate a buffer large enough and do one read operation into the buffer. This will of course have the hefty delay I mentioned, I better way would probably be to allocat a ~16MB sized buffer, read into it, parse the lines you can and move the last (possibly unparsable at this time) line to the beginning of the buffer and continue your read loop into the rest of it.
  • Louis Ricci
    Louis Ricci almost 11 years
    @ArneRecknagel - the underlying caching and abstraction of a mapped file would make the task I described in my last comment a bit easier, but probably not any faster.
  • sehe
    sehe almost 11 years
    @LastCoder mmap are a convenience too, but also: prevent paging all the pages you don't access, work in binary mode implicitly, only require virtual address space (as opposed to copying it to a local buffer). Some filesystem drivers may even have zero-copy paths, especially on readonly maps
  • ogni42
    ogni42 almost 11 years
    parallelizing on a HDD will make things worse, with the impact depending on the distribution of the files on the HDD. On a SSD it might (!) improve things.
  • sehe
    sehe almost 11 years
    @ArneRecknagel it uses Boost Iostreams for convenience, but you could use mmap (POSIX) or MapViewOfFileEx function (Win32) if you prefer.
  • utnapistim
    utnapistim almost 11 years
    You are probably right (I hadn't considered that single HDD could cause further delays). If op combines unbuffered read (say - moving rdbuf() into separate ostringstream and reading from there) it may still be faster). I guess once op decides on implementation, he(she?) will have to measure and find out.
  • Louis Ricci
    Louis Ricci almost 11 years
    @sehe - Thanks sehe, zero copy gave me something to look into. Seems for sequential read mmap offers an order of magnitude improvement. My previous bias was from work on large file encryption in the past where toggling between an optimal amount of reads and writes was an issue.
  • sehe
    sehe almost 11 years
    @ArneRecknagel I've added a version not using Boost in case it helps. See it live on Coliru (counting the lines in it's own main.cpp)
  • sehe
    sehe almost 11 years
    @ArneRecknagel I've updated my code after benchmarking with a 8.9GiB file. It turned out that using memchr instead of std::count made it run in 2.3s instead of 8.4s (over 3x faster). Next, using a read loop on the fd turned out to be marginally faster than using the mmap. I show my adapted wc() version here
  • Void
    Void almost 11 years
    Does calling madvise(addr, 0, MADV_SEQUENTIAL) after your call to mmap() help with performance? That would at least make it more comparable to the wc() implementation, which uses posix_fadvise().
  • sehe
    sehe almost 11 years
    @Void nope, no visible improvements. Thanks for pointing out madvise exists too :)
  • Peter Cordes
    Peter Cordes about 8 years
    Reading in 16kiB chunks reuses the same 4 pages of address space in your process. You won't have TLB misses, and 16kiB is smaller than L1 cache. The memcpy from page-cache (inside read(2)) goes very fast, and the memchr only touches memory that's hot in L1. The mmap version has to fault each page, because mmap doesn't wire all the pages (unless you use MAP_POPULATE, but that won't work well when file size is a large fraction of RAM size).