Need a way to sort a 100 GB log file by date

10,922

Solution 1

If a string sort will work for you, then just use the Windows SORT command. Sort the file and be done with it. It'll happily sort your 100GB file, and it's simple to use.

If you need to filter and convert the file, specifically the date field, then I would simply write a small conversion program that converts the data field in to a 0 filled integer (like # of seconds since 1970, or whatever you like), and rewrites the record. Then you can pipe (|) the output in to the sort command, then you have a final, sorted file thats more readily parsed by your utility program.

I think the mistake you're making is simply trying to do this all in one go. 100GB of data is a lot, and it takes some time to copy, but it doesn't take THAT long. Since you have to sort it, you already have to deal with a copy of the file at some point (i.e. you need as much free space on your machine to handle both copies at some time), even with an external sorting routine like merge sort.

Writing a simple reformatter and piping it in to sort will save you a couple trips through the file, and save space on disk, since you'll inevitably just need the two copies.

I would also tweak the formatter in to pulling only the fields I'm really interested in, and do all of the "heavy" parsing at that point so that what you end up with is essentially a formatted file that easily handled by your reporting routines. That way you'll save time later when potentially running your reports more than once.

Use a simple CSV or, even better, a fixed length file format for output if possible.

Make sure your date information, if you choose to use an integer, has all of the fields the same length. Otherwise the SORT utility won't sort them correctly (you end up with 1 10 2 3 instead of 1 2 3 10. You're better to have 01 02 03 10.).

Edit --

Let's approach it from a different tact.

The biggest question is "do you need all this data". This relates to the earlier suggestion about doing the heavy parsing first. Obviously, the more you can reduce the initial set the better. For example, simply removing 10% of the data is 10GB.

Something I like to think about as a rule of thumb, especially when dealing with a lot of data: "If you have 1 Million of something, then every millisecond saved, is 20 minutes off the bottom line."

Normally, we really don't think in terms of milliseconds for our work, it's more "seat of the pants", "that feels faster". But the 1ms == 20min/million is a good measure to get a grasp of how much data you're dealing with, and how long stuff should/could take.

For you case, 100GB of data. With a swag of 100 bytes per record, you're taking 1 Billion rows. 20,000 minutes per millisecond. -- 5 1/2 hours. gulp (It's a rule of thumb, if you do the math it doesn't quite work out to this.)

So, you can appreciate the desire to reduce the raw data if at all possible.

That was one reason I deferred to the Windows SORT command. It's a basic process, but one affected by nuance, and one that can use some optimization. The folks who wrote SORT had time and opportunity to make it "optimal", in many ways. Whether they did or did not, I can't say. But its a fair assumption that they would put more time and attention in to this process to make their SORT as good as practical, versus you who are under a tight deadline.

There are 3rd party sorting utilities for large data sets, that probably (ideally) work better for that case. But, those are unavailable to you (you can get them but I don't think you wanted to rush out and get some other utility right away). So, SORT is our best guess for now.

That said, reducing the data set will gain more than any sort utility.

How much detail do you really need? And how much information are you really tracking? For example, if it were, say, web statistics, you may have 1000 pages on your site. But even with hourly numbers for a year, 365 * 24 * 1000, that's only 8.7M "buckets" of information -- a far cry from 1B.

So, is there any preprocessing you can do that does not require sorting? Summarizing the information into a coarser granularity? You can do that without sorting, simply using memory based hash maps. Even if you don't have "enough memory" to process all 100GB of data in one throw, you probably have enough to do it in chunks (5 chunks, 10 chunks), and write out the intermediary results.

You may also have a lot better luck splitting the data as well. Into monthly, or weekly file chunks. Maybe that's not easily done because the data is "mostly" sorted. But, in that case, if it's by date, the offenders (i.e. the data that's out of sort) may well be clustered within the file, with the "out of order" stuff being just mixed up on the barriers of the time periods (like around day transitions, maybe you have rows like 11:58pm, 11:59pm, 00:00am, 00:01am, 11:58pm, 00:02pm). You might be able to leverage that heuristic as well.

The goal being that if you can somewhat deterministically determine the subset that's out of order, and break the file up in to chunks of "in order data" and "out of order data", your sorting task may be MUCH MUCH smaller. Sort the few rows that are out of order, and then you have a merge problem (much simpler than a sorting problem).

So, those are tactics you can take approaching the problem. Summarization is obviously the best one as anything that reduces this data load in any measurable, is likely worth the trouble. Of course it all boils down to what you really want from the data, clearly the reports will drive that. This is also a nice point about "pre-mature optimization". If they're not reporting on it, don't process it :).

Solution 2

Code like this is completely bound by how fast you can get the data off the disk. The file simply can never fit in the file system cache so you're always waiting on the disk to supply the data. You're doing fairly well at 10 MB/sec, optimizing the code is never going to have a discernible effect.

Get a faster disk. Defrag the one you've got as an intermediate step.

Solution 3

Short answer - load the data into a relational database eg Sql Express, create an index, and use a cursor based solution eg DataReader to read each record off and write it to disk.

Solution 4

Why don't you try this relatively unkown tool from microsoft called logparser. It basically allows you to do an SQL query over a CSV file (or any other formatted textfile).

Saves you the trouble of pumping it into a database, doing your sort, and pumping it back out again

Solution 5

Just to answer your question about sorting a long file that doesn't fit into the memory - you'll need to use some external sorting algorithm such as Merge sort. The process is roughly following:

  • Partition the input into several parts that fit into memory and can be sorted using standard in-memory sorting algorithms (e.g. 100 MB or larger - you'll need to keep ~4 parts in memory at once). Sort all the parts and write them back to disk.

  • Read two parts from the disk (they are both sorted) and merge them, which can be done just by simultaneously iterating over the two inputs. Write the merged data set to another place in the disk. Note that you don't need to read the whole part into memory - just read it/write it in blocks as you go.

  • Repeat merging of parts until you have only a single part (which will be sorted file with all the data from your original input data set).

You mentioned that the data is partially sorted already, so it would be a good idea to pick some algorithm for in-memory sorting (in the first phase) that is efficient in this case. You can see some suggestions in this question (though I'm not sure if the answer will be the same for very large data sets - and it depends on how much partially sorted the input is).

Share:
10,922
Gleno
Author by

Gleno

I am the very model of a modern major general.

Updated on June 24, 2022

Comments

  • Gleno
    Gleno almost 2 years

    So, for some strange reason I end up with a 100GB log file that is unsorted (actually it's partially sorted), while the algorithms that I'm attempting to apply require sorted data. A line in the log file looks like so

    data <date> data data more data
    

    I have access to C# 4.0 and about 4 GB of RAM on my workstation. I would imagine that merge-sort of some kind would be best here, but short of implementing these algorithms myself - I want to ask if there's some kind of a shortcut I could take.

    Incidentally parsing the date string with DateTime.Parse() is very slow and takes up a lot of CPU time - The chugging-rate is measly 10 MB/sec. Is there a faster way than the following?

        public static DateTime Parse(string data)
        {            
            int year, month, day;
    
            int.TryParse(data.Substring(0, 4), out year);
            int.TryParse(data.Substring(5, 2), out month);
            int.TryParse(data.Substring(8, 2), out day);
    
            return new DateTime(year, month, day);
        }
    

    I wrote that to speed up DateTime.Parse() and it actually works well, but is still taking a bucket-load of cycles.

    Note that for the current log-file I'm interested in hours, minutes and seconds also. I know that I can provide DateTime.Parse() with format, but that doesn't seem to speed it up all that much.

    I'm looking for a nudge in the right direction, thanks in advance.

    EDIT: Some people have suggested that I use string comparison in order to compare dates. That would work for the sorting phase, but I do need to parse dates for the algorithms. I still have no idea how to sort 100GB file on 4GB of free ram, without doing it manually.

    EDIT 2 : Well, thanks to several suggestions that I use windows sort, I found out that there's a similar tool for Linux. Basically you call sort and it fixes everything for you. As we speak it's doing something, and I hope it'll finish soon. The command I'm using is

    sort -k 2b 2008.log > 2008.sorted.log
    

    -k specifies that I want to sort on the second row, which is an date-time string in the usual YYYY-MM-DD hh:mm:ss.msek format. I must admit that the man-pages are lacking explaining all the options, but I found a lot of examples by running info coreutils 'sort invocation'.

    I'll report back with results and timings. This part of the log is about 27GB. I am thinking of sorting 2009 and 2010 separately and then merging the results into a single file with the sort -m option.

    Edit 3 Well, checking iotop suggests that it's reading in small chunks of the data file and then furiously doing something in order to process them. This process seems to be quite slow. =(

    sort isn't using any memory, and only a single core. When it does read data from the drive it's not processing anything. Am I doing something wrong?

    Edit 4 Three hours in and it's still doing the same thing. Now I'm at that stage where I want to try playing with parameters of the function, but I'm three hours invested... I'll abort in in about 4 hours, and try to put it for overnight computation with smarter memory and space parameters...

    Edit 5 Before I went home, I restarted the process with the following command:

    sort -k 2b --buffer-size=60% -T ~/temp/ -T "/media/My Passport" 2010.log -o 2010.sorted.log
    

    It returned this, this morning:

    sort: write failed: /media/My Passport/sortQAUKdT: File too large
    

    Wraawr! I thought I would just add as many hard drives as possible to speed this process up. Apparently adding a USB-drive was the worst idea ever. At the moment I can't even tell if it's about FAT/NTFS or some such, because fdisk is telling me that the USB drive is a "wrong device"... no kidding. I'll try to give it another go later, for now let's put this project into the maybe failed pile.

    Final Notice This time it worked, with the same command as above, but without the problematic external hard drive. Thank you all for your help!

    Benchmarking

    Using 2 workstation grade (at least 70mb/sec read/write IO) hard-disks on the same SATA controller, it took me 162 minutes to sort a 30GB log file. I will need to sort another 52 GB file tonight, I'll post how that goes.