Linux sort all data in memory

5,306

Solution 1

You can specify the temporary directory to be nonexistant and change the main memory size parameter. This will however cause the sort to fail if you don't have enough mem:

$ sort -S 1000 -T /nonexistant/dir /usr/share/dict/words | wc -l 
sort: cannot create temporary file in `/nonexistant/dir': No such file or directory
0
$ sort -S 10000 -T /nonexistant/dir /usr/share/dict/words | wc -l
98569

Unit for the -S option is kB (see the comment below).

Solution 2

Read side

Barring very non-standard filesystems the whole shebang will be read-cached any way (observe this simple in htop).

You can see the amount of buffering as well in vmstat 1 output. Observe how linux will simply take all available memory (even when not addressable to a single client process, e.g. when running a PAE kernel on 32 bit, or 64bit kernel with 32bit userland).

Observe how you can force the cache to be cleared by issueing echo 3 > /proc/sys/vm/drop_caches in another terminal. (clearing page cache, inode and dentry caches)

Write side

On the write side, the tmpfs feature in linux 2.4+ is perfect. It does the analogous of the read caching and you can manually limit it's size. This is my default /tmp mount:

sudo mount -t tmpfs -o nodev,noexec,size=6g none /tmp

I'll usually work on /tmp for longer periods of the day and use version control to push things into a (nonvolatile) repository.

Takeaway

So, shy from /write it yourself/ solutions, you should just use the kernel features that are there.

[1] I also symlink things like ~/.cache ~/.opera/cache etc. into /tmp/ Really lifts the burden of cleaning up, make things fly performance wise and keeps my SSDs in healthy condition

Share:
5,306

Related videos on Youtube

studiohack
Author by

studiohack

Updated on September 18, 2022

Comments

  • studiohack
    studiohack almost 2 years

    For Linux command sort, how do I force sort to load all input into memory and sort assuming I have enough memory? Or is it best to use a RAMDISK to store the input before feeding it to sort?

    • sehe
      sehe about 13 years
      "sort all data in memory": something like od /dev/mem -An | sort came to mind immediately
    • Gilles 'SO- stop being evil'
      Gilles 'SO- stop being evil' about 13 years
      What are you trying to achieve? If you want sort to be as fast as possible, let it do what it wants to do. If you don't want it to access the filesystem, don't give it a filesystem to access, as shown by viraptor. I'm having trouble coming up with a use case though.
  • Gilles 'SO- stop being evil'
    Gilles 'SO- stop being evil' about 13 years
    The unit for -S is kB unless you add a suffix, e.g. -S 1000 = 1024000 bytes, -S 1000b = 1000 bytes, -S 1% = 1% of physical memory. See the description of SIZE just below the option list, or the hypertext manual.
  • user1686
    user1686 about 13 years
    All Linux distros already have a writable tmpfs mounted on /dev/shm, which can be abused for sorting. Some now have /run, too.
  • Raza
    Raza over 10 years
    @grawity, But tmpfs has a set size, so you can run out of "disk space" on your ram disk. For example I have 4GiB of ram but sorting a 3GiB file fails because I only have a 2GiB ram disk. It is also less efficient because it has to worry about "writing it out" and "reading it back".
  • user1686
    user1686 over 10 years
    @KevinCox It has a maximum size, yes. But not all systems use a tmpfs for /tmp; when 'sort' was written, tmpfs did not even exist. And there is another directory, /var/tmp, which is explicitly required by FHS not to be a tmpfs and programs can use it for storing huge files. // As for writing data out, don't forget mmap
  • sehe
    sehe over 10 years
    @KevinCox You can just on-the-fly resize. On my systems I have sometimes done mount -o remount,size=24g /tmp - and voila :) You can also size back down (assuming you free the space first)
  • Raza
    Raza over 10 years
    Note that is is the amount of memory used not the amount of input sorted at once. If you are sorting a 2GiB file you will need more than -S2G.
  • CMCDragonkai
    CMCDragonkai over 7 years
    What is the default buffer size for sort?