How to speed up the extraction of a large tgz file with lots of small files?

14,404

I suppose you have a Linux laptop or desktop on which your hugearchive.tgz file is on some local disk (not a remote network filesystem, which could be too slow). If possible, put that hugearchive.tgz file on some fast disk (SSD preferably, not magnetic rotating hard disks) and fast Linux-native file system (Ext4, XFS, BTRFS, not FAT32 or NTFS).

Notice that a .tgz file is a gnu-zipped compression of a .tar file.

Next time you get a huge archive, consider asking it in afio archive format, which has the big advantage of compressing not-too-small files individually (or perhaps ask for some SQL dump - e.g. for PostGreSQL or Sqlite or MariaDB - in compressed form).

First, you should make a list of the file names in that hugearchive.tgz gziped tar archive and ask for the total count of bytes:

 tar -tzv --totals -f hugearchive.tgz > /tmp/hugearchive-list.txt

That command will run gunzip to uncompress the .tgz file to some pipe (so won't consume a lot of disk space) and write the table-of-contents into /tmp/hugearchive-list.txt and you'll get on your stderr something like

  Total bytes read: 340048000 (331MiB, 169MiB/s)

of course the figures are fictive, you'll get much bigger ones. But you'll know what is the total cumulated size of the archive, and you'll know its table of contents. Use wc -l /tmp/hugearchive-list.txt to get the number of lines in that table of content, that is the number of files in the archive, unless some files are weirdly and maliciously named (with e.g. some newline in their filename, which is possible but weird).

My guess is that you'll process your huge archive in less than one hour. Details depend on the computer, notably the hardware (if you can afford it, use some SSD, and get at least 8Gbytes of RAM).

Then you can decide if you are able to extract all the files or not, since you know how much total size they need. Since you have the table-of-contents in /tmp/hugearchive-list.txt you can easily extract the useful files only, if so needed.


For what it is worth, on my i3770K desktop with 16Gb RAM and both SSD & disk storage, I made (for experimenting) a useless huge archive (made specifically for the purpose of answering this question, since I don't have your hugearchive.tgz file ....) with

sudo time tar czf /tmp/hugefile.tgz /bin /usr/bin /usr/local/bin /var 

and it took this time to create that archive (with all these file systems on SSD):

 719.63s user 60.44s system 102% cpu 12:40.87 total

and the produced /tmp/hugefile.tgz has 5.4 gigabytes (notice that it probably sits in the page cache).

I then tried:

time tar -tzv --totals -f /tmp/hugefile.tgz > /tmp/hugefile-list.txt

and got:

Total bytes read: 116505825280 (109GiB, 277MiB/s)
tar -tzv --totals -f /tmp/hugefile.tgz > /tmp/hugefile-list.txt
    395.77s user 26.06s system 104% cpu 6:42.43 total

and the produced /tmp/hugefile-list.txt has 2.3Mbytes (for 23Kfiles), not a big deal.

Don't use z in your tar commands if your tar archive is not GNU zipped.

Read the documentation of tar(1) (and also of time(1) if you use it, and more generally of every command you are using!) and of course use the command line (not some GUI interface), also learn some shell scripting.

BTW, you could later segregate very small files (less than 64Kbytes) and e.g. put them inside some database (perhaps some Sqlite or Redis or PostGreSQL or MongoDB database, filled with e.g. a small script) or maybe some GDBM indexed file. Notice that most file systems have some significant overhead for a big lot of small files.

Learning shell scripting and some scripting language (Python, Lua, Guile, Ocaml, Common Lisp), and basic database techniques is not a loss of time. If e.g. you are starting a PhD, it is almost a required skill set.

I don't know and don't use (and dislike) Windows, so I am obviously biased (my first Linux was some Slackware with a 0.99.12 kernel circa 1993 or early 1994), but I strongly recommend you to do all your NLP work on Linux (and keep Windows only for playing video games, when you have time for that), because scripting and combining many useful existing free software is so much easier on Linux.

Share:
14,404
Vulcan
Author by

Vulcan

#SOreadytohelp #SOreadytohelp

Updated on June 28, 2022

Comments

  • Vulcan
    Vulcan almost 2 years

    I have a tar archive (17GB) which consists of many small files (all files <1MB ). How do I use this archive.

    1. Do I extract it ? using 7-zip on my laptop says it will take 20hrs (and I think it will take even more)
    2. Can I read/browse the contents of the file without extracting it? If yes, then how?
    3. Is there any other option?

    It is actually a processed wikipedia dataset on which I am supposed to perform some Natural Language Processing.

    Platform Windows/Linux is not an issue; anything will do, as long as it gets the jobs done as quickly as possible.

    • vlp
      vlp over 8 years
      So it is a .tgz file which contains many .zip files? Or just a .tgz file which contains many text files?
    • Vulcan
      Vulcan over 8 years
      a .tgz with many text files
    • Matteo Italia
      Matteo Italia over 8 years
      How many files are in there? It sounds strange that such a small file would take so much time...
    • Vulcan
      Vulcan over 8 years
      @MatteoItalia i don't know How Many? but have a look imgur.com/fOiSHLq
    • Vulcan
      Vulcan over 8 years
      I have a feeling I am doing something Completely Wrong here
    • Basile Starynkevitch
      Basile Starynkevitch over 8 years
      IMHO using Windows is completely wrong. See my answer.
    • Jason Hu
      Jason Hu over 8 years
      you want all the files, then you decompress the tarball before you go to sleep. small files are pain in the ass especially for mechanic disk. if you have a ssd, that would be better.
    • Jason Hu
      Jason Hu over 8 years
      btw, I think this question is actually quite valuable since it's quite often when there are too many small files to move around and ends up taking too much time to compress and decompress. this question shouldn't deserve a downvote.
    • Vulcan
      Vulcan over 8 years
      put on hold offtopic so should i delete this question and copy to Super User ASAP or should i wait for moderators to do it?
  • Vulcan
    Vulcan over 8 years
    How much time would that take,
  • vlp
    vlp over 8 years
    I don't know, depends on many factors. The data won't be stored to disk, so it might be reasonably fast. Of course it depends on the way you will process the data...
  • Vulcan
    Vulcan over 8 years
    $ zcat Stage1_Articles.tgz /n gzip: Stage1_Articles.tgz: not in gzip format now what?
  • Matteo Italia
    Matteo Italia over 8 years
    @Vulcan: it means that it's not actually a tgz. What's the output if you do file Stage1_Articles.tgz?
  • Vulcan
    Vulcan over 8 years
    $ file Stage1_Articles.tgz Stage1_Articles.tgz: POSIX tar archive (GNU)
  • Vulcan
    Vulcan over 8 years
    Its just a tar then. Correct?
  • Matteo Italia
    Matteo Italia over 8 years
    Yes, you can omit the zcat part.
  • Vulcan
    Vulcan over 8 years
    OK that does work.... but the total combines time for extracting the files is same
  • Vulcan
    Vulcan over 8 years
    and now suppose i use head -n10000 and now i want the next 10000 files can i do that?
  • Jason Hu
    Jason Hu over 8 years
    I especially love paragraphs after BTW:)
  • Vulcan
    Vulcan over 8 years
    sudo time tar czf /tmp/hugefile.tgz /bin /usr/bin /usr/local/bin /var i tried my best but could not figure what do these extra paths specify /bin /usr/bin /usr/local/bin /var
  • Vulcan
    Vulcan over 8 years
    and yes i have windows only for playing games .. dual boot with lubuntu for everything else..and iam not doing a PhD. it's a college project :P
  • Basile Starynkevitch
    Basile Starynkevitch over 8 years
    Don'"t repeat that exact command !!! It is just an example to create a big .tgz archive. I don't have your hugefile.tgz on my machine, so I created a stupid example for you...
  • Basile Starynkevitch
    Basile Starynkevitch over 8 years
    But you should learn basic shell & scripting skills, and read the documentation of tar before using it. BTW, do you know that college has very different meaning in various countries; in France (where I live), it is some kind of junior high school for pupils 13 years old!
  • Vulcan
    Vulcan over 8 years
    yes yes i know basic shell scripting(an amateur though!). but the part after which u give the filename of the tgz you have also have paths to other folders specifically this part /bin /usr/bin /usr/local/bin /var what does this specify? Is this a part of time or tar
  • Basile Starynkevitch
    Basile Starynkevitch over 8 years
    It is just a stupid example to make a huge archive file from my system files under /bin, /usr/bin/ etc....I don't have your archive, and I won't download it. Read more about tar and shell scripting
  • Basile Starynkevitch
    Basile Starynkevitch over 8 years
    You absolutely need to RTFM. You won't understand my answer if you don't follow the links
  • Vulcan
    Vulcan over 8 years
    Ohhh boy! yes i got it know. I overlook the fact that you can combine files from multiple directories... really really silly mistake and sorry :(