Is there a smarter tar or cpio out there for efficiently retrieving a file stored in the archive?

16,324

Solution 1

tar (and cpio and afio and pax and similar programs) are stream-oriented formats - they are intended to be streamed direct to a tape or piped into another process. while, in theory, it would be possible to add an index at the end of the file/stream, i don't know of any version that does (it would be a useful enhancement though)

it won't help with your existing tar or cpio archives, but there is another tool, dar ("disk archive"), that does create archive files that contain such an index and can give you fast direct access to individual files within the archive.

if dar isn't included with your unix/linux-dist, you can find it at:

http://dar.linux.free.fr/

Solution 2

You could use SquashFS for such archives. It is

  • designed to be accessed using a fuse driver (although a traditional interface exists)
  • compressed (the larger the block size, the more efficient)
  • included in the Linux kernel
  • stores UIDs/GIDs and creation time
  • endianess-aware, therefore quite portable

The only drawback I know of is that it is read-only.

http://squashfs.sourceforge.net/ http://www.tldp.org/HOWTO/SquashFS-HOWTO/whatis.html

Solution 3

While it doesn't store an index, star is purported to be faster than tar. Plus it supports longer filenames and has better support for file attributes.

As I'm sure you're aware, decompressing the file takes time and would likely be a factor in the speed of extraction even if there was an index.

Edit: You might also want to take a look at xar. It has an XML header that contains information about the files in the archive.

From the referenced page:

Xar's XML header allows it to contain arbitrary metadata about files contained within the archive. In addition to the standard unix file metadata such as the size of the file and it's modification and creation times, xar can store information such as ext2fs and hfs file bits, unix flags, references to extended attributes, Mac OS X Finder information, Mac OS X resource forks, and hashes of the file data.

Solution 4

The only archive format I know of that stores an index is ZIP, because I've had to reconstruct corrupted indexes more than once.

Solution 5

Thorbjørn Ravn Anderser is right. GNU tar creates "seekable" archives by default. But it does not use that information when it reads these archives if -n option is not given. With -n option I just extracted 7GB file from 300GB archive in time required to read/write 7GB. Without -n it took more than hour and produced no result.

I'm not sure how compression affects this. My archive was not compressed. Compressed archives are not "seekable" because current (1.26) GNU tar offloads compression to external program.

Share:
16,324

Related videos on Youtube

Alex Reynolds
Author by

Alex Reynolds

Updated on September 17, 2022

Comments

  • Alex Reynolds
    Alex Reynolds almost 2 years

    I am using tar to archive a group of very large (multi-GB) bz2 files.

    If I use tar -tf file.tar to list the files within the archive, this takes a very long time to complete (~10-15 minutes).

    Likewise, cpio -t < file.cpio takes just as long to complete, plus or minus a few seconds.

    Accordingly, retrieving a file from an archive (via tar -xf file.tar myFileOfInterest.bz2 for example) is as slow.

    Is there an archival method out there that keeps a readily available "catalog" with the archive, so that an individual file within the archive can be retrieved quickly?

    For example, some kind of catalog that stores a pointer to a particular byte in the archive, as well as the size of the file to be retrieved (as well as any other filesystem-specific particulars).

    Is there a tool (or argument to tar or cpio) that allows efficient retrieval of a file within the archive?

    • Admin
      Admin almost 4 years
      As others have said most archive formats other than tar use an index, you can also make an external index for uncompressed tar-s; serverfault.com/a/1023249/254756
  • Alex Reynolds
    Alex Reynolds almost 15 years
    Is there a way to pipe an extraction to standard output? It looks like there's a way to make an archive from standard input, but not a way (at least not directly) to extract to standard output. It's not clear from the documentation if there is a way to do this. Do you know how this might be accomplished?
  • cas
    cas almost 15 years
    nope, don't know. I don't actually use dar myself...i just know that it exists. i'm happy enough with tar, and tend to just create text files listing the contents for large tar files that i might want to search later. you can do this at the same time as creating the tar archive by using the v option twice (e.g. "tar cvvjf /tmp/foo.tar.bz2 /path/to/backup > /tmp/foo.txt")
  • cas
    cas almost 15 years
    +1 for alerting me to a useful sounding tool i'd never heard of before.
  • Brian Minton
    Brian Minton over 9 years
    according to the tar man page man7.org/linux/man-pages/man1/tar.1.html, GNU tar will by default use the seekable format when writing, and if the archive is seekable, will use it when reading (for list or extract). If you are using GNU tar and still seeing the issue, you should file a bug report with GNU.
  • Pacerier
    Pacerier over 9 years
    ZIP files can grow big.
  • Pacerier
    Pacerier over 9 years
    Link of star is down......
  • icando
    icando over 9 years
    If I read the manual correctly, it never says it has any sort of index and can jump to any file within the archive given the file name. --seek just means the underlying media is seekable, so that when it reads from the beginning, it can skip reading file contents, but it still needs to read entry headers from beginning. That said, if you have an archive with 1M files, and you try to extract the last one, with --no-seek, you need to read contents of all files; with --seek, you only need to read 1M headers, one for each file, but it is still super slow.
  • icando
    icando over 9 years
    If I read the manual correctly, it never says it has any sort of index and can jump to any file within the archive given the file name. --seek just means the underlying media is seekable, so that when it reads from the beginning, it can skip reading file contents, but it still needs to read entry headers from beginning. That said, if you have an archive with 1M files, and you try to extract the last one, with --no-seek, you need to read contents of all files; with --seek, you only need to read 1M headers, one for each file, but it is still super slow.
  • user1089802
    user1089802 over 9 years
    @Pacerier To my understanding the ZIP64 format allows for very large files, but the original ZIP format doesn't.
  • Pacerier
    Pacerier over 9 years
    @ThorbjørnRavnAndersen, A single 4 GB file is big dude.
  • alexandre
    alexandre over 5 years
    @Pacerier 4GB hasn't been big since DVD ISOs came on the scene almost twenty years ago. Terrabytes is big nowadays.
  • user1089802
    user1089802 over 5 years
    For Linux it would be fine to 7zip a tar file.
  • tao_oat
    tao_oat over 4 years
    @ThorbjørnRavnAndersen that would defeat the point of being indexable