Generating ZIP files with PHP + Apache on-the-fly in high speed?

12,602

Solution 1

This may be what you need: http://pablotron.org/software/zipstream-php/

This lib allows you to build a dynamic streaming zip file without swapping to disk.

Solution 2

You're going to have to store the generated zip file, if you want them to be able to resume downloads.

Basically you generate the zip file and chuck it in a /tmp directory with a repeatable filename (hash of the search filters maybe). Then you send the correct headers to the user and echo file_get_contents to the user.

To support resuming you need to check out the $_SERVER['HTTP_RANGE'] value, it's format is detailed here and once your parsed that you'll need to run something like this.

$size = filesize($zip_file);

if(isset($_SERVER['HTTP_RANGE'])) {
    //parse http_range
    $range = explode( '-', $seek_range);
    $new_length = $range[1] - $range[0]
    header("HTTP/1.1 206 Partial Content");
    header("Content-Length: $new_length");
    header("Content-Range: bytes {$range[0]}-$range[1]");
    echo file_get_contents($zip_file, FILE_BINARY, null, $range[0], $new_length);
} else {
    header("Content-Range: bytes 0-$size");
    header("Content-Length: ".$size);
    echo file_get_contents($zip_file);
} 

This is very sketchy code, you'll probably need to play around with the headers and the contents to the HTTP_RANGE variable a bit. You can use fopen and fwrite rather than file_get contents if you wish and just fseek to the right place.

Now to your questions

  • PHP has execution timeout for scripts. While it can be changed by the script itself, will there be no problems by removing it completely?

You can remove it if you want to, however if something goes pear shaped and your code get stuck in an infinite loop at can lead to interesting problems should that infinite loop be logging and error somewhere and you don't notice, until a rather grumpy sys-admin wonders why their server ran out of hard disk space ;)

  • With the resume option, there is the possibility of the filter results changing for different HTTP requests. This might be mitigated by sorting the results chronologically, as the collection is only getting bigger. The request URL would then also include a date when it was originally created and the script would not consider files younger than that. Will this be enough?

Cache the file to the hard disk, means you wont have this problem.

  • Will passing large amounts of file data through PHP not be a performance hit in itself?

Yes it wont be as fast as a regular download from the webserver. But it shouldn't be too slow.

Solution 3

Use e.g. the PhpConcept Library Zip library.

Resuming must be supported by your webserver except the case where you don't make the zipfiles accessible directly. If you have a php script as mediator then pay attention to sending the right headers to support resuming.

The script creating the files shouldn't timeout ever just make sure the users can't select thousands of files at once. And keep something in place to remove "old zipfiles" and watch out that some malicious user doesn't use up your diskspace by requesting many different filecollections.

Share:
12,602
Vilx-
Author by

Vilx-

Just your average everyday programmer. #SOreadytohelp

Updated on July 19, 2022

Comments

  • Vilx-
    Vilx- almost 2 years

    To quote some famous words:

    “Programmers… often take refuge in an understandable, but disastrous, inclination towards complexity and ingenuity in their work. Forbidden to design anything larger than a program, they respond by making that program intricate enough to challenge their professional skill.”

    While solving some mundane problem at work I came up with this idea, which I'm not quite sure how to solve. I know I won't be implementing this, but I'm very curious as to what the best solution is. :)


    Suppose you have this big collection with JPG files and a few odd SWF files. With "big" I mean "a couple thousand". Every JPG file is around 200KB, and the SWFs can be up to a few MB in size. Every day there's a few new JPG files. The total size of all the stuff is thus around 1 GB, and is slowly but steadily increasing. Files are VERY rarely changed or deleted.

    The users can view each of the files individually on the webpage. However there is also the wish to allow them to download a whole bunch of them at once. The files have some metadata attached to them (date, category, etc.) that the user can filter the collection by.

    The ultimate implementation would then be to allow the user to specify some filter criteria and then download the corresponding files as a single ZIP file.

    Since the amount of criteria is big enough, I cannot pre-generate all the possible ZIP files and must do it on-the-fly. Another problem is that the download can be quite large and for users with slow connections it's quite likely that it will take an hour or more. Support for "resume" is therefore a must-have.

    On the bright side however the ZIP doesn't need to compress anything - the files are mostly JPEGs anyway. Thus the whole process shouldn't be more CPU-intensive than a simple file download.

    The problems then that I have identified are thus:

    • PHP has execution timeout for scripts. While it can be changed by the script itself, will there be no problems by removing it completely?
    • With the resume option, there is the possibility of the filter results changing for different HTTP requests. This might be mitigated by sorting the results chronologically, as the collection is only getting bigger. The request URL would then also include a date when it was originally created and the script would not consider files younger than that. Will this be enough?
    • Will passing large amounts of file data through PHP not be a performance hit in itself?

    How would you implement this? Is PHP up to the task at all?


    Added:

    By now two people have suggested to store the requested ZIP files in a temporary folder and serving them from there as usual files. While this is indeed an obvious solution, there are several practical considerations which make this infeasible.

    The ZIP files will usually be pretty large, ranging from a few tens of megabytes to hundreads of megabytes. It's also completely normal for a user to request "everything", meaning that the ZIP file will be over a gigabyte in size. Also there are many possible filter combinations and many of them are likely to be selected by the users.

    As a result, the ZIP files will be pretty slow to generate (due to sheer volume of data and disk speed), and will contain the whole collection many times over. I don't see how this solution would work without some mega-expensive SCSI RAID array.

  • Vilx-
    Vilx- almost 15 years
    No, I don't want to store the generated ZIP files. As I said - the combinations are many, and even under normal circumstances they could become very large. Also, I want to offer the option to download "everything". If I wanted to create the ZIP file first and then give the use a link to it, the user would never get it - the disk speed alone would mean that the ZIP file would take forever to build.
  • Vilx-
    Vilx- almost 15 years
    Why construct it all? If I know the byte at which I have to restart, I can also calculate which file was at that point in the ZIP file and skip reading all the files that were before that. Or after that. Just generate the ZIP archive "by hand" "in place" and output whatever you have generated immediately to the user. No point to generate it all anywhere.
  • Vilx-
    Vilx- almost 15 years
    Nice! That's a solution in the right direction! :) It doesn't support resume or caching of CRC values, but it demonstrates that this is possible and I can build on top of it! :)
  • Frosty Z
    Frosty Z over 12 years
    There appears to be a new version of ZipStream here: github.com/Grandt/PHPZip
  • grorel
    grorel about 3 years
    404 not found on the link