using lambda functions to unzip archives in S3 is really sloooooow

node.js amazon-web-services amazon-s3 aws-lambda

12,542

I suspect that the unzip module you are using is a JavaScript implementation that allows you to unzip zip files - which is very slow.

I recommend using gzip to compress the files and using the internal zlib library that is C compiled and should provide much better performance.

In case you choose to stick with zip, you could contact amazon support and ask to increase the 60 seconds limit on your lambda function.

12,542

Author by

russell

Lots of experience with C, Java, OOP, less with web technologies. Ultimate Frisbee and running in my spare time.

Updated on June 19, 2022

Comments

russell almost 2 years

My company is uploading large archive files to S3, and now wants them to be unzipped on S3. I wrote a lambda function based on unzip, triggered by arrival of a file to the xxx-zip bucket, which streams the zip file from S3, unzips the stream, and then streams the individual files to the xxx-data bucket.

It works, but I find it is much slower than I expect - even on a test file, zip size about 500k and holding around 500 files, this is timing out with a 60 second timeout set. Does this seem right? On my local system running with node it is faster than this. It seems to me that since files are being moved inside Amazon's cloud latency should be short, and since the files are being streamed the actual time taken should be about the time it takes to unzip the stream.

Is there an inherent reason why this won't work, or is there something in my code that is causing it to be so slow? It is the first time I've worked with node.js so I could be doing something badly. Or is there a better way to do this that I couldn't find with google?

Here is an outline of the code (BufferStream is a class I wrote that wraps the Buffer returned by s3.getObject() into a readStream)

var aws = require('aws-sdk');
var s3 = new aws.S3({apiVersion: '2006-03-01'});
var unzip = require('unzip');
var stream = require('stream');
var util = require( "util" );
var fs = require('fs');

exports.handler = function(event, context) {
    var zipfile = event.Records[0].s3.object.key;
    s3.getObject({Bucket:SOURCE_BUCKET, Key:zipfile}, 
    function(err, data) {
        var errors = 0;
        var total = 0;
        var successful = 0;
        var active = 0;
        if (err) {
            console.log('error: ' + err);
        }
        else {
            console.log('Received zip file ' + zipfile);
            new BufferStream(data.Body)
            .pipe(unzip.Parse()).on('entry', function(entry) {
                total++;
                var filename = entry.path;
                var in_process = ' (' + ++active + ' in process)';
                console.log('extracting ' + entry.type + ' ' + filename + in_process );
                s3.upload({Bucket:DEST_BUCKET, Key: filename, Body: entry}, {},
                function(err, data) {
                    var remaining = ' (' + --active + ' remaining)';
                    if (err) {
                        // if for any reason the file is not read discard it
                        errors++
                        console.log('Error pushing ' + filename + ' to S3' + remaining + ': ' + err);
                        entry.autodrain();
                    }
                    else {
                        successful++;
                        console.log('successfully wrote ' + filename + ' to S3' + remaining);
                    }
                });
            });
            console.log('Completed, ' + total + ' files processed, ' + successful + ' written to S3, ' + errors + ' failed');
            context.done(null, '');
        }
    });
}

russell about 9 years

Thanks for the suggestion. I am trying to do this with zlib to compare times, but ran into a problem with uploading to S3 (stackoverflow.com/questions/28688490/…)
russell about 9 years

After more measurements I found it is the size of the files, and not the number, that causes problem. If the archive contained a 5MB text file that alone took most of the time. So it looks like this is not the right way to do this. I'm going to write it on EC2 polling from SQS instead.