Read and parse CSV file in S3 without downloading the entire file

23,154

Solution 1

You should just be able to use the createReadStream method and pipe it into fast-csv:

const s3Stream = s3.getObject(params).createReadStream()
require('fast-csv').fromStream(s3Stream)
  .on('data', (data) => {
    // do something here
  })

Solution 2

I do not have enough reputation to comment but as of now the accepted answer method "fromStream" does not exist for 'fast-csv'. Now you'll need to use the parseStream method:

const s3Stream = s3.getObject(params).createReadStream()
require('fast-csv').parseStream(s3Stream)
  .on('data', (data) => {
    // use rows
  })

Solution 3

For me, the answer that solved my issue was,

  const csv = require('@fast-csv/parse');

  const params = {
    Bucket: srcBucket,
    Key: srcKey,
  };
  const csvFile = s3.getObject(params).createReadStream();

  let parserFcn = new Promise((resolve, reject) => {
    const parser = csv
      .parseStream(csvFile, { headers: true })
      .on("data", function (data) {
        console.log('Data parsed: ', data);
      })
      .on("end", function () {
        resolve("csv parse process finished");
      })
      .on("error", function () {
        reject("csv parse process failed");
      });
  });

  try {
    await parserFcn;
  } catch (error) {
    console.log("Get Error: ", error);
  }
Share:
23,154
changingrainbows
Author by

changingrainbows

Updated on January 14, 2021

Comments

  • changingrainbows
    changingrainbows over 3 years

    Using node.js, with the intention of running this module as an AWS Lambda function.

    Using s3.getObject() from aws-sdk, I am able to successfully pick up a very large CSV file from Amazon S3. The intention is to read each line in the file and emit an event with the body of each line.

    In all examples I could find, it looks like the entire CSV file in S3 has to be buffered or streamed, converted to a string and then read line by line.

    s3.getObject(params, function(err, data) {
       var body = data.Body.toString('utf-8');
    }
    

    This operation takes a very long time, given the size of the source CSV file. Also, the CSV rows are of varying length, and I'm not certain if I can use the buffer size as an option.

    Question

    Is there a way to pick up the S3 file in node.js and read/transform it line by line, which avoids stringifying the entire file in memory first?

    Ideally, I'd prefer to use the better capabilities of fast-csv and/or node-csv, instead of looping manually.

  • Deepak G M
    Deepak G M almost 5 years
    This works pretty well. Just to add to it, if you are interested to know when the parsing ends, add .on('end' () => { //Your handling })
  • ChristoKiwi
    ChristoKiwi almost 5 years
    @DeepakGM you forgot a comma .on('end', () => { })
  • Hoon
    Hoon over 3 years
    Thanks for adding this. I was looking for it. :)
  • Kai Durai
    Kai Durai over 3 years
    This method is deprecated, see my answer below for using parseStream method instead
  • pta
    pta over 2 years
    what the version of the lib contain fromStream ? parseStream throws an Error....