s3 - how to get fast line count of file? wc -l is too slow

17,772

Solution 1

Here's two methods that might work for you...

Amazon S3 has a new feature called S3 Select that allows you to query files stored on S3.

You can perform a count of the number of records (lines) in a file and it can even work on GZIP files. Results may vary depending upon your file format.

S3 Select

Amazon Athena is also a similar option that might be suitable. It can query files stored in Amazon S3.

Solution 2

Yes, Amazon S3 is having the SELECT feature, also keep an eye on the cost while executing any query from SELECT tab.. For example, here is the price @Jun2018 (This may varies) S3 Select pricing is based on the size of the input, the output, and the data transferred. Each query will cost 0.002 USD per GB scanned, plus 0.0007 USD per GB returned.

Solution 3

You can do it using python/boto3. Define bucket_name and prefix:

colsep = ','
s3          = boto3.client('s3')
bucket_name = 'my-data-test'
s3_key = 'in/file.parquet'

Note that S3 SELECT can access only one file at a time.

Now you can open S3 SELECT cursor:

sql_stmt    = """SELECT count(*) FROM s3object S"""  
req_fact =s3.select_object_content(
    Bucket  = bucket_name,
    Key     = s3_key,
    ExpressionType  = 'SQL',
    Expression      = sql_stmt,
    InputSerialization={'Parquet': {}},
    OutputSerialization = {'CSV': {
                'RecordDelimiter': os.linesep,
                'FieldDelimiter': colsep}},
    
)

Now iterate thourgh returned records:

for event in req_fact['Payload']:
    if 'Records' in event:
        rr=event['Records']['Payload'].decode('utf-8')
        for i, rec in enumerate(rr.split(linesep)):
            if rec:
                row=rec.split(colsep)
                if row:
                    print('File line count:', row[0])

If you want to count records in all parquet files in a given S3 directory, check out this python/boto3 script: S3-parquet-files-row-counter

Share:
17,772
tooptoop4
Author by

tooptoop4

Updated on June 09, 2022

Comments

  • tooptoop4
    tooptoop4 almost 2 years

    Does anyone have a quick way of getting the line count of a file hosted in S3? Preferably using the CLI, s3api but I am open to python/boto as well. Note: solution must run non-interactively, ie in an overnight batch.

    Right no i am doing this, it works but takes around 10 minutes for a 20GB file:

     aws cp s3://foo/bar - | wc -l
    
  • John Rotenstein
    John Rotenstein about 6 years
    Amazon Athena has an SDK or can be called from the AWS CLI. There is apparently an S3 Select SDK for Java and Python but I'm unable to locate it.
  • kdgregory
    kdgregory about 6 years
    One limitation of the S3 Select is that it only works on CSV or JSON files, not arbitrary "lines". Re the SDK, it does take a short while for new APIs to be released.
  • Hrushikesh Patel
    Hrushikesh Patel about 4 years
    this is not working for the file of more than 128MB.
  • John Rotenstein
    John Rotenstein about 4 years
    @HrushikeshPatel What do you mean by "not working"? What is the error message you receive?
  • Hrushikesh Patel
    Hrushikesh Patel about 4 years
    AWS saying "The maximum input file size of 128 MB is exceeded. This file is 397.6 MB. To work with larger files, use the API."
  • pramod singh
    pramod singh about 3 years
    @HrushikeshPatel did you find any solution? using awssdk or cli?