Is it possible to loop through Amazon S3 bucket and count the number of lines in its file/key using Python?
17,299
Solution 1
Using boto3
you can do the following:
import boto3
# create the s3 resource
s3 = boto3.resource('s3')
# get the file object
obj = s3.Object('bucket_name', 'key')
# read the file contents in memory
file_contents = obj.get()["Body"].read()
# print the occurrences of the new line character to get the number of lines
print file_contents.count('\n')
If you want to do this for all objects in a bucket, you can use the following code snippet:
bucket = s3.Bucket('bucket_name')
for obj in bucket.objects.all():
file_contents = obj.get()["Body"].read()
print file_contents.count('\n')
Here is the reference to boto3 documentation for more functionality: http://boto3.readthedocs.io/en/latest/reference/services/s3.html#object
Update: (Using boto 2)
import boto
s3 = boto.connect_s3() # establish connection
bucket = s3.get_bucket('bucket_name') # get bucket
for key in bucket.list(prefix='key'): # list objects at a given prefix
file_contents = key.get_contents_as_string() # get file contents
print file_contents.count('\n') # print the occurrences of the new line character to get the number of lines
Solution 2
Reading large files to memory sometimes is far from ideal. Instead you may find the following more of use:
s3 = boto3.client('s3')
obj = s3.get_object(Bucket='bucketname', Key=fileKey)
nlines = 0
for _ in obj['Body'].iter_lines(): nlines+=1
print (nlines)
Related videos on Youtube
Author by
Renukadevi
Updated on June 04, 2022Comments
-
Renukadevi almost 2 years
Is it possible to loop through the file/key in Amazon S3 bucket, read the contents and count the number of lines using Python?
For Example:
1. My bucket: "my-bucket-name" 2. File/Key : "test.txt"
I need to loop through the file "test.txt" and count the number of line in the raw file.
Sample Code:
for bucket in conn.get_all_buckets(): if bucket.name == "my-bucket-name": for file in bucket.list(): #need to count the number lines in each file and print to a log.
-
Renukadevi almost 8 yearsHi thanks, May be i dint phrase my question in proper. I want to iterate through specific files in S3 and count the number of rows in it.
-
mootmoot almost 8 years@Renukadevi : Please clarify meaning of "specific" . Do you mean file with prefix?
-
Renukadevi almost 8 yearsTrouble is, I am not using a Boto 3.0. My version of Boto is 2.38.0. Hence cannot try the s3.Object methods. Another issue is my files are all in .gz format and its gets even worse when I try to use Key.open_read as a fd to gzip.GzipFile. It errs as AttributeError: 'str' object has no attribute 'tell' or 'seek' I was wondering if ther is any work around.
-
tamjd1 almost 8 years@Renukadevi, I updated my post to add an example for boto2. To decompress gzip data, you can probably use the zlib library, see example here: stackoverflow.com/a/2695575/4072706 Hope this helps.
-
Renukadevi almost 8 yearsThank a ton ., that was so simple., I am new to AWS and your solution helped a lot.