Load S3 Data into AWS SageMaker Notebook

68,658

Solution 1

If you have a look here it seems you can specify this in the InputDataConfig. Search for "S3DataSource" (ref) in the document. The first hit is even in Python, on page 25/26.

Solution 2

import boto3
import pandas as pd
from sagemaker import get_execution_role

role = get_execution_role()
bucket='my-bucket'
data_key = 'train.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)

pd.read_csv(data_location)

Solution 3

In the simplest case you don't need boto3, because you just read resources.
Then it's even simpler:

import pandas as pd

bucket='my-bucket'
data_key = 'train.csv'
data_location = 's3://{}/{}'.format(bucket, data_key)

pd.read_csv(data_location)

But as Prateek stated make sure to configure your SageMaker notebook instance to have access to s3. This is done at configuration step in Permissions > IAM role

Solution 4

You could also access your bucket as your file system using s3fs

import s3fs
fs = s3fs.S3FileSystem()

# To List 5 files in your accessible bucket
fs.ls('s3://bucket-name/data/')[:5]

# open it directly
with fs.open(f's3://bucket-name/data/image.png') as f:
    display(Image.open(f))

Solution 5

Do make sure the Amazon SageMaker role has policy attached to it to have access to S3. It can be done in IAM.

Share:
68,658
A555h55
Author by

A555h55

Updated on July 09, 2022

Comments

  • A555h55
    A555h55 almost 2 years

    I've just started to experiment with AWS SageMaker and would like to load data from an S3 bucket into a pandas dataframe in my SageMaker python jupyter notebook for analysis.

    I could use boto to grab the data from S3, but I'm wondering whether there is a more elegant method as part of the SageMaker framework to do this in my python code?

    Thanks in advance for any advice.

  • Hack-R
    Hack-R almost 5 years
    What are the advantages / disadvantages over the other way, I wonder
  • CircleOnCircles
    CircleOnCircles almost 5 years
    @Hack-R The pro is that you are able to use the python file pointer interface/object throughout the code. The con is that this object operates per file which might not be performance efficient.
  • Binyamin Even
    Binyamin Even over 4 years
    why did you import boto3?
  • ivankeller
    ivankeller over 4 years
    why do you need the role? (see my answer to the question below)
  • Iakovos Belonias
    Iakovos Belonias over 4 years
    With that solution you avoid the credential headache, it's exactly what I was looking for, thank you.
  • Mabyn
    Mabyn over 4 years
    @Ben Thanks for this answer; however it's not working for me. I'm getting this error: AttributeError: type object 'Image' has no attribute 'open'. Can you share what library you're using for Image or any other details? Thanks!
  • Mabyn
    Mabyn over 4 years
    Never mind, I just figured it out: from IPython.display import display; from PIL import Image. After that, the above worked great. Thanks!
  • Zach Oakes
    Zach Oakes almost 4 years
    I'm getting either a timeout or an Access Denied -- I have a folder between the file and bucket, so added that to end of bucket or begin of file -- I'm using root access, and don't think I have any protection on this bucket ? Does this (execution role) require an IAM?
  • Zach Oakes
    Zach Oakes almost 4 years
    Got it -- removing execution_role() fixed it -- great call. I was hoping something like this was available : )