Persisting data in Google Colaboratory

30,203

Solution 1

Put that before your code, so will always download your file before run your code.

!wget -q http://www.yoursite.com/file.csv

Solution 2

Your interpretation is correct. VMs are ephemeral and recycled after periods of inactivity. There's no mechanism for persistent data on the VM itself right now.

In order for data to persist, you'll need to store it somewhere outside of the VM, e.g., Drive, GCS, or any other cloud hosting provider.

Some recipes for loading and saving data from external sources is available in the I/O example notebook.

Solution 3

Not sure whether this is the best solution, but you can sync your data between Colab and Drive with automated authentication like this: https://gist.github.com/rdinse/159f5d77f13d03e0183cb8f7154b170a

Solution 4

Clouderizer may provide some data persistence, at the cost of a long setup(because you use google colab only as a host) and little space to work on.

But, in my opinion that's best than have your file(s) "recycled" when you forget to save your progress.

Solution 5

As you pointed out, Google Colaboratory's file system is ephemeral. There are workarounds, though there's a network latency penalty and code overhead: e.g. you can use boilerplate code in your notebooks to mount external file systems like GDrive (see their example notebook).

Alternatively, while this is not supported in Colaboratory, other Jupyter hosting services – like Jupyo – provision dedicated VMs with persistent file systems so the data and the notebooks persist across sessions.

Share:
30,203
Admin
Author by

Admin

Updated on February 17, 2021

Comments

  • Admin
    Admin about 3 years

    Has anyone figured out a way to keep files persisted across sessions in Google's newly open sourced Colaboratory?

    Using the sample notebooks, I'm successfully authenticating and transferring csv files from my Google Drive instance and have stashed them in /tmp, my ~, and ~/datalab. Pandas can read them just fine off of disk too. But once the session times out , it looks like the whole filesystem is wiped and a new VM is spun up, without downloaded files.

    I guess this isn't surprising given Google's Colaboratory Faq:

    Q: Where is my code executed? What happens to my execution state if I close the browser window?

    A: Code is executed in a virtual machine dedicated to your account. Virtual machines are recycled when idle for a while, and have a maximum lifetime enforced by the system.

    Given that, maybe this is a feature (ie "go use Google Cloud Storage, which works fine in Colaboratory")? When I first used the tool, I was hoping that any .csv files that were in the My File/Colab Notebooks Google Drive folder would be also loaded onto the VM instance that the notebook was running on :/