Amazon EC2 + S3 + Python + Scraping - The cheapest way of doing this?

linux amazon-ec2 python amazon-web-services scraping

5,148

The basic premise of your setup seems fine, however, there are a few items that you may want to factor in.

Firstly, EC2 network (and I/O) bandwidth is dependant on instance type. If you are hoping to use t1.micro instances do not expect 'super fast internet connectivity' - even with an m1.small, you may not see the performance you are looking for. Also, keep in mind that you pay for bandwidth used on EC2 (and not just for instance time).

With regard to your first point, there should be no real difficulty in setting up Python on an EC2 instance. However, the potential difficulty arises from coordinating your instances. For example, if you have 2 instances running, how will you split the task between them? How will each instances 'know' what the other has done (presuming you aren't going to manually partition a list of URLs). Moreover, if you are launching an instance, will one of the EC2 instances be responsible for handling that or will your local machine deal with it (if it is one of the EC2 instances, how do you determine which instance will be responsible for the task (i.e. to prevent the 'launch' task being executed by every instance) and how do you redistribute the tasks to include the new instance? How do you determine which instance(s) to terminate automatically?

Undoubtedly, all of the above are possible (corosync/heartbeat, pacemaker, auto-scaling, etc.) but easy to overlook initially. Regardless, if you are looking for the 'best price' you will probably want to go with spot instances (as opposed to on-demand), however, for that to work, you do need a fairly robust architecture. (It is worth noting that the spot prices fluctuates significantly - at times exceeding the on-demand price; depending on the time-scale over which you are working, you will either want to set a low upper spot price, or determine the best approach (spot/on-demand) on a regular (hourly) basis to minimize your costs.) Although, I can't confirm it at the moment, the simplest (and cheapest) option may be AWS' auto-scaling. You need to setup Cloudwatch alarms (but Cloudwatch does provide 10 free alarms) and auto-scaling itself does not have a cost associated with it (other than the cost of new instances and the Cloudwatch costs).

Given that I really have no idea of the scope of your undertaking, I might ask why not simply use EC2 for the parsing and processing. Especially if the parsing is complex, the pages can be fetched faster than they can be processed, and you have a large number of pages (presumable, otherwise you wouldn't be going to through the effort of setting up AWS), it might be more efficient to simply process the pages on EC2, and when everything is done, download a dump of the database. Arguably, this might simplify things a bit - have one instance running MySQL (with the data stored on an EBS volume), each instance queries the MySQL instance for the next set of records (and perhaps marks those as reserved), fetches and processes, and saves the data to MySQL.

If you are not going to run MySQL on EC2, you can either store your HTML files on S3, as you have mentioned, or can go with saving them on an EBS volume. The advantage of S3 is that you don't need to pre-allocate storage (especially useful if you don't know the size of the data you are dealing with) - you pay for PUTs/GETs and storage; the downside is speed - S3 is not meant to be used as a filesystem, and (even though you can mount it as a filesystem) it would be fairly inefficient to be saving each individual file to S3 (as in you will want to accumulate a few pages and them upload them to S3). Additionally, if you have a large volume of files (tens of thousands) the processing of fetching all the filenames, etc. can be slow. EBS volumes are meant to be used as storage attached to an instance - the advantage is in speed - both transfer rates and the fact it has a 'filesystem' (so reading a list of files, etc. is quick) - EBS volumes persist beyond instance termination (except for EBS root volumes, which do not by default (but can be made to)). The downsides of EBS volumes are that you have to pre-allocate a quantity of storage (which cannot be modified on the fly) - and you pay for that amount of storage (regardless of whether all of it is in use); you also pay for I/O operations (also, the performance of EBS volumes is dependent on network speed - so larger instances get better EBS performance). The other advantage of EBS is that, being a filesystem, you can perform a task like gzipping the files very easily (and I imagine that if you are downloading a lot of html pages you will not want to be fetching individual files of S3 later on).

I am not really going to speculate on the possibilities (keeping in mind that at a very large scale, something like map-reduce/hadoop would be used to manage this kind of task), but as long as you have an approach for partitioning the task (e.g. MySQL instance) and managing the scaling of instances (e.g. auto-scaling), the idea you have should work fine.

5,148

ThinkCode

Updated on September 18, 2022

Comments

ThinkCode over 1 year
I tapped in to Amazons AWS offerings and please explain this in high level - if I am thinking right.

So I have few Python scraping scripts on my local machine. I want to use AWS for super fast internet connectivity and cheaper price - win / win!
- I understand that I can deploy a centOS/Ubuntu instance on EC2. Install the necessary Python libraries. Start and stop instances using boto (Python) to save costs. Am I thinking right so far? (Is it feasible?)
- I will CRON some scripts that will start fetching (scraping) HTML files for parsing later on. So these HTML files are copied over to S3 for storage (or shall I dump them to my local machine since that is how I will be parsing and storing in MySQL?).
Please advise if I make any sense with my assumptions and the little knowledge I have of AWS with my few hours of reading/Googling about the service.
Hristo Deshev over 12 years

Can you point me to the documentation about limiting bandwidth according to instance type? This is the first time I see it, and I'd like to confirm it.
cyberx86 over 12 years

Just to clarify, by 'bandwidth' I mean transfer-rate/throughput. It is mentioned in a few places, two were easy to find. "Because Amazon EBS volumes require network access, you will see faster and more consistent throughput performance with larger instances." AWS EBS Also, the 'I/O performance' on the instance types page refers to network I/O performance (see, for example, the cluster compute instance types which note '10GbE' against I/O performance)
Hristo Deshev over 12 years

Thanks! The "instance types" page clears it for me. Something else crossed my mind - if you are running small instances you face the chance of being run on a hardware host with other small instances. And maybe those can saturate the host network interfaces and hurt EBS I/O performance. I remember reading somewhere that by using large instances you are least likely to run on a host machine that's under load from other users.
ThinkCode over 12 years

Thank you so much for such an elaborate response. It definitely makes me think/explore different areas/possibilities. I was initially thinking of downloading a text file of URLs on EC2 since MySQL can be expensive and I guess can be avoided. The HTML files can range from few 1000s to 10000s. Each instances are independent because the text file with URLs take care of it. EBS volume makes sense, gZipping a nightly download of files and downloading to my local machine for further processing since parsing it is pretty elaborate. I didn't know hadoop can be applied here? Never used hadoop....
ThinkCode over 12 years

Didn't get a straightforward answer on the web about this - can IP addresses be rotated on the fly since it is scraping and the IPs can be throttled I suppose? (I am a lil' confused with the FREE stuff on AWS. 720 hours, some bandwidth,... Spot instances make sense too since this isn't mission critical tasks and more emphasis on costs. $1 an hour is still a lot of money for me! Thorough testing will answer the pricing Qs I guess)!
ThinkCode over 12 years

URL pool is generated on a daily basis for various instances, so don't overlap. Can you please briefly explain how Hadoop can be used (or what it does/benefits) for say 100,000 URLs/HTML files? Just want to understand! Can you also address the IP issue? Thank you so much!
cyberx86 over 12 years

The difference comes down to whether or not you are working with a single/growing large list or multiple discrete lists of URLs. In the discrete case you may manually assign each instance one list (or part of a list) - on completion, the instance can gzip the data, upload to S3, and terminate. On the other hand, if you have a single list of URLs that you start with, and new URLs being added as you find them (like a search engine), you need to dynamically coordinate your task (because, your new instances are being added to lessen (remove) some of the work allocated to your existing instances.
cyberx86 over 12 years

@ThinkCode: Hadoop is an implementation of Map-Reduce - the idea being that if you have a very large task, which cannot be practically executed on a single machine, it is necessary to break the task into smaller pieces (so that it can be distributed) and then combine the results together. If you had a very large set of URLs, which would take months/years for a single instance to handle, you Hadoop could divide the job into multiple small tasks, distribute each task to its own instance, and them join all the results for a unified product.
cyberx86 over 12 years

IP addresses are not typically rotated 'on demand' on EC2 instances. Off-hand, the only times that IP addresses change are when an instance starts/resumes and if you assign/unassign an elastic IP (eIP) address. There is a charge ($0.10) for each eIP assignment after the first 100. As a point of mention, depending on which site(s) you are scraping, be careful to limit the rate yourself (e.g. rotate between sites). Especially on smaller sites, you will be able to scrape faster than the site can serve pages, and may unintentionally bring down the server.
ThinkCode over 12 years

Yeah, I do the scraping minimally conforming to the robots.txt file. I didn't know about 100 eIPs, I thought it was only 16 IPs and I have to place a request giving them a reason to increase the number of IPs?
ThinkCode over 12 years

Got it, thank you so much for patiently answering all my questions! I guess you may see me here as I start tinkering with AWS. Thanks again!