Programmatic Form Submit

python forms screen-scraping submit

13,703

Solution 1

you'll need to generate a HTTP request containing the data for the form.

The form will look something like:

<form action="submit.php" method="POST"> ... </form>

This tells you the url to request is www.example.com/submit.php and your request should be a POST.

In the form will be several input items, eg:

<input type="text" name="itemnumber"> ... </input>

you need to create a string of all these input name=value pairs encoded for a URL appended to the end of your requested URL, which now becomes www.example.com/submit.php?itemnumber=5234&otherinput=othervalue etc... This will work fine for GET. POST is a little trickier.

</motivation>

Just follow S.Lott's links for some much easier to use library support :P

Solution 2

Using python, I think it takes the following steps:

parse the web page that contains the form, find out the form submit address, and the submit method ("post" or "get").

this explains form elements in html file

Use urllib2 to submit the form. You may need some functions like "urlencode", "quote" from urllib to generate the url and data for post method. Read the library doc for details.

Solution 3

From a similar question - options-for-html-scraping - you can learn that with Python you can use Beautiful Soup.

Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping. Three features make it powerful:

Beautiful Soup won't choke if you give it bad markup. It yields a parse tree that makes approximately as much sense as your original document. This is usually good enough to collect the data you need and run away.

Beautiful Soup provides a few simple methods and Pythonic idioms for navigating, searching, and modifying a parse tree: a toolkit for dissecting a document and extracting what you need. You don't have to create a custom parser for each application.

Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. You don't have to think about encodings, unless the document doesn't specify an encoding and Beautiful Soup can't autodetect one. Then you just have to specify the original encoding.

The unusual name caught the attention of our host, November 12, 2008.

13,703

Author by

user1066101

Software Architect, aspiring writer. Programmer for well over 30 years, about 70% of my working life. Blog: S.Lott-Software Architect. Books: Building Skills. Technorati: SLott. LinkedIn: Profile. Ohloh: s_lott.

Updated on June 04, 2022

Comments

user1066101 about 2 years

I want to scrape the contents of a webpage. The contents are produced after a form on that site has been filled in and submitted.

I've read on how to scrape the end result content/webpage - but how to I programmatically submit the form?

I'm using python and have read that I might need to get the original webpage with the form, parse it, get the form parameters and then do X?

Can anyone point me in the rigth direction?
- user1066101 over 15 years
  
  Read about urllib2. stackoverflow.com/questions/301924/… stackoverflow.com/questions/120061/… Indeed, almost every question with urllib or urllib2 has an example you can use.
- Ali Afshar over 15 years
  
  Can I recommend voidspace.org.uk/python/articles/urllib2.shtml if you go down the urllib2 route. Basic Urllib also has enough to handle the naive case quite easily.