Scraping javascript-generated data using Python

11,274

Solution 1

There's also dryscape (a library written by me, so the recommendation is a bit biased, obviously :) which uses a fast Webkit-based in-memory browser to navigate around. It understands Javascript, too, but is a lot more lightweight than Selenium.

Solution 2

If you need to scape the page content which is updated with AJAX and you are not in the control of this AJAX interface I would use Selenium browser automator for the task:

http://code.google.com/p/selenium/

  • Selenium has Python bindings

  • It launches a real browser instance so it can do and scrape 100% the same thing as you see with your own eyes

  • Get HTML document content after AJAX updates thru Selenium API

  • Use lxml + xpath / CSS selectors to parse out the relevant parts out of the document

Share:
11,274
trigger
Author by

trigger

Updated on June 13, 2022

Comments

  • trigger
    trigger almost 2 years

    I want to scrape some data of following url using Python. http://www.hankyung.com/stockplus/main.php?module=stock&mode=stock_analysis_infomation&itemcode=078340

    It's about a summary of company information.

    What I want to scrape is not shown on the first page. By clicking tab named "재무제표", you can access financial statement. And clicking tab named "현금흐름표', you can access "Cash Flow".

    I want to scrape the "Cash Flow" data.

    However, Cash flow data is generated by javascript across the url. The following link is that url which is hidden, http://stock.kisline.com/compinfo/financial/main.action?vhead=N&vfoot=N&vstay=&omit=&vwidth=

    Cash flow data is generated by submitting some option value and cookie to this url.

    As you perceived, itemcode=078340 in the first link means stock code and there are as many as 1680 stocks that I want gather cash flow data. I want make it a loop structure.

    Is there good way to scrape cash flow data? I tried scrapy but scrapy is difficult to cope with my another scraping code already I'm using.

  • trigger
    trigger about 12 years
    Thanks a lot. I'm gonna try selenium.
  • abbood
    abbood about 11 years
    can i substitute jquery with this lxml +xpath part at the end (and follow the rest of the steps)?
  • Mikko Ohtamaa
    Mikko Ohtamaa about 11 years
    Selenium comes with its own CSS selector engine (which probably uses the underlying browser), so you don't need neither jQuery nor lxml anymore