Webscraping Financial Data from Morningstar

16,987

Solution 1

It's a real problem scraping some modern web pages, particularly on pages generated by single-page applications (where the content is maintained by AJAX calls and DOM modification rather than delivered as ready-to-go HTML in a single server response).

The best way I have found to access such content is to use the Selenium web testing environment to have a browser load the page under the control of my program, then extract the page contents from Selenium for scraping. There are other environments that will execute the scripts and modify the DOM appropriately, but I haven't used any of them.

It's not as difficult as it sounds, but it will take you a little jiggering around to get there.

Solution 2

Web scraping can be greatly simplified when the site offers an API, be it officially supported or just an unofficial hack. Even the hack is better than trying to fiddle with the HTML which can change every day.

So a search for morningstar api might be fruitful. And, in fact, some friendly Gister has already worked this out for you.

Would the search be without result, a usually fruitful approach is to investigate what ajax calls the page is doing to retrieve data and then issue them directly. This can be achieved via the browser debuggers, tab "network" or so where each request can be investigated in detail in a very friendly UI.

Share:
16,987
user5841014
Author by

user5841014

Updated on June 04, 2022

Comments

  • user5841014
    user5841014 almost 2 years

    I am trying to scrape data from the morningstar website below:

    http://financials.morningstar.com/ratios/r.html?t=IBM&region=USA&culture=en_US

    I am currently trying to do just IBM but hope to eventually be able to type in the code of another company and do this same with that one. My code so far is below:

    import requests, os, bs4, string
    
    
    
    url = 'http://financials.morningstar.com/ratios/r.html?t=IBM&region=USA&culture=en_US';
    fin_tbl = ()
    
    
    page = requests.get(url)
    
    c = page.content
    
    soup = bs4.BeautifulSoup(c, "html.parser")
    
    
    summary = soup.find("div", {"class":"r_bodywrap"})
    
    tables = summary.find_all('table')
    
    print(tables[0])
    

    The problem I am experiencing at the moment is unlike a simpler webpage I have scraped the program can't seem to locate any tables even though I can see them in the HTML for the page.

    In researching this problem the closest stackoverflow question is below:

    Python webscraping - NoneObeject Failure - broken HTML?

    In that one they explained that Morningstar's tables are dynamically loaded and used some json code I am unfamiliar with and somehow generated a different weblink which managed to scrape the data but I don't understand where it came from?

  • user5841014
    user5841014 about 8 years
    Hey holdenweb, thanks for that I have played with Selenium a little in the past and will probably take a closer look. I thought about it when I first started looking at this problem but my main reservation was that as far as I was aware Selenium cannot just be run in the background. But in light of what you and other have said about modern pages nt being straight HTML and the constant changing of that HTML it might be worth taking a look at it.Thanks!
  • user5841014
    user5841014 about 8 years
    Hey flaschbier, thanks for finding that API I will have to take a closer look at them. This will solve my immediate problem but I was also hoping to use this as an opportunity to learn more about webscraping by using this as a project. I looked at the API you suggested but couldn't find anything to explain what was going on in the background.
  • user5841014
    user5841014 about 8 years
    In light of your last paragraph I have never looked at how the page retrieves data and don't really know where to begin. I use Chrome is there any material or tutorials you could recommend as an introduction to this? Thanks
  • flaschbier
    flaschbier about 8 years
    Not really. There are docs, but I never read it. Maybe I should have, but just pressing F12, switching to the network tab, reloading the page and seeing what happens already did the trick. (Click on the URLs you see request details ;)
  • user5841014
    user5841014 about 8 years
    Thanks Alexander I have never used java but will have to check it out to see if the advantages would be worth learning it