pandas read_html ValueError: No tables found

29,353

Solution 1

You can use requests and avoid opening browser.

You can get current conditions by using:

https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15

and strip of 'jQuery1720724027235122559_1542743885014(' from the left and ')' from the right. Then handle the json string.

You can get summary and history by calling the API with the following

https://api-ak.wunderground.com/api/606f3f6977348613/history_20170201null/units:both/v:2.0/q/pws:KMAHADLE7.json?callback=jQuery1720724027235122559_1542743885015&_=1542743886276

You then need to strip 'jQuery1720724027235122559_1542743885015(' from the front and ');' from the right. You then have a JSON string you can parse.

Sample of JSON:

You can find these URLs by using F12 dev tools in browser and inspecting the network tab for the traffic created during page load.

An example for current, noting there seems to be a problem with nulls in the JSON so I am replacing with "placeholder":

import requests
import pandas as pd
import json
from pandas.io.json import json_normalize
from bs4 import BeautifulSoup

url = 'https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')
s = s.replace('null','"placeholder"')
data= json.loads(s)
data = json_normalize(data)
df = pd.DataFrame(data)
print(df)

Solution 2

Here's a solution using selenium for browser automation

from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome(chromedriver)
driver.implicitly_wait(30)

driver.get('https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html')
    df=pd.read_html(driver.find_element_by_id("history_table").get_attribute('outerHTML'))[0]

Time    Temperature Dew Point   Humidity    Wind    Speed   Gust    Pressure  Precip. Rate. Precip. Accum.  UV  Solar
0   12:02 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²
1   12:07 AM    25.5 °C 19 °C   76 %    East    0 kph   0 kph   29.31 hPa   0 mm    0 mm    0   0 w/m²
2   12:12 AM    25.5 °C 19 °C   76 %    East    0 kph   0 kph   29.31 hPa   0 mm    0 mm    0   0 w/m²
3   12:17 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²
4   12:22 AM    25.5 °C 18.7 °C 75 %    East    0 kph   0 kph   29.3 hPa    0 mm    0 mm    0   0 w/m²

Editing with breakdown of exactly what's happening, since the above one-liner is actually not very good self-documenting code:

After setting up the driver, we select the table with its ID value (Thankfully this site actually uses reasonable and descriptive IDs)

tab=driver.find_element_by_id("history_table")

Then, from that element, we get the HTML instead of the web driver element object

tab_html=tab.get_attribute('outerHTML')

We use pandas to parse the html

tab_dfs=pd.read_html(tab_html)

From the docs:

"read_html returns a list of DataFrame objects, even if there is only a single table contained in the HTML content"

So we index into that list with the only table we have, at index zero

df=tab_dfs[0]
Share:
29,353
Noman Bashir
Author by

Noman Bashir

Updated on November 21, 2020

Comments

  • Noman Bashir
    Noman Bashir over 3 years

    I am trying to scrap the historical weather data from the "https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html" weather underground page. I have the following code:

    import pandas as pd 
    
    page_link = 'https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html'
    df = pd.read_html(page_link)
    print(df)
    

    I have the following response:

    Traceback (most recent call last):
     File "weather_station_scrapping.py", line 11, in <module>
      result = pd.read_html(page_link)
     File "/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 987, in read_html
      displayed_only=displayed_only)
     File "/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 815, in _parse raise_with_traceback(retained)
     File "/anaconda3/lib/python3.6/site-packages/pandas/compat/__init__.py", line 403, in raise_with_traceback
      raise exc.with_traceback(traceback)
    ValueError: No tables found
    

    Although, this page clearly has a table but it is not being picked by the read_html. I have tried using Selenium so that the page can be loaded before I read it.

    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    
    driver = webdriver.Firefox()
    driver.get("https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html")
    elem = driver.find_element_by_id("history_table")
    
    head = elem.find_element_by_tag_name('thead')
    body = elem.find_element_by_tag_name('tbody')
    
    list_rows = []
    
    for items in body.find_element_by_tag_name('tr'):
        list_cells = []
        for item in items.find_elements_by_tag_name('td'):
            list_cells.append(item.text)
        list_rows.append(list_cells)
    driver.close()
    

    Now, the problem is that it cannot find "tr". I would appreciate any suggestions.

  • Noman Bashir
    Noman Bashir over 5 years
    Hi, thanks a lot. This works wonders, but I would highly appreciate if you would shed a little light on why did we select an attribute and picked the value at index 0.
  • G. Anderson
    G. Anderson over 5 years
    Edited with breakdown
  • Noman Bashir
    Noman Bashir over 5 years
    Thanks a lot. It was really helpful.