pandas read_html ValueError: No tables found
Solution 1
You can use requests
and avoid opening browser.
You can get current conditions by using:
and strip of 'jQuery1720724027235122559_1542743885014('
from the left and ')'
from the right. Then handle the json string.
You can get summary and history by calling the API with the following
You then need to strip 'jQuery1720724027235122559_1542743885015('
from the front and ');'
from the right. You then have a JSON string you can parse.
Sample of JSON:
You can find these URLs by using F12 dev tools in browser and inspecting the network tab for the traffic created during page load.
An example for current
, noting there seems to be a problem with nulls
in the JSON so I am replacing with "placeholder"
:
import requests
import pandas as pd
import json
from pandas.io.json import json_normalize
from bs4 import BeautifulSoup
url = 'https://stationdata.wunderground.com/cgi-bin/stationlookup?station=KMAHADLE7&units=both&v=2.0&format=json&callback=jQuery1720724027235122559_1542743885014&_=15'
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml")
s = soup.select('html')[0].text.strip('jQuery1720724027235122559_1542743885014(').strip(')')
s = s.replace('null','"placeholder"')
data= json.loads(s)
data = json_normalize(data)
df = pd.DataFrame(data)
print(df)
Solution 2
Here's a solution using selenium for browser automation
from selenium import webdriver
import pandas as pd
driver = webdriver.Chrome(chromedriver)
driver.implicitly_wait(30)
driver.get('https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html')
df=pd.read_html(driver.find_element_by_id("history_table").get_attribute('outerHTML'))[0]
Time Temperature Dew Point Humidity Wind Speed Gust Pressure Precip. Rate. Precip. Accum. UV Solar
0 12:02 AM 25.5 °C 18.7 °C 75 % East 0 kph 0 kph 29.3 hPa 0 mm 0 mm 0 0 w/m²
1 12:07 AM 25.5 °C 19 °C 76 % East 0 kph 0 kph 29.31 hPa 0 mm 0 mm 0 0 w/m²
2 12:12 AM 25.5 °C 19 °C 76 % East 0 kph 0 kph 29.31 hPa 0 mm 0 mm 0 0 w/m²
3 12:17 AM 25.5 °C 18.7 °C 75 % East 0 kph 0 kph 29.3 hPa 0 mm 0 mm 0 0 w/m²
4 12:22 AM 25.5 °C 18.7 °C 75 % East 0 kph 0 kph 29.3 hPa 0 mm 0 mm 0 0 w/m²
Editing with breakdown of exactly what's happening, since the above one-liner is actually not very good self-documenting code:
After setting up the driver, we select the table with its ID value (Thankfully this site actually uses reasonable and descriptive IDs)
tab=driver.find_element_by_id("history_table")
Then, from that element, we get the HTML instead of the web driver element object
tab_html=tab.get_attribute('outerHTML')
We use pandas to parse the html
tab_dfs=pd.read_html(tab_html)
From the docs:
"read_html returns a list of DataFrame objects, even if there is only a single table contained in the HTML content"
So we index into that list with the only table we have, at index zero
df=tab_dfs[0]
Noman Bashir
Updated on November 21, 2020Comments
-
Noman Bashir over 3 years
I am trying to scrap the historical weather data from the "https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html" weather underground page. I have the following code:
import pandas as pd page_link = 'https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html' df = pd.read_html(page_link) print(df)
I have the following response:
Traceback (most recent call last): File "weather_station_scrapping.py", line 11, in <module> result = pd.read_html(page_link) File "/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 987, in read_html displayed_only=displayed_only) File "/anaconda3/lib/python3.6/site-packages/pandas/io/html.py", line 815, in _parse raise_with_traceback(retained) File "/anaconda3/lib/python3.6/site-packages/pandas/compat/__init__.py", line 403, in raise_with_traceback raise exc.with_traceback(traceback) ValueError: No tables found
Although, this page clearly has a table but it is not being picked by the read_html. I have tried using Selenium so that the page can be loaded before I read it.
from selenium import webdriver from selenium.webdriver.common.keys import Keys driver = webdriver.Firefox() driver.get("https://www.wunderground.com/personal-weather-station/dashboard?ID=KMAHADLE7#history/tdata/s20170201/e20170201/mcustom.html") elem = driver.find_element_by_id("history_table") head = elem.find_element_by_tag_name('thead') body = elem.find_element_by_tag_name('tbody') list_rows = [] for items in body.find_element_by_tag_name('tr'): list_cells = [] for item in items.find_elements_by_tag_name('td'): list_cells.append(item.text) list_rows.append(list_cells) driver.close()
Now, the problem is that it cannot find "tr". I would appreciate any suggestions.
-
Noman Bashir over 5 yearsHi, thanks a lot. This works wonders, but I would highly appreciate if you would shed a little light on why did we select an attribute and picked the value at index 0.
-
G. Anderson over 5 yearsEdited with breakdown
-
Noman Bashir over 5 yearsThanks a lot. It was really helpful.