Python: Get html table data by xpath

13,507

Solution 1

There is a fairly general pattern which you could use to parse many, though not all, tables.

import lxml.html as LH
import requests
import pandas as pd
def text(elt):
    return elt.text_content().replace(u'\xa0', u' ')

url = 'http://www.fdmbenzinpriser.dk/searchprices/5/'
r = requests.get(url)
root = LH.fromstring(r.content)

for table in root.xpath('//table[@id="sortabletable"]'):
    header = [text(th) for th in table.xpath('//th')]        # 1
    data = [[text(td) for td in tr.xpath('td')]  
            for tr in table.xpath('//tr')]                   # 2
    data = [row for row in data if len(row)==len(header)]    # 3 
    data = pd.DataFrame(data, columns=header)                # 4
    print(data)
  1. You can use table.xpath('//th') to find the column names.
  2. table.xpath('//tr') returns the rows, and for each row, tr.xpath('td') returns the element representing one "cell" of the table.
  3. Sometimes you may need to filter out certain rows, such as in this case, rows with fewer values than the header.
  4. What you do with the data (a list of lists) is up to you. Here I use Pandas for presentation only:

        Pris                               Adresse       Tidspunkt
0       8.04           Brovejen 18 5500 Middelfart   3 min 38 sek 
1       7.88         Hovedvejen 11 5500 Middelfart   4 min 52 sek 
2       7.88         Assensvej 105 5500 Middelfart   5 min 56 sek 
3       8.23    Ejby Industrivej 111 2600 Glostrup   6 min 28 sek 
4       8.15            Park Alle 125 2605 Brøndby  25 min 21 sek 
5       8.09           Sletvej 36 8310 Tranbjerg J  25 min 34 sek 
6       8.24      Vindinggård Center 29 7100 Vejle   27 min 6 sek 
7     7.99 *         Søndergade 116 8620 Kjellerup  31 min 27 sek 
8     7.99 *   Gertrud Rasks Vej 1 9210 Aalborg SØ  31 min 27 sek 
9     7.99 *              Sorøvej 13 4200 Slagelse  31 min 27 sek 

Solution 2

If you mean all the text:

from bs4 import BeautifulSoup

url_str = 'http://www.fdmbenzinpriser.dk/searchprices/5/'
import requests

r = requests.get(url_str).content

print([x.text for x in BeautifulSoup(r).find_all("table",attrs={"id":"sortabletable"})]

['Pris\nAdresse\nTidspunkt\n\n\n\n\n* Denne pris er indberettet af selskabet Indberet pris\n\n\n\n\n\n\xa08.24\n\xa0Gladsaxe Møllevej 33 2860 Søborg\n7 min 4 sek \n\n\n\n\xa08.89\n\xa0Frederikssundsvej 356 2700 Brønshøj\n9 min 10 sek \n\n\n\n\xa07.98\n\xa0Gartnerivej 1 7500 Holstebro\n14 min 25 sek \n\n\n\n\xa07.99 *\n\xa0Søndergade 116 8620 Kjellerup\n15 min 7 sek \n\n\n\n\xa07.99 *\n\xa0Gertrud Rasks Vej 1 9210 Aalborg SØ\n15 min 7 sek \n\n\n\n\xa07.99 *\n\xa0Sorøvej 13 4200 Slagelse\n15 min 7 sek \n\n\n\n\xa08.08 *\n\xa0Tørholmsvej 95 9800 Hjørring\n15 min 7 sek \n\n\n\n\xa08.09 *\n\xa0Nordvej 6 9900 Frederikshavn\n15 min 7 sek \n\n\n\n\xa08.09 *\n\xa0Skelmosevej  89 6980 Tim\n15 min 7 sek \n\n\n\n\xa08.09 *\n\xa0Højgårdsvej 2 4000 Roskilde\n15 min 7 sek']
Share:
13,507
Norfeldt
Author by

Norfeldt

Always learning something new 👨‍🔬

Updated on July 30, 2022

Comments

  • Norfeldt
    Norfeldt over 1 year

    I feel that extracting data from html tables is extremely difficult and requires custom build for each site.. I would very much like to be proved wrong here..

    Is there an simple pythonic way to extract strings and numbers out of a website by just using the url and xpath of the table of interest?

    Example:

    url_str = 'http://www.fdmbenzinpriser.dk/searchprices/5/'
    xpath_str = //*[@id="sortabletable"]
    

    I once had a script that could fetch data from this site. But lost it. As I recall it I was using the tag '' and some string logic.. not very pretty

    I know that sites like thingspeak can do these things..

  • Timothy Lombard
    Timothy Lombard almost 6 years
    very useful pattern but curious on how would you extract a href link from a table row using your general pattern?
  • unutbu
    unutbu almost 6 years
    @TimothyLombard: In the text function, instead of elt.text_content().... you might use something like elt.xpath('//a/@href') depending on HTML. For more specific help, it might be best to post a new question with an example of the HTML you are processing and the desired output.
  • reservoirinvest
    reservoirinvest over 5 years
    Nice code. What if there are two headers, with the top one having a row-span?