BeautifulSoup extracting data from multiple tables

10,124

Solution 1

Using bs4, but this should work:

from bs4 import BeautifulSoup as bsoup

ofile = open("htmlsample.html")
soup = bsoup(ofile)
soup.prettify()

tables = soup.find_all("tbody")

storeTable = tables[0].find_all("tr")
storeValueRows = tables[2].find_all("tr")

storeRank = []
for row in storeTable:
    storeRank.append(row.get_text().strip())

storeMatrix = []
for row in storeValueRows:
    storeMatrixRow = []
    for cell in row.find_all("td")[::2]:
        storeMatrixRow.append(cell.get_text().strip())
    storeMatrix.append(", ".join(storeMatrixRow))

for record in zip(storeRank, storeMatrix):
    print " ".join(record)

The above will print out:

# of countries - rank 1 reached 0, 0, 1, 9
# of countries - rank 5 reached 0, 8, 49, 29
# of countries - rank 10 reached 25, 31, 49, 32
# of countries - rank 100 reached 49, 49, 49, 32
# of countries - rank 500 reached 49, 49, 49, 32
# of countries - rank 1000 reached 49, 49, 49, 32
[Finished in 0.5s]

Changing storeTable to tables[1] and storeValueRows to tables[3] will print out:

Country 
Canada 6, 5, 2, 1
Brazil 7, 5, 2, 1
Hungary 7, 6, 2, 2
Sweden 9, 5, 1, 1
Malaysia 10, 5, 2, 1
Mexico 10, 5, 2, 2
Greece 10, 6, 2, 1
Israel 10, 6, 2, 1
Bulgaria 10, 6, 2, -
Chile 10, 6, 2, -
Vietnam 10, 6, 2, -
Ireland 10, 6, 2, -
Kuwait 10, 6, 2, -
Finland 10, 7, 2, -
United Arab Emirates 10, 7, 2, -
Argentina 10, 7, 2, -
Slovakia 10, 7, 2, -
Romania 10, 8, 2, -
Belgium 10, 9, 2, 3
New Zealand 10, 13, 2, -
Portugal 10, 14, 2, -
Indonesia 10, 14, 2, -
South Africa 10, 15, 2, -
Ukraine 10, 15, 2, -
Philippines 10, 16, 2, -
United Kingdom 11, 5, 2, 1
Denmark 11, 6, 2, 2
Australia 12, 9, 2, 3
United States 13, 9, 2, 2
Austria 13, 9, 2, 3
Turkey 14, 5, 2, 1
Egypt 14, 5, 2, 1
Netherlands 14, 8, 2, 2
Spain 14, 11, 2, 4
Thailand 15, 10, 2, 3
Singapore 16, 10, 2, 2
Switzerland 16, 10, 2, 3
Taiwan 17, 12, 2, 4
Poland 17, 13, 2, 5
France 18, 8, 2, 3
Czech Republic 18, 13, 2, 6
Germany 19, 11, 2, 3
Norway 20, 14, 2, 5
India 20, 14, 2, 5
Italy 20, 15, 2, 7
Hong Kong 26, 21, 2, -
Japan 33, 16, 4, 5
Russia 33, 17, 2, 7
South Korea 46, 27, 2, 5
[Finished in 0.6s]

Not the best of code and can be improved further. However, the logic applies well.

Hope this helps.

EDIT:

If you want the format South Korea, 46, 27, 2, 5 instead of South Korea 46, 27, 2, 5 (note the , after the country name), just change this:

storeRank.append(row.get_text().strip())

to this:

storeRank.append(row.get_text().strip() + ",")

Solution 2

Looks like you are scraping data from http://www.appannie.com.

Here is the code to get the data. I am sure some parts of the code can be improved or written in a pythonic way. But it gets what you want. Also, I used Beautiful Soup 4 instead of 3.

from bs4 import BeautifulSoup

html_file = open('test2.html')
soup = BeautifulSoup(html_file)

countries = []
countries_table = soup.find_all('table', attrs={'class':'data-table table-rank'})[1]
countries_body = countries_table.find_all('tbody')[1]
countries_row = countries_body.find_all('tr', attrs={"class": "ranks"})
for row in countries_row:
    countries.append(row.div.a.text)

data = []
data_table = soup.find_all('table', attrs={'class':'data-table table-rank'})[3]
data_body = data_table.find_all('tbody')[1]
data_row = data_body.find_all('tr', attrs={"class": "ranks"})
for row in data_row:
    tds = row.find_all('td')
    sublist = []
    for td in tds[::2]:    
        sublist.append(td.text)
    data.append(sublist)

for element in zip(countries, data):
    print element

Hope this helps :)

Share:
10,124
user2447387
Author by

user2447387

Updated on June 04, 2022

Comments

  • user2447387
    user2447387 almost 2 years

    I'm trying to extract some data from two html tables in a html file with BeautifulSoup.

    This is actually the first time I'm using it and I'searched a lot of questions/example but none seem to work in my case. The html contains two tables, the first with the headers of the first column (which are always text) and the second, containing the data of the following columns. Moreover, the table contains text, numbers and also symbols. This makes for a novice like me everything more complicated. Here's the layout of the html, copied from the browser I was able to extract the whole html content of the rows, but only for the first tables, so in reality I am not getting any data but only the content of the first column.

    The output I'm trying to obtain is a string containing the "joint" information of the tables (Col1= text, Col2=number, Col3=number, Col4=number, Col5=number), such as:

    Canada, 6, 5, 2, 1
    

    Here's the list of the Xpaths for each item:

    "Canada": /html/body/div/div[1]/table/tbody[2]/tr[2]/td/div/a
    "6": /html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[1] 
    "5": /html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[3] 
    "2": /html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[5]
    "1": /html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[7]
    

    I would be also happy with strings in "rough" html format, as long as there is one string per row, so that I'll be able to parse it further with the methods I already know. Here's the code I have so far. Thanks!

    from BeautifulSoup import BeautifulSoup
    html=""" 
    my html code
    """
    soup = BeautifulSoup(html)
    table=soup.find("table")
    for row in table.findAll('tr'):
        col = row.findAll('td')
        print row, col
    
    • jfs
      jfs about 10 years
      Could you put in your answer the html corresponding to Canada, 6, 5, 2, 1? If you don't know then you could use your browser to get xpath for each of the items for a start.
    • user2447387
      user2447387 about 10 years
      Well in order to get that result I should be able to "join" the two tables. Because "Canada" is included in the first table, while the other numbers are in the second one. I can't really figure out a way to join them, there is no single html string that combines all information. Here's a sample of the html layout copied from my browser (unfortunately the webpage requires a login so I cannot post the direct link)
    • user2447387
      user2447387 about 10 years
      Here's the list of the Xpaths: Canada "/html/body/div/div[1]/table/tbody[2]/tr[2]/td/div/a", 6 "/html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[1]", 5 "/html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[3]", 2 "/html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[5]", 1 "/html/body/div/div[2]/div/table/tbody[2]/tr[2]/td[7]"
    • jfs
      jfs about 10 years
      don't put relevant to your question info in the comments, update your question instead (as a bonus, you could format the info properly)
  • user2447387
    user2447387 about 10 years
    @warunsl: Yep exactly, thanks. I'm getting this error though "countries.append(row.div.a.text) AttributeError: 'NoneType' object has no attribute 'a'"
  • user2447387
    user2447387 about 10 years
    Thanks! You've been extremely helpful!
  • WitchGod
    WitchGod about 10 years
    Thanks for accepting and good luck. The above is very easy to manipulate to get either of the two tables that you want, so I hope you try to understand the logic underneath it. :)
  • user2447387
    user2447387 about 10 years
    Yes, thank you very much. It is indeed quite difficult for newbies getting to properly master Bs. I haven't found a lot of helpful resources/example, but it might also be that my case was somewhat a bit more complicated than the average :)
  • WitchGod
    WitchGod about 10 years
    It was more complicated. In this case, it's not BS that was the problem, really. If you check the code, it's more of list and string manipulations, which is basic Python. The key functions above were zip and join, which are highly essential when dealing with lists, tuples, etc IMHO. :)
  • shaktimaan
    shaktimaan about 10 years
    @user2447387 My code works with respect to the HTML saved from here - appannie.com/apps/google-play/app/radiotime.player/app-ranki‌​ng/… I did not use the HTML you posted in the question