Scraping Wikipedia tables with Python selectively

13,103

Solution 1

If I could understand your question then the following should get you there:

import requests
from bs4 import BeautifulSoup

URL = "https://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government"

res = requests.get(URL).text
soup = BeautifulSoup(res,'lxml')
for items in soup.find('table', class_='wikitable').find_all('tr')[1::1]:
    data = items.find_all(['th','td'])
    try:
        country = data[0].a.text
        title = data[1].a.text
        name = data[1].a.find_next_sibling().text
    except IndexError:pass
    print("{}|{}|{}".format(country,title,name))

Output:

Afghanistan|President|Ashraf Ghani
Albania|President|Ilir Meta
Algeria|President|Abdelaziz Bouteflika
Andorra|Episcopal Co-Prince|Joan Enric Vives Sicília
Angola|President|João Lourenço
Antigua and Barbuda|Queen|Elizabeth II
Argentina|President|Mauricio Macri

And so on ----

Solution 2

I found a super easy and short way to do this, by importing the wikipedia python module and then using pandas' read_html to put it into a dataframe.

From there you can apply any amount of analysis you wish.

import pandas as pd
import wikipedia as wp
html = wp.page("List_of_video_games_considered_the_best").html().encode("UTF-8")
try: 
    df = pd.read_html(html)[1]  # Try 2nd table first as most pages contain contents table first
except IndexError:
    df = pd.read_html(html)[0]
print(df.to_string())

OR if you would like to call it from the command line:

Simply call by python yourfile.py -p Wikipedia_Page_Article_Here

import pandas as pd
import argparse
import wikipedia as wp
parser = argparse.ArgumentParser()
parser.add_argument("-p", "--wiki_page", help="Give a wiki page to get table", required=True)
args = parser.parse_args()
html = wp.page(args.wiki_page).html().encode("UTF-8")
try: 
    df = pd.read_html(html)[1]  # Try 2nd table first as most pages contain contents table first
except IndexError:
    df = pd.read_html(html)[0]
print(df.to_string())

Hope this helps someone out there!

Share:
13,103
aviss
Author by

aviss

Updated on June 08, 2022

Comments

  • aviss
    aviss almost 2 years

    I have troubles sorting a wiki table and hope someone who has done it before can give me advice. From the List_of_current_heads_of_state_and_government I need countries (works with the code below) and then only the first mention of Head of state + their names. I am not sure how to isolate the first mention as they all come in one cell. And my attempt to pull their names gives me this error: IndexError: list index out of range. Will appreciate your help!

    import requests
    from bs4 import BeautifulSoup
    
    wiki = "https://en.wikipedia.org/wiki/List_of_current_heads_of_state_and_government"
    website_url = requests.get(wiki).text
    soup = BeautifulSoup(website_url,'lxml')
    
    my_table = soup.find('table',{'class':'wikitable plainrowheaders'})
    #print(my_table)
    
    states = []
    titles = []
    names = []
    for row in my_table.find_all('tr')[1:]:
        state_cell = row.find_all('a')[0]  
        states.append(state_cell.text)
    print(states)
    for row in my_table.find_all('td'):
        title_cell = row.find_all('a')[0]
        titles.append(title_cell.text)
    print(titles)
    for row in my_table.find_all('td'):
        name_cell = row.find_all('a')[1]
        names.append(name_cell.text)
    print(names)
    

    Desirable output would be a pandas df:

    State | Title | Name |
    
  • SIM
    SIM about 6 years
    Or you can try like this as well print(country,title,name,sep=" | "). Thanks.
  • aviss
    aviss about 6 years
    Yes, that's exactly what I was looking for. Thanks!