Python - Web Scraping HTML table and printing to CSV

16,426

Run the code and you will get your desired data from that table. To give it a go and extract the data from this very element, all you need to do is wrap the whole html element, which you have pasted above, within html=''' '''

import csv
from bs4 import BeautifulSoup

outfile = open("table_data.csv","w",newline='')
writer = csv.writer(outfile)

tree = BeautifulSoup(html,"lxml")
table_tag = tree.select("table")[0]
tab_data = [[item.text for item in row_data.select("th,td")]
                for row_data in table_tag.select("tr")]

for data in tab_data:
    writer.writerow(data)
    print(' '.join(data))

I've tried to break the code into pieces to make you understand. What I did above is a nested for loop. Here is how it goes separately:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html,"lxml")
table = soup.find('table')

list_of_rows = []
for row in table.findAll('tr'):
    list_of_cells = []
    for cell in row.findAll(["th","td"]):
        text = cell.text
        list_of_cells.append(text)
    list_of_rows.append(list_of_cells)

for item in list_of_rows:
    print(' '.join(item))

Result:

Date Open High Low Close Volume Market Cap
Sep 14, 2017 3875.37 3920.60 3153.86 3154.95 2,716,310,000 64,191,600,000
Sep 13, 2017 4131.98 3789.92 3882.59 2,219,410,000 68,432,200,000
Sep 12, 2017 4168.88 4344.65 4085.22 4130.81 1,864,530,000 69,033,400,000
Share:
16,426
user8508478
Author by

user8508478

Updated on June 11, 2022

Comments

  • user8508478
    user8508478 almost 2 years

    I'm pretty much brand new to Python, but I'm looking to build a webscraping tool that will rip data from an HTML table online and print it into a CSV in the same format.

    Here's a sample of the HTML table (it's enormous, so I'm going to provide only a few rows).

    <div class="col-xs-12 tab-content">
            <div id="historical-data" class="tab-pane active">
              <div class="tab-header">
                <h2 class="pull-left bottom-margin-2x">Historical data for Bitcoin</h2>
    
                <div class="clear"></div>
    
                <div class="row">
                  <div class="col-md-12">
                    <div class="pull-left">
                      <small>Currency in USD</small>
                    </div>
                    <div id="reportrange" class="pull-right">
                        <i class="glyphicon glyphicon-calendar fa fa-calendar"></i>&nbsp;
                        <span>Aug 16, 2017 - Sep 15, 2017</span> <b class="caret"></b>
                    </div>
                  </div>
                </div>
    
                <table class="table">
                  <thead>
                  <tr>
                    <th class="text-left">Date</th>
                    <th class="text-right">Open</th>
                    <th class="text-right">High</th>
                    <th class="text-right">Low</th>
                    <th class="text-right">Close</th>
                    <th class="text-right">Volume</th>
                    <th class="text-right">Market Cap</th>
                  </tr>
                  </thead>
                  <tbody>
    
                    <tr class="text-right">
                      <td class="text-left">Sep 14, 2017</td>
                      <td>3875.37</td>     
                      <td>3920.60</td>
                      <td>3153.86</td>
                      <td>3154.95</td>
                      <td>2,716,310,000</td>
                      <td>64,191,600,000</td>
                    </tr>
    
                    <tr class="text-right">
                      <td class="text-left">Sep 13, 2017</td>
                      <td>4131.98</td>     
                      <td>4131.98</td>
                      <td>3789.92</td>
                      <td>3882.59</td>
                      <td>2,219,410,000</td>
                      <td>68,432,200,000</td>
                    </tr>
    
                    <tr class="text-right">
                      <td class="text-left">Sep 12, 2017</td>
                      <td>4168.88</td>     
                      <td>4344.65</td>
                      <td>4085.22</td>
                      <td>4130.81</td>
                      <td>1,864,530,000</td>
                      <td>69,033,400,000</td>
                    </tr>                
                  </tbody>
                </table>
              </div>
    
            </div>
        </div>
    

    I'm particularly interested in recreating the table with the same column headers provided: "Date," "Open," "High," "Low," "Close," "Volume," "Market Cap." Currently, I've been able to write a simple script that will essentially go to the URL, download the HTML, parse with BeautifulSoup, and then use 'for' statements to get the td elements. Below a sample of my code (URL omitted) and the result:

    from bs4 import BeautifulSoup
    import requests
    import pandas as pd
    import csv
    
    url = "enterURLhere"
    page = requests.get(url)
    pagetext = page.text
    
    pricetable = {
        "Date" : [],
        "Open" : [],
        "High" : [],
        "Low" : [],
        "Close" : [],
        "Volume" : [],
        "Market Cap" : []
    }
    
    soup = BeautifulSoup(pagetext, 'html.parser')
    
    file = open("test.csv", 'w')
    
    for row in soup.find_all('tr'):
        for col in row.find_all('td'):
            print(col.text)
    

    sample output

    Anyone have any pointers on how to at least reformat the data pull into the table? Thanks.