requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied

11,175

Your mistake is second for loop in code

for ref in table.find_all('a', href=True):
    links = (ref['href'])
    print (links)
    for link in links:

ref['href'] gives you single url but you use it as list in next for loop.

So you have

for link in ref['href']:

and it gives you first char from url http://properties.kimcore... which is h

Full working code

from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame

page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')

table = soup.find('table')
for ref in table.find_all('a', href=True):
    link = ref['href']
    print(link)
    page = requests.get(link)
    soup = BeautifulSoup(page.content, 'html.parser')
    table = []
    # Find all the divs we need in one go.
    divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
    for div in divs:
        # find all the enclosing a tags.
        anchors = div.find_all('a')
        for anchor in anchors:
            # Now we have groups of 3 list items (li) tags
            lis = anchor.find_all('li')
            # we clean up the text from the group of 3 li tags and add them as a list to our table list.
            table.append([unicodedata.normalize("NFKD",lis[0].text).strip(), lis[1].text, lis[2].text.strip()])
    # We have all the data so we add it to a DataFrame.
    headers = ['Number', 'Tenant', 'Square Footage']
    df = DataFrame(table, columns=headers)
    print (df)

BTW: if you use comma in (ref['href'], ) then you get tuple and then second for works correclty.


EDIT: it create list table_data at start and add all data into this list. And it convert into DataFrame at the end.

But now I see it read the same page few times - because in every row the same url is in every column. You would have to get url only from one column.

EDIT: now it doesn't read the same url many times

EDIT: now it get text and hre from first link and add to every element in list when you use append().

from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame

page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')

table_data = []

# all rows in table except first ([1:]) - headers
rows = soup.select('table tr')[1:]
for row in rows: 

    # link in first column (td[0]
    #link = row.select('td')[0].find('a')
    link = row.find('a')

    link_href = link['href']
    link_text = link.text

    print('text:', link_text)
    print('href:', link_href)

    page = requests.get(link_href)
    soup = BeautifulSoup(page.content, 'html.parser')

    divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
    for div in divs:
        anchors = div.find_all('a')
        for anchor in anchors:
            lis = anchor.find_all('li')
            item1 = unicodedata.normalize("NFKD", lis[0].text).strip()
            item2 = lis[1].text
            item3 = lis[2].text.strip()
            table_data.append([item1, item2, item3, link_text, link_href])

    print('table_data size:', len(table_data))            

headers = ['Number', 'Tenant', 'Square Footage', 'Link Text', 'Link Href']
df = DataFrame(table_data, columns=headers)
print(df)
Share:
11,175
snappers
Author by

snappers

Updated on June 04, 2022

Comments

  • snappers
    snappers almost 2 years

    I am working on a web scraping project and have run into the following error.

    requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied. Perhaps you meant http://h?

    Below is my code. I retrieve all of the links from the html table and they print out as expected. But when I try to loop through them (links) with request.get I get the error above.

    from bs4 import BeautifulSoup
    import requests
    import unicodedata
    from pandas import DataFrame
    
    page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
    soup = BeautifulSoup(page.content, 'html.parser')
    
    table = soup.find('table')
    for ref in table.find_all('a', href=True):
        links = (ref['href'])
        print (links)
        for link in links:
            page = requests.get(link)
            soup = BeautifulSoup(page.content, 'html.parser')
            table = []
            # Find all the divs we need in one go.
            divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
            for div in divs:
                # find all the enclosing a tags.
                anchors = div.find_all('a')
                for anchor in anchors:
                    # Now we have groups of 3 list items (li) tags
                    lis = anchor.find_all('li')
                    # we clean up the text from the group of 3 li tags and add them as a list to our table list.
                    table.append([unicodedata.normalize("NFKD",lis[0].text).strip(), lis[1].text, lis[2].text.strip()])
            # We have all the data so we add it to a DataFrame.
            headers = ['Number', 'Tenant', 'Square Footage']
            df = DataFrame(table, columns=headers)
            print (df)
    
    • furas
      furas over 6 years
      always put full error message (Traceback) in question (as text, not screenshot). There are other useful informations. For example It shows which line makes problem.
    • P.hunter
      P.hunter over 6 years
      as per my understanding there isn't any error, they are just not getting what they what to be exact right?
    • furas
      furas over 6 years
      your mistake is double for loop - use print to display values in variables and you will see what silly mistake you made.
    • Galen
      Galen over 6 years
      @P.hunter The question indicates requests.exceptions.MissingSchema.
    • P.hunter
      P.hunter over 6 years
      yea i got it thanks
    • furas
      furas over 6 years
      btw: pandas can read tables directly from web pages - all_tables = pandas.read_html(url) ; df = all_tables[0]
  • Galen
    Galen over 6 years
  • snappers
    snappers over 6 years
    Now I need to get all of the data frame outputs into a single df
  • furas
    furas over 6 years
    you can create many df and use merge, join, or concatenate to create one df with all data.
  • furas
    furas over 6 years
    you can also create single df at start and append() new rows in loop.
  • furas
    furas over 6 years
    you can also create single list at start and append() data to list, and at the end convert single list into single df
  • snappers
    snappers over 6 years
    Thanks so much, this is really helping me learn. Is it also possible to add the property name to each list as a column? the property name is contained between each 'a' tag along with the first href value we scraped
  • furas
    furas over 6 years
    yes, you can get text from <a> when you get href and later add to list when you do append(). See new code.