requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied

11,175

Your mistake is second for loop in code

for ref in table.find_all('a', href=True):
    links = (ref['href'])
    print (links)
    for link in links:

ref['href'] gives you single url but you use it as list in next for loop.

So you have

for link in ref['href']:

and it gives you first char from url http://properties.kimcore... which is h

Full working code

from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame

page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')

table = soup.find('table')
for ref in table.find_all('a', href=True):
    link = ref['href']
    print(link)
    page = requests.get(link)
    soup = BeautifulSoup(page.content, 'html.parser')
    table = []
    # Find all the divs we need in one go.
    divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
    for div in divs:
        # find all the enclosing a tags.
        anchors = div.find_all('a')
        for anchor in anchors:
            # Now we have groups of 3 list items (li) tags
            lis = anchor.find_all('li')
            # we clean up the text from the group of 3 li tags and add them as a list to our table list.
            table.append([unicodedata.normalize("NFKD",lis[0].text).strip(), lis[1].text, lis[2].text.strip()])
    # We have all the data so we add it to a DataFrame.
    headers = ['Number', 'Tenant', 'Square Footage']
    df = DataFrame(table, columns=headers)
    print (df)

BTW: if you use comma in (ref['href'], ) then you get tuple and then second for works correclty.

EDIT: it create list table_data at start and add all data into this list. And it convert into DataFrame at the end.

But now I see it read the same page few times - because in every row the same url is in every column. You would have to get url only from one column.

EDIT: now it doesn't read the same url many times

EDIT: now it get text and hre from first link and add to every element in list when you use append().

from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame

page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')

table_data = []

# all rows in table except first ([1:]) - headers
rows = soup.select('table tr')[1:]
for row in rows: 

    # link in first column (td[0]
    #link = row.select('td')[0].find('a')
    link = row.find('a')

    link_href = link['href']
    link_text = link.text

    print('text:', link_text)
    print('href:', link_href)

    page = requests.get(link_href)
    soup = BeautifulSoup(page.content, 'html.parser')

    divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
    for div in divs:
        anchors = div.find_all('a')
        for anchor in anchors:
            lis = anchor.find_all('li')
            item1 = unicodedata.normalize("NFKD", lis[0].text).strip()
            item2 = lis[1].text
            item3 = lis[2].text.strip()
            table_data.append([item1, item2, item3, link_text, link_href])

    print('table_data size:', len(table_data))            

headers = ['Number', 'Tenant', 'Square Footage', 'Link Text', 'Link Href']
df = DataFrame(table_data, columns=headers)
print(df)

11,175

Author by

snappers

Updated on June 04, 2022

Comments

snappers almost 2 years

I am working on a web scraping project and have run into the following error.

requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied. Perhaps you meant http://h?

Below is my code. I retrieve all of the links from the html table and they print out as expected. But when I try to loop through them (links) with request.get I get the error above.

from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame

page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')

table = soup.find('table')
for ref in table.find_all('a', href=True):
    links = (ref['href'])
    print (links)
    for link in links:
        page = requests.get(link)
        soup = BeautifulSoup(page.content, 'html.parser')
        table = []
        # Find all the divs we need in one go.
        divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
        for div in divs:
            # find all the enclosing a tags.
            anchors = div.find_all('a')
            for anchor in anchors:
                # Now we have groups of 3 list items (li) tags
                lis = anchor.find_all('li')
                # we clean up the text from the group of 3 li tags and add them as a list to our table list.
                table.append([unicodedata.normalize("NFKD",lis[0].text).strip(), lis[1].text, lis[2].text.strip()])
        # We have all the data so we add it to a DataFrame.
        headers = ['Number', 'Tenant', 'Square Footage']
        df = DataFrame(table, columns=headers)
        print (df)

furas over 6 years

always put full error message (Traceback) in question (as text, not screenshot). There are other useful informations. For example It shows which line makes problem.
P.hunter over 6 years

as per my understanding there isn't any error, they are just not getting what they what to be exact right?
furas over 6 years

your mistake is double for loop - use print to display values in variables and you will see what silly mistake you made.
Galen over 6 years

@P.hunter The question indicates requests.exceptions.MissingSchema.
P.hunter over 6 years

yea i got it thanks
furas over 6 years

btw: pandas can read tables directly from web pages - all_tables = pandas.read_html(url) ; df = all_tables[0]

Galen over 6 years

Yes. Parenthesis do not make a tuple.
snappers over 6 years

Now I need to get all of the data frame outputs into a single df
furas over 6 years

you can create many df and use merge, join, or concatenate to create one df with all data.
furas over 6 years

you can also create single df at start and append() new rows in loop.
furas over 6 years

you can also create single list at start and append() data to list, and at the end convert single list into single df
snappers over 6 years

Thanks so much, this is really helping me learn. Is it also possible to add the property name to each list as a column? the property name is contained between each 'a' tag along with the first href value we scraped
furas over 6 years

yes, you can get text from <a> when you get href and later add to list when you do append(). See new code.