requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied
Your mistake is second for
loop in code
for ref in table.find_all('a', href=True):
links = (ref['href'])
print (links)
for link in links:
ref['href']
gives you single url but you use it as list in next for
loop.
So you have
for link in ref['href']:
and it gives you first char from url http://properties.kimcore...
which is h
Full working code
from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame
page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
for ref in table.find_all('a', href=True):
link = ref['href']
print(link)
page = requests.get(link)
soup = BeautifulSoup(page.content, 'html.parser')
table = []
# Find all the divs we need in one go.
divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
for div in divs:
# find all the enclosing a tags.
anchors = div.find_all('a')
for anchor in anchors:
# Now we have groups of 3 list items (li) tags
lis = anchor.find_all('li')
# we clean up the text from the group of 3 li tags and add them as a list to our table list.
table.append([unicodedata.normalize("NFKD",lis[0].text).strip(), lis[1].text, lis[2].text.strip()])
# We have all the data so we add it to a DataFrame.
headers = ['Number', 'Tenant', 'Square Footage']
df = DataFrame(table, columns=headers)
print (df)
BTW: if you use comma in (ref['href'], )
then you get tuple and then second for
works correclty.
EDIT: it create list table_data
at start and add all data into this list. And it convert into DataFrame at the end.
But now I see it read the same page few times - because in every row the same url is in every column. You would have to get url only from one column.
EDIT: now it doesn't read the same url many times
EDIT: now it get text and hre from first link and add to every element in list when you use append()
.
from bs4 import BeautifulSoup
import requests
import unicodedata
from pandas import DataFrame
page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/")
soup = BeautifulSoup(page.content, 'html.parser')
table_data = []
# all rows in table except first ([1:]) - headers
rows = soup.select('table tr')[1:]
for row in rows:
# link in first column (td[0]
#link = row.select('td')[0].find('a')
link = row.find('a')
link_href = link['href']
link_text = link.text
print('text:', link_text)
print('href:', link_href)
page = requests.get(link_href)
soup = BeautifulSoup(page.content, 'html.parser')
divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']})
for div in divs:
anchors = div.find_all('a')
for anchor in anchors:
lis = anchor.find_all('li')
item1 = unicodedata.normalize("NFKD", lis[0].text).strip()
item2 = lis[1].text
item3 = lis[2].text.strip()
table_data.append([item1, item2, item3, link_text, link_href])
print('table_data size:', len(table_data))
headers = ['Number', 'Tenant', 'Square Footage', 'Link Text', 'Link Href']
df = DataFrame(table_data, columns=headers)
print(df)
snappers
Updated on June 04, 2022Comments
-
snappers almost 2 years
I am working on a web scraping project and have run into the following error.
requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied. Perhaps you meant http://h?
Below is my code. I retrieve all of the links from the html table and they print out as expected. But when I try to loop through them (links) with request.get I get the error above.
from bs4 import BeautifulSoup import requests import unicodedata from pandas import DataFrame page = requests.get("http://properties.kimcorealty.com/property/output/find/search4/view:list/") soup = BeautifulSoup(page.content, 'html.parser') table = soup.find('table') for ref in table.find_all('a', href=True): links = (ref['href']) print (links) for link in links: page = requests.get(link) soup = BeautifulSoup(page.content, 'html.parser') table = [] # Find all the divs we need in one go. divs = soup.find_all('div', {'id':['units_box_1', 'units_box_2', 'units_box_3']}) for div in divs: # find all the enclosing a tags. anchors = div.find_all('a') for anchor in anchors: # Now we have groups of 3 list items (li) tags lis = anchor.find_all('li') # we clean up the text from the group of 3 li tags and add them as a list to our table list. table.append([unicodedata.normalize("NFKD",lis[0].text).strip(), lis[1].text, lis[2].text.strip()]) # We have all the data so we add it to a DataFrame. headers = ['Number', 'Tenant', 'Square Footage'] df = DataFrame(table, columns=headers) print (df)
-
furas over 6 yearsalways put full error message (Traceback) in question (as text, not screenshot). There are other useful informations. For example It shows which line makes problem.
-
P.hunter over 6 yearsas per my understanding there isn't any error, they are just not getting what they what to be exact right?
-
furas over 6 yearsyour mistake is double
for
loop - use print to display values in variables and you will see what silly mistake you made. -
Galen over 6 years@P.hunter The question indicates
requests.exceptions.MissingSchema
. -
P.hunter over 6 yearsyea i got it thanks
-
furas over 6 yearsbtw: pandas can read tables directly from web pages -
all_tables = pandas.read_html(url) ; df = all_tables[0]
-
-
Galen over 6 years
-
snappers over 6 yearsNow I need to get all of the data frame outputs into a single df
-
furas over 6 years
-
furas over 6 yearsyou can also create single
df
at start and append() new rows in loop. -
furas over 6 yearsyou can also create single list at start and
append()
data to list, and at the end convert single list into singledf
-
snappers over 6 yearsThanks so much, this is really helping me learn. Is it also possible to add the property name to each list as a column? the property name is contained between each 'a' tag along with the first href value we scraped
-
furas over 6 yearsyes, you can get text from
<a
> when you gethref
and later add to list when you doappend()
. See new code.