Get only the first link of a URLs list with BeautifulSoup
Solution 1
You can do it with a oneliner:
import re
soup.find('a', href=re.compile('^http://get.cm/get'))['href']
to assign it to a variable just:
variable=soup.find('a', href=re.compile('^http://get.cm/get'))['href']
I have no idea what exactly are you doing so i will post the full code from scratch: NB! if you use bs4 change the imports
import urllib2
from BeautifulSoup import BeautifulSoup
import re
request = urllib2.Request("http://download.cyanogenmod.com/?device=p970")
response = urllib2.urlopen(request)
soup = BeautifulSoup(response)
variable=soup.find('a', href=re.compile('^http://get.cm/get'))['href']
print variable
>>>
http://get.cm/get/4jj
Solution 2
You can do this more easily and clearly in BeautifulSoup without loops.
Assuming your parsed BeautifulSoup object is named soup
:
output = soup.find(lambda tag: tag.name=='a' and "condition" in tag).attrs['href']
print output
Note that the find
method returns only the first result, while find_all
returns all of them.
Related videos on Youtube
Gabriele Salvatori
I'm a Computer Science student at Tor Vergata University of Rome, Italy. Interested in engineering, computer science, electronics, and everything about scientific discipines.
Updated on September 15, 2022Comments
-
Gabriele Salvatori over 1 year
I parsed an entire HTML file, extracting some URLs with Beautifulsoup module in Python, with this peace of code:
for link in soup.find_all('a'): for line in link : if "condition" in line : print link.get("href")
and i get in the shell a series of links that observe the condition in the if loop:
- http:// ..link1
- http:// ..link2
- .
- .
- http:// ..linkn
how can i put in a variable "output" only the first link of this list?
EDIT:
The web page is : http://download.cyanogenmod.com/?device=p970 , the script have to return the first short URL (http://get.cm/...) in the HTML page.
-
Gabriele Salvatori over 11 yearsi tried closing the cicle with break and assign the link to a variable in the line loop but at this time the shell does not print anything.
-
Gabriele Salvatori over 11 yearsyep, after the end of cicle using
break
withprint output
-
Gabriele Salvatori over 11 yearsThis solution give me this error :
AttributeError: 'NoneType' object has no attribute 'attrs'
-
Gabriele Salvatori over 11 yearsi've fixed Indentation but shell returns nothing
-
Gabriele Salvatori over 11 yearsShell continues to return me nothing.
-
jdotjdot over 11 yearsThat's because you didn't implement the
lambda
correctly. What I wrote for you was a sample using the obviously incorrect"condition" in tag
. You're getting theAttributeError
becausesoup.find
isn't finding any objects for which thelambda
returnsTrue
, and so then you're attempting to callattrs
onNone
. I would have been able to provide a better answer had you given the website you were pulling originally.