Get only the first link of a URLs list with BeautifulSoup

12,309

Solution 1

You can do it with a oneliner:

import re

soup.find('a', href=re.compile('^http://get.cm/get'))['href']

to assign it to a variable just:

variable=soup.find('a', href=re.compile('^http://get.cm/get'))['href']

I have no idea what exactly are you doing so i will post the full code from scratch: NB! if you use bs4 change the imports

import urllib2
from BeautifulSoup import BeautifulSoup
import re

request = urllib2.Request("http://download.cyanogenmod.com/?device=p970")
response = urllib2.urlopen(request)
soup = BeautifulSoup(response)
variable=soup.find('a', href=re.compile('^http://get.cm/get'))['href']
print variable

>>> 
http://get.cm/get/4jj

Solution 2

You can do this more easily and clearly in BeautifulSoup without loops.

Assuming your parsed BeautifulSoup object is named soup:

output = soup.find(lambda tag: tag.name=='a' and "condition" in tag).attrs['href']
print output

Note that the find method returns only the first result, while find_all returns all of them.

Share:
12,309

Related videos on Youtube

Gabriele Salvatori
Author by

Gabriele Salvatori

I'm a Computer Science student at Tor Vergata University of Rome, Italy. Interested in engineering, computer science, electronics, and everything about scientific discipines.

Updated on September 15, 2022

Comments

  • Gabriele Salvatori
    Gabriele Salvatori over 1 year

    I parsed an entire HTML file, extracting some URLs with Beautifulsoup module in Python, with this peace of code:

    for link in soup.find_all('a'):
        for line in link :
            if "condition" in line :
    
               print link.get("href")
    

    and i get in the shell a series of links that observe the condition in the if loop:

    • http:// ..link1
    • http:// ..link2
    • .
    • .
    • http:// ..linkn

    how can i put in a variable "output" only the first link of this list?

    EDIT:

    The web page is : http://download.cyanogenmod.com/?device=p970 , the script have to return the first short URL (http://get.cm/...) in the HTML page.

  • Gabriele Salvatori
    Gabriele Salvatori over 11 years
    i tried closing the cicle with break and assign the link to a variable in the line loop but at this time the shell does not print anything.
  • Gabriele Salvatori
    Gabriele Salvatori over 11 years
    yep, after the end of cicle using break with print output
  • Gabriele Salvatori
    Gabriele Salvatori over 11 years
    This solution give me this error : AttributeError: 'NoneType' object has no attribute 'attrs'
  • Gabriele Salvatori
    Gabriele Salvatori over 11 years
    i've fixed Indentation but shell returns nothing
  • Gabriele Salvatori
    Gabriele Salvatori over 11 years
    Shell continues to return me nothing.
  • jdotjdot
    jdotjdot over 11 years
    That's because you didn't implement the lambda correctly. What I wrote for you was a sample using the obviously incorrect "condition" in tag. You're getting the AttributeError because soup.find isn't finding any objects for which the lambda returns True, and so then you're attempting to call attrs on None. I would have been able to provide a better answer had you given the website you were pulling originally.