Fetch all href link using selenium in python

108,509

Solution 1

Well, you have to simply loop through the list:

elems = driver.find_elements_by_xpath("//a[@href]")
for elem in elems:
    print(elem.get_attribute("href"))

find_elements_by_* returns a list of elements (note the spelling of 'elements'). Loop through the list, take each element and fetch the required attribute value you want from it (in this case href).

Solution 2

I have checked and tested that there is a function named find_elements_by_tag_name() you can use. This example works fine for me.

elems = driver.find_elements_by_tag_name('a')
    for elem in elems:
        href = elem.get_attribute('href')
        if href is not None:
            print(href)

Solution 3

You can try something like:

    links = driver.find_elements_by_partial_link_text('')

Solution 4

You can import the HTML dom using html dom library in python. You can find it over here and install it using PIP:

https://pypi.python.org/pypi/htmldom/2.0

from htmldom import htmldom
dom = htmldom.HtmlDom("https://www.github.com/")  
dom = dom.createDom()

The above code creates a HtmlDom object.The HtmlDom takes a default parameter, the url of the page. Once the dom object is created, you need to call "createDom" method of HtmlDom. This will parse the html data and constructs the parse tree which then can be used for searching and manipulating the html data. The only restriction the library imposes is that the data whether it is html or xml must have a root element.

You can query the elements using the "find" method of HtmlDom object:

p_links = dom.find("a")  
for link in p_links:
  print ("URL: " +link.attr("href"))

The above code will print all the links/urls present on the web page

Solution 5

Unfortunately, the original link posted by OP is dead...

If you're looking for a way to scrape links on a page, here's how you can scrape all of the "Hot Network Questions" links on this page with gazpacho:

from gazpacho import Soup

url = "https://stackoverflow.com/q/34759787/3731467"

soup = Soup.get(url)
a_tags = soup.find("div", {"id": "hot-network-questions"}).find("a")

[a.attrs["href"] for a in a_tags]
Share:
108,509
Xonshiz
Author by

Xonshiz

Software Developer, Occasional Gamer, Foodie, Blogger, YouTuber, Otaku and a wanna be Cyber Security Expert.

Updated on July 09, 2022

Comments

  • Xonshiz
    Xonshiz almost 2 years

    I am practicing Selenium in Python and I wanted to fetch all the links on a web page using Selenium.

    For example, I want all the links in the href= property of all the <a> tags on http://psychoticelites.com/

    I've written a script and it is working. But, it's giving me the object address. I've tried using the id tag to get the value, but, it doesn't work.

    My current script:

    from selenium import webdriver
    from selenium.webdriver.common.keys import Keys
    
    
    driver = webdriver.Firefox()
    driver.get("http://psychoticelites.com/")
    
    assert "Psychotic" in driver.title
    
    continue_link = driver.find_element_by_tag_name('a')
    elem = driver.find_elements_by_xpath("//*[@href]")
    #x = str(continue_link)
    #print(continue_link)
    print(elem)
    
  • Ywapom
    Ywapom about 6 years
    why is it that all the documentation says xpath is "not recommended" but most of the answers on stackoverflow use xpath?
  • Xonshiz
    Xonshiz about 6 years
    XPath is NOT reliable. If the DOM of the website changes, so does the XPath and your script is bound to crash then. After working with multiple scripts on scrapping, I've come to a conclusion that use XPath as a last resort.
  • MortenB
    MortenB almost 5 years
    short xpaths like in this example they are reliable, I use lots of driver.find_element_by_xpath("//*[@id='<my identifier>']") if xpath become long strings depending on columns/rows/divs etc that relies on layout they should not be used.
  • GodSaveTheDucks
    GodSaveTheDucks over 4 years
    What if I need to return href's that belong to a specific class?
  • ItIsEntropy
    ItIsEntropy about 3 years
    This creates a StaleElementReferenceException error for me on the line href=elem.get_attribute('href'). I tried printing the elem to the console before I access it to get the attribute but that just moves the exception to the line trying to print. this is the exact message: stale element reference: element is not attached to the page document Edit: forgot to press shift enter so I did not have the message. corrected in edit
  • Xonshiz
    Xonshiz about 3 years
    You can use this to get elements based on their Class Name driver.find_elements_by_class_name("content"), where "content" is the name of the class you're looking for.
  • mrk
    mrk almost 3 years
    I think the hint to the sleep command is helpful otherwise it is redundant to the accepted answer.