Parsing HTML5 data-* attribute values with Selenium in Python

python html parsing selenium custom-data-attribute

17,463

Solution 1

If you have elements like the following:

<rect rx="3" ry="3" width="76%" height="40" transform="translate(0,40)" data-value="75" class="bar">bar1</rect>
<rect rx="3" ry="3" width="76%" height="40" transform="translate(0,40)" data-value="76" class="bar">bar2</rect>

You can get the text value and the attribute value as follows:

elements = driver.find_elements_by_class_name('bar')
for element in elements:
    print element.text
    print element.get_attribute('data-value')

This prints out:

bar1
75
bar2
76

Solution 2

You mention you tried:

for text in driver.find_elements_by_class_name('bar'): 
    print(data_value.text)

Seeing as data_value is not defined anywhere, it won't work. If you did print(text.text) you should get the text of each element that has a bar class. (This is essentially what you do in your first snippet.)

You also mention this:

for data in driver.find_elements_by_xpath('//*[contains(@data-value)]/@data-value'): 
    print(data.text)

This cannot work because Selenium's find_element(s)... functions cannot return anything else than elements or lists of elements. You are trying to get it to return an attribute, which won't work. XPath generally allows it, but when you use XPath through Selenium you cannot get anything else than elements.

You could do what Jessamyn Smith suggested or:

results = driver.execute_script("""
    var els = document.getElementsByClassName("bar");
    var ret = [];
    for (var i =0, el; (el = els[i]); ++i) {
        ret.push([el.textContent, el.attributes["data-value"].value]);
    }
    return ret;
""")
for r in results:
    print(r[0], r[1])

This will take one round-trip between your script and the browser. Looping and using .text and .get_attribute() involves 2 round-trips per iteration. The JavasScript builds a list of pairs of results. Each pair contains the text of the element in the first position, and the value of data-value in the second position.

17,463

Author by

metersk

Updated on July 01, 2022

Comments

metersk almost 2 years

I am parsing a JS generated webpage like so:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


driver = webdriver.Firefox()
driver.get('https://www.consumerbarometer.com/en/graph-builder/?question=M1&filter=country:singapore,canada,mexico,brazil,argentina,united_states,bulgaria,austria,belgium,croatia,czech_republic,denmark,estonia,finland,france,germany,greece,hungary,italy,ireland,latvia,lithuania,norway,netherlands,poland,portugal,russia,romania,serbia,slovakia,spain,slovenia,sweden,switzerland,ukraine,united_kingdom,australia,china,israel,hong_kong_sar,japan,korea,new_zealand,malaysia,taiwan,turkey,vietnam')

// wait for svg to appear
WebDriverWait(driver, 10).until(EC.visibility_of_element_located((By.TAG_NAME, 'svg')))

for text in driver.find_elements_by_class_name('bar-text-label'):
    print(text.text)

driver.close()

Besides getting the text from the class bar-text-label I would also like to get values from an HTML5 data-attribute. For example,<rect rx="3" ry="3" width="76%" height="40" transform="translate(0,40)" data-value="76" class="bar"></rect> and I would like to be able to parse 76 from this.

Is this possible to do in Selenium?

I tried both of the below, with no sucess:

for text in driver.find_elements_by_class_name('bar'): 
    print(data_value.text)

for data in driver.find_elements_by_xpath('//*[contains(@data-value)]/@data-value'): 
    print(data.text)

metersk about 9 years

This is very interesting. I did not know you could execute js like that.
Louis about 9 years

I did not know either at first. If you run everything locally, the difference is not great but if you use Sauce Labs, Browser Stack or something to run tests remotely, the round-trips add up a lot. I've reduced the time it takes to run large test suites in half by combining multiple Selenium calls into a single execute_script (or execute_script_async) call.