How to get text of an element in Selenium WebDriver, without including child element text?

python html selenium selenium-webdriver

140,677

Solution 1

Here's a general solution:

def get_text_excluding_children(driver, element):
    return driver.execute_script("""
    return jQuery(arguments[0]).contents().filter(function() {
        return this.nodeType == Node.TEXT_NODE;
    }).text();
    """, element)

The element passed to the function can be something obtained from the find_element...() methods (i.e. it can be a WebElement object).

Or if you don't have jQuery or don't want to use it you can replace the body of the function above above with this:

return self.driver.execute_script("""
var parent = arguments[0];
var child = parent.firstChild;
var ret = "";
while(child) {
    if (child.nodeType === Node.TEXT_NODE)
        ret += child.textContent;
    child = child.nextSibling;
}
return ret;
""", element)

I'm actually using this code in a test suite.

Solution 2

In the HTML which you have shared:

<div id="a">This is some
   <div id="b">text</div>
</div>

The text This is some is within a text node. To depict the text node in a structured way:

<div id="a">
    This is some
   <div id="b">text</div>
</div>

This Usecase

To extract and print the text This is some from the text node using Selenium's python client you have 2 ways as follows:

Using splitlines(): You can identify the parent element i.e. <div id="a">, extract the innerHTML and then use splitlines() as follows:

using xpath:

print(driver.find_element_by_xpath("//div[@id='a']").get_attribute("innerHTML").splitlines()[0])

using xpath:

print(driver.find_element_by_css_selector("div#a").get_attribute("innerHTML").splitlines()[0])

Using execute_script(): You can also use the execute_script() method which can synchronously execute JavaScript in the current window/frame as follows:

using xpath and firstChild:

parent_element = driver.find_element_by_xpath("//div[@id='a']")
print(driver.execute_script('return arguments[0].firstChild.textContent;', parent_element).strip())

using xpath and childNodes[n]:

parent_element = driver.find_element_by_xpath("//div[@id='a']")
print(driver.execute_script('return arguments[0].childNodes[1].textContent;', parent_element).strip())

Solution 3

def get_true_text(tag):
    children = tag.find_elements_by_xpath('*')
    original_text = tag.text
    for child in children:
        original_text = original_text.replace(child.text, '', 1)
    return original_text

Solution 4

You don't have to do a replace, you can get the length of the children text and subtract that from the overall length, and slice into the original text. That should be substantially faster.

Solution 5

Unfortunately, Selenium was only built to work with Elements, not Text nodes.

If you try to use a function like get_element_by_xpath to target the text nodes, Selenium will throw an InvalidSelectorException.

One workaround is to grab the relevant HTML with Selenium and then use an HTML parsing library like BeautifulSoup that can handle text nodes more elegantly.

import bs4
from bs4 import BeautifulSoup

inner_html = driver.find_elements_by_css_selector('#a')[0].get_attribute("innerHTML")
inner_soup = BeautifulSoup(inner_html, 'html.parser')

outer_html = driver.find_elements_by_css_selector('#a')[0].get_attribute("outerHTML")
outer_soup = BeautifulSoup(outer_html, 'html.parser')

From there, there are several ways to search for the Text content. You'll have to experiment to see what works best for your use case.

Here's a simple one-liner that may be sufficient:

inner_soup.find(text=True)

If that doesn't work, then you can loop through the element's child nodes with .contents() and check their object type.

BeautifulSoup has four types of elements, and the one that you'll be interested in is the NavigableString type, which is produced by Text nodes. By contrast, Elements will have a type of Tag.

contents = inner_soup.contents

for bs4_object in contents:

    if (type(bs4_object) == bs4.Tag):
        print("This object is an Element.")

    elif (type(bs4_object) == bs4.NavigableString):
        print("This object is a Text node.")

Note that BeautifulSoup doesn't support Xpath expressions. If you need those, then you can use some of the workarounds in this thread.

View more solutions

140,677

Author by

josh

Updated on April 15, 2020

Comments

josh about 4 years
```
<div id="a">This is some
   <div id="b">text</div>
</div>
```
Getting "This is some" is non-trivial. For instance, this returns "This is some text":
```
driver.find_element_by_id('a').text
```
How does one, in a general way, get the text of a specific element without including the text of it's children?

(I'm providing an answer below but will leave the question open in case someone can come up with a less hideous solution).
josh over 11 years

this runs disgustingly slowly, though... there has to be a better way??
Arran over 11 years

You should always try to get the most specific child element you can. In this case, if you've got a lot of children elements it'll run slow. Why don't you check if the element actually has text before returning, i.e make the XPath: *[string-length(text()) > 1] or make the for loop check for child.text being not null and not empty. Also, what about CSS selector? XPath queries are very slow anyway, so maybe a CSS selector will be faster.
josh over 10 years

right, what I basically realized is... don't use selenium's search methods, just use jquery
wlingke over 10 years

@josh, I would disagree with that... Seleniums methods are meant to mock interactions from a user's POV whereas jQuery is not. Yes you can use both to grab elements but in general there should be relatively few situations where you'd need to execute javascript.
Louis about 8 years

The first code snippet assumes jQuery is loaded in the page. The 2nd code snippet works whether or not jQuery is loaded.