Select li elements from ul with xpath

10,257

Solution 1

Try below code to get required output:

items = fixed_content.xpath('//ul/li//span | //ul/li//div[@class="subClass-1"]')
for item in items:
    item.text_content().strip()

The output is

'....text1....'
'....text2....'
'....text3....'

or

items = fixed_content.xpath('//ul/li') 
for item in items:
    text1 = item.xpath('.//a[@class="name"]/span')[0].text_content().strip()
    text2 = item.xpath('.//div[@class="subClass-1"]')[0].text_content().strip()
    text3 = item.xpath('.//span[@class="subClass-2"]')[0].text_content().strip()

if you want to get each text node as variable

Solution 2

Your xpath queries seem to give the wanted output for me. text1, text2 and text3 results when writing them out completely. Using the string() method you are able to select the inner text value of the found element:

//ul/li/div[@id="div-1"]/div[@id="subdiv-1"]/a[@class="name"]/span/string(),
//ul/li/div[@id="div-1"]/div[@id="subdiv-2"]/div[@class="class-1"]/div[@class="subClass-1"]/div/string(),
//ul/li/div[@id="div-1"]/div[@id="subdiv-2"]/div[@class="class-1"]/span[@class="subClass-2"]/string()

Does writing them out and using the string() method not provide the expected text1-3 values for you?

Share:
10,257
MinionAttack
Author by

MinionAttack

Master and degree on computer science (software developer). I'm always learning something new :D

Updated on June 04, 2022

Comments

  • MinionAttack
    MinionAttack about 2 years

    I'm starting with XPATH from lxml on Python3 and I'm unable to get the right sintaxis to select all li elements with content of a ul. I'm trying with this structure:

    <body>
     <div> ..... </div>
     <div> ..... </div>
     <div id="div-A">
      <div id="subdiv-1">
       <form> ... </form>
       <div> ..... </div>
       <div> ..... </div>
       <ul>
        <li>
         <div id="div-1">
          <div> ..... </div>
          <div> ..... </div>
          <div id="subdiv-1">
           <a class="name">
            <span>
              ....text1....
            </span>
           </a>
          </div>
          <div id="subdiv-2">
           <div class="class-1">
            <div class="subClass-1">
             <div> ....text2.... </div>
            </div>
            <span class="subClass-2">
             ....text3....
            </span>
           </div>
          </div>
         </div>
        </li>
        ... x23...
       </ul>
      </div>
     </div>
    </body>
    

    My goal it's to be able to get text1, text2 and text3.

    So first, I try to get all li elements with their content:

    content = html_response.content
    fixed_content = fromstring(content)  # parse the HTML and correct malformed HTML
    items = fixed_content.xpath('//ul/li/*')
    

    And pass items to a function with a for loop to iterate over the 23 li elements. Now I try to get the texts, so:

    for item in items:
     text1 = item.xpath('/div[@id="div-1"]/div[@id="subdiv-1"]/a[@class="name"]/span').text_content()
     text2 = item.xpath('/div[@id="div-1"]/div[@id="subdiv-2"]/div[@class="class-1"]/div[@class="subClass-1"]/div').text_content()
     text3 = item.xpath('/div[@id="div-1"]/div[@id="subdiv-2"]/div[@class="class-1"]/div[@class="subClass-2"]/span[@class="subClass-2"]').text_content()
    

    But I get on all cases an empty result with no content. What I'm doing wrong?

    Regards.

  • MinionAttack
    MinionAttack almost 6 years
    Thanks, so my mistake was on '//ul/li/*' and not start with the dot on "xpath('. " :)
  • MinionAttack
    MinionAttack almost 6 years
    Yes, but I think my problem was not doing the string at once. I had '//ul/li/*' instead of '//ul/li' and on the for loop, I didn't start with the dot " xpath('. " as Andersson pointed. Thanks for your answer anyway :)
  • Lesleyvdp
    Lesleyvdp almost 6 years
    Ah I see, my bad. Glad you did get it resolved though! :)
  • Andersson
    Andersson almost 6 years
    '//ul/li/*' should select all child nodes of li nodes, but not li nodes. If XPath starts with dot it mean that you want to search for child/descendant of current node (item in your case) while no dot means to search in a whole HTML DOM...
  • MinionAttack
    MinionAttack almost 6 years
    Thanks for the explanation!
  • BlueCacti
    BlueCacti almost 6 years
    While this code snippet may solve the question, including an explanation really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion.