Home > Enterprise >  Why does XPath contains() select an unexpected node?
Why does XPath contains() select an unexpected node?

Time:01-05

I'm trying to find the correct XPath expression to get only URLs from all my documents, whatever the tag is. I'm trying with this one :

<urlset xmlns="https://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://url
    </loc>
    <lastmod>2019-08-07T15:01:51 00:00
    </lastmod>
  </url>
</urlset>

The following expression gives me these results :

//*[contains(.,'http')]//text()**
  1. https://url
  2. 2019-08-07T15:01:51 00:00

What I'm looking for is to get rid of the second line. I need to be able to get only URLs from any XML file.

CodePudding user response:

Well, let's ignore the fact that not all URLs contain "http" and not everything that contains "http" is a URL...

To find all text nodes containing "http", just use //text()[contains(., 'http')].

CodePudding user response:

The reason that your XPath,

//*[contains(.,'http')]//text()

selects a surprise second result is that this XPath says to select all elements whose string-value contains an "http" substring, and return all descendant text nodes. These elements include not just the immediate parent element of the targeted text node but its ancestors as well:

  1. The loc element, as you expected.
  2. The urlset and url too, as you did not expect. (The urlset and url elements also have a 2019-08-07T15:01:51 00:00 descendant text node, and thus as part of their string-values.)

Alternatives to achieve desired result

  • Narrow the * all-elements wildcard to a single, named element:

    //loc[contains(.,'http')]/text()
    
  • Narrow the * all-elements wildcard to multiple, named elements:

    //*[(self::loc or self::e2) and contains(.,'http')]/text()
    
  • Select all text nodes containing the substring, "http" as noted by Michael Kay:

    //text()[contains(., 'http')]
    

See also

  •  Tags:  
  • Related