Why does XPath contains() select an unexpected node?-CodePudding

I'm trying to find the correct XPath expression to get only URLs from all my documents, whatever the tag is. I'm trying with this one :

<urlset xmlns="https://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://url
    </loc>
    <lastmod>2019-08-07T15:01:51 00:00
    </lastmod>
  </url>
</urlset>

The following expression gives me these results :

//*[contains(.,'http')]//text()**

https://url
2019-08-07T15:01:51 00:00

What I'm looking for is to get rid of the second line. I need to be able to get only URLs from any XML file.

CodePudding user response：

Well, let's ignore the fact that not all URLs contain "http" and not everything that contains "http" is a URL...

To find all text nodes containing "http", just use //text()[contains(., 'http')].

CodePudding user response：

The reason that your XPath,

//*[contains(.,'http')]//text()

selects a surprise second result is that this XPath says to select all elements whose string-value contains an "http" substring, and return all descendant text nodes. These elements include not just the immediate parent element of the targeted text node but its ancestors as well:

The loc element, as you expected.
The urlset and url too, as you did not expect. (The urlset and url elements also have a 2019-08-07T15:01:51 00:00 descendant text node, and thus as part of their string-values.)

Alternatives to achieve desired result

Narrow the * all-elements wildcard to a single, named element:
```
//loc[contains(.,'http')]/text()
```
Narrow the * all-elements wildcard to multiple, named elements:
```
//*[(self::loc or self::e2) and contains(.,'http')]/text()
```
Select all text nodes containing the substring, "http" as noted by Michael Kay:
```
//text()[contains(., 'http')]
```

Alternatives to achieve desired result

See also