I'm trying to find the correct XPath expression to get only URLs from all my documents, whatever the tag is. I'm trying with this one :
<urlset xmlns="https://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://url
</loc>
<lastmod>2019-08-07T15:01:51 00:00
</lastmod>
</url>
</urlset>
The following expression gives me these results :
//*[contains(.,'http')]//text()**
https://url2019-08-07T15:01:51 00:00
What I'm looking for is to get rid of the second line. I need to be able to get only URLs from any XML file.
CodePudding user response:
Well, let's ignore the fact that not all URLs contain "http" and not everything that contains "http" is a URL...
To find all text nodes containing "http", just use //text()[contains(., 'http')].
CodePudding user response:
The reason that your XPath,
//*[contains(.,'http')]//text()
selects a surprise second result is that this XPath says to select all elements whose string-value contains an "http" substring, and return all descendant text nodes. These elements include not just the immediate parent element of the targeted text node but its ancestors as well:
- The
locelement, as you expected. - The
urlsetandurltoo, as you did not expect. (Theurlsetandurlelements also have a2019-08-07T15:01:51 00:00descendant text node, and thus as part of their string-values.)
Alternatives to achieve desired result
Narrow the
*all-elements wildcard to a single, named element://loc[contains(.,'http')]/text()Narrow the
*all-elements wildcard to multiple, named elements://*[(self::loc or self::e2) and contains(.,'http')]/text()Select all text nodes containing the substring,
"http"as noted by Michael Kay://text()[contains(., 'http')]
