I am trying to select all of the first occurrences of a specific type in the following structure:
<div >
<div >
<h3>Title1</h3>
<span >
<a href="https://www.domain1.org/" target="_blank">Org1</a>
</span>
<span >Loc1</span>
<div >
desc1
<a href="https://www.domain1-1.org/" target="_blank">https://www.domain1-1.org/</a>
<span >Posted on: 01/19/2022</span>
</div>
</div>
<div >
<h3>Title2</h3>
<span >
<a href="https://www.domain2.org/" target="_blank">Org2</a>
</span>
<span >Loc2</span>
<div >
desc2
<a href="https://www.domain2.org/" target="_blank">https://www.domain2.org/</a>
<span >Posted on: 01/18/2022</span>
</div>
</div>
<div >
<h3>Title3</h3>
<span >
<a href="https://www.domain3.org/" target="_blank">Org3</a>
</span>
<span >Loc3</span>
<div >
desc3
<a href="mailto:[email protected]">[email protected]</a>
<span >Posted on: 01/19/2022</span>
</div>
</div>
<div >
<h3>TItle4</h3>
<span >Org4</span>
<span >Loc4</span>
<div >
desc4
<a href="mailto:[email protected]">[email protected]</a>
<a href="https://www.domain4.org/" target="_blank">https://www.domain4.org/</a>
<a href="https://www.domain4-1.org/" target="_blank">https://www.domain4-1.org/</a>
<span >Posted on: 01/06/2022</span>
</div>
</div>
</div>
Specifically, I need the result to be the following:
https://www.domain1.org/
https://www.domain2.org/
https://www.domain3.org/
https://www.domain4.org/
Which should be the first a/@href under each div[@class='job-listing'], but I'm not sure how to express that. Some things to note:
- The
<a>is always two nodes under the root (job-listing) - The first
<a>isn't always correct (only looking for http), but I can filter those out easily enough; I'm caught up on how to select the node, not filtering for the content or anything like that. - I need the value of
a/@href, not the contents of<a>.
Thanks!
CodePudding user response:
//div[@class='job-listing']/descendant::a[1] gives you the first a descendant of each of those divs, if you want to add the check then use e.g. //div[@class='job-listing']/descendant::a[starts-with(@href, 'http')][1].
If you need the href attribute node use //div[@class='job-listing']/descendant::a[starts-with(@href, 'http')][1]/@href. Note that some default serialization for XSLT or XQuery doesn't allow you to serialize a sequence of standalone attribute nodes but in XPath 2 or 3 you can of course use e.g. //div[@class='job-listing']/descendant::a[starts-with(@href, 'http')][1]/@href/string() to get a sequence of attribute values instead.
