<li>
<b>word</b>
<i>type</i>
<b>1.</b>
"translation 1"
<b>2.</b>
"translation 2"
</li>
I'm doing webscraping from an online dictionary, and the main dictionary part has roughly the above structure.
How exactly do I get all those children? With the usual selenium approach I see online, that is list_elem.find_elements(By.XPATH, ".//*") I only get the "proper" children, but not the textual ones (sorry if my word choice is off). Meaning I would like to have len(children) == 6, instead of len(children) == 4
I would like to get all children for further analysis
CodePudding user response:
Elements *, comment(), text(), and processing-instruction() are all nodes.
To select all nodes:
.//node()
To ensure that it's only selecting * and text() you can add a predicate filter:
.//node()[self::* or self::text()]
CodePudding user response:
I'm not a Selenium expert but I've read StackOverflow answers where apparently knowledgeable people have asserted that Selenium's XPath queries must return elements (so text nodes are not supported as a query result type), and I'm pretty sure that's correct.
So a query like like //* (return every element in the document) will work fine in Selenium, but //text() (return every text node in the document) won't, because although it's a valid XPath query, it returns text nodes rather than elements.
I suggest you consider using a different XPath API to execute your XPath queries, e.g. lxml, which doesn't have that limitation.
