I have a project about getting XML files from URL's, scraping them, pulling the data, then processing it. Also, I am creating the URL with user input. But I need to check if the URL contains XML file to scrape. Any ideas how to do that? So basically how to check if URL contains XML file or not?
CodePudding user response:
Ways to know whether GETing a URL will retrieve XML...
Before retrieving the file
- Have an out-of-band guarantee.
- Inspect
Content-TypeHTTP header of response to a HEAD request1.
After retrieving the file
- Inspect
Content-TypeHTTP header of the response1. - Sniff root element.
Files.probeContentType(path)- Parse via conforming XML parser without getting any well-formedness errors.
Note: Only parsing via a conforming XML parser is guaranteed to provide 100% determination.
1 MIME assignments for XML data:
application/xml(RFC 7303, previously RFC 3023)text/xml(RFC 7303, previously RFC 3023)- Other MIME assignments used with XML applications.
