I am working on a personal project and wants to parse this html and retrieve information from this.
Basically I want to get all the information that is given inside the
tags ,for this I am using JSOUP in java .
<html>
</head>
<table border="0" cellspacing="0" cellpadding="0" style="">
<tbody>
<tr style="">
<td>
<p >
<span style="">
<span style="font-size:9.0pt; font-family:"Arial",sans-serif">
<br>
<br>Information:
<br>
<br>Legal Business Name
<br>Asfdsf
<br>
<br>Phone
<br>(718) 43543
<br>
<br>Principle Name 1
<br>afdsgsfgsg df
<br>
<br>Bus Street Address
<br>sdfdsf
<br>
<br>Bus City
<br>sdfdsf
<br>
<br>Bus State
<br>ny
<br>
<br>Bus Zip Code
<br>4324324
<br>
<br>Email Address
<br>[email protected]
<br>
<br>Tertiary Email Address
<br>--- No answer ---
<br>
<br>Business Website Address
<br>dsfdsf.com
<br>
<br>DBA info same as Business
<br>
<br>DBA information is same as Business.
<br>
<br>DBA Name
<br>Awqeewd gdfg
<br>
<br>DBA Street Address
<br>dsfdsf 3432 fdgdf
<br>
<br>DBA City
<br>NORTH
<br>
<br>Attachments:
</span>
</span>
</p>
<p >
<span style="">
</span>
</p>
</div>
</body>
</html>
I am using this code to fetch but this is giving all values in a paragraph.
Document doc = Jsoup.parse(htmlString);
List<String> valueList = new ArrayList<>();
Elements keyElements = doc.getElementsByTag("td");
for (Element keyElement : keyElements) {
String value = keyElement.text();
// store in value list
}
I also tried
doc.getElementsByTag("br");
but his is giving empty value.
Can someone please help me to get this data in a better way?
CodePudding user response:
it must be getElementsByTagName . T.T
CodePudding user response:
You can use this solution:
Document.OutputSettings outputSettings = new Document.OutputSettings();
outputSettings.prettyPrint(false);
doc.outputSettings(outputSettings);
doc.select("br").before("\\n");;
doc.select("p").before("\\n");
String str = doc.html().replaceAll("\\\\n", "\n");
String strWithNewLines = Jsoup.clean(str, "", Safelist.none(), outputSettings);
System.out.println(strWithNewLines);
CodePudding user response:
I suppose you can try this:
If the HTML String was this:
String html = "<html>\n"
" </head>\n"
"<table class=\"MsoNormalTable\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" style=\"\">\n"
" <tbody>\n"
" <tr style=\"\">\n"
" <td>\n"
" <p class=\"MsoNormal\">\n"
" <span style=\"\">\n"
" <span style=\"font-size:9.0pt; font-family:\"Arial\",sans-serif\">\n"
" <br>\n"
" <br>Information: \n"
" <br>\n"
" <br>Legal Business Name\n"
" <br>Asfdsf\n"
" <br>\n"
" <br>Phone\n"
" <br>(718) 43543\n"
" <br>\n"
" <br>Principle Name 1\n"
" <br>afdsgsfgsg df\n"
" <br>\n"
" <br>Bus Street Address\n"
" <br>sdfdsf\n"
" <br>\n"
" <br>Bus City\n"
" <br>sdfdsf\n"
" <br>\n"
" <br>Bus State\n"
" <br>ny\n"
" <br>\n"
" <br>Bus Zip Code\n"
" <br>4324324\n"
" <br>\n"
" <br>Email Address\n"
" <br>[email protected]\n"
" <br>\n"
" <br>Tertiary Email Address\n"
" <br>--- No answer ---\n"
" <br>\n"
" <br>Business Website Address\n"
" <br>dsfdsf.com\n"
" <br>\n"
" <br>DBA info same as Business\n"
" <br>\n"
" <br>DBA information is same as Business.\n"
" <br>\n"
" <br>DBA Name\n"
" <br>Awqeewd gdfg\n"
" <br>\n"
" <br>DBA Street Address\n"
" <br>dsfdsf 3432 fdgdf\n"
" <br>\n"
" <br>DBA City\n"
" <br>NORTH\n"
" <br>\n"
" <br>Attachments:\n"
" </span>\n"
" </span>\n"
" </p>\n"
" <p class=\"MsoNormal\">\n"
" <span style=\"\"> \n"
" </span>\n"
" </p>\n"
" </div>\n"
" </body>\n"
" </html>";
And you run this string through the following method provided below:
String[] values = getTextAfterHtmlStartEndTags(html, "br");
// Display the discovered values...
for (String str : values) {
System.out.println(str);
}
The console Window will display:
Information:
Legal Business Name
Asfdsf
Phone
(718) 43543
Principle Name 1
afdsgsfgsg df
Bus Street Address
sdfdsf
Bus City
sdfdsf
Bus State
ny
Bus Zip Code
4324324
Email Address
[email protected]
Tertiary Email Address
--- No answer ---
Business Website Address
dsfdsf.com
DBA info same as Business
DBA information is same as Business.
DBA Name
Awqeewd gdfg
DBA Street Address
dsfdsf 3432 fdgdf
DBA City
NORTH
Attachments:
The getTextAfterHtmlStartEndTags() method:
/**
*
* To be used with the JSoup API<br><br>
* <b>Example Usage:</b><br><pre>
*
* <b>Required Imports:</b>
*
* import org.jsoup.Jsoup;
* import org.jsoup.nodes.Document;
* import org.jsoup.nodes.Element;
* import org.jsoup.nodes.Node;
* import org.jsoup.select.Elements;
*
* <b>Example Code:</b>
*
* {@code String html = "<td>\n"
* " <span class=\"detailh2\" style=\"margin:0px\">This month: </span>2 145 \n"
* " <span class=\"detailh2\">Total: </span> 31 704 \n"
* " <span class=\"detailh2\">Last: </span> 30.12.2021 \n"
* "</td>";
*
* String[] values = getTextAfterHtmlStartEndTags(html, "span");
* for (String str : values) {
* System.out.println(str);
* }}</pre><br>
* <p>
* The console window will display:
* <pre>
*
* 2 145
* 31 704
* 30.12.2021</pre><br>
* <p>
* If you want the data from a specific HTML tag element then you can supply
* one or more text elements within those HTML tags in th optional
* 'specificTo' parameter as a string array or as args, for example:
* <pre>
*
* {@code String[] values = getTextAfterHtmlStartEndTags(html, "span", "This month:", "Total:");
* for (String str : values) {
* System.out.println(str);
* }}</pre><br>
* <p>
* The console window will display:
* <pre>
*
* This month: --> 2 145
* Total: --> 31 704</pre>
*
* @param htmlString (String) The HTML string to parse.<br>
*
* @param htmlStartTagString (String) The HTML start tag to get data
* from.<br>
*
* @param specificTo (String - args) The desired data from multiple
* HTML tags of the same type (see the above
* example code).<br>
*
* @return (String[] Array) A single Dimensional String Array containing the
* desired data (if properly parsed and found).
*/
public static String[] getTextAfterHtmlStartEndTags(String htmlString,
String htmlStartTagString, String... specificTo) {
String html = htmlString;
List<String> list = new ArrayList<>();
String value = "N/A";
Document doc = Jsoup.parse(html);
Elements elements = doc.select(htmlStartTagString);
for (Element a : elements) {
if (specificTo.length > 0) {
for (int i = 0; i < specificTo.length; i ) {
if (a.before("</" htmlStartTagString ">").text().contains(specificTo[i])) {
Node node = a.nextSibling();
value = specificTo[i] " --> " node.toString().trim();
list.add(value);
}
}
}
else {
Node node = a.nextSibling();
value = node.toString().trim();
list.add(value);
}
}
return list.toArray(new String[list.size()]);
}
CodePudding user response:
You can use Element.wholeText() method to preserve line separators.
Unfortunately it looks like it also preserves depth of indentation so you would need to remove leading spaces or tabulators in each line.
Demo:
String htmlString = "..."; // <--- replace with your HTML
Document doc = Jsoup.parse(htmlString);
Elements keyElements = doc.getElementsByTag("td");
for (Element keyElement : keyElements) {
String value = keyElement
.wholeText()
.trim()
.replaceAll("(?m)^[ \t] ",""); //remove leading spaces and tabs from each line
System.out.println(value);
System.out.println("---");
}
Output (based on HTML from question):
Information:
Legal Business Name
Asfdsf
Phone
(718) 43543
Principle Name 1
afdsgsfgsg df
Bus Street Address
sdfdsf
Bus City
sdfdsf
Bus State
ny
Bus Zip Code
4324324
Email Address
[email protected]
Tertiary Email Address
--- No answer ---
Business Website Address
dsfdsf.com
DBA info same as Business
DBA information is same as Business.
DBA Name
Awqeewd gdfg
DBA Street Address
dsfdsf 3432 fdgdf
DBA City
NORTH
Attachments:
---
