Jsoup to fetch data from html betwwen two <br> tags-CodePudding

I am working on a personal project and wants to parse this html and retrieve information from this.

Basically I want to get all the information that is given inside the
tags ,for this I am using JSOUP in java .

<html>
  </head>
<table  border="0" cellspacing="0" cellpadding="0" style="">
                            <tbody>
                              <tr style="">
                                <td>
                                  <p >
                                    <span style="">
                                      <span style="font-size:9.0pt; font-family:"Arial",sans-serif">
                                        <br>
                                        <br>Information: 
                                        <br>
                                        <br>Legal Business Name
                                        <br>Asfdsf
                                        <br>
                                        <br>Phone
                                        <br>(718) 43543
                                        <br>
                                        <br>Principle Name 1
                                        <br>afdsgsfgsg df
                                        <br>
                                        <br>Bus Street Address
                                        <br>sdfdsf
                                        <br>
                                        <br>Bus City
                                        <br>sdfdsf
                                        <br>
                                        <br>Bus State
                                        <br>ny
                                        <br>
                                        <br>Bus Zip Code
                                        <br>4324324
                                        <br>
                                        <br>Email Address
                                        <br>[email protected]
                                        <br>
                                        <br>Tertiary Email Address
                                        <br>--- No answer ---
                                        <br>
                                        <br>Business Website Address
                                        <br>dsfdsf.com
                                        <br>
                                        <br>DBA info same as Business
                                        <br>
                                        <br>DBA information is same as Business.
                                        <br>
                                        <br>DBA Name
                                        <br>Awqeewd gdfg
                                        <br>
                                        <br>DBA Street Address
                                        <br>dsfdsf 3432 fdgdf
                                        <br>
                                        <br>DBA City
                                        <br>NORTH
                                        <br>
                                        <br>Attachments:
                                      </span>
                                    </span>
                                  </p>
        <p >
          <span style=""> 
          </span>
        </p>
      </div>
      </body>
    </html>

I am using this code to fetch but this is giving all values in a paragraph.

Document doc = Jsoup.parse(htmlString);
    List<String> valueList = new ArrayList<>();
    Elements keyElements = doc.getElementsByTag("td");
    for (Element keyElement : keyElements) {
      String value = keyElement.text();
      // store in value list

}

I also tried

doc.getElementsByTag("br");

but his is giving empty value.

Can someone please help me to get this data in a better way?

CodePudding user response：

it must be getElementsByTagName . T.T

CodePudding user response：

You can use this solution:


 Document.OutputSettings outputSettings = new Document.OutputSettings();
        outputSettings.prettyPrint(false);
        doc.outputSettings(outputSettings);
        doc.select("br").before("\\n");;
        doc.select("p").before("\\n");
        String str = doc.html().replaceAll("\\\\n", "\n");
        String strWithNewLines = Jsoup.clean(str, "", Safelist.none(), outputSettings);
        System.out.println(strWithNewLines);

CodePudding user response：

I suppose you can try this:

If the HTML String was this:

String html = "<html>\n"
              "  </head>\n"
              "<table class=\"MsoNormalTable\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\" style=\"\">\n"
              "                            <tbody>\n"
              "                              <tr style=\"\">\n"
              "                                <td>\n"
              "                                  <p class=\"MsoNormal\">\n"
              "                                    <span style=\"\">\n"
              "                                      <span style=\"font-size:9.0pt; font-family:\"Arial\",sans-serif\">\n"
              "                                        <br>\n"
              "                                        <br>Information: \n"
              "                                        <br>\n"
              "                                        <br>Legal Business Name\n"
              "                                        <br>Asfdsf\n"
              "                                        <br>\n"
              "                                        <br>Phone\n"
              "                                        <br>(718) 43543\n"
              "                                        <br>\n"
              "                                        <br>Principle Name 1\n"
              "                                        <br>afdsgsfgsg df\n"
              "                                        <br>\n"
              "                                        <br>Bus Street Address\n"
              "                                        <br>sdfdsf\n"
              "                                        <br>\n"
              "                                        <br>Bus City\n"
              "                                        <br>sdfdsf\n"
              "                                        <br>\n"
              "                                        <br>Bus State\n"
              "                                        <br>ny\n"
              "                                        <br>\n"
              "                                        <br>Bus Zip Code\n"
              "                                        <br>4324324\n"
              "                                        <br>\n"
              "                                        <br>Email Address\n"
              "                                        <br>[email protected]\n"
              "                                        <br>\n"
              "                                        <br>Tertiary Email Address\n"
              "                                        <br>--- No answer ---\n"
              "                                        <br>\n"
              "                                        <br>Business Website Address\n"
              "                                        <br>dsfdsf.com\n"
              "                                        <br>\n"
              "                                        <br>DBA info same as Business\n"
              "                                        <br>\n"
              "                                        <br>DBA information is same as Business.\n"
              "                                        <br>\n"
              "                                        <br>DBA Name\n"
              "                                        <br>Awqeewd gdfg\n"
              "                                        <br>\n"
              "                                        <br>DBA Street Address\n"
              "                                        <br>dsfdsf 3432 fdgdf\n"
              "                                        <br>\n"
              "                                        <br>DBA City\n"
              "                                        <br>NORTH\n"
              "                                        <br>\n"
              "                                        <br>Attachments:\n"
              "                                      </span>\n"
              "                                    </span>\n"
              "                                  </p>\n"
              "        <p class=\"MsoNormal\">\n"
              "          <span style=\"\"> \n"
              "          </span>\n"
              "        </p>\n"
              "      </div>\n"
              "      </body>\n"
              "    </html>";

And you run this string through the following method provided below:

String[] values = getTextAfterHtmlStartEndTags(html, "br");

// Display the discovered values...
for (String str : values) {
    System.out.println(str);
}

The console Window will display:

Information:

Legal Business Name
Asfdsf

Phone
(718) 43543

Principle Name 1
afdsgsfgsg df

Bus Street Address
sdfdsf

Bus City
sdfdsf

Bus State
ny

Bus Zip Code
4324324

Email Address
[email protected]

Tertiary Email Address
--- No answer ---

Business Website Address
dsfdsf.com

DBA info same as Business

DBA information is same as Business.

DBA Name
Awqeewd gdfg

DBA Street Address
dsfdsf 3432 fdgdf

DBA City
NORTH

Attachments:

The getTextAfterHtmlStartEndTags() method:

/**
 *
 * To be used with the JSoup API<br><br>
 * <b>Example Usage:</b><br><pre>
 *
 * <b>Required Imports:</b>
 *
 *  import org.jsoup.Jsoup;
 *  import org.jsoup.nodes.Document;
 *  import org.jsoup.nodes.Element;
 *  import org.jsoup.nodes.Node;
 *  import org.jsoup.select.Elements;
 *
 * <b>Example Code:</b>
 *
 * {@code    String html = "<td>\n"
 *             "    <span class=\"detailh2\" style=\"margin:0px\">This month: </span>2 145 \n"
 *             "    <span class=\"detailh2\">Total: </span> 31 704                         \n"
 *             "    <span class=\"detailh2\">Last: </span> 30.12.2021                      \n"
 *             "</td>";
 *
 *     String[] values = getTextAfterHtmlStartEndTags(html, "span");
 *     for (String str : values) {
 *         System.out.println(str);
 *     }}</pre><br>
 * <p>
 * The console window will display:
 * <pre>
 *
 *      2 145
 *      31 704
 *      30.12.2021</pre><br>
 * <p>
 * If you want the data from a specific HTML tag element then you can supply
 * one or more text elements within those HTML tags in th optional
 * 'specificTo' parameter as a string array or as args, for example:
 * <pre>
 *
 *  {@code   String[] values = getTextAfterHtmlStartEndTags(html, "span", "This month:", "Total:");
 *     for (String str : values) {
 *         System.out.println(str);
 *     }}</pre><br>
 * <p>
 * The console window will display:
 * <pre>
 *
 *      This month: --> 2 145
 *      Total: --> 31 704</pre>
 *
 * @param htmlString         (String) The HTML string to parse.<br>
 *
 * @param htmlStartTagString (String) The HTML start tag to get data
 *                           from.<br>
 *
 * @param specificTo         (String - args) The desired data from multiple
 *                           HTML tags of the same type (see the above
 *                           example code).<br>
 *
 * @return (String[] Array) A single Dimensional String Array containing the
 *         desired data (if properly parsed and found).
 */
public static String[] getTextAfterHtmlStartEndTags(String htmlString,
        String htmlStartTagString, String... specificTo) {
    String html = htmlString;
    List<String> list = new ArrayList<>();
    String value = "N/A";
    Document doc = Jsoup.parse(html);
    Elements elements = doc.select(htmlStartTagString);
    for (Element a : elements) {
        if (specificTo.length > 0) {
            for (int i = 0; i < specificTo.length; i  ) {
                if (a.before("</"   htmlStartTagString   ">").text().contains(specificTo[i])) {
                    Node node = a.nextSibling();
                    value = specificTo[i]   " --> "   node.toString().trim();
                    list.add(value);
                }
            }
        }
        else {
            Node node = a.nextSibling();
            value = node.toString().trim();
            list.add(value);
        }
    }
    return list.toArray(new String[list.size()]);
}

CodePudding user response：

You can use Element.wholeText() method to preserve line separators.

Unfortunately it looks like it also preserves depth of indentation so you would need to remove leading spaces or tabulators in each line.

Demo:

String htmlString = "..."; // <--- replace with your HTML

Document doc = Jsoup.parse(htmlString);
Elements keyElements = doc.getElementsByTag("td");
for (Element keyElement : keyElements) {
    String value = keyElement
            .wholeText()
            .trim()                        
            .replaceAll("(?m)^[ \t] ",""); //remove leading spaces and tabs from each line
    System.out.println(value);
    System.out.println("---");
}

Output (based on HTML from question):

Information: 

Legal Business Name
Asfdsf

Phone
(718) 43543

Principle Name 1
afdsgsfgsg df

Bus Street Address
sdfdsf

Bus City
sdfdsf

Bus State
ny

Bus Zip Code
4324324

Email Address
[email protected]

Tertiary Email Address
--- No answer ---

Business Website Address
dsfdsf.com

DBA info same as Business

DBA information is same as Business.

DBA Name
Awqeewd gdfg

DBA Street Address
dsfdsf 3432 fdgdf

DBA City
NORTH

Attachments:
---