java - Extract text between two (different) HTML tags using jsoup - TagMerge
5Extract text between two (different) HTML tags using jsoupExtract text between two (different) HTML tags using jsoup

Extract text between two (different) HTML tags using jsoup

Asked 1 years ago
0
5 answers

Use the Element.nextSibling() method. In the example code below, the desired values are placed into a List Interface of String:

String html = "<td>\n"
            + "    <span class=\"detailh2\" style=\"margin:0px\">This month: </span>2 145 \n"
            + "    <span class=\"detailh2\">Total: </span> 31 704                         \n"
            + "    <span class=\"detailh2\">Last: </span> 30.12.2021                      \n"
            + "</td>";

List<String> valuesList = new ArrayList<>();

Document doc = Jsoup.parse(html);
Elements elements = doc.select("span");
for (Element a : elements) {
    Node node = a.nextSibling();
    valuesList.add(node.toString().trim());
}
    
// Display valuesLlist in Condole window:
for (String value : valuesList) {
    System.out.println(value);
}

It will display the following into the Console Window:

2 145
31 704
30.12.2021

If you prefer to just get the value for Total: then you can try this:

String html = "<td>\n"
            + "    <span class=\"detailh2\" style=\"margin:0px\">This month: </span>2 145 \n"
            + "    <span class=\"detailh2\">Total: </span> 31 704                         \n"
            + "    <span class=\"detailh2\">Last: </span> 30.12.2021                      \n"
            + "</td>";
String totalValue = "N/A";
Document doc = Jsoup.parse(html);
Elements elements = doc.select("span");
for (Element a : elements) {
    if (a.before("</span>").text().contains("Total:")) {
        Node node = a.nextSibling();
        totalValue = "Total: --> " + node.toString().trim();
        break;
    }
}
    
// Display the value in Condole window:
System.out.println(totalValue);

The above code will display the following within the Console Window:

 Total: --> 31 704

Source: link

0

I want to get the names of all those links from between the two h2 tags there
<h2><span class="mw-headline" id="People">People</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Bush&action=edit&section=1" title="Edit section: People">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
<ul>
<li><a href="/wiki/George_H._W._Bush" title="George H. W. Bush">George H. W. Bush</a> (born 1924), the 41st president of the United States of America</li>
<li><a href="/wiki/George_W._Bush" title="George W. Bush">George W. Bush</a> (born 1946), the 43rd president of the United States of America</li>
<li><a href="/wiki/Jeb_Bush" title="Jeb Bush">Jeb Bush</a> (born 1953), the former governor of Florida and also a member of the Bush family</li>
<li><a href="/wiki/Bush_family" title="Bush family">Bush family</a>, the political family that includes both presidents</li>
<li><a href="/wiki/Bush_(surname)" title="Bush (surname)">Bush (surname)</a>, a surname (including a list of people with the name)    </li>
</ul>
<h2><span class="mw-headline" id="Places.2C_United_States">Places, United States</span><span class="mw-editsection"><span class="mw-editsection-bracket">[</span><a href="/w/index.php?title=Bush&action=edit&section=2" title="Edit section: Places, United States">edit</a><span class="mw-editsection-bracket">]</span></span></h2>
neither this
Elements h2next = docx.select("span.mw-headline#People");
    do 
    {
     ul = h2next.select("ul").first();
     System.out.println(ul.text());
    } 
    while (h2next!=null && ul==null);
nor
//String content = docx.getElementById("People").outerHtml();
I tried this:
Document docx = Jsoup.connect("https://en.wikipedia.org/wiki/Bush").get();
Element contentDiv = docx.select("span#mw-headlinePeople").first();
String printMe = contentDiv.toString(); // The result
Since I noticed that the data I want lives in a partition named:
<h2><span class="mw-headline" id="People">

Source: link

0

For example:
String html = "

An <a href='http://example.com/'>example</a> link.

"; Document doc = Jsoup.parse(html); Element link = doc.select("a").first(); String text = doc.body().text(); // "An example link" String linkHref = link.attr("href"); // "http://example.com/" String linkText = link.text(); // "example"" String linkOuterH = link.outerHtml(); // "<a href="http://example.com">example</a>" String linkInnerH = link.html(); // "example"

Source: link

0

If you select both with one selector by combining them with a ,, they will be in the order they appear on the page. Therefore you can keep track of whether you are in a "People section" or not while looping through the results like this:
Elements elements = docx.select("span.mw-headline, li > a");

boolean inPeopleSection = false;
for (Element elem : elements) {
    if (elem.className().equals("mw-headline")) {
        // It's a headline
        inPeopleSection = elem.id().equals("People");
    } else {
        // It's a link
        if (inPeopleSection) {
            System.out.println(elem.text());
        }
    }
}
Output:
George H. W. Bush
George W. Bush
Jeb Bush
Bush family
Bush (surname)
A simple selector would be h2:contains(people) + ul a, e.g.:
Elements els = doc.select("h2:contains(people) + ul a");
Which gives these elements:
0 <a href="/wiki/George_H._W._Bush" title="George H. W. Bush">
George H. W. Bush
1 <a href="/wiki/George_W._Bush" title="George W. Bush">
George W. Bush
2 <a href="/wiki/Jeb_Bush" title="Jeb Bush">
Jeb Bush
3 <a href="/wiki/Bush_family" title="Bush family">
Bush family
4 <a href="/wiki/Bush_(surname)" title="Bush (surname)">
Bush (surname)

Source: link

0

I have the following HTML...
<h3 class="number">
<span class="navigation">
6:55 <a href="/results/result.html" class="under">»</a>
</span>**This is the text I need to parse!**</h3>
I can use the following code to extract the text from h3 tag.
Element h3 = doc.select("h3").get(0);
Unfortunately, that gives me everything in that tag.
6:55 » This is the text I need to parse!
Try this:
Element h3 = doc.select("h3").get(0);
String h3Text = h3.text();
String spanText = h3.select("span").get(0).text();
String textBetweenSpanEndAndH3End = h3Text.replace(spanText, "");

Source: link

Recent Questions on java

    Programming Languages