How to Extract Website Data with Java using Jsoup Library

Sovary May 8, 2022 2.73K
5 minutes read

If you like to learn to extract data from public webpage and have a basic understanding of Java CSS and HTML, then this tutorial is for you. You will inspect the HTML structure of your target site with your browser’s developer tools. I wrote this article is for 𝐅𝐎𝐑 𝐄𝐃𝐔𝐂𝐀𝐓𝐈𝐎𝐍𝐀𝐋 𝐏𝐔𝐑𝐏𝐎𝐒𝐄 𝐎𝐍𝐋𝐘 as I will work on IMDB website.

Pre-requisite

  • Eclipse IDE or Netbean IDE or your favorite IDE for writing Java code. For this tutorial, I will use Eclipse IDE
  • Download Jsoup library Java which helps fetch URLs and extract data, using HTML5 DOM methods and CSS selectors.
  • Chrome Browser to inspect element HTML
  • You should have basic understanding of HTML and CSS.

Warm-up 

Before we move to real web scraping section, I will show how to extract the data from simple DOM. For example, we have very simple structure elements as below

HTML

<div id="block">
    <h1>My First Heading</h1>
    <p>My first paragraph.</p>
    <p>My second paragraph.</p>
    <img src="http://example.com/example.jpg"/>
</div>

To get the text inside p tag we have two options for CSS selector to traverse. We can start from the root id or directly select the p tag. 

Java

//we can select directly with p, because it have only one p
Element p1 = doc.select("p").first();
String tx1 = p1.text();

//This more specify from the root with id and inside that have p element
Element p2 = doc.select("#block p").first();
String tx2 = p2.text();
//print
System.out.println(tx1);
System.out.println(tx2);
  • doc variable stores the Document of HTML DOM, just ignore this a moment.
  • select() is the method to find desired CSS selector return as Elements
  • first() is to return the first element that found return as single Element.
  • text() is to extract any text from the element.

The Java code will print the result My first paragraph.  as we use first() method to return first single value. Anyway if you want the second p text by just remove first() method into get(1) because select() will return as collection Elements and we could traverse by its index.

Next step we will try to get the image source from img element. The same as before we can traverse from the root with id or directly select the img tag.

Java

//we can select directly with img
Element img1 = doc.select("img").first();
String src1 = img1.attr("src");

//This more specify from the root with id and inside that have img element
Element img2 = doc.select("#block img").first();
String src2 = img2.attr("src");

//print
System.out.println(src1);
System.out.println(src2);

Everything is almost the same but you will find out a new method.

attr() is a method for extracting the attribute value from HTML as img element has src attribute so we can place src text in the method.

I hope you get some idea with this section, so let's get started scraping on real web.

Implementation

Step 1: Create a new Java project in Eclipse (File -> New -> Java Project -> Name project -> Finish)

Step 2: Create new class for running the code. Right click on src -> New -> Class -> Name class -> Tick public static void main(String[] args) -> Finish

Step 3: Add Jsoup library to Java project.  Copy the jar file (Jsoup) and paste it into src folder. Then right-click on jar file -> Build Path -> Add to Build Path

Please take a look at the short video below to follow step 1 to step 3


Step 4: We will fetch https://www.imdb.com/chart/top and parse it to DOM with timeout connection. 
Document doc = Jsoup.connect("https://www.imdb.com/chart/top").timeout(6000).get();

Step 5: We will get the title movie by extracting the data from HTML, but before we do that we have to inspect the element in the browser with ctl+shift+I in Chrome then we are going to test the number of movies in list with the CSS selection below.

Java

Elements body = doc.select("tbody.lister-list");
System.out.println(body.select("tr").size());

For tbody.lister-list the selector defines to find table body that has class lister-list and we print the number of elements tr . We have selected tr and return as a collection of elements  we know collection and loop like brotherhood.

Step 6: We will loop the tr elements and select the attribute of the image.

for(Element e : body.select("tr"))
{
    String img = e.select("td.posterColumn img").attr("src");
    System.out.println(img);
}

We will see all list source thumbnail movies. Let's pick one element to analyze in for loop. The purpose of step 6 we want to get link image so we start getting a tr element which contains td element inside and we can specify with class name posterColumn . Inside that, we have an image element so we have to specify with img tag after all we will get td.posterColumn img . Next, we can call the method attr() to manipulate the attribute value which is in src.

Step 7: We want to get title movie, we will find in alt attriubte in image or in a tag, we have options so in this case I will get the title from alt attriubte of image. We can write as below code:
for(Element e : body.select("tr"))
{
    String title = e.select("td.posterColumn img").attr("alt");
    System.out.println(title);
}

In conclusion, the main point to handle the scraping in this section is about CSS selectors. If you already understand about basic CSS, HTML and Java you are ready to go.

Please watch the below video for more detail

 

You might Also Like:

 

Java  Video  Web scraping 
Author

Founder of CamboTutorial.com, I am happy to share my knowledge related to programming that can help other people. I love write tutorial related to PHP, Laravel, Python, Java, Android Developement, all published post are make simple and easy to understand for beginner. Follow him    

Search