Simplify Your Web Scraping with Scraper API. This library allows you to get up-and-running fast and provides impressive features.
Web Scraping can be challenging. While you can certainly leverage a module such as Axios and handle the details yourself, sometimes it is better to stand on the shoulders of giants.
Scraper API is a giant. This multi-language product makes basic to advanced web scraping a breeze. Scraper API supports Bash, Node, Python/Scrapy, PHP, Ruby and Java. It also features Rendering Javascript, POST/PUT Requests, Custom Headers, Sessions, the ability to specify a Geographic Location and Proxy Mode.
In this article, we will cover basic usage and rendering Javascript. All of the code examples leverage Node.js in order to demonstrate Scraper API.
package.json
Above we have the package.json needed in order to run our application. The only detail we need to concern ourselves with is the inclusion of the scraperapi-sdk NPM module in the dependencies section.
The Web Page to Be Scraped
In the example above, we have the HTML for the web page that we will scrape using the Scraper API. It is important to note that the content mainly contains the paragraph: “I’ve been scraped!”. But, if you visit the target URL: http://examples.kevinchisholm.com/scrape-me, you will see the text: “This text was added with JavaScript.” The reason for this is: there is a JavaScript file that is being executed in the page. That script looks for the element with the ID:”Main” and then replaces the HTML text “I’ve been scraped!” with “This text was added with JavaScript.” This is an important detail that we will touch upon in the next two examples.
Basic Scraper API Usage
Example # 1-A
Example # 1-A is where the web-scraping begins. We set the variable scraperapiClient, which is a reference to the imported scraperapi-sdk NPM module. The reason we have created a scrapePage() function, is that we want to use the JavaScript await expression, so that our script will pause until the asynchronous call to the Scraper API client returns.
Let’s break-down what is happening here:
- We set the scrapeUrl variable, which is the page to be scraped.
- We set the scrapeResponse variable, which will be the HTML returned when we scrape the page.
- We call the scraperapiClient.get() method, passing it the scrapeUrl variable (the page to be scraped).
- We use console.log to output the HTML returned from the scraped page (the scrapeResponse variable).
NOTE: See Example # 1-B below for a discussion of the HTML that is returned from the scraped page.
Example # 1-B
In Example # 1-B, we see the HTML that is returned from the scraped page. Ironically, there is not too much to discuss here: the HTML is 100% identical to the actual HTML that was in the web page. This demonstrates the simplicity and power of the Scraper API. In other words: you get exactly what you ask for: the HTML of the URL that you pass to Scraper API.
NOTE: Earlier we discussed the fact that the HTML content is: “I’ve been scraped!”, but what we seen when you visit the URL is: “This text was added with JavaScript.” In the next example, we will discuss the reason for this.
Rendering Javascript
Example # 2-A
Example # 2-A is identical to the code in Example # 1-A with one exception: When we call scraperapiClient.get(), we pass a 2nd argument: {render: true}. This 2nd argument, tells Scraper API to wait until the JavaScript in the page has finished executing. The benefit of this is demonstrated in Example # 2-B: If JavaScript alters the page content, then the HTML that we get back includes the changes that the JavaScript has implemented.
Example # 2-B
In Example # 2-B, you will see that the HTML content returned by Scraper API is: “This text was added with JavaScript.” instead of “I’ve been scraped!”. This is because we passed {render: true} as the 2nd argument when we called scraperapiClient.get().
With Scraper API, you can implement web scraping in minutes. The process is simple and the features are well thought out. Whether your project is for research or commercial purposes, this product provides a robust and reliable way to fetch the source code of any web page.