Web Scraping with Node and Cheerio.js

Cheerio.js allows you to traverse the DOM of a web page that you fetch behind the scenes, and easily scrape that page.

There are security rules that limit the reach of client-side JavaScript, and if any of these rules are relaxed the user may be susceptible to malicious activity. On the server side, however, JavaScript is not subject to these kinds of limitations. And, in fact, in the absence of them there’s a great deal of power, particularly in the area of web scraping, which, as it turns out, allows for one of the cool upsides of this awesome freedom.

To get started, clone the following github repository: Basic web scraping with Node.js and Cheerio.js.

You’ll find instructions on how to run this code in the Github.

The page we will target for web scraping

Lets’ take a moment to look at the example web page that we will scrape: http://output.jsbin.com/xavuga. Now, if you use your web developer tools to inspect the DOM, you’ll see that there are three main sections to the page. There’s a HEADER element, a SECTION element, and a FOOTER element, and we will target those three sections later, in some of the code examples.

The request NPM module

One of our key tools is the request NPM module, which allows you to make an HTTP request and use the return value as you wish.

The cheerio NPM module

The cheerio NPM module provides a server-side jQuery implementation, and its functionality mirrors the most common tasks associated with jQuery. There isn’t a 1:1 method replication; that was not their goal. The key point is: you can parse HTML with JavaScript on the server-side.

Caching an entire web page – Example # 1

In Example # 1, we set some variables. The fs variable references the file system node module, which provides access to the local file system. We’ll need this to write files to disk. The request variable refers to the request node module, which we discussed earlier, and the cheerio variable refers to that cheerio node module that we also discussed. The pageUrl variable is the URL of the web page that we will scrape. Now, at the highest level, there are two things that happen in this code: we define a function named scrapePage, and then we execute that function. So, now, let’s take a look at what happens inside of this function.

First, we call the request function, passing it two arguments, the first of which is the URL of the request. The second argument is a callback function, which takes three arguments. The first argument is an error object, and this “error first” pattern is common in Node.js. The second argument is the response object, and the third argument is the contents of the request, which is HTML.

Inside of the request callback, we leverage the file-system module’s writeFile method. The first argument we pass is the full path of the file name, which tells the fs module what file to write. For the second argument we pass the responseHtml variable, which is the content that we want to write to the file; this is what was returned by the request function. The third argument is a callback function, which we are using to log a message indicating that the file write to disk was successful. When you run Example # 1, you should see a new file in the HTML folder: content.html. This file contains the entire contents of the web page that we make a request to.

Caching only a part of a web page – Example # 2

In Example # 2, we have an updated version of the scrapePage function, and for the sake of brevity, I have omitted the parts of the code that have not changed. The first change to the scrapePage function is the use of the cheerio.load method, and I assigned it to the $ variable. Now we can use the $ variable much the same way we would jQuery. We create the $header variable, which contains the HTML of the HTML header element. We then use the file-system module’s writeFile method to write the HTML header element to the file: header.html.

Now, when you run Example # 2, you should see another new file in the HTML folder called header.html, which contains the entire contents of the web page that we make a request to.

Example # 3

In Example # 3, we have updated the scrapePage function again, and the new code follows the same pattern as the one in Example # 2. The difference is that we have also scraped the content and footer sections, and in both cases, we’ve written the associated HTML file to disk. So, now, when you run Example # 3, you should see four files in the HTML folder, and they are entire-page.html, header.html, content.html and footer.html.

Summary

In this article, took a look at what is possible when scraping web pages. Now, even though we only scratched the surface, we did work in some high-level areas, focusing on making a request and then parsing the HTML of that request. We used the request module to make the HTTP request, and the cheerio module to parse the returned HTML. We also used the fs (file-system) module, in order to write our scraped HTML to disk.

My hope is that this article has opened up some new possibilities in your work, and has pointed you in the right direction for pulling this all off. So, happy web page scraping!