As a developer I live for APIs. The ability to take structured information from one source and transform it into another is very exciting. Unfortunately, not all information sources provide a API. Many times, the information needed is jamed into a HTML source which presents a challenge attempting to extract it. Luckily, there's an incredible NodeJS package called Cheerio which makes this task pretty simple.
I'm going to demonstrate creating a NodeJS application which will HTML scrape information from GitHub's Showcase Page. This example is taken from the GitHub-Trending project where GitHub's trending and showcase pages are scraped to provide a JSON API which can be viewed here - this project is used to power a few of my mobile applications.
Here's a few prerequisites:
For this demonstration I'm going to create a single file application which will dump information when executed. This will suffice for an example but if you're looking to integrate with a larger project take the steps proper steps to modulize it.
Run the following commands in the command line
mkdir cheerio-example- Create a new project directory
cd cheerio-example- Go into the project directory you just created
npm install request- Install the Request package
npm install cheerio- Install the cheerio package
touch app.js- Create the application file
Great! We've got our dependencies downloaded and the application file created. Now it's time to start populating the app.js file with content:
If we run this (
node app.js) it will reach out to GitHub's Showcase page, grab the HTML and print it to the console. Not very exciting but we're actually half way. This is where Cheerio comes in. If you've never used Cheerio before then you're in for a treat. Cheerio takes raw HTML, parses it, and returns a jQuery object to you so you may traverse the DOM.
If we run this we'll see each showcase printed to the console like so:
Icon fonts Package managers Science Machine learning Web games Emoji Projects with great wikis Productivity tools Policies CSS preprocessors Video tools Clean code linters Data visualization Projects that power GitHub Projects that power GitHub for Windows
That's all there is to it. Cheerio, combined with Request, makes parsing HTML very easy. With just this example, you can begin scraping HTML into structred data which can be used in practical applications - in my case, mobile applications! The iOS application, CodeHub, calls out to CodeHub-Trending which exposes a structured API of data that is scraped from GitHub's Trending and Showcase web pages!