An in-depth guide to building a minimal, robust web scraper for extracting structured data on the internets.
Intro
Node.js provides a perfect, dynamic environment to quickly experiment and work with data from the web.
While there are more and more visual scraping products these days (import.io,Ā Spider,Ā Scrapinghub,Ā Apify,Ā Crawly, ā¦ā¦), there will always be a need for the simplicity and flexibility of writing one-off scrapers manually.
This post is intended as a tutorial for writing these types of data extraction scripts in Node.js, including some subtle best practices that Iāve learned from writing dozens of these types of crawlers over the years.
In particular, weāll be walking through how to create a scraper for GitHubāsĀ listĀ of trending repositories. If you want to follow along with the code,Ā check out the repoĀ scrape-github-trending.
Tutorial
Building Blocks
One of the best features of Node.js is the extremely comprehensive community of open source modules it has to offer. For this type of task, weāll be leaning heavily on two modules,Ā gotĀ to robustly download raw HTML, andĀ cheerioĀ which provides a jQuery-inspired API for parsing and traversing those pages.
Cheerio is really great for quick & dirty web scraping where you just want to operate against raw HTML. If youāre dealing with more advanced scenarios where you want your crawler to mimic a real user as close as possible or navigate client-side scripting, youāll likely want to useĀ Puppeteer.
Unlike cheerio, puppeteer is a wrapper for automating headless chrome instances, which is really useful for working with modern JS-powered SPAs. Since youāre working with Chrome itself, it also has best-in-class support for parsing / rendering / scripting conformance. Headless Chrome is still relatively new, but it will likely phase out older approaches such asĀ PhantomJSĀ in the years to come.
As far asĀ gotĀ goes, there are dozens of HTTP fetching libraries available on NPM, with some of the more popular alternatives beingĀ superagent,Ā axios,Ā unfetchĀ (isomorphic === usable from Node.js or browser), and finallyĀ requestĀ /Ā request-promise-nativeĀ (most popular library by far though the maintainers have officially deprecated any future development).
Getting Started
Alright, for this tutorial weāll be writing a scraper for GitHubās list ofĀ trending repositories.
The first thing I do when writing a scraper is to open the target page in Chrome and take a look at the how the desired data is structured in dev tools.
Switching back and forth between theĀ
ConsoleĀ andĀ ElementsĀ tabs, you can use theĀ $$(ā.repo-list liā)Ā selector in the console to select all of the trending repos.What youāre looking for in creating theseĀ CSS selectorsĀ is to keep them as simple as possible while also making them as focused as possible. By looking through theĀ
ElementsĀ tab and selecting the elements youāre interested in, youāll usually come up with some potential selectors that may work. The next step is to try them out in theĀ ConsoleĀ tab using theĀ $$()Ā syntax to make sure youāre only selecting the elements you intended to select. One rule of thumb here is to try and avoid using aspects of the HTMLās structure or classes that may change more often in refactors or code rewrites.Writing the Scraper
Now that we have a good idea of some CSS selectors that will target our desired data, letās convert them to a Node.js script:
#!/usr/bin/env node 'use strict' const pMap = require('p-map') const cheerio = require('cheerio') const got = require('got') const { resolve } = require('url') const baseUrl = 'https://github.com' exports.main = async () => { const { body } = await got(`https://github.com/trending`) const $ = cheerio.load(body) const repos = $('.repo-list li').get().map((li) => { try { const $li = $(li) const $link = $li.find('h3 a') const url = resolve(baseUrl, $link.attr('href')) const linkText = $link.text() const repoParts = linkText.split('/').map((p) => p.trim()) const desc = $li.find('p').text().trim() return { url, userName: repoParts[0], repoName: repoParts[1], desc } } catch (err) { console.error('parse error', err) } }).filter(Boolean) return repos }
Note that weāre usingĀ async / await syntaxĀ here to handle downloading the external web page asynchronously in a way that looks synchronous.
- Line 12: we download the remote page and extract itās textĀ
bodyĀ (HTML).
- Line 14: we load that HTML into cheerio so that itās easy to traverse and manipulate.
- Line 15: we select all the repositoryĀ
liĀ elements using our previous CSS selector and map over them.
- Lines 16ā32: we extract the relevant portions of each trending repo into a plain JSON object.
- Line 33: here weāre filtering out any repos that failed to parse correctly or threw an error. These will beĀ
undefinedĀ in the array andĀ[].filter(Boolean)Ā is a shorthand syntax for filtering any non-truthy values.
At this point, weāve succeeded in scraping a single web page and extracting some relevant data. Hereās some example JSON output at this point:
[ { "url": "https://github.com/jlevy/the-art-of-command-line", "userName": "jlevy", "repoName": "the-art-of-command-line", "desc": "Master the command line, in one page" }, { "url": "https://github.com/jackfrued/Python-100-Days", "userName": "jackfrued", "repoName": "Python-100-Days", "desc": "Python - 100天ä»ę°ęå°å¤§åø" }, { "url": "https://github.com/MisterBooo/LeetCodeAnimation", "userName": "MisterBooo", "repoName": "LeetCodeAnimation", "desc": "Demonstrate all the questions on LeetCode in the form of animation.ļ¼ēØåØē»ēå½¢å¼åē°č§£LeetCodeé¢ē®ēęč·Æļ¼" }, { "url": "https://github.com/microsoft/terminal", "userName": "microsoft", "repoName": "terminal", "desc": "The new Windows Terminal, and the original Windows console host -- all in the same place!" }, { "url": "https://github.com/TheAlgorithms/Python", "userName": "TheAlgorithms", "repoName": "Python", "desc": "All Algorithms implemented in Python" }, ... ]
Crawling Deeper
Now that weāve explored how to scrape a single page, the next logical step is to branch out and crawl multiple pages. You could even get fancy and crawl links recursively from this point on, but for now weāll just focus on crawling one level down in this data, that is the repository URLs themselves.
Weāll follow a very similar approach to how we scraped the original trending list. First, load up an example GitHub repository in Chrome and look through some of the most useful metadata that GitHub exposes and how you might target those elements via CSS selectors.
Once you have a good handle on what data you want to extract and have some working selectors in theĀ
Console, itās time to write a Node.js function to download and parse a single GitHub repository.async function processDetailPage (repo) { console.warn('processing repo', repo.url) try { const { body } = await got(repo.url) const $ = cheerio.load(body) const numCommits = parseInt($('.commits span').text().trim().replace(/,/g, '')) const [ numIssues, numPRs, numProjects ] = $('.Counter').map((i, el) => parseInt($(el).text().trim())).get() const [ numWatchers, numStars, numStarsRedundant, // eslint-disable-line numForks ] = $('.social-count').map((i, el) => parseInt($(el).text().trim().replace(/,/g, ''))).get() const languages = $('.repository-lang-stats-numbers li').map((i, li) => { const $li = $(li) const lang = $li.find('.lang').text().trim() const percentStr = $li.find('.percent').text().trim().replace('%', '') const percent = parseFloat(percentStr) return { language: lang, percent } }).get() return { ...repo, numCommits, numIssues, numPRs, numProjects, numWatchers, numStars, numForks, languages } } catch (err) { console.error(err.message) } }
The only real difference here from our first scraping example is that weāre using some differentĀ
cheerioĀ utility methods likeĀ $.find()Ā and also doing some additional string parsing to coerce the data to our needs.At this point, weāre able to extract a lot of the most useful metadata about each repo individually, but we need a way of robustly mapping over all the repos we want to process. For this, weāre going to use the excellentĀ p-mapĀ module. Most of the time you want to set a practical limit on parallelism whether it be throttling network bandwidth or compute resources. This is whereĀ p-mapĀ really shines. I use it 99% of the time as a drop-in replacement forĀ
Promise.all(ā¦), which doesnāt support limiting parallelism.const pMap = require('p-map') // ... return (await pMap(repos, processDetailPage, { concurrency: 3 })).filter(Boolean)
Here, weāre mapping over each repository with a maximum concurrency of 3 requests at a time. This helps significantly in making your crawler more robust against random network and server issues.
If you want to add one more level of robustness here, I would recommend wrapping your sub-scraping async functions inĀ p-retryĀ andĀ p-timeout. This is whatĀ gotĀ is actually doing under the hood to ensure more robust HTTP requests.
All Together Now
Here is the full executable Node.js code. You can also find the full reproducible project atĀ scrape-github-trending.
#!/usr/bin/env node 'use strict' const pMap = require('p-map') const cheerio = require('cheerio') const got = require('got') const { resolve } = require('url') const baseUrl = 'https://github.com' exports.main = async () => { const { body } = await got(`https://github.com/trending`) const $ = cheerio.load(body) const repos = $('.repo-list li').get().map((li) => { try { const $li = $(li) const $link = $li.find('h3 a') const url = resolve(baseUrl, $link.attr('href')) const linkText = $link.text() const repoParts = linkText.split('/').map((p) => p.trim()) const desc = $li.find('p').text().trim() return { url, userName: repoParts[0], repoName: repoParts[1], desc } } catch (err) { console.error('parse error', err) } }).filter(Boolean) return (await pMap(repos, processDetailPage, { concurrency: 3 })).filter(Boolean) } async function processDetailPage (repo) { console.warn('processing repo', repo.url) try { const { body } = await got(repo.url) const $ = cheerio.load(body) const numCommits = parseInt($('.commits span').text().trim().replace(/,/g, '')) const [ numIssues, numPRs, numProjects ] = $('.Counter').map((i, el) => parseInt($(el).text().trim())).get() const [ numWatchers, numStars, numStarsRedundant, // eslint-disable-line numForks ] = $('.social-count').map((i, el) => parseInt($(el).text().trim().replace(/,/g, ''))).get() const languages = $('.repository-lang-stats-numbers li').map((i, li) => { const $li = $(li) const lang = $li.find('.lang').text().trim() const percentStr = $li.find('.percent').text().trim().replace('%', '') const percent = parseFloat(percentStr) return { language: lang, percent } }).get() return { ...repo, numCommits, numIssues, numPRs, numProjects, numWatchers, numStars, numForks, languages } } catch (err) { console.error(err.message) } } if (!module.parent) { exports.main() .then((repos) => { console.log(JSON.stringify(repos, null, 2)) process.exit(0) }) .catch((err) => { console.error(err) process.exit(1) }) }
And here you can find an example of the corresponding output.
Conclusion
I have used this exact pattern dozens of times for one-off scraping tasks in Node.js. Itās simple, robust, and really easy to customize to practically any targeted crawling / scraping scenarios.
Itās worth mentioning thatĀ scrape-itĀ also looks like a very well engineered library that is essentially doing everything under the hood in this article.
If your crawling use case requires a more distributed workflow or more complicated client-side parsing, I would highly recommend checking outĀ Puppeteer, which is a game changing library from Google for automating headless Chrome. You may also want to check out the related crawling resources listed inĀ awesome-puppeteerĀ such asĀ headless-chrome-crawlerĀ which provides a distributed crawling solution built on top of Puppeteer.
In my experience, however, 95% of the time a simple one-file script like the one in this article tends to do the job just fine. And imho,Ā KISSĀ is the single most important rule in software engineering.
Thanks for your time && I wish you luck on your future scraping adventures!
PS. You can play with the full repo here:
Ā
Ā
