Group 37

Updated: 22/11/2016

Technical Research: NodeJS Cheerio + Request

Node.js does not provide an inbuilt module containing DOM (the Document Object Module), however the externally developed library Cheerio does. It implements a subset of JQuery server-side, giving the ability to parse HTML and XML. [1]

Key Features

Built upon HTMLParser2, an XML/HTML parser that uses the SAX online algorithm (Simple API for XML) [2]

Implements majority of jQuery API on the server [1]

Designed for Node.js

Usage

Cheerio can be used alongside the inbuilt library Request (HTML request client), to web-scrape markup files. This is a key component of building the form renderer, wherein we will extract data from markup files.

Advantages

Easy to learn and use-heavily based on jQuery

Under active development, constantly improving

Node library, therefore has inbuilt compatibility with Electrode, a technology favoured by the client

[1]

Disadvantages

The HTML parser can sometimes fail to read poorly written or formatted HTML, although this will not itself crash the program. It can make it difficult therefore to find if a bug is in the scraper or the page [2]

Alternatives

JSDOM-however the author is no longer maintaining this module, meaning it suffers from memory leaks in complex use cases [2]

References

[1] "Web Scraping In Node.Js". SitePoint. N.p., 2016. Web. 8 Dec. 2016.
[2] Ogden, Max. "Max Ogden Blogotronz". Maxogden.com. N.p., 2016. Web. 8 Dec. 2016.