Updated: 22/11/2016

Technical Research: NodeJS Cheerio + Request

Node.js does not provide an inbuilt module containing DOM (the Document Object Module), however the externally developed library Cheerio does. It implements a subset of JQuery server-side, giving the ability to parse HTML and XML. [1]


Key Features

  • Built upon HTMLParser2, an XML/HTML parser that uses the SAX online algorithm (Simple API for XML) [2]
  • Implements majority of jQuery API on the server [1]
  • Designed for Node.js

  • Usage

    Cheerio can be used alongside the inbuilt library Request (HTML request client), to web-scrape markup files. This is a key component of building the form renderer, wherein we will extract data from markup files.


    Advantages

  • Easy to learn and use-heavily based on jQuery
  • Under active development, constantly improving
  • Node library, therefore has inbuilt compatibility with Electrode, a technology favoured by the client
  • [1]


    Disadvantages

  • The HTML parser can sometimes fail to read poorly written or formatted HTML, although this will not itself crash the program. It can make it difficult therefore to find if a bug is in the scraper or the page [2]

  • Alternatives

  • JSDOM-however the author is no longer maintaining this module, meaning it suffers from memory leaks in complex use cases [2]

  • References

    [1] "Web Scraping In Node.Js". SitePoint. N.p., 2016. Web. 8 Dec. 2016.
    [2] Ogden, Max. "Max Ogden Blogotronz". Maxogden.com. N.p., 2016. Web. 8 Dec. 2016.