Updated: 22/11/2016

Technical Research: BeautifulSoup

BeautifulSoup is a Python library from a third party (Crummy) designed for extracting data from HTML and XML files. [1]


Key Features

  • Sits atop HTML or XML parser, allowing the user to use Pythonic idioms for searching and iterating over the parse tree
  • Easy to use
  • [3]


    Usage

    It is downloaded as a module and used within Python, allowing the program to pull data by the markup file’s tag. [1] This is key for the first stage of the form renderer wherein components are to be pulled from a markup file. It can be used alongside URLib2, an inbuilt library, that is required to open the URL of the target site. [3]


    Advantages

  • Saves time when extracting data out of complex or poorly formed HTML or XML
  • Syntax easy to learn and use
  • [3]


    Disadvantages

  • Can fail on large HTML documents
  • Slow performance
  • [2]


    Alternatives

  • Scrapy (we had little success in trying to set up a Scrapy web-scraper)

  • References

    [1] "Beautiful Soup Documentation — Beautiful Soup 4.4.0 Documentation". Crummy.com. N.p., 2016. Web. 10 Dec. 2016.
    [2] "Parsing HTML In Python - Lxml Or Beautifulsoup? Which Of These Is Better For What Kinds Of Purposes?". Stackoverflow.com. N.p., 2016. Web. 10 Dec. 2016.
    [3] "Scraping Websites With Python". Python For Beginners. N.p., 2016. Web. 10 Dec. 2016.