Group 37

Updated: 22/11/2016

Technical Research: BeautifulSoup

BeautifulSoup is a Python library from a third party (Crummy) designed for extracting data from HTML and XML files. [1]

Key Features

Sits atop HTML or XML parser, allowing the user to use Pythonic idioms for searching and iterating over the parse tree

Easy to use

[3]

Usage

It is downloaded as a module and used within Python, allowing the program to pull data by the markup file’s tag. [1] This is key for the first stage of the form renderer wherein components are to be pulled from a markup file. It can be used alongside URLib2, an inbuilt library, that is required to open the URL of the target site. [3]

Advantages

Saves time when extracting data out of complex or poorly formed HTML or XML

Syntax easy to learn and use

[3]

Disadvantages

Can fail on large HTML documents

Slow performance

[2]

Alternatives

Scrapy (we had little success in trying to set up a Scrapy web-scraper)

References

[1] "Beautiful Soup Documentation — Beautiful Soup 4.4.0 Documentation". Crummy.com. N.p., 2016. Web. 10 Dec. 2016.
[2] "Parsing HTML In Python - Lxml Or Beautifulsoup? Which Of These Is Better For What Kinds Of Purposes?". Stackoverflow.com. N.p., 2016. Web. 10 Dec. 2016.
[3] "Scraping Websites With Python". Python For Beginners. N.p., 2016. Web. 10 Dec. 2016.