In this blog  we are going to learn some useful Python packages for web scraping. Web scraping is a technique of extracting information from websites. Web scraping method directly fetches a web page without Web Browser and extract information from it.

Using BeautifulSoup Python package

Before using BeautifulSoup we need to install following packages. pip, python-setuptools and pip are the tools to install python packages. the command sudo pip install BeautifulSoup4 will install BeautifulSoup python package.

Now we will use BeautifulSoup to scrap a web page. Let us suppose we need to find how many question related to c++ have been posted at stackoverflow.com. Let us see below code.

How it Works

  • Before programming , inspect web page in any web browser. For example we have inspected web page https://stackoverflow.com/questions?c++ in Firefox Web Browser.
  • To inspect , write click on the web page and select “Inspect element”.
  • See the html tag , to find the information and read the tag and css selector.

  • We extract information using box = page.find(‘div’, attrs={‘class’: ‘summarycount al’}).

 

Scrapy framework

Scrapy framework is open source framework for extracting information from a web page. For more info you can follow https://scrapy.org/. Here we are going to learn how to setup and use it to extract data.

  • To use scrapy framework , install it using ‘sudo pip install scrapy‘¬† on Ubuntu Linux terminal. If pip installer fails ,you can install it by alternative command sudo apt-get install python-scrapy
  • Create project using following command.

After running above command it will create a folder stackoverflow. The folder or directory stackoverflow would contain the project structure like below.

  • Go into the directory spiders and create a file name stackoverflow_spider.py and paste the below code.

  • Run command scrapy crawl stackoverflow from the project directory.

Ref:

https://stackoverflow.com/questions/28670554/python-scrapy-issue-with-scrapy-version/28736998#28736998