In this blog  we are going to learn some useful Python packages for web scraping. Web scraping is a technique of extracting information from websites. Web scraping method directly fetches a web page without Web Browser and extract information from it.

Using BeautifulSoup Python package

Before using BeautifulSoup we need to install following packages. pip, python-setuptools and pip are the tools to install python packages. the command sudo pip install BeautifulSoup4 will install BeautifulSoup python package.

sudo easy_install pip
sudo apt-get install python-setuptools
sudo easy_install pip
sudo pip install BeautifulSoup4

Now we will use BeautifulSoup to scrap a web page. Let us suppose we need to find how many question related to c++ have been posted at stackoverflow.com. Let us see below code.

from bs4 import BeautifulSoup
import requests

# Use the requests module to obtain a page
res = requests.get('https://stackoverflow.com/questions?c++')

# Create a BeautifulSoup object
page = BeautifulSoup(res.text, 'html.parser')   # the text field contains the source of the page

box = page.find('div', attrs={'class': 'summarycount al'})

print box.text.strip()

How it Works

  • Before programming , inspect web page in any web browser. For example we have inspected web page https://stackoverflow.com/questions?c++ in Firefox Web Browser.
  • To inspect , write click on the web page and select “Inspect element”.
  • See the html tag , to find the information and read the tag and css selector.

  • We extract information using box = page.find(‘div’, attrs={‘class’: ‘summarycount al’}).

 

Scrapy framework

Scrapy framework is open source framework for extracting information from a web page. For more info you can follow https://scrapy.org/. Here we are going to learn how to setup and use it to extract data.

  • To use scrapy framework , install it using ‘sudo pip install scrapy‘¬† on Ubuntu Linux terminal. If pip installer fails ,you can install it by alternative command sudo apt-get install python-scrapy
  • Create project using following command.
scrapy startproject stackoverflow

After running above command it will create a folder stackoverflow. The folder or directory stackoverflow would contain the project structure like below.

.
|-- scrapy.cfg
`-- stackoverflow
    |-- __init__.py
    |-- items.py
    |-- pipelines.py
    |-- settings.py
    `-- spiders
        `-- __init__.py

  • Go into the directory spiders and create a file name stackoverflow_spider.py and paste the below code.
import scrapy

class StackOverflowSpider(scrapy.Spider):
    name = 'stackoverflow'  # each spider has a unique name
    start_urls = ['http://stackoverflow.com/questions?sort=votes']  # the parsing starts from a specific set of urls

    def parse(self, response):  # for each request this generator yields, its response is sent to parse_question
        for href in response.css('.question-summary h3 a::attr(href)'):  # do some scraping stuff using css selectors to find question urls 
            full_url = response.urljoin(href.extract())
            yield scrapy.Request(full_url, callback=self.parse_question)

    def parse_question(self, response): 
        yield {
            'title': response.css('h1 a::text').extract_first(),
            'votes': response.css('.question .vote-count-post::text').extract_first(),
            'body': response.css('.question .post-text').extract_first(),
            'tags': response.css('.question .post-tag::text').extract(),
            'link': response.url,
        }
  • Run command scrapy crawl stackoverflow from the project directory.

Ref:

https://stackoverflow.com/questions/28670554/python-scrapy-issue-with-scrapy-version/28736998#28736998



Related Contents to follow