How to Scrape Web Pages with Beautiful Soup and Python?

What is Web Scraping?

When a script pretends to be a browser and retrieves web pages to extract information. Mainly web scraping refers to the extraction of data from a website.
For example, search engines, Google, etc scrape web pages, but we call that “web-crawling”.

In Python for web scraping we can use Beautiful Soup, package for parsing HTML and XML documents.

Beautiful Soup (HTML parser)

What is PIP?

PIP is a package manager for Python packages. If you have Python version 3.4 or later, PIP is included by default.

How to install Beautiful Soup?

BeautifulSoup is not a standard python library, so we need to install it first, before use it.
To install Beautiful Soup run this:

pip install beautifulsoup4

Web Page Scraper with BeautifulSoup examples

Get all links from a web page

#Get all links from a web page
import urllib.request, urllib.parse, urllib.error

#import BeautifulSoup library
from bs4 import BeautifulSoup

#Ignore SSL certificate errors
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE

url = input('Enter url: ')
#read() read all document. open() read line by line
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html,'html.parser')

#Get all anchor tags
tags = soup('a')
for tag in tags:
    print(tag.get('href', None))

Why use a Proxy API?

One of the most frustrating parts of automated web scraping is constantly dealing with IP blocks and CAPTCHAs. Fortunately, solutions can be found and not very difficult to use. Read more here about Proxy API for Web Scraping

scraper api

Leave a Comment