What is Web Scraping?
When a script pretends to be a browser and retrieves web pages to extract information. Mainly web scraping refers to the extraction of data from a website.
For example, search engines, Google, etc scrape web pages, but we call that “web-crawling”.
In Python for web scraping we can use Beautiful Soup, package for parsing HTML and XML documents.
Beautiful Soup (HTML parser)
What is PIP?
PIP is a package manager for Python packages. If you have Python version 3.4 or later, PIP is included by default.
How to install Beautiful Soup?
BeautifulSoup is not a standard python library, so we need to install it first, before use it.
To install Beautiful Soup run this:
pip install beautifulsoup4
Web Page Scraper with BeautifulSoup examples
Get all links from a web page
#Get all links from a web page import urllib.request, urllib.parse, urllib.error #import BeautifulSoup library from bs4 import BeautifulSoup #Ignore SSL certificate errors import ssl ctx = ssl.create_default_context() ctx.check_hostname = False ctx.verify_mode = ssl.CERT_NONE url = input('Enter url: ') #read() read all document. open() read line by line html = urllib.request.urlopen(url, context=ctx).read() soup = BeautifulSoup(html,'html.parser') #Get all anchor tags tags = soup('a') for tag in tags: print(tag.get('href', None))
Why use a Proxy API?
One of the most frustrating parts of automated web scraping is constantly dealing with IP blocks and CAPTCHAs. Fortunately, solutions can be found and not very difficult to use. Read more here about Proxy API for Web Scraping