Web Scraping (also termed Screen Scraping, Web Data Extraction, Web Harvesting, etc.) is a technique for extracting large amounts of data from websites and save the the extracted data to a local file or to a database.
In this course, you will learn how to perform web scraping using Python 3 and the Beautiful Soup, a free open-source library written in Python for parsing HTML.
We will use lxml, which is an extensive library for parsing XML and HTML documents very quickly; it can even handle messed up tags. We will also be using the Requests module instead of the already built-in urllib2 module due to improvements in speed and readability.
The course cover the following topics: accessing web pages programmatically; scraping web pages to extract the required data using Beautiful Soup to parse web pages; interacting with web pages to do different things with them programmatically; and using Selenium for web scraping and when we need it.
By the end of this course, you will be able to understand how websites and servers function, diverse data extraction techniques, and methods of handling and organizing data.
This Web Scraping course covers the following topics:
- [Instructor] Welcome everybody to the course. I'm really glad that all of you are here. Thank you again for checking out this course so, I'll start off explaining what web scraping is, as this course is all about web scraping. So, in simple and brief terms, web scraping is extracting data from the internet. Why do we need to do that? That's a valid question.
Why do we need to scrape data from the internet?
We live in an age where everything is being uploaded on the internet, you know, daily terabytes of data is being uploaded on the internet. And with such an amount of data, you could perform analysis and you could actually improve a lot of things. Like, if you were a businessman and if you're launching your product, you could learn more about the market, e.g. what do people like? How will they respond to your product launch, or how can you improve your product.
All of this information is on the internet so you need some tool, some way to actually get that information out of the internet. One way would be to hire hundreds of persons to do this stuff, or the other way, the smart way would be that you write a computer program which does this for you, and conserves your resources. So, this is what web scraping is about.
The next question is what is this course about?
This course is about how can we access webpages programmatically? We'll be using Python for this course, so this course will teach us how we can parse a webpage, extract required data from that webpage.
Let's say you want to get some photos from some webpage, how are you going to do that? Let's say there are thousands of photos. Don't tell me that you're going to download each of them individually? I mean that is going to take a lot of time, so we will learn how to scrape a web page and extract our required data from our web pages.
Python has this module, which is known as BeautifulSoup, this is a parser for parsing web pages. We'll learn this too in this course. So in this course, we'll also learn how we can interact differently with our web pages and how we can move around and play around with them, you know?
We'll also learn that and you could say that it is an introduction to automation, you know, automating different stuff on the internet. You can say that in simple terms. So we'll be using Selenium for interacting with web pages in this course.
So to sum it up, to sum the answer to this question, I can say that briefly it is about how we extract data from any kind of web page using a programming language, in our case, it is going to be Python.
What will you learn by the end of this course?
At the end of this course, you will have a deep understanding of how web sites and servers function, like how are web sites hosted and how do they send requests to servers? What are servers? What are websites? And how does this communication take place? So you'll have a deeper understanding of how this works, how all of this works.
Then, you'll learn different web scraping and data extraction techniques which are being used worldwide, like the best practices, the worst practices. Things you have to worry, things you have to keep in mind while you are scraping data, and how can you do this in an efficient manner. You will learn this by the end of this course.
You'll also learn how to handle a lot of data and how we can actually parse that data , and get it in the required format we want.
So I really look forward to teaching you this course. I hope that you're excited about this. I think let's get to our first tutorial, so I'll see you soon in the next video. Thank you!
How to use lists in Python. Revision of Python Lists and commonly used list functions.
How to use dictionaries in Python. Revision of Python Dictionaries and commonly used dict functions.
How to use tuples in Python. Revision of Python Tuples and commonly used tuple functions.
How to write simple list comprehensions. Python supports a concept called "list comprehensions". It can be used to construct lists in a very natural, easy way, like a mathematician is used to do. List comprehensions provide a concise way to create lists. Common applications are to make new lists where each element is the result of
some operations applied to each member of another sequence or iterable, or to create a subsequence of those elements that satisfy a certain condition.
For example, assume you want to create a list of squares, instead of writing a regular for loop, you can use a list comprehension like:
squares = [x**2 for x in range(10)]
Writing complex list comprehensions.
Using if else conditions while writing list comprehensions.
Introduction to the Python xlsxwriter module which can be used to create Excel files and write into them. This video discusses on creating and writing data to Excel files using xlsxwriter.
Introduction to Python xlrd module which is used to read data from Excel files. This video discusses how we can read Excel files using xlrd.
How to fake to a website’s server that your Python code is a browser.
Introduction to Beautiful Soup - a Python module used for parsing HTML.
Understanding the HTML parse tree which is created by Beautiful Soup.
How we can access tags in the HTML parse tree generated by Beautiful Soup.
Concept of navigable strings in Beautiful Soup and how we can access them.
How we can navigate our parse tree using tag names.
How we can access a tag’s direct children.
How we can access a tag’s all descendants.
How to get to the parent tag of a tag.
How to get all the ancestors of a tag.
How to access the next sibling element of a tag in our parse tree with BeautifulSoup.
How to access the previous sibling tag of a tag in the parse tree with BeautifulSoup.
How to access all the next sibling tags and all the previous sibling tags of a tag in our parse tree with BeautifulSoup.
GoTrained is an e-learning academy aiming at creating useful content in different languages and it concentrates on technology and management.
We adopt a special approach for selecting content we provide; we mainly focus on skills that are frequently requested by clients and jobs while there are only few videos that cover them. We also try to build video series to cover not only the basics, but also the advanced areas.