Practical Web Scraping Course in Python, Scrapy and Selenium

Name: Practical Web Scraping Course in Python, Scrapy and Selenium
Rating: 4.4 (9 reviews)

The core of Python web scraping in less than 60 minutes + GitHub repo + Selenium, Scrapy real-life use-cases

Created byMykhailo Kushnir

Last updated 7/2022

English

What you'll learn

How to get data for your content, stock info, crypto exchange reserves, etc.
How to scrape sites with popular frameworks
Advanced techniques like scraping images, pdfs, graphics, etc.
Get more information in less time: save yourself hours of research
Working, tested instrument that'll get you data from 95% of sites

Coding Exercises

This course includes our updated coding exercises so you can practice your skills as you learn.

Course content

5 sections • 28 lectures • 52m total length

Hello and Welcome1:04
Welcome. This course would teach you practical ways of capturing data from the internet.

My name is Mykhailo Kushnir, and currently, I’m working as ML Engineer in Ukraine, Lviv.

I need data for both my work and my pet projects. I’m sure most of you are here for similar reasons.

Same as I, you probably don’t want to commit all the time in the world to it. Because of that, I would try to keep tutorials as short and condensed as possible. That’s my intention.

Throughout this course, you’ll learn many ways to scrape data, store and version control data efficiently, use selenium for data capturing, and many more.

Finally, a small disclaimer: Udemy typically asks you to rate a course faster than you actually went through it enough to form an opinion. If that is the case, feel free to postpone the rating decision until you understand whether this course gave you enough value for the money it costs. Also, if you face any obstacles during the education process, please let me know about them, and we’ll see if I can be helpful to you. Otherwise, enjoy the course!
How to use the course1:25
Hi, everyone. In this video, I’ll try to explain to you how to use this course for your own good.

First of all, I assume I know your problem. You either want to get data for your own pet project or you’re looking for a side-gag skill which scraping is.

And you want it now.

I’ve created a course that I would like to watch myself and I don’t really like long-running stuff. I’ll supplement you with reading materials, links and scripts that would help you immediately, but nonetheless, you’d have to google. On my end, I promise you to pack the content with information and useful tools.

The main part of the code would be placed on this GitHub repository. You’ll find the links to it after the video. By the way, that would be a common pattern. Whenever you see an external resource on the screen, a link to it would be possible to find after the video in the reading materials.

For the best efficiency you need to follow the course in 3 steps:
Watch the videos
Reproduce the code from it
Extend this code for some real use case problems. I’ll give you some ideas.

If something goes wrong, reach out to our slack community for a potential answer.

Now you’re fully ready for your first tutorial. It won’t be a simple one, but you’ll make it. Good luck!
Development environment setup2:01
Hey everyone, in this section I’ll introduce you to the course and give some tips on how to learn with the highest efficiency

After the initial overview, we will learn how to set up a programming environment for web scraping. When you complete the video part, you’ll find reading materials with links. Make sure you go through them as there would be something to grasp.

In this initial setup, you would need Python Docker and your favourite IDE. I’d suggest VS code.

First, you have to learn how to install python. There’s no better way of doing it except going to python's official website and following tutorials under the Downloads section

Next, we’d have to install a virtual environment package. And use it to create a new environment. You’ll be using it for installing requirements.txt through various projects in this course.

Virtualenv package helps you skip versioning issues so it’s definitely a useful tool.

If everything was done correctly, you would be able to create a virtual environment and install the requirements.txt file. Make sure you’ve pulled the source code for this course from GitHub.

Go to the docker install page to see how you can set it up on your specific operating system
When docker would be installed, for the start it would be enough for you to pull selenium standalone-chrome for this course

And then start it with the run command

Here is a useful link for VS code installation as well

Once again, if you face issues with this initial setup - make sure you’ve glanced at the reading materials after the video section. You can also go to our slack community to search for help from other students.
Reading Materials0:06
Last Check Before We Start
[LEARNING TIP] Use captions created for this course0:03

Tracking HTTP requests2:36
Tracking and reproducing HTTP requests is an ultimate and primary method of getting data from the Internet. You should always aim to use it, either by replicating your browser's requests or by requesting API access from site owners. In this tutorial, I'll show you how to find the necessary requests and replicate them in a blink of an eye with Postman.
BeautifulSoup 4 tools in detail2:10
Here is the list of topics we will touch on. First, I’ll explain the difference between the select method and find/find_all methods and why you should prefer the first one more often. Next, we’ll look for use cases when you’d access parents and children elements through specific properties and we’ll review how to get to the text content of tags.

First, let me explain the difference between the select and the find_all methods. Both of them are aiming to capture all elements by some predefined criteria. For example, here’s how you can modify the code from the previous tutorial to use the find_all method and still reach the same result. As you see, the find_all method requires you to define selectors in a pythonic way while the select method allows a more natural, JavaScript-ish way. That’s why I prefer the first one.

The find method does the same thing that the find_all does, but only matches the first element if it exists. It’s not hard to conclude that the same thing can be achieved through the select method and usage of some Python magic.
Basic scraping with BeautifulSoup 41:56
There are two major use cases I can remember. First, if you have your data packaged into some local HTML or XML file. In that case, you can load it into BS4 and apply its tools to read the markup’s content.

Second, if you’re trying to parse static websites. By static here I mean sites without the usage of javascript for rendering. In other words, if your data is present right away, beautiful soup can be a useful tool. Let us look at an example.
How to get the same result with find() and select() methods of BS4?
Basic scraping with BeautifulSoup 40:06
Web scraping with Selenium4:03
Hi everyone, in this tutorial we will make scraping more dynamic and introduce the selenium library. It is well-known among programmers as a helpful tool in many automation tasks. It can help you simulate user behaviour and therefore pass certain traps along the way of data capturing.

This library helps you navigate through the site using code simulating customers' behaviour. Bunch of stuff you can do with it like:
Interact with site
Click on elements
Drag-N-Drop simulations
Form filling
JavaScript execution on page
DOM search and many more;
Visual Intro to Selenium tools2:37
In this tutorial I’ll show you how to use the Selenium framework in the most visual possible way. The goal of it is to make you familiar with the tool and those features we would be using in the next lessons. For that purpose, I encourage you to use Jupyter Notebooks as it allows running code step by step with all variables staying in memory. You should be prepared for this if you completed this course's installation tutorial.
Tackling Pagination with Selenium2:54
Pagination Followup0:10
Scraping endless pages with Selenium1:42
Dealing with authentication and user sessions2:30
This method is helpful for the sake of speeding up your regular scraping tasks. For example, let’s say you’re willing to scrape a site on a daily basis. Instead of logging in each time, you can use a custom scraping profile in your browser to have a session up and running until it is expired by the target site.

Be cautious about this method as some sites would block automation scripts. I’m using such methods typically when I need to perform a single request to the site every once in a while. In later tutorials, I’ll also demonstrate to you how to make your script behave more like a human so it would be possible to decrease the chance of ban.
Bypassing Captcha3:14
Services like 2captcha.com can help you. They provide human help for resolving captchas. For this particular tutorial you’d need to create an account there and preferably donate at least 1$ to reproduce the code I’m going to show. Otherwise, this practical exercise would be only theoretical for you.
Scraping HighCharts.JS plots with Selenium3:02
In this tutorial, we’ll learn how to parse data from JavaScript rendered graphics.

Requirements

Basic knowledge of Python

Description

With the vast amount of data available on the internet, it's no wonder that web scraping has become such a popular tool for extracting information. Whether you're looking to gather data for research purposes or collect information from a competitor's website, web scraping can be a valuable skill in your toolkit. And with this practical web scraping course, you'll learn everything you need to know to start extracting data from any website. So if you're ready to start learning web scraping, this is the course for you.

Right now, the "Practical Web Scraping Course" is an ongoing project and therefore it will contain the most recent ways to parse data and would be updated often. You'll also get your answers to the questions you'd have in a short period. Here's the list of all themes that you'd learn within this course eventually:

Tracking HTTP requests in practice
Basic scraping with BS4 and requests libraries
BS4 tools in detail
Efficient scraping with Selenium
Visual Intro to Selenium tools
Dealing with authentication and user sessions
Bypassing Captcha
Scraping dynamic websites
Selenium and pagination
Scraping HighCharts.JS
Use Heroku to host your spiders
Scrapy Introduction
Scrapy integration with DB
[Items below would be added in the next part of the course]
Hosting Scrapy spiders locally
Use schedulers to run Scrapy spiders locally
Ethical scraping tools
Avoid getting banned
Scraping images and pdf’s
Real-time scraping

With this course you will be able to:

- Save time by learning modern methods of data scraping

- Get information about the most up-to-date scraping tools and techniques

- Avoid being scammed by others selling outdated courses

- Get your money's worth with a complete and comprehensive course

Who this course is for:

Data Scientiets, Software Engineers, Open Internet Enthusiasts

Practical Web Scraping Course in Python, Scrapy and Selenium

What you'll learn

Explore related topics

Coding Exercises

Course content

Introduction5 lectures • 5min

Part 1. Basic Scraping Toolkit12 lectures • 27min

Part 2. Scrapy4 lectures • 8min

Real Life Use Case #1. Run spider on Heroku4 lectures • 5min

Real Life Use Case #2. Parse medium.com with Scrapy3 lectures • 8min

Requirements

Description

Who this course is for: