Advanced Web Scraping with Python using Scrapy & Splash

Name: Advanced Web Scraping with Python using Scrapy & Splash
Rating: 4.5 (499 reviews)

The most advanced web scraping & crawling course using Scrapy & Splash! Take your web scraping skills to the next level.

Created byAhmed Rafik

Last updated 8/2020

English

What you'll learn

Advanced web scraping techniques
Best techniques to analyse a website before scraping it
Write clean spiders
Optimize Splash scripts
Bypass 504 HTTP errors
Build Splash Cluster
Bypass Google ReCaptcha (not solving it)
Build Desktop apps for Scrapy Spiders (Tkinter)
ScrapyRT
Showcase scraped data using ScrapyRT & Flask
Heavy data processing
Input & Output processors

Course content

6 sections • 48 lectures • 5h 35m total length

Development Environment (Walkthrough)5:29
Set up a python web scraping workflow by installing Anaconda, creating an environment for scrapy in Anaconda Navigator, installing scrapy, pep8, pylint, and using VS Code with the Python extension.
Installing Splash(Windows Pro/Enterprise edition & Mac OS)6:32
Installing Splash(Windows Home Edition)4:36
Installing Splash (Linux)1:23
Udemy 1011:21
Asking questions0:26

Project Intro9:42
Understanding the API7:30
Consuming the API PART 16:58
Implement update query and get inscriptions requests in Scrapy by overriding start_requests. Send a post with a JSON payload, configure headers, and set a callback to process the response.
Code update (Handle 555 & 403 HTTP responses)2:52
Consuming the API PART 217:28
XHR Pagination5:42
Summary Page15:15
Determine if a page needs JavaScript, then use Splash or Selenium to render HTML and scrape listing addresses and descriptions with Scrapy, building absolute URLs and passing meta data.
Bypass 504 HTTP Error (Method 1)7:25
Increase the virtual machine memory and CPU for splash in docker, then configure splash with restart always and port mapping to reduce 504 gateway timeout errors when scraping.
Bypass 504 HTTP Error (Method 2)8:53
Learn to bypass 504 gateway timeout by optimizing splash scripts for Scrapy and Splash. Disable images and JavaScript, abort CSS requests, and return HTML to speed up scraping.
Bypass 504 HTTP Error (Method 3)7:56
Project source code0:07

Project Intro5:52
Explore scraping the Steam official store to pull top selling games, including image URL, game name, platform, release date, user reviews, and pricing, by building a Scrapy spider.
Extracting data PART 112:09
define each data point as a field in items.py, instantiate the steam item, and extract game url, image url, game name, release date, and pricing with XPath.
Extracting data PART 212:20
Extracting data PART 36:45
Extracting data PART 417:36
Pagintion5:56
ItemLoader7:35
Data processing PART 12:25
Data processing PART 27:07
Process the review summary field by removing HTML tags and converting to a string, then extract supported platforms from class attributes using map compose and item loader.
Data processing PART 38:27
Project source code0:07

ScrapyRT8:18
Using Flask with ScrapyRT7:28
Flask templates PART 17:16
Learn how to render templates in Flask with render_template, set up the templates folder, and pass variables and lists using Jinja to create dynamic HTML.
Flask templates PART 26:56
Flask templates PART 38:29
Modify a Flask app to render index.html with a games list, build a Bootstrap grid of three-card rows, and display each game's image, name, platforms, and URL.
Project source code0:07

Locate the API5:04
ReCaptcha Response6:44
Testing the API4:59
Spoofing Cookie header + Custom Cookie parser8:29
Parsing JSON Objects10:28
Advanced pagination16:25
Master pagination techniques for web scraping listings across pages. Decode and update the search query state to inject the next page number and build the new URL with Scrapy.
Media Pipelines PART 15:04
Media Pipelines PART 24:38
Project source code0:07

Desktop APP PART 115:19
Design a Tkinter-based desktop app to run scrapy spiders, allowing dynamic spider selection, feed type choice (json or csv), and inputs for output path and dataset name.
Desktop APP PART 29:20
Load dynamic spiders into a desktop app using spider loader to populate a dropdown, then capture chosen spider and prepare execution with the execute button.
Desktop APP PART 37:38
Desktop APP PART 4(Threading)7:04
Project source code0:07

Requirements

PC or Mac with internet access.
Have done a couple of projects using SCRAPY & SPLASH is extremely REQUIRED.
Basics of elements selection using XPATH is also extremely REQUIRED.

Description

Hi there & welcome to the most advanced online resource on Web Scraping with Python using Scrapy & Splash. This course is fully project-based means pretty much on each section we gonna scrape a different website & tackle a different web scraping dilemma also rather than focusing on the basics of Scrapy & Splash we gonna dive straight forward into real-world projects, this also means that this course is absolutely not suitable for beginners with no background on web scraping, Scrapy, Splash & XPath expressions.

---This courses covers a variety of topics such as:---

Requests chaining, like how the requests must be sent in a certain order otherwise they won't be fulfilled at all.
How to analyze a website before scraping it, this is an important step to do since it helps a lot in choosing the right tools to scrape a website & it literally has a huge impact on the performance of your final product.
How to optimize Splash scripts by reducing/aborting all the unnecessary requests that have nothing to do with the data points you're going to scrape, this is an important thing to do if you care about the performance of Splash as it is the key to bypass 504 Gateway Timeout HTTP errors in Splash.
We gonna also cover how to build a Cluster of Splash instances with a load balancer(HAProxy) rather than having one fully overloaded Splash instance this also helps in bypassing 504 Gateway Timeout errors.
Heavy data processing, you'll understand how Input & Output processors work so you'll be able to use them in order to clean the scraped data points as this will ensure the quality of your feeds.
We'll use ScrapyRT (Scrapy RealTime) to build spiders that can fetch data in real-time.
Showcase the scraped data points in a minimalist web app using ScrapyRT & Flask, this is extremely helpful for web scraping freelancers.
Bypass Google ReCaptcha, please don't get me wrong on this point, I don't mean that we will solve it using Scrapy, instead, I'm gonna show you a technique that I use frequently to fool websites and let them think that the request is sent using a browser & was performed by a human being!
Build clean & well-structured spiders
Finally, we gonna build a Desktop app using Tkinter, the app will fetch & execute all the available spiders in your Scrapy project, you can also choose the feed type, feed location & name, this is also extremely helpful & important if you're a web scraping freelancer, it is always a good idea to deliver to your client a desktop app rather than installing Scrapy on his machine & stuff like that.

This course is straight to the point, there's no "foobar" or "quotes to toscrape dot com" as other courses do so make sure you have a good level of focus & lot of determination & motivation.

By the end of this course, you'll sharpen your skills in web scraping using Scrapy & Splash, you'll be able to write clean & high performing spiders that differentiate you from others, this also means if you're a web scraping freelancer you'll get more offers since you can deliver "User-Friendly" spiders with a Graphical User Interface(GUI) or web apps that fetch data in real-time.

So join me on this course & let's harvest the web together!

Who this course is for:

Anyone wants to learn advanced web scraping techniques
Anyone wants to learn how to turn Scrapy projects into Desktop/web apps
Web scraping freelancers

Advanced Web Scraping with Python using Scrapy & Splash

What you'll learn

Explore related topics

Course content

Introduction6 lectures • 20min

Centris Canada11 lectures • 1hr 30min

Steam Store11 lectures • 1hr 26min

Build Web App (ScrapyRT + Flask)6 lectures • 39min

Zillow9 lectures • 1hr 2min

Scrapy & Tkinter for Desktop Apps5 lectures • 39min

Requirements

Description

Who this course is for: