
Learn how HTTP status codes classify server responses for web scraping, from 1xx informational to 5xx errors, with examples like 200 OK, 301 moved, and 404 not found.
Explore how request headers identify clients, influence access controls, and how simulating a Google crawler demonstrates header-driven scraping concepts and the ethics of access.
Explore how proxies act as intermediaries between client and server, including caching proxies and load balancers, and how residential proxies rotate IP addresses to obfuscate scraping requests.
Master the basics of HTML markup, understand semantic versus structural tags, and learn how proper tagging improves accessibility and search engine understanding for effective web scraping.
Explore how JavaScript adds behavior to static web pages by linking a button to hide a link, using DOM selection, event listeners, and simple show-hide logic.
JavaScript drives dynamic HTML and CSS, enabling single page apps; learn to scrape such pages by using headless browsers to render and extract content.
Explore how to embed CSS and JavaScript directly in HTML using style and script tags, including inline styles and external scripting, to build a single, ready-to-scrape web page.
Learn how authentication verifies identity and authorization grants access, and implement common methods like API keys in headers, bearer tokens, and basic auth for API requests.
Explore beyond get: send delete, post, put, patch, head, and options requests using the requests library and HTTP bin, which echoes back your requests for testing.
Learn sending data in the request body with the Python requests library, comparing form encoded data and json payloads, and understanding their encoding and content type.
Navigate the parsed HTML tree with tags and attributes in beautiful soup, explore multi-valued attributes like class, and learn to select nested elements for data extraction.
Learn how html tags form a hierarchy from root to children and descendants, including ul, li, and a, using beautiful soup to navigate, filter navigable strings, and access relationships.
Learn how to navigate siblings in the Beautiful Soup navigation API, horizontal navigation to move between tags at the same tree depth using next_sibling and previous_sibling for efficient web scraping.
Extract all text content from HTML using Beautiful Soup's stripped strings and dot strings attributes, compare their results, and understand when to preserve or discard whitespace.
Refine the book data extraction by cleaning price strings to a float, using regex substitutions, and mapping ratings to integers, returning dictionaries with title, price, and rating.
Master pandas by turning a list of dictionaries into a dataframe for efficient data handling. Filter with boolean masks, calculate averages, and export books data to CSV, JSON, or Excel.
Target page elements by id or attribute using Beautiful Soup's find and find_all, then build complex filters with anonymous functions and lambda expressions for flexible scraping.
Explore find and find all, plus select and select one, to see how single-tag results differ from list results and how to verify element identity.
Learn how stocks, tickers, and exchanges map to a portfolio, fetch prices from Google Finance, manage multi-exchange pricing, and convert currencies for a unified view.
Use Python dataclasses to define stock, position, and portfolio types with ticker, exchange, currency, and price fields; employ post_init to fetch price data and prepare portfolio valuation.
Define position and portfolio as data classes to pair stocks with quantities and compute the portfolio's total value in USD from stock prices sourced via Google Finance.
Master the network tab to monitor HTTP requests, headers, and responses, then replicate server requests in Python to scrape data without parsing HTML or CSS.
Explore how to discover coffee shop locations by inspecting network requests, identifying GraphQL post queries that return restaurant data in JSON, and reproducing them with a Python workflow.
Replicate a web scraping request to source open positions from a career page using Python, intercept and replicate requests, and return position details with location, type, distance, and requirements.
Explore the adjacent sibling combinator, signified by the plus sign, to select elements that are siblings of a given element. Compare it with the general sibling combinator to choose wisely.
Master simple, compound, and complex selectors and selector lists, and use descendant, child, and sibling combinators to target elements for web scraping with precision.
Write a Python scraper that downloads the highest resolution Unsplash images by keyword, excluding premium watermarked ones, using both HTML and API approaches with pagination, saving images locally.
Prospect the site by analyzing image search results, filtering ads and premium images, and inspecting HTML structure to identify reliable selectors; compare HTML parsing with the API for high‑resolution images.
Explore calling the Unsplash API with Python, fetch images via get requests, parse photos results to extract full-resolution image URLs, and handle pagination for future downloads.
Learn how to paginate with api requests, constructing urls with page and per page, filtering by photos, and looping through pages to collect a target number of images.
Learn to render a javascript-heavy site with playwright and select legs, install chromium, and extract the budget value by waiting for network idle and parsing the html content.
Welcome to the Ultimate Web Scraping With Python Bootcamp, the only course you need to go from a complete beginner in python to a very competent web scraper.
Web scraping is the process of programmatically extracting data from the web. Scraping agents visit a web resource, extract content from it, and then process the resulting data in order to parse some specific information of interest.
Scraping is the kind of programming skill that offers immediate feedback, and can be used to automate a wide variety of data collection and processing tasks.
Over the next 17+ hours, we will methodically cover everything you need to know to write web scraping agents in python.
This bootcamp is organized in three parts of increasing difficulty designed to help you progressively build your skill.
Part I - Begin
We'll start by understanding how the web works by taking a closer look at HTTP, the key application layer communication protocol of the modern web. Next, we'll explore HTML, CSS, and JavaScript from first principles to get a deeper understanding of how website are built. Finally, we'll learn how to use python to send HTTP requests and parse the resulting HTML, CSS, and JavaScript to extract the data we need. Our goal in the first part of the course is to build a solid foundation in both web scraping and python, and put those skills to practice by building functional web scrapers from scratch. Selected topics include:
a detailed overview the request-response cycle
understanding user-agents, HTTP verbs, headers and statuses
understanding why custom headers can often be used to bypass paywalls
mastering the requests library to work with HTTP in python
what stateless means and how cookies work
exploring the role of proxies in modern web architectures
mastering beautifulsoup for parsing and data extraction
Part II - Refine
In the second part of the course, we'll build on the foundation we've already laid to explore more advanced topics in web scraping. We'll learn how to scrape dynamic websites that use JavaScript to render their content, by setting up Microsoft Playwright as a headless browser to automate this process. We'll also learn how to identify and emulate API calls to scrape data from websites that don't have formally public APIs. Our projects in this section will include an image scraper that can download a set number of high-resolution images given some keyword, as well as another scraping agent that extracts price and content of discounted video games from a dynamically rendered website. Topics include:
identifying and using hidden APIs and understanding the benefits they offer
emulating headers, cookies, and body content with ease
automatically generating python code from intercepted API requests using postman and httpie
working with the highly performant selectolax parsing library
mastering CSS selectors
introducing Microsoft Playwright for headless browsing and dynamic rendering
Part III - Master
In the final part of the course, we'll introduce scrapy. This will give us an excellent, time-tested framework for building more complex and robust web scrapers. We'll learn how to set up scrapy within a virtual environment and how to create spiders and pipelines to extract data from websites in a variety of formats. Having learned how to use scrapy, we'll then explore how to integrate it with Playwright so that we tackle the challenge of scraping dynamic websites from right within scrapy. We'll conclude this section by building a scraping agent that executes custom JavaScript code before returning the resulting HTML to scrapy. Some topics from this section:
learning how to set up scrapy and explore its command line interface ("the scrapy tool")
dynamically explore response objects using scrapy shell
understand and define item schemas and load data using itemloaders and input/output processors
integrate Playwright into scrapy to tackle dynamically rendered JavaScript sites
write PageMethods to specify highly specific instructions to the headless browser from right within scrapy
define custom pipelines for saving into SQL databases and highly customized output formats
In this bootcamp, I will take you step-by-step through engaging video lectures and teach you everything you need to know to get started with web scraping in python.
By the end of this course, you will have a complete toolset to conceptualize and implement scraping agents for any website you can imagine.
See you inside!