Lecture description

What Scrapy is, Scrapy vs. other Python-based scraping tools such as BeautifulSoup and Selenium, when you should use Scrapy and when it makes sense to use other tools, pros and cons of Scrapy.

Scrapy, overall, is a web crawling framework written in Python. One of its main advantages is that it's built on top of Twisted, an asynchronous networking framework, which in other words means that it's: a) really efficient and b) Scrapy is an asynchronous framework. So, to illustrate why this is a great feature for those of you that don't know what is an asynchronous scraping framework means, let's use some enlightening example. So, imagine you have to call hundred different people by phone numbers. Well, normally you'd do it by sitting down and then dialing the first number, and patiently waiting for the response on the other end. In an asynchronous world, you can pretty much dial 20 or 50 phone numbers at the same time, and then only process those calls once the other person on the other end picks up the phone. Hopefully, now it makes sense.

Scrapy is supported under Python 2.7 and Python 3.3. So depending on your version of Python, you are pretty much good to go. It is important to note that Python 2.6 support was dropped starting at Scrapy 0.20, and Python 3 support was added in Scrapy 1.1.

Scrapy, in some ways, is similar to Django. So if you use or have previously used Django, you will definitely benefit.

Now let's talk more about other Python-based Web Scraping Tools. There are old-specialized libraries, with very focused functionality and they are not really complete web scraping solutions like Scrapy is. The first two, urllib2, and then Requests are modules for reading or opening web pages, so HTTP modules. The other two are Beautiful Soup and then lxml, aka, the fun part of the scraping jobs, or really for extracting data points from those pages that logged with urllib2 and then Requests.

First, urllib2's biggest advantage is that it is included in the Python standard library, so as long as you have Python installed, you are good to go. In the past, urllib2 was more popular but since then another tool replaced it, which is called Requests. The documentation of Requests are superb. I think it's even the most popular module for Python, period. And if you haven't already, just give the docs a read. Unfortunately, Requests doesn't come pre-installed with Python, so you'll have to install it. I personally use it for quick and dirty scraping jobs. Both urllib2 and Requests support Python 2 and Python 3.

The next tool is called Beautiful Soup and once again, it's used for extracting data points from the pages that are logged. Beautiful Soup is quite robust and it handles nicely malformed markup. In other words if you have a page that is not getting validated as a proper HTML, but you know for a fact that it's a page and that it's HTML specifically page, then you should give it a try, scraping data from it with Beautiful Soup. Actually, the name came from the expression 'tag soup' which is used to describe a really invalid markup. Beautiful Soup creates a parse tree that can be used to extract data from HTML. The official docs are comprehensive and easy to read and with lots of examples. So Beautiful Soup, just like Requests, is really, beginner-friendly, and just like the other tools for scraping, Beautiful Soup also supports Python 2 and Python 3.

lxml just similar to the Beautiful Soup as it's used for scraping data. It's the most feature-rich Python library for processing both XML and HTML. It's also really fast and memory efficient. A fun fact is that Scrapy selectors are built over lxml and for example, Beautiful Soup also supports it as a parser. Just like with the Requests, I personally use lxml in pair with Requests for quick and dirty jobs. Bear in mind that the official documentation is not that beginner friendly to be honest. And so if you haven't already used a similar tool in the past, use examples from blogs or other sites; it'll probably make a bit more sense than the official way of reading.

Another tool for scraping is called Selenium. So to paraphrase this, Selenium is first of all a tool for writing automated tests for web applications. It's used for web scraping mainly because it's a) beginner friendly, and b) if a site uses JavaScript. So if a site is having its own JavaScript, which more and more sites are, Selenium is a good option. Once again, it's easy to extract the data using Selenium if you are a beginner or if JavaScript interactions are very complex - if we have a bunch of get and post requests. I use Selenium sometimes solely or in pair with Scrapy. Most of the time when I'm using it with Scrapy, I kind of try to iterate on JavaScript pages and then use Scrapy Selectors to grab the HTML that Selenium produces. Currently, supported Python versions for Selenium are 2.7 and 3.3+. Overall, Selenium support is really extensive, and it provides bindings for languages such as Java, C#, Ruby, Python of course, and then JavaScript. Selenium official docs are great and easy to grasp, and you can probably give it a read even if you are a complete beginner; in two hours you will figure all out. Bear in mind that, from my testing, for example, Scraping thousand pages from Wikipedia was 20 times faster, in Scrapy than in Selenium - believe it or not. Also, on the top of that, it consumed a lot less memory, and CPU usage was a lot lower with Scrapy than with Selenium.

So back to the Scrapy main pros, and when using Scrapy, of course, first and foremost it's asynchronous; furthermore, if you are building something robust and want to make it as efficient as possible with lots of flexibility and lots of options, then you should definitely use Scrapy.

One case example when using some other tools, like the previously mentioned tools makes sense is if you had a project where you need to load Home Page, or let's say, a restaurant website, and check if they are having your favorite dish on the menu, then for this type of cases, you should not use Scrapy because, to be honest, it would be overkill. Some of the drawbacks of Scrapy is that, since it's really a full fledged framework, it's not that beginner friendly, and the learning curve is a little steeper than some other tools. Also installing Scrapy is a tricky process, especially with Windows. But bear in mind that you have a lot of resources online for this, which means that you have -I'm not even kidding- probably thousand blog posts about installing Scrapy on your specific operating system.

Learn more from the full course

Scrapy: Powerful Web Scraping & Crawling with Python

Python Scrapy Tutorial - Learn how to scrape websites and build a powerful web crawler using Scrapy, Splash and Python

10:33:50 of on-demand video • Updated January 2020

Creating a web crawler in Scrapy

Crawling a single or multiple pages and scrape data

Deploying & Scheduling Spiders to ScrapingHub

Logging into Websites with Scrapy

Running Scrapy as a Standalone Script

Integrating Splash with Scrapy to scrape JavaScript rendered websites

Using Scrapy with Selenium in Special Cases, e.g. to Scrape JavaScript Driven Web Pages

Building Scrapy Advanced Spider

More functions that Scrapy offers after Spider is Done with Scraping

Editing and Using Scrapy Parameters

Exporting data extracted by Scrapy into CSV, Excel, XML, or JSON files

Storing data extracted by Scrapy into MySQL and MongoDB databases

Several real-life web scraping projects, including Craigslist, LinkedIn and many others

Python source code for all exercises in this Scrapy tutorial can be downloaded

Q&A board to send your questions and get them answered quickly

English

Hey there! So today, we are going to learn about Scrapy. What Scrapy is overall. Scrapy versus other Python-based scraping tools. Why you should use it and when it makes sense to use some other tools. Pros and cons of Scrapy and that would be it. So let's begin! Scrapy, overall, is a web crawling framework written in Python. One of its main advantages is that it's built on top of Twisted, an asynchronous networking framework, which in other words means that it's: a) really efficient, and b) Scrapy is an asynchronous framework. So, to illustrate why this is a great feature... I'll use, for those of you that don't know what an asynchronous scraping framework means... ...I'll use some enlightening example. So, imagine you have to call hundred different people by phone numbers. Well, normally you'd do it by sitting down and then dialing the first number... ...and then patiently waiting for the response on the other end. In an asynchronous world, you can pretty much dial in first 20 or 50 phone numbers... ...and then only process those calls once the other person on the other end picks up the phone. Hopefully, now it makes sense. Scrapy is supported under or uses Python 2.7 and Python 3.3. So you can pretty much, depending on your version of Python, you are pretty much good to go. So Python 2.6, important thing to note, support was dropped starting at Scrapy 0.20. So just bear that in mind, and Python 3 support was added in Scrapy 1.1 Scrapy, in some ways, it's similar to Django. So those of you that use or have used, previously, Django will definitely benefit. Now let's talk more about other Python-based scraping tools. And these, bear in mind that these, are old-specialized libraries... ...with very focused functionality and they don't claim or they are not really a complete web scraping solution like Scrapy is. The first two, urllib2 and then Requests are modules for reading or opening web pages, so HTTP modules. The other two are Beautiful Soup and then lxml. These are for, aka, the fun part of the scraping jobs. Or really for extracting data points from those pages that are loaded with urllib2 and then Requests. Let's get back to urllib2 and urllib2's biggest advantage is that it's included in the Python standard library... ...so it's batteries-included and as long as you have Python installed, you are good to go. In the past, it was more popular but since then another tool replaced it. And that tool, believe it or not, is called Requests. The docs or documentations are superb for Requests. I think it's even the most popular module for Python, period. And if you haven't already... once again the docs are just amazing, so if you haven't already, just give it a read. And Requests, unfortunately, doesn't come pre-installed with Python, so you'll have to install it. I personally use it for quick and dirty scraping jobs... ...and both urllib2 and Requests are supported with Python 2 and Python 3. The next tool is called Beautiful Soup and once again... ...it's used for extracting data points from the pages that are loaded, okay? And it's... Beautiful Soup is quite robust and it handles nicely malformed markup. So, in other words, if you have a page that is not getting validated as a proper HTML... ...but you know for a fact that it's a page and that it's an HTML specifically page, then you should give it a try... ...scraping data from it with Beautiful Soup. So actually the name came from the expression 'tag soup'... ...which is used to describe a really invalid markup. Beautiful Soup creates a parse tree that can be used to extract data from HTML. The official docs are comprehensive, easy to read and with lots of examples. So they are really, just like with Requests, they are really, beginner-friendly. And just like the other tools for scraping, Beautiful Soup also comes with Python 2 and Python 3. And now let's talk about... let's see... about lxml Now what lxml is... it's just similar to Beautiful Soup so once again, it handles or it's used for scraping data. It's the most feature-rich Python library... ...for processing both XML and HTML. It's also really fast and memory efficient. A fun fact is that Scrapy selectors are built over lxml and for example, Beautiful Soup also supports it as a parser. Just like with Requests, I personally use lxml in pair with Requests... ...of course, for again as previously mentioned, quick and dirty jobs. Bear in mind that the official documentation is not that beginner-friendly to be honest. And so if you haven't already used a similar tool in the past, use examples from blogs or other sites. It'll probably make a bit more sense than the official way of reading. The last tool for scraping is called Selenium. So to paraphrase this, Selenium is first of all a tool writing automated tests for web applications. It's used for web scraping mainly because it's beginner-friendly... ...and if a site uses JavaScript... so if a site is heavy on JavaScript... which more and more sites are... Selenium is a good option because, once again, it's easy to extract the data... ...if you are a beginner or if JavaScript interactions are very complex... ...if we have a bunch of get and post requests. I use it sometimes solely or in pair with Scrapy. And most of the time when I'm using it with Scrapy I, kind of, try to iterate over... ...once again JavaScript heavy pages and then use Scrapy Selectors to grab the HTML that Selenium produces. Currently, supported Python versions for Selenium are 2.7 and 3.5+ Overall, Selenium support is really extensive. And it provides bindings for languages such as Java, C#, Ruby, Python of course, and then JavaScript. Selenium official docs are, once again, great and easy to grasp. And you can probably give it a read even if you are a complete beginner. And in two hours you will, pretty much, figure all out. Bear in mind that, from my testing, for example, Scraping thousand pages from Wikipedia... ...was 20 times faster, believe it or not, in Scrapy than in Selenium. Also, on top of that, it [i.e. Scrapy] consumed a lot less memory and CPU usage was a lot lower... ...with Scrapy than with Selenium. So back to the Scrapy main pros, and when using Scrapy, of course, first and foremost it's asynchronous... ...but if you are building something robust and want to make it as efficient as possible... ...with lots of flexibility and a bunch of functions, then you should definitely use it. One case example when using some other tools, like the previously mentioned tools, kind of makes sense... ...is if you had a project where you need to load the Home Page or... ...something like that or your favorite, let's say, restaurant and check if they are having your favorite dish on the menu. And then for this type of cases, you should not use Scrapy because, to be honest, it would be overkill. Some of the drawbacks of Scrapy is that, since it's really a full-fledged framework, it's not that beginner-friendly... ...and the learning curve is a little steeper than some other tools. Also, installing Scrapy is a tricky process, especially with Windows. But bear in mind that you have a lot of resources online for this... ...and this pretty much means that you have, I'm not even kidding, probably thousand blog posts about... ...installing Scrapy on your specific operating system. And that's it for this video... So thanks for watching, and I'll see you in the very next video... ...where I will discuss installing Scrapy. Bye!

More about this course

Scrapy vs. Beautiful Soup vs. Selenium

Lecture description

Learn more from the full course

Scrapy: Powerful Web Scraping & Crawling with Python