Scrapy Simple Spider - Part 2

GoTrained Academy
A free video tutorial from GoTrained Academy
eLearning Professionals
4.2 instructor rating • 9 courses • 50,233 students

Learn more from the full course

Scrapy: Powerful Web Scraping & Crawling with Python

Python Scrapy Tutorial - Learn how to scrape websites and build a powerful web crawler using Scrapy, Splash and Python

10:33:50 of on-demand video • Updated January 2020

  • Creating a web crawler in Scrapy
  • Crawling a single or multiple pages and scrape data
  • Deploying & Scheduling Spiders to ScrapingHub
  • Logging into Websites with Scrapy
  • Running Scrapy as a Standalone Script
  • Integrating Splash with Scrapy to scrape JavaScript rendered websites
  • Using Scrapy with Selenium in Special Cases, e.g. to Scrape JavaScript Driven Web Pages
  • Building Scrapy Advanced Spider
  • More functions that Scrapy offers after Spider is Done with Scraping
  • Editing and Using Scrapy Parameters
  • Exporting data extracted by Scrapy into CSV, Excel, XML, or JSON files
  • Storing data extracted by Scrapy into MySQL and MongoDB databases
  • Several real-life web scraping projects, including Craigslist, LinkedIn and many others
  • Python source code for all exercises in this Scrapy tutorial can be downloaded
  • Q&A board to send your questions and get them answered quickly
English Hey there! So today we are going to cover how to start scraping the data with Scrapy, and those data points that we are going to collect. First one, it's going to be this one. So the header and the second one is going to be this list of top ten tags. And the data that is going to be scraped from is called Let's begin right away. Let's open our Terminal, maximize the window once again and zoom it in a little bit. Something like this. So, the first thing of course is: scrapy, and the best way to figure how to extract data with Scrapy is trying it with Scrapy Shell. Scrapy Shell is built on IPython, so it has magic functions, autocomplete, etc. So, the command is located here. So, if we type in just scrapy shell. So let's copy and paste this, hit Enter. You will see a bunch of different outputs. You don't need to worry about all of this stuff or the beginning of it... just has a bunch of info really, messages and debug messages that just print out different middlewares, the extensions that are enabled by default, and overridden settings, etc. The most important thing that you need to know as of right now is that you should use the previously discussed "fetch" function to actually fetch the URL. So to do that, you type in fetch and then open and closed parenthesis And then, either in single or double quotes, you should here input the URL to the site that you would like to open with Scrapy. So let's copy and paste this URL to our Terminal. So paste it in here and hit Enter. Now here we got two different rows of data, and the first one just the date and the time that the message was sent, and that INFO message is that the spider is opened. And then this second one is that DEBUG message and that it's crawled. So, this 200 in parenthesis indicates that the response is successful. Most of you know that 404 would be returned probably if the page was not found, 301 if it's redirected, etc. So as long as it's either on 200 or 300 we are pretty much good to go. And as you can see this is the URL that we specified. So, let's go back to the site, and the way that you normally scrape the data or inspect actually the source code is by right-clicking and either going into the source code, so View page source, or you can just go to the element that you would like to scrape, which, in our case, is going to be this one. Right-click, and then click here, Inspect. You will get here the elements and here you will see that, as I'm hovering over the <h1> and then <a> HTML node, that we get different highlighters as you can see. So headers are mostly located in the <h1> tag and Scrapy offers the response. So either you will see in Scrapy either response or, sorry I can't type it in, so you will either see "response" or "request". "response" will be pretty much as the name indicates really, a response that is going to be returned. So in this case we have 200 as a successful one and then the URL to the response. "request" is for requesting the URLs and figuring out the FormRequest, etc. That will be covered a little bit later on. So "response" is going to offer for selecting data either XPath or CSS selectors. Let's cover briefly CSS selectors. So to select this header right now, so this data point that I'm highlighting right now, or this container really, which contains <h1> and then here we have <a> tag, we type in here response.css and then open and closed parenthesis, and, either in single or double quotes, we can type in just h1 Hit enter and you will see that a selector is found and that XPath is going to be pretty much the HTML node that I'm highlighting right now. So to actually get to the text you can type in pretty much this and you will see the <h1> text is found in two instances. And, let's for example concentrate first on the XPath, because it's will probably be easier. So the call to it, it's fairly simple. It's probably best to use actually XPath, not CSS, at least in my experience they are more robust and really writing more advanced selectors would be probably a lot more easier with the XPath than CSS. And, also CSS selectors actually get transformed into the XPaths before it's printed out. So, if we just print out response.xpath and then h1, you will see that we will get an error, or really just an empty list, which is even worse in our case. So, to actually get all of the instances of the <h1> tag, we type in just the double slashes, and then hit Enter. And, as you will see, this pretty much data point here located, or selector, corresponds to this one when we type in response.css('h1') So, in our XPath, the way that you go into the <a> tag, for example. So currently we are in this HTML node. We would like to get to this text, so the way that we actually do this is by going into this <a> tag, and then we would like to extract just the text. So let's navigate to the Terminal and type in once again slash and then "a" So this will go into every instance of the <h1> tag, and then, if found, it will go into <a> tag. Hit Enter and you will see that XPath is different here and also the data that it's going to be returned is different one. So previously we were in <h1> tag. Right now we are in our <a> tag, as you can see. So we have successful gone into <a> tag, and what we would like to do, or really what we would like to extract once again, is just the text. So the way that you actually get to the text is by once again typing in, /text() Hit Enter and you will see that right now data is equal to the "Quotes to Scrape", which corresponds pretty much to this section in the <a> HTML node. To get rid of selectors and XPaths, we type in .extract() hit Enter, and you will see that we just get the list itself. And the "u" stands for, obviously, Unicode, and we get "Quotes to Scrape". Now, if we want to get just the string and not this list, we type in extract_first(). Hit Enter and you will see right now it's actually in the single quotes and it's in the form of a string. So that's how you scrape the first data point. How you actually scrape the second data point is going to be a bit trickier, but it's fairly simple. So, we would like to extract all of the different tags right now from the right side of the page. So we go either once again by View page source, which will take some time to figure out where exactly these tags are, or we go to the first instance of the tag and then click Inspect. And we will see here that the format is different here obviously. So we don't get any <h1> tags or stuff like that. We get <span>, and then in that <span> we have class with a value tag-item. And then we have the very next, let's say, sibling HTML node is <a> tag with the class tag, and then we have the href, which goes to the different page, and then we have the text, which reads love, which is our first tag really that we would like to get. So there are really numerous ways of getting to the data. The one that we will use is pretty simple. So we are going to isolate pretty much every class that has tag in it, or the value tag in it, and then we will just scrape text. Pretty simple, right? To do that, let's go to the Terminal, and then type in response.xpath, open and closed parenthesis, and then, in single quotes, we will find every instance of the class. So the way that you do that is by typing in //@, and then in the square brackets you will type in the "@" sign and then the class. And then you will type in here just the empty value. So this is actually the way that you write the class XPath selector. The logic changes, for example, if you would like to just select the id with some value, you would type in this. So it's fairly customizable, and also fairly simple. So we type in class, and the class that we are looking for has the value tag, as you can see from here. So let's type this here and hit Enter. So we get a bunch of different tags, which pretty much doesn't correspond to the data points that we would like to get. So as you can see here we have 10 different classes or tags, sorry. And here we have probably 50 or so different selectors that are returned. So this is no good. So the way that you further isolate this is by going into the, for example, or trying in the class which has tag item, which is probably going to be a lot more easier to get at. So the reason why you get the different <a> tags is because, if we go here you will see that the selector, or really HTML node, is going to be pretty much the same. So we still have the <a> class and then, in the class we have the tag as a value. So this selector that we wrote pretty much goes into entire page and scrapes all of these different tags, which we would not actually like to get. We would only want to get these first 10 data points. So to fix this we type in our new XPath selector. So we go to the Inspect, and then we go to the one HTML node above. And we see here that we currently are selecting <span> with a class, and that class has the value tag item. Let's copy and paste this into our Terminal. So we are going to select every class which has tag item in it, hit Enter, and you will see here we have a lot less data points, or selectors in our case. So let's calculate length of this, and it should be 10. So let's see, length is 10 which is perfect. So the second thing that we need to do is further isolate or get to the text. So currently we are selecting 10 instances of the tag items. So here they are online or in Chrome, and what we need to do is go into each and every one and go to the <a> tag, and then scrape the text. So pretty simple or really similar to the first data point that we collected here from the header. So, how we actually go into the very next HTML node, which is in our case <a>, is going to be the same as with scraping the first data point. So we type in /a and hit Enter, and you will see that selector actually changed. And then what we want to do if we are into <a> tag, we want to just select the text from each and every <a> tag. To do that, of course, we go in and type in once again the /text(), hit Enter, and you will see here our data is a lot more cleaner. And we can't actually call in, like last time, extract_first() to get just the data itself, because this will only select the love, or the first one really. So what we need to do is just call in extract(), hit Enter, and you will see it's the form of a list. So, hopefully this wasn't hard. In the next section I believe we are going to cover how to actually write simple and more advanced XPaths. But you have a lot more details online on how to actually do this, and if you have any questions how to extract any data points, or if something was confusing, you have a Q&A here, so you can try it out, and I'll get back to you as soon as possible with the answer. Talk to you in the very next video.