The Scale Problem
A free video tutorial from Hadelin de Ponteves
4.5 instructor rating • 84 courses • 1,230,569 students
Learn more from the full courseDeep Learning and Computer Vision A-Z™: OpenCV, SSD & GANs
Become a Wizard of all the latest Computer Vision tools that exist out there. Detect anything and create powerful apps.
10:58:47 of on-demand video • Updated December 2020
- Have a toolbox of the most powerful Computer Vision models
- Understand the theory behind Computer Vision
- Master OpenCV
- Master Object Detection
- Master Facial Recognition
- Create powerful Computer Vision applications
English [Auto] Hello welcome back to the course on computer vision. Today we're talking about the scale problem the scale problem of the that we have that the is the algorithm actually solves and how it solves. OK. So what is a scale problem. Well let's have a look at this image. This is an image we took in what was in your brain or in the Czech Republic. Yes just some horses at a field. And what can we see here. Well let's just do what we have learned so far. And let's see how the algorithm will go if we just apply the constitution so that all the grid would apply some boxes and right away for them we can see the boxes that have identified have fallen features of horses and boxes which have been found. And so if I remove the green boxes which haven't fallen future forces then you will see that somehow this horse right in front the main horse the one that's like the most obvious to us is humans. It has been missed. The algorithm did not pick up the horse it picked up that horse over there it's got features obviously seen features of that horse's features of that horse and seen pictures of this horse that you can barely see but it does not see this horse. Point blank and once the question is why. Why is that happening here. What's what's going on. Well the reason for that is that this horse is too close. It's too big for the algorithm to pick up features and I'll show you what what I mean by that so let's look at just a pair of rectangles or a couple of rectangles. Look at these ones the largest single out one of them so that rectangle. And while in this grand scheme of things it might seem that how can you not pick up a horse it's pretty obvious that there is a horse. Let's remember what we talked about at the start we said that every single one of these rectangles works as a separate image. It is dealing in its own right with what it has inside and is trying to understand do I see features of a horse or do I not see you if you show a horse and to make it even easier to visualize let's crop the rest of the image and let's look at what this rectangles were getting at. Let's imagine that we are this rectangle. And what this rectangle is seeing is that now by looking at that image can you identify that there's a horse in there. Probably not. What would that look like if anything that looks like. I don't know some like a puddle of mud or maybe I don't know it like some fog very like a fire in the distance. I don't know. Maybe water falling or something like God could be anything right. If you just forget about what we saw previously and you let your eyes adjust. You'll notice that it doesn't resemble a horse in any sort of way. And you'll probably say OK that's fair enough. But there's other features on the horse that you could identify from this image. You know you could maybe identify the who knows of the tail of the eyes. Well let's have a look at that as well. Let's have a look at another rectangle. Look at this one. So here you can see that yes we can kind of identify that there is an eye. And because of the the what what is it called The part that goes in the mouth with the horse. And I should I really should know this but because of that pink strap we can see that it's a horse and and it's fair that the algorithm should be also leveraging that picture. So yes it does. It does resemble a horse better. But remember that the S is not only an to identify that indeed there's a horse in there but it also means to identify it it will need to make a suggestion for the actual box that completes this horse that takes lead it contains a horse. And so from here can you tell just forgetting about that image can you tell where the horse is actually located. Is it is it located like this or is it located like this or maybe it's like the guy is turned his head like this but it's actually standing on this side. Maybe that's one of its hooves Maybe that is the horse's Hooven maybe that the horses standing here so it's very hard for the algorithm to predict what the horse looks like what the full image of the horses are like. Maybe maybe it's maybe there's something else blocking the horse in the way so the horse is so big that in most cases for most rectangles we cannot even see that horse. But even then all those cases where we can Judus certain features identify that it's a horse it's very hard to build the rectangle. And so this is where the S is the the next component that comes to the rescue. And so let's have a look at the scale problem. That is how he deals with the scale problem. So let's move back to the architecture and here we can see how there's many layers. So far we've been talking mostly about looking at examples or one of the original image but in reality what happens is the image region is constantly reduced in size. There are many layers in this network that's working behind the algorithm and or in the background. And basically what it does is it applies a conclusion or operation to reduce the image size so here you can see how it was 300 them it's thirty eight nineteen ten pixels five pixel three pixels one pixel so it's constantly being reduced in size. The distance sensor and and then all of everything we talked about is applied on every single step on all of those images what we were talking about is applied to all those rectangles identified the training of where they should be and what they contain where they should be happens. And as you can see here there's a huge number of detection is eight thousand two hundred and three directions in this set up for 300 300 and put image at the start. Eight hundred eight thousand seven hundred thirty two detections per class. And as you can imagine there's many different classes as well. And that's because there's many iterations and basically what happens is once every time we can involve the image further we are again we are applying these classifiers. Here you can see classifiers are being applied to detect for instance in our case horses again classified as a BNN to apply to detect horses classify as had been applied classify So basically that just shows that everything is being repeated again and again and again. And so let's have a look at that how that happens and then I'll provide some additional comments on that. So there is our image what happens to the image is it is it actually goes through a convolution and we won't you just resize it but that's just for visual purposes just to help us understand better what's going on. In reality it's not just resized. It actually goes through the confusion so the image won't look anything like itself any more it will be completely random looking thing for us but it will preserve the features. So once again we discuss this more in the illusional neural networks tutorials which are in an X number two. But for our purposes just for us to better understand intuition behind things we're going to keep it as an image of horses is the exact image and just should resize that will still help us understand what's going on. And so here what happens is the image is resized but the rectangles are stay the same size and now they are applied to this new image which is resized. And because the image is smaller that means that horse is smaller. And now if we try to pick up features of horses you'll see that these rectangles do pick up a horse whereas those rectangles that used speak of a horse so this horse is not picked up in animal that one isn't that one isn't because they're smaller or not just similar to what we had in the first example with Adlon at the lake and there was people there were people in the background that they were too small to be identified. Same thing here. So just for argument's sake we're going to say that these horses are not picked up and that's totally fine because we're very pick them up before whereas now we've picked up this big horse which in this new version of the image is smaller. And so now it's it will be identified. The question though that still remains is how will we draw the box for this horse in the original image we know from what we discussed in the previous tutorial we know how the algorithm goes about drawing this box. In this version of image because through training it will understand you know it has it has lots of examples of of horses or other objects with the ground truth and will be able to understand how to draw these boxes and so in this case we will also learn how to draw the boxes in this image. But remember that we got this image not just through resizing but through a convolution operation. And that means how do that means we still have a question how do we then take this box and move it back. Well it's pretty straightforward. We're just going to look at our architecture again here and so wherever we are in the process the is the algorithm it's constructed in such a way that it preserves information about how to return how to scale back to the original image so hard to go from let's say this layer to here are hard to go from this layer to here so it all remember that it found the horse for instance over here in this part of the image of this layer and it will also know how it got to this layer so it will know how to get back. So how to decontrol mean in simplistic terms how to get back to this original image and where it should place the horse there so that's an important part that it needs to place the horse in the right place not only in this image but in the original image. And this is the algorithm is equipped to do that. And so basically through this way we are now able to handle objects of different sizes. And the beauty of this is the algorithm is that it doesn't just resize the image and then proceeds do the conversion operation again. The idea here is that all this happens at the same time all of these boxes that we're detecting all of the positions of the boxes the classes inside the boxes and even the resizing or the different scales of the images. It all happens in one neural network over here like this so instead of just having this image and then having a copy of the image where just resize and another copy of the image were smaller and was more up and running neural networks for each one were running all of this in one your own network. And why is that good. Why is that important is because we can utilize the features we can share the features across the different layers so that you know we've when we're when we're training the network it learns how to detect horses when they're in the full size or less so in this in this lair in this size than when they're smaller when they're made smaller and so and so that way all of the layers are learning together. And that gives it that additional power or that you know there's power in numbers that they're they're seeing more different horses at the same time so some horses are seen here some horses are seen here and therefore it is more powerful in the sense that it has more examples to work off and the network can better adjusted its weights to properly detect those horses. So this really helps or facilitates the training. Yeah so that's how the essays the algorithm deals with the scale problem. It's quite a smart solution. I'm sure you'd agree that you know putting all of this into one emotional neural network is is pretty exciting and definitely gives us some amazing results. And then and on that note I hope you enjoyed this tutorial and I look forward to see you next time. And until then enjoy computer vision.