MapReduce Overview and Background

Packt Publishing
A free video tutorial from Packt Publishing
Tech Knowledge in Motion
4.0 instructor rating • 1570 courses • 297,721 students

Lecture description

Need to combine different documents into a new representation.

Learn more from the full course

Learning MongoDB

A comprehensive guide to MongoDB for ultra-fast, fault tolerant management of big data, including advanced data analysis

03:25:42 of on-demand video • Updated April 2015

  • Install MongoDB on Linux and Windows, both manually and using packages
  • Configure MongoDB to autostart and access your data using the command line and GUI clients
  • Learn how to manage databases, including creation, pruning, backup, and recovery to fulfil your big data needs
  • Master how to create map and reduce functions using step-by-step diagrams and examples
  • Understand replica sets, failover verification, responsiveness, and load balancing for large scale applications
  • Discover how redundancy and filesystem choices impact security
  • Delve into advanced topics such as monitoring, automated deployment, sharding, and caching to boost your application
English [Auto] In this section we'll be exploring a powerful map reduce capability of Margaret. We'll begin with an overview of what math produced is and how we can use it. Well then go through the creation of a map and reduce function one at the time. Next we'll explore some advanced mass produced techniques before moving on to a discussion of when to use mass produce this first video provides an overview of Macrobius creating effective mass produce algorithms requires a solid understanding of the data. The desired end result needs to be defined in any questions or modifications to existing data should be identified. We'll need a more diverse set of data in order to complete this section. We can import this data just as we have in previous videos. If we populate a book import data file on the server we place the bulk import data file on the server and call Mongo to import the data. You can then log into Mongo and use course tracker and confirm that the collections are available. In keeping with our focus on students and previous sections we'll be extending the course tracker example by adding instructor and course documents. We will also be introducing the idea of document references. You can see that an instructor or document will reference courses and that course documents will reference students proper understanding of relationships is critical in designing produce algorithms. You may have noticed that our sample data specifies ID values. This is for convenience and referencing so that we don't have to identify automatically generated object IDs from the command line. We can query for an instructor document that the ID is not an object id as was already mentioned. We see that there is a Course's array which contains references to the ID values found in the course documents when we query for of course document. We see that there is a student's array which contains references to the ID values from documents in the student collection. Finally we can query for each student document and see that there are no backward references to Course or instructor all references go in one direction as we learned in section 2. The documents have been normalized only far enough to prevent significant duplication of data. Now that we understand what the data looks like let's do a very basic math produced to find the number of courses for which each student has signed up all the data we need is in the courses collection but it's distributed among all documents. What we want to do is take the data from the students array and reorganize it to be sorted by student ID and represent a sum totalling the number of times the student ID appears in any office in the mortgage shell. We want to create the math function. The keyword this refers to a single course document. So this taught students is the array of student ID values. This function loops over those student ID values and admits the value of one with a student ID as the key. If this isn't perfectly clear right now don't worry. We're going to go over this in more detail in the next video. The reduced function catches emitted values by Key. It's important to understand that it doesn't catch them one at a time as they are emitted. The reduced function is not called until all emit calls have been made. In this case we would expect the key coming in to match one of the student id values the values variable will be an array containing all the values that were emitted for the given key. In this simple case these will all be ones because that's all we admit this function then loops over that array and obtains a sum where the reduced function returns that sum the return value is associated with the key that passed in the array of values. We can run the macro do this by identifying the DP collection and calling me to produce this function expects references to the map function. The reduced function and information about what to do with the result of mass produce operation. In this case we overwrite or create a collection number courses after we run the math produced. We can show collections and see the new collection when we do a Find on that collection. We see the student id values along with the total number of courses corresponding to each after we run the math produce various results about what the math produced did or displayed. We can also show collections and see if the new collection number courses when we do a Find on that collection. We see the student id values along with the total number of courses corresponding to each. As we review the diagram again in this diagram we see that what we did is for each course document we loop through all student ID values each student ID value was admitted as a key with a value of 1. All of these emitted values were later collected into an array which was passed to the reduce function to calculate the sum. In this video we learned how important it is to understand the structure of existing data in order to create map reduce functions. We also discovered that it's important to know ahead of time what the end result should look like. And any calculations required to achieve it in the next video will look more closely at creating map functions.