Analyzing Data with Apache Spark

Todd McGrath
A free video tutorial from Todd McGrath
Data Engineer, Software Developer, Mentor
3.7 instructor rating • 3 courses • 2,272 students

Lecture description

I don't expect you to follow all the details of this video. I want to give you a big picture through source code examples of using Apache Spark and Python to analyze data.

We're going to start by running some code examples of Python against the Spark API through a Spark Driver program called PySpark.

I don't expect you to follow along with all these Python examples yet. I'll fill in the blanks later in the course.

We'll talk about what a "Spark Driver" program means later.

From PySpark, we're going to analyze some Uber data. Uber is a company which has disrupted the worldwide taxi industry. We're going to use the Uber data from New York City from the NYC's Taxi & Limousine Commission.

With this data, we can use Python and Spark to analyze. We can determine the total number of Uber trips, the most popular Uber bases in NYC, etc.

In this example, we'll give a glimpse into Spark core concepts such as Resilient Distributed Datasets, Transformations, Actions and Spark drivers. In addition, we'll see code examples of how to use Python with Spark.

Again, I don't expect you to follow all the details here, it's intended as a high level over to begin.

Learn more from the full course

Apache Spark with Python - Learn by Doing

50 Python source code examples and multiple deployment scenarios

01:54:51 of on-demand video • Updated February 2016

  • Have confidence using Spark from Python
  • Understand Spark core concepts and processing options
  • Run Spark and Python on their own computer
  • Setup Spark on new Amazon EC2 cluster
  • Deploy Python Programs to to a Spark Cluster
  • Know what tools to use for Spark Adminstration
  • Certificate of completion
  • 30 money back guarantee
English [Auto] I always appreciate it when my instructors would start off of course with some code and some examples right away. I'm going to do that today with you all as well. Let's cover some of the things we'll be learning in this course through example. Don't worry if this doesn't make sense to you. We're going to cover all and much more detail during the course. But let's start off with an example. I'm going to start up a spark driver program here called pi spark. It takes a few minutes so while this is well maybe not that long but while it's starting up I want to set the stage for the example. By now I'm sure you've heard of the company called Uber. It is a way to share rides and it is really taking on the taxi industry throughout the world. It's been well let's say controversial in some cities. One of the cities that it was controversial in was New York and a while back there was some publicly released data about Oubre pick ups and the amount of activity that's happening in New York. We're going to use the data from this Web site that I'm showing you here and this data just to give you an overview of it is looks like this. The CXVII file we have a base number a date the active number of Oubre vehicles and trips. So all this data is described in more detail here on this get hub site which you can take a look at after the example. But for now that's the stage for what we're going to be doing. We're going to be analyzing this Oubre data as it occurred in New York over a period of time. We're going to take a look at it using Python and Sparke. So I'm going to download this file to a location that I can easily loaded into Sparc and I've done that already. So you'll see in the next example here how I load it up. I'm sure the driver program has been started by now and it has. So let's start things off by loading in the CXXVI file of Oubre data and from New York we're going to call to the spark context as you see and we're going to call a function called textfile. This is going to create a new resilient distributed dataset called U-T. Again these concepts like Sparke context resilient distributed data set and more will be covered later on the course now that we have the CXXVI file a pointer to that ESV file let's call and spark action on it. Let's see how many rows are in that file. Ok 355 Let's see what the first row of that file looks like. OK. As expected we saw the cxxviii before we see some of the column names in it. Let's do a little bit more with this data. Let's split it up by comma. We're going to create a new RTD called Rose. So we're going to use a map transformation call and pass in a function a python lambda Let's see how many distinct bases there are in this data. OK so we've mapped over the rows already the end paying specific attention to the first column or it's zero based indexed dataset. So we use 0 instead of 1 and we've determined that there are seven unique bases in this data. Let's take a look at what those bases the names of those bases OK. We can see that all of the distinct names including the column header dispatching base number here. Let's dive into this a little bit more. Let's filter the data and look at a particular base. So let's look at base b 0 2 6 1 7 there's 59 rows in this data for that particular base. Let's create a new RTD for this base so we can conveniently use it in the future such as how many rows are in this data where there was more than 15000 trips so there's six. It appears that there's six rows of data where there was more than 15000 trips. We can confirm that in another way. OK. So we're diving deeper into this data. Let's keep going. Let's create a new RTT let's exclude the first role in this RTT or this C S V file by passing in a function and excluding where the line has the name base in it. And then again let's split it based on comma. So now we've got a filtered rows RTD that is all ready for us to use for example. We can determine the total number of trips for each base by first of all creating a new RTD from filtered rows with map. We're going to create a pair RTD that contains keys and values. The key here being the base name and the value being the trips for that particular day. Next we're going to call reduce by key and call collect to display the results. The results are we have a total number of trips across the timeframe in New York per base. So we can see be 0 2 6 1 7 and over 700000 trips for this particular timeframe. I wonder what was the most busy base Oubre base at this time frame. Well I think would be easier to see if we sorted it. So let's call reduce by against the new RTD that we're creating with map. Same as before. But let's use take Mordred to return the top 10 and we'll order it by the value in the key value pair. Already the AH This makes it easier to read. We can now see that BS 0 2 7 6 4 had the most trips over that time frame compared to 2 6 1 7 here. Why over a million. OK so we're off and running here. We're using Python in a spark driver program. We're using SPARC transformations and already Deeson actions and we're analyzing data. So in this example on hand we're going to stop now and review some of these key concepts that we've covered in this example and then we're going to get into even more examples. So I'll see you in the next lecture.