The real power and value proposition of Apache Spark is its speed and platform to execute Data Science tasks. Spark's unique use case is that it combines ETL, batch analytic, real-time stream analysis, machine learning, graph processing, and visualizations to allow Data Scientists to tackle the complexities that come with raw unstructured data sets. Spark embraces this approach and has the vision to make the transition from working on a single machine to working on a cluster, something that makes data science tasks a lot more agile.
In this course, you’ll get a hands-on technical resource that will enable you to become comfortable and confident working with Spark for Data Science. We won't just explore Spark’s Data Science libraries, we’ll dive deeper and expand on the topics.
This course starts by taking you through Spark and the needed steps to build machine learning applications. You will learn to collect, clean, and visualize data coming from Twitter with Spark streaming. Then, you will get acquainted with Spark Machine learning algorithms and different machine learning techniques. You will also learn to apply statistical analysis and mining operations on our Tweet dataset. Finally, the course will end by giving you some ideas on how to perform awesome analysis including graph processing. By the end of the course, you will be able to do your Data scientist job in a very visual way, comprehensive and appealing for business and other stakeholders.
About The Author
Eric Charles has 10 years’ experience in the field of Data Science and is the founder of Datalayer, a social network for Data Scientists. He is passionate about using software and mathematics to help companies get insights from data.
His typical day includes building efficient processing with advanced machine learning algorithms, easy SQL, streaming and graph analytics. He also focuses a lot on visualization and result sharing.
He is passionate about open source and is an active Apache Member. He regularly gives talks to corporate clients and at open source events.
Big data solutions such as Spark are hard to setup, time consuming to learn, and obscure for non-technical users. The aim of the video is to give you just enough information to find your way through this complex ecosystem.
The goal of this video is to show the steps to install Spark on your laptop. This will allow you to easily test and prototype on a limited dataset size. If you want to process real big data, you will need a cluster in the cloud. We will also show you how to achieve this.
This video shows you how to set up Apache Zeppelin, a tool to execute and visualize your Spark analysis.
Load files with the more advanced Spark data structures, the Dataframe, and Dataset, and perform basic operations on the loaded data.
Manipulate the rows and columns of Dataframe. This is needed to feature engineer your datasets.
Often the received dataset is not what you expect it to be. You will need to convert the encoding and format.
Visualization is a powerful way to make sense from data.
Data scientists use complex algorithms to analyze data with high-level libraries. Spark ships an out-of-the-box, complete solution for distributed machine learning.
Dataframes are not the best suited data structures for data science algorithms which mainly deal with matrices and vectors. We haven't seen the use of the Spark optimized distributed data structure for this.
Before applying the algorithms, we need to cleanse the datasets and create new features.
With big data, we may have an explosion of features which can cause noise. We need to be able to work on only the relevant features.
Data analysis works out in multiple steps to be sequentially executed—Spark fulfils this need with pipelines.
We want to connect to the Twitter API and store collected tweets in a persistent way.
Get introduced to and visualize streaming information.
Build a clean and usable from the raw collected data.
We want an easy and fast view of our dataset.
The aim of this video is to understand your dataset better.
The aim of this video is to validate statistical relevance.
Algorithms have sometimes difficulties with many features which can be redundant. We will see what SVD and PCA can bring to reduce the number of features for later modeling.
In this video, you will learn to extend the basic statistics for your needs.
Free text is difficult for algorithms. We need to convert words into numerical values.
Standard Spark functions does not meet the Tweet specificities. Let's see how we can do a better job.
We need to enrich our Dataset with more features we generate. This is a common process in Data Science analysis. We will add Sentiment features based on free text Feature Vectors.
Once again, we want to add generate features from the ones we have. Topics can be considered, although they are not easy to apply on Tweets.
The aim of this video is to visualize word clouds of data in Spark DataFrame.
We want to aggregated view on uses locations. Therefore we will preprocess the users' locations with Geohash.
Your data analysis has to be an iterative and collaborative process.
We want to get feedback as soon as possible from the end-users and from our business stakeholders.
The aim of the video is to review the available Spark Algorithms to classify and prepare the needed datasets.
The aim of this video is to train a Classification Model.
The aim of this video is to apply the Model on the Test Dataset and evaluate its performance.
This video aims at automating the Model Selection.
The aim of the video is to introduce clustering algorithms and run K-Means to identify a cluster based on User Followers and Friends.
This video helps us to Cluster users by location with the 'ml' package.
The aim of this video is to show how to adapt to changing sources.
Recommendation is widely use on web sites. It needs to learn from preferences. We don't do that in our Tweet dataset. The aim of this video is to show you alternatives techniques to recommend users.
The aim of this video is to have a view on how the user's mentions are linked.
In this course, all the existing topics have not been covered and new topics pop up every day. Hence, we need to give pointers to the viewers to know where to learn more.
Packt has been committed to developer learning since 2004. A lot has changed in software since then - but Packt has remained responsive to these changes, continuing to look forward at the trends and tools defining the way we work and live. And how to put them to work.
With an extensive library of content - more than 4000 books and video courses -Packt's mission is to help developers stay relevant in a rapidly changing world. From new web frameworks and programming languages, to cutting edge data analytics, and DevOps, Packt takes software professionals in every field to what's important to them now.
From skills that will help you to develop and future proof your career to immediate solutions to every day tech challenges, Packt is a go-to resource to make you a better, smarter developer.
Packt Udemy courses continue this tradition, bringing you comprehensive yet concise video courses straight from the experts.