In this class you will learn:
All our examples will be performed on real clusters - no training wheels, single local clusters or third-party tools.
Note 1: you will need to know how to SSH to your Amazon AWS instance (I will show how I do it using the Mac but Windows or Linux isn't covered)
Note 2: There is a minimal cost involved when using Amazon's AWS instances. This biggest machine we will use is around 0.05 US cents/hour/machine.
This is the same video and content on my blog - for full source see (http://amunategui.github.io/databricks-spark-bayes/index.html)
Setting up the Spark launching instance on AWS and controlling it via SSH - part 1
Here is an overview of using PuTTY to connect to your AWS instance. We also transfer the .pem file manually instead of using scp command as in the original lecture (or WinSCP for windows users).
We'll continue setting up our AWS EC2 launching instance and load the latest Spark binaries
We're now going to create 1 master and 2 dependent Spark clusters and do a very brief test run on RStudio server.
For those that want to try out Spark 2.0, here is a video on how to set it up as things have changed a bit from 1.6x. Though I wholeheartedly recommend going through the original course on Spark 1.6x first. In the next few months I will update this course with Spark 2.0 objects.
We'll follow two simple examples from the SparkR documentation to give us a taste of some of the available SparkR functions
We'll look at using the SparkR GLM model and do some Gaussian modeling and extra a mean squared error on our results
Now we'll explore binomial modeling using GLM and how to measure the accuracy of the predictions
Here we'll learn how to create an S3 bucket on Amazon Web Services and upload a data set.
S3cmd is a great tool to automatically breakup larger files before uploading them to S3 and reconstructing them into a single file afterwards.
He we'll look at the basic commands to interact with a SparkR data frame.
We continue to look at commands to query SparkR data frames along with a very brief look at the magrittr package and SparkSQL syntax.
Quick look at starting the PySpark shell and running a simple word count query
I am data scientist in the healthcare industry. I have been applying machine learning and predictive analytics to better patients lives for the past 3 years. Prior to that I was a developer on a trading desk on Wall Street for 6 years. On the personal side, I love data science competitions and hackathons - people often ask me how can one break into this field, to which I reply: 'join an online competition!'