
This course is a gentle yet thorough introduction to Data Science, Statistics and R using real life examples.
Q. How do companies make decisions?
A. Using data
We talk about what it takes to go from data to making a decision from data. This sets the agenda for the rest of the course - each of the things on this journey is covered in the upcoming sections
Get setup with R and Rstudio. All the examples that follow in this course will have source code attached. Download and run them in Rstudio
Bosses are impatient. They often want you to cut to the chase, and give them an answer that's ok, but in a short amount of time. Descriptive statistics are the first place to start - they are often the 10s answer to any question about the data.
Computing a frequency distribution using R
A histogram is a good visual summary of your data.
Computing the Mean, Median, Mode in R
The mean, median and mode are point estimates to represent your data. IQR is a measure that explains the spread of the data.
Visualize the IQR and outliers using box and whisker plots
The standard deviation measures the spread of a dataset, and it so happens, the standard deviation is actually very profound.
Drawing inferences from data is key to being able to take decisions using data. There is a science to this, whose foundation is in random variables, probability distributions, and performing tests of statistical significance.
Random variables are everywhere. Any data that you'll study is a random variable whose behaviour is determined by a probability distribution.
The Normal Distribution is arguably the most well-known and commonly seen probability distribution. It is characterized by its probability density function, mean and standard deviation.
Sampling is a little like fishing. Sampling is crucial to induction - drawing conclusions about something by looking at some evidence.
A sample is described by sample statistics like the sample mean. The sampling distribution is the probability distribution of sample means.
Find a point estimate for the average weight of all football players using a sample of football players in 1 college team.
Find a point estimate for the % of voters in favor of a candidate.
A test of significance is an important step in building support for your findings and inferences. Here is the first example of a test of significance - is the population mean equal to a given value?
Perform a test of significance to check whether the population % is equal to a certain value
Perform a test of significance to compare 2 population means. The example used is A/B Testing - which is pretty widely used in internet companies to test out product features.
Perform a test of significance to compare two population proportions
The next few sections dive deep into all the data processing, slicing and dicing ability that R provides. The wide variety of R packages available is one reason why R is popular among many data scientists.
Let's start with the basics. What are variables and how do we assign variables in R?
print(), show(), message(), cat() are different ways to print something to screen.
Numbers in R are of type numeric.
R has built-in datatypes for dates and timestamps.
Logical is a datatype that is the result of conditional tests in R
The wide variety of built-in data structures are what makes R different from other standard programming languages. These include vectors, arrays, matrices, data frames and lists.
The mode of a vector is the datatype of all its elements.
Finding the sum, product, or mean of a vector
Generate sequences using the : operator, rep() and seq()
Access elements based on their position in the vector.
Access elements based on whether they pass a conditional test.
Assign names to the elements of a vector
Creating an array can be done by using a vector and then arranging it along dimensions.
Outer products are complex operations that operate on every pair of elements from two arrays.
A Matrix is a 2 Dimensional array. But it has special meaning and can be interpreted in a bunch of different ways.
rbind() and cbind() to merge matrices.
A factor is a special type of vector used to represent categorical variables
Lists are fundamentally different from vectors, arrays and matrices - which are all homogenous data structures.
Data Frames are how R stores data read from files and databases.
Using the aggregate() and order() functions
Merge data frames based on one or more common columns
Taught by a Stanford-educated, ex-Googler and an IIT, IIM - educated ex-Flipkart lead analyst. This team has decades of practical experience in quant trading, analytics and e-commerce.
This course is a gentle yet thorough introduction to Data Science, Statistics and R using real life examples.
Let’s parse that.
Gentle, yet thorough: This course does not require a prior quantitative or mathematics background. It starts by introducing basic concepts such as the mean, median etc and eventually covers all aspects of an analytics (or) data science career from analysing and preparing raw data to visualising your findings.
Data Science, Statistics and R: This course is an introduction to Data Science and Statistics using the R programming language. It covers both the theoretical aspects of Statistical concepts and the practical implementation using R.
Real life examples: Every concept is explained with the help of examples, case studies and source code in R wherever necessary. The examples cover a wide array of topics and range from A/B testing in an Internet company context to the Capital Asset Pricing Model in a quant finance context.
What's Covered:
Data Analysis with R: Datatypes and Data structures in R, Vectors, Arrays, Matrices, Lists, Data Frames, Reading data from files, Aggregating, Sorting & Merging Data Frames
Linear Regression: Regression, Simple Linear Regression in Excel, Simple Linear Regression in R, Multiple Linear Regression in R, Categorical variables in regression, Robust regression, Parsing regression diagnostic plots
Data Visualization in R: Line plot, Scatter plot, Bar plot, Histogram, Scatterplot matrix, Heat map, Packages for Data Visualisation : Rcolorbrewer, ggplot2
Descriptive Statistics: Mean, Median, Mode, IQR, Standard Deviation, Frequency Distributions, Histograms, Boxplots
Inferential Statistics: Random Variables, Probability Distributions, Uniform Distribution, Normal Distribution, Sampling, Sampling Distribution, Hypothesis testing, Test statistic, Test of significance