
Import and Export in R
You might find that loading data into R can be quite frustrating. Almost every single type of file that you want to get into R seems to require its own function, and even then you might get lost in the functions’ arguments. In short, you might agree that it can be fairly easy to mix up things from time to time, whether you are a beginner or a more advanced R user.
Types of files that we‘ll import
Importing CSV file
Importing Text file
Importing Excel file
Importing files from Database
Importing files from Web
Importing files from Statistical Tool
And lastly Exporting the Data
Importing CSV file
The utils package, which is automatically loaded in your R session on startup, can import CSV files with the read.csv() function.
Use read.csv() to import a data frame
Now use this commands to import CSV Files
#Importing csv file
# read.csv()
titanic_train<- read.csv(file.choose())
class(titanic_train)
titanic <- read.csv("titanic_train.csv")
str(titanic)
#Using readr package
install.packages("readr")
library(readr)
titanic <- read_csv("titanic_train.csv")
titanic
All the codes which are used in this video is given at the end of this chapter.The CSV files which are used here is available in the resource section of this lecture
This brings an end to this post, I encourage you to re read the post to understand it completely if you haven’t and THANK YOU.
Importing Text File
The utils package, which is automatically loaded in your R session on startup, can import text files with the read-table function.
Use read-table to import a data frame
Now use this commands to import Text Files
If you have a .txt or a tab-delimited text file, you can easily import it with the basic R function read.table(). In other words, the contents of your file will look similar to this and can be imported as follows:
# Importing table/text
# read.table ()
# Import the hotdogs.txt file: hotdogs
?read.table
hotdogs <- read.table( "hotdog.txt",sep = "\t", header = TRUE)
# Call head() on hotdogs
head(hotdogs)
All the codes which are used in this video is given at the end of this chapter.The Text files which are used here is available in the resource section of this lecture
This brings an end to this post, I encourage you to re-read the post to understand it completely if you haven’t and THANK YOU.
Importing Of Excel Files
As most of you know, Excel is a spreadsheet application developed by Microsoft. It is an easily accessible tool for organizing, analyzing and storing data in tables and has a widespread use in many different application fields all over the world. It doesn't need to surprise that R has implemented some ways to read, write and manipulate Excel files (and spreadsheets in general).
How To Import Excel Files
Before you start thinking about how to load your Excel files and spreadsheets into R, you need to first make sure that your data is well prepared to be imported.
The readxl package, which is automatically loaded in your R session on startup, can import Excel files with the read_excel() function.
Use the read_excel() to import a data frame
If you would neglect to do this, you might experience problems when using the R functions
Using this command you can import Excel File in R
#Importing xls file using readxl package - read_excel()
#install redxl package
install.packages("readxl")
# Load the readxl package
library(readxl)
# Print out the names of both spreadsheets
excel_sheets("urbanpop.xlsx")
# Read the sheets, one by one
pop_1 <- read_excel("urbanpop.xlsx", sheet = 1)
pop_2 <- read_excel("urbanpop.xlsx", sheet = 2)
pop_3 <- read_excel("urbanpop.xlsx", sheet = 3)
# Put pop_1, pop_2 and pop_3 in a list: pop_list
pop_list <- list(pop_1,pop_2,pop_3)
# Display the structure of pop_list
str(pop_list)
# Explore other packages - XLConnect, xlsx, gdata
All the codes which are used in this video is given at the end of this chapter.This brings an end to this post, I encourage you to re-read the post to understand it completely if you haven’t and THANK YOU.
Export Data in R - Text,CSV,Excel - Text Study Note
Section 7, Lecture 45
Export Data in R - Text,CSV,Excel
In this tutorial, we will learn how to export data from R environment to different formats.
To export data to the hard drive, you need the file path and an extension. First of all, the path is the location where the data will be stored.
Exporting Text File
You can export text files with write.table(mydata, "Path../../mydata.txt", sep="\t")function.
Now use this commands to Export Text Files
# Export data in a text file
write.table(hotdogs, "D:\\Rajib Backup\\Project\\Innovation\\Analytics\\Machine Learning\\Tutorial\\EduCBA\\Chap5 -Import and Export\\NewHotdog.txt", sep = "\t")
Exporting CSV File
You can export text files with write csv(mydata, " Path../../mydata.csv")function.
Now use this commands to Export CSV Files
#Export data in csv
write.csv(my_df, "D:\\Rajib Backup\\Project\\Innovation\\Analytics\\Machine Learning\\Tutorial\\EduCBA\\Chap5 -Import and Export\\my_df.csv")
Exporting Excel File
You can export text files with write xlsx(mydata, " Path../../mydata.xlsx")function.
Now use this commands to Export Excel Files
# Export data in excel
install.packages("writexl")
library(writexl)
my_df <- mtcars[1:3,]
write_xlsx(my_df,"D:\\Rajib Backup\\Project\\Innovation\\Analytics\\Machine Learning\\Tutorial\\EduCBA\\Chap5 -Import and Export\\Newmtcars.xlsx")
All the codes which are used in this video is given at the end of this chapter.The Text,CSV,Excel files which are used here is available in the resource section of this lecture
This brings an end to this post, I encourage you to re-read the post to understand it completely if you haven’t and THANK YOU.
Export Data in R - Text,CSV,Excel - Text Study Note
Section 7, Lecture 45
Export Data in R - Text,CSV,Excel
In this tutorial, we will learn how to export data from R environment to different formats.
To export data to the hard drive, you need the file path and an extension. First of all, the path is the location where the data will be stored.
Exporting Text File
You can export text files with write.table(mydata, "Path../../mydata.txt", sep="\t")function.
Now use this commands to Export Text Files
# Export data in a text file
write.table(hotdogs, "D:\\Rajib Backup\\Project\\Innovation\\Analytics\\Machine Learning\\Tutorial\\EduCBA\\Chap5 -Import and Export\\NewHotdog.txt", sep = "\t")
Exporting CSV File
You can export text files with write csv(mydata, " Path../../mydata.csv")function.
Now use this commands to Export CSV Files
#Export data in csv
write.csv(my_df, "D:\\Rajib Backup\\Project\\Innovation\\Analytics\\Machine Learning\\Tutorial\\EduCBA\\Chap5 -Import and Export\\my_df.csv")
Exporting Excel File
You can export text files with write xlsx(mydata, " Path../../mydata.xlsx")function.
Now use this commands to Export Excel Files
# Export data in excel
install.packages("writexl")
library(writexl)
my_df <- mtcars[1:3,]
write_xlsx(my_df,"D:\\Rajib Backup\\Project\\Innovation\\Analytics\\Machine Learning\\Tutorial\\EduCBA\\Chap5 -Import and Export\\Newmtcars.xlsx")
All the codes which are used in this video is given at the end of this chapter.The Text,CSV,Excel files which are used here is available in the resource section of this lecture
This brings an end to this post, I encourage you to re-read the post to understand it completely if you haven’t and THANK YOU.
Data Manipulation
The apply() functions form the basis of more complex combinations and helps to perform operations with very few lines of code. More specifically, the family is made up of the
apply()
lapply()
sapply()
tapply()
by functions.
How To Use apply() in R
Let’s start with the apply(), which operates on arrays.
The R base manual tells you that it’s called as follows: apply(X, MARGIN, FUNCTION)
where:
X is an array or a matrix if the dimension of the array is 2;
MARGIN is a variable defining how the function is applied,
when
MARGIN=1, it applies over rows,
whereas with
MARGIN=2, it works over columns.
FUNCTION which is the function that you want to apply to the data. It can be any R function, including a User Defined Function (UDF).
By this command you can use Apply() function
# Topic 1: Apply Function
###################################################################################
# apply function helps to apply a function to a matrix row or a column and returns a vector, array or list
# Syntax : apply(x, margin, function), where margin indicates whether the function is to be applied to a row or a column
# margin =1 indicates that the function needs to be applied to a row
# margin =2 indicates that the function needs to be applied to a column
# function can be any function such as mean , average, sum
m <- matrix(c(1,2,3,4),2,2)
m
apply(m, 1, sum)
apply(m, 2,sum)
apply(m, 1, mean)
apply(m, 2, mean)
The lapply() Function
You want to apply a given function to every element of a list and obtain a list as result. When you execute ?lapply, you see that the syntax looks like the apply() function.
The difference is that:
It can be used for other objects like dataframes, lists or vectors;
And
The output returned is a list (which explains the “l” in the function name), which has the same number of elements as the object passed to it.
By this command you can use lapply() function
################################################
#Using sapply and lapply
################################################
#Lapply() function
#lapply is similar to apply, but it takes a list as an input, and returns a list as the output.
# syntax is lapply(list, function)
#example 1:
data <- list(x = 1:5, y = 6:10, z = 11:15)
data
lapply(data, FUN = median)
#example 2:
data2 <- list(a=c(1,1), b=c(2,2), c=c(3,3))
data2
lapply(data2, sum)
lapply(data2, mean)
The sapply() Function
The sapply() function works like lapply(), but it tries to simplify the output to the most elementary data structure that is possible. And indeed, sapply() is a ‘wrapper’ function for lapply().
An example may help to understand this: let’s say that you want to repeat the extraction operation of a single element as in the last example, but now take the first element of the second row for each matrix.
Applying the lapply() function would give us a list, unless you pass simplify=FALSE as parameter to sapply(). Then, a list will be returned.
By this command you can use sapply() function
#Sapply function
# sapply is the same as lapply, but returns a vector instead of a list.
# syntax is sapply(list, function)
#example 1 :
data <- list(x = 1:5, y = 6:10, z = 11:15)
data
lapply(data, FUN = sum)
lapply(data, FUN = median)
unlist(lapply(data, FUN = median))
sapply(data, FUN = sum)
sapply(data, FUN = median)
#Note : if the result are all scalars, then a vector is returned
# however if the result are of same size (>1) then a matrix is returned. Otherwise, the result is returned as list itself
sapply(data, FUN = range)
The vapply() Function
And lastly the vapply function .This function is shown in below
Arguments
.x: A vector.
.f: A function to be applied.
fun_value: A (generalized) vector; a template for the return value from .f.
... : Optional arguments to .f.
use_names: Logical; if TRUE and if X is character, use .x as names for the result unless it had names already.
By this command you can use vapply() function
#vapply function
# vapply() is similar to sapply() but it explicitly specify the type of return value (integer, double, characters).
vapply(data,sum, FUN.VALUE = double(1))
vapply(data,range, FUN.VALUE = double(2))
By this command you can use tapply() and mapply() function
################################################
# Using tapply() and mapply()
################################################
# tapply() tapply works on vector,
# it apply the function by grouping factors inside the vector.
# syntax is tapply(x, factor, function)
#example 1:
age <- c(23,33,28,21,20,19,34)
gender <- c("m" , "m", "m" , "f", "f", "f" , "m")
f <- factor(gender)
f
tapply(age, f, mean)
tapply(age, gender, mean)
#example number 2
#load the datasets
library(datasets)
#you can view all the datasets
data()
View(mtcars)
class(mtcars)
mtcars$wt
mtcars$cyl
f <- factor(mtcars$cyl)
f
tapply(mtcars$wt, f, mean)
##############################################################################
# mapply() - mapply is a multivariate version of sapply. It will apply the specified function
# to the first element of each argument first, followed by the second element, and so on.
# syntax is mapply(function...)
## example number 1
# create a list:
rep(1,4)
rep(2,3)
rep(3,2)
rep(4,1)
a <- list(rep(1,4), rep(2,3), rep(3,2), rep(4,1))
a
# We can see that we are calling the same function (rep) where th first argument
# variaes from 1 to 4 and second argument varies from 4 to 1.
# instaed we can use mapply function
b <- mapply(rep, 1:4, 4:1)
b
#####################################################################################
####################################################################################
This brings an end to this post, I encourage you to re read the post to understand it completely if you haven’t and THANK YOU.
Data Manipulation
The apply() functions form the basis of more complex combinations and helps to perform operations with very few lines of code. More specifically, the family is made up of the
apply()
lapply()
sapply()
tapply()
by functions.
How To Use apply() in R
Let’s start with the apply(), which operates on arrays.
The R base manual tells you that it’s called as follows: apply(X, MARGIN, FUNCTION)
where:
X is an array or a matrix if the dimension of the array is 2;
MARGIN is a variable defining how the function is applied,
when
MARGIN=1, it applies over rows,
whereas with
MARGIN=2, it works over columns.
FUNCTION which is the function that you want to apply to the data. It can be any R function, including a User Defined Function (UDF).
By this command you can use Apply() function
# Topic 1: Apply Function
###################################################################################
# apply function helps to apply a function to a matrix row or a column and returns a vector, array or list
# Syntax : apply(x, margin, function), where margin indicates whether the function is to be applied to a row or a column
# margin =1 indicates that the function needs to be applied to a row
# margin =2 indicates that the function needs to be applied to a column
# function can be any function such as mean , average, sum
m <- matrix(c(1,2,3,4),2,2)
m
apply(m, 1, sum)
apply(m, 2,sum)
apply(m, 1, mean)
apply(m, 2, mean)
The lapply() Function
You want to apply a given function to every element of a list and obtain a list as result. When you execute ?lapply, you see that the syntax looks like the apply() function.
The difference is that:
It can be used for other objects like dataframes, lists or vectors;
And
The output returned is a list (which explains the “l” in the function name), which has the same number of elements as the object passed to it.
By this command you can use lapply() function
################################################
#Using sapply and lapply
################################################
#Lapply() function
#lapply is similar to apply, but it takes a list as an input, and returns a list as the output.
# syntax is lapply(list, function)
#example 1:
data <- list(x = 1:5, y = 6:10, z = 11:15)
data
lapply(data, FUN = median)
#example 2:
data2 <- list(a=c(1,1), b=c(2,2), c=c(3,3))
data2
lapply(data2, sum)
lapply(data2, mean)
The sapply() Function
The sapply() function works like lapply(), but it tries to simplify the output to the most elementary data structure that is possible. And indeed, sapply() is a ‘wrapper’ function for lapply().
An example may help to understand this: let’s say that you want to repeat the extraction operation of a single element as in the last example, but now take the first element of the second row for each matrix.
Applying the lapply() function would give us a list, unless you pass simplify=FALSE as parameter to sapply(). Then, a list will be returned.
By this command you can use sapply() function
#Sapply function
# sapply is the same as lapply, but returns a vector instead of a list.
# syntax is sapply(list, function)
#example 1 :
data <- list(x = 1:5, y = 6:10, z = 11:15)
data
lapply(data, FUN = sum)
lapply(data, FUN = median)
unlist(lapply(data, FUN = median))
sapply(data, FUN = sum)
sapply(data, FUN = median)
#Note : if the result are all scalars, then a vector is returned
# however if the result are of same size (>1) then a matrix is returned. Otherwise, the result is returned as list itself
sapply(data, FUN = range)
The vapply() Function
And lastly the vapply function .This function is shown in below
Arguments
.x: A vector.
.f: A function to be applied.
fun_value: A (generalized) vector; a template for the return value from .f.
... : Optional arguments to .f.
use_names: Logical; if TRUE and if X is character, use .x as names for the result unless it had names already.
By this command you can use vapply() function
#vapply function
# vapply() is similar to sapply() but it explicitly specify the type of return value (integer, double, characters).
vapply(data,sum, FUN.VALUE = double(1))
vapply(data,range, FUN.VALUE = double(2))
By this command you can use tapply() and mapply() function
################################################
# Using tapply() and mapply()
################################################
# tapply() tapply works on vector,
# it apply the function by grouping factors inside the vector.
# syntax is tapply(x, factor, function)
#example 1:
age <- c(23,33,28,21,20,19,34)
gender <- c("m" , "m", "m" , "f", "f", "f" , "m")
f <- factor(gender)
f
tapply(age, f, mean)
tapply(age, gender, mean)
#example number 2
#load the datasets
library(datasets)
#you can view all the datasets
data()
View(mtcars)
class(mtcars)
mtcars$wt
mtcars$cyl
f <- factor(mtcars$cyl)
f
tapply(mtcars$wt, f, mean)
##############################################################################
# mapply() - mapply is a multivariate version of sapply. It will apply the specified function
# to the first element of each argument first, followed by the second element, and so on.
# syntax is mapply(function...)
## example number 1
# create a list:
rep(1,4)
rep(2,3)
rep(3,2)
rep(4,1)
a <- list(rep(1,4), rep(2,3), rep(3,2), rep(4,1))
a
# We can see that we are calling the same function (rep) where th first argument
# variaes from 1 to 4 and second argument varies from 4 to 1.
# instaed we can use mapply function
b <- mapply(rep, 1:4, 4:1)
b
#####################################################################################
####################################################################################
This brings an end to this post, I encourage you to re read the post to understand it completely if you haven’t and THANK YOU.
Selecting columns using select()
select() keeps only the variables you mention
Use This Command To Perform The Above Mentioned Function
#######################################
#select(): Select specific column from tbl
#######################################
tbl <- select (hflights, ActualElapsedTime, AirTime, ArrDelay, DepDelay )
glimpse(tbl)
#starts_with("X"): every name that starts with "X",
#ends_with("X"): every name that ends with "X",
#contains("X"): every name that contains "X",
#matches("X"): every name that matches "X", where "X" can be a regular expression,
#num_range("x", 1:5): the variables named x01, x02, x03, x04 and x05,
#one_of(x): every name that appears in x, which should be a character vector.
#Example: print out only the UniqueCarrier, FlightNum, TailNum, Cancelled, and CancellationCode columns of hflights
select(hflights, ends_with("Num"))
select(hflights, starts_with("Cancel"))
select(hflights, UniqueCarrier, ends_with("Num"), starts_with("Cancel"))
Create new columns using mutate()
mutate() is the second of five data manipulation functions you will get familiar with in this course. mutate() creates new columns which are added to a copy of the dataset.
Use This Command To Perform The Above Mentioned Function
#######################################
#mutate(): Add columns from existing data
#######################################
g2 <- mutate(hflights, loss = ArrDelay - DepDelay)
g2
g1 <- mutate(hflights, ActualGroundTime = ActualElapsedTime - AirTime)
g1
#hflights$ActualGroundTime <- hflights$ActualElapsedTime - hflights$AirTime
#######################################
Selecting rows using filter()
Filtering data is one of the very basic operation when you work with data. You want to remove a part of the data that is invalid or simply you’re not interested in. Or, you want to zero in on a particular part of the data you want to know more about. Of course, dplyr has ’filter()’ function to do such filtering, but there is even more. With dplyr you can do the kind of filtering, which could be hard to perform or complicated to construct with tools like SQL and traditional BI tools, in such a simple and more intuitive way.
R comes with a set of logical operators that you can use inside filter():
• <
• <=
• ==
• !=
• !=
• >
Use This Command To Perform The Above Mentioned Function
#filter() : Filter specific rows which matches the logical condition
#######################################
#R comes with a set of logical operators that you can use inside filter():
#x < y, TRUE if x is less than y
#x <= y, TRUE if x is less than or equal to y
#x == y, TRUE if x equals y
#x != y, TRUE if x does not equal y
#x >= y, TRUE if x is greater than or equal to y
#x > y, TRUE if x is greater than y
#x %in% c(a, b, c), TRUE if x is in the vector c(a, b, c)
# All flights that traveled 3000 miles or more
long_flight <- filter(hflights, Distance >= 3000)
View(long_flight)
glimpse(long_flight)
# All flights where taxing took longer than flying
long_journey <- filter(hflights, TaxiIn + TaxiOut > AirTime)
View(long_journey)
# All flights that departed before 5am or arrived after 10pm
All_Day_Journey <- filter(hflights, DepTime < 500 | ArrTime > 2200)
# All flights that departed late but arrived ahead of schedule
Early_Flight <- filter(hflights, DepDelay > 0, ArrDelay < 0)
glimpse(Early_Flight)
# All flights that were cancelled after being delayed
Cancelled_Delay <- filter(hflights, Cancelled == 1, DepDelay > 0)
#How many weekend flights flew a distance of more than 1000 miles but
#had a total taxiing time below 15 minutes?
w <- filter(hflights, DayOfWeek == 6 |DayOfWeek == 7, Distance >1000, TaxiIn + TaxiOut <15)
nrow(w)
y <- filter(hflights, DayOfWeek %in% c(6,7), Distance > 1000, TaxiIn + TaxiOut < 15)
nrow(y)
#######################################
Arrange or re-order rows using arrange()
To arrange (or re-order) rows by a particular column such as the taxonomic order, list the name of the column you want to arrange the rows
Use This Command To Perform The Above Mentioned Function
#######################################
#arrange(): reorders the rows according to single or multiple variables,
#######################################
dtc <- filter(hflights, Cancelled == 1, !is.na(DepDelay)) #Delay not equal to NA
glimpse(dtc)
# Arrange dtc by departure delays
d <- arrange(dtc, DepDelay)
# Arrange dtc so that cancellation reasons are grouped
c <- arrange(dtc,CancellationCode )
#By default, arrange() arranges the rows from smallest to largest.
#Rows with the smallest value of the variable will appear at the top of the data set.
#You can reverse this behavior with the desc() function.
# Arrange according to carrier and decreasing departure delays
des_Flight <- arrange(hflights, desc(DepDelay))
# Arrange flights by total delay (normal order).
arrange(hflights, ArrDelay + DepDelay)
#######################################
Selecting columns using select()
select() keeps only the variables you mention
Use This Command To Perform The Above Mentioned Function
#######################################
#select(): Select specific column from tbl
#######################################
tbl <- select (hflights, ActualElapsedTime, AirTime, ArrDelay, DepDelay )
glimpse(tbl)
#starts_with("X"): every name that starts with "X",
#ends_with("X"): every name that ends with "X",
#contains("X"): every name that contains "X",
#matches("X"): every name that matches "X", where "X" can be a regular expression,
#num_range("x", 1:5): the variables named x01, x02, x03, x04 and x05,
#one_of(x): every name that appears in x, which should be a character vector.
#Example: print out only the UniqueCarrier, FlightNum, TailNum, Cancelled, and CancellationCode columns of hflights
select(hflights, ends_with("Num"))
select(hflights, starts_with("Cancel"))
select(hflights, UniqueCarrier, ends_with("Num"), starts_with("Cancel"))
Create new columns using mutate()
mutate() is the second of five data manipulation functions you will get familiar with in this course. mutate() creates new columns which are added to a copy of the dataset.
Use This Command To Perform The Above Mentioned Function
#######################################
#mutate(): Add columns from existing data
#######################################
g2 <- mutate(hflights, loss = ArrDelay - DepDelay)
g2
g1 <- mutate(hflights, ActualGroundTime = ActualElapsedTime - AirTime)
g1
#hflights$ActualGroundTime <- hflights$ActualElapsedTime - hflights$AirTime
#######################################
Selecting rows using filter()
Filtering data is one of the very basic operation when you work with data. You want to remove a part of the data that is invalid or simply you’re not interested in. Or, you want to zero in on a particular part of the data you want to know more about. Of course, dplyr has ’filter()’ function to do such filtering, but there is even more. With dplyr you can do the kind of filtering, which could be hard to perform or complicated to construct with tools like SQL and traditional BI tools, in such a simple and more intuitive way.
R comes with a set of logical operators that you can use inside filter():
• <
• <=
• ==
• !=
• !=
• >
Use This Command To Perform The Above Mentioned Function
#filter() : Filter specific rows which matches the logical condition
#######################################
#R comes with a set of logical operators that you can use inside filter():
#x < y, TRUE if x is less than y
#x <= y, TRUE if x is less than or equal to y
#x == y, TRUE if x equals y
#x != y, TRUE if x does not equal y
#x >= y, TRUE if x is greater than or equal to y
#x > y, TRUE if x is greater than y
#x %in% c(a, b, c), TRUE if x is in the vector c(a, b, c)
# All flights that traveled 3000 miles or more
long_flight <- filter(hflights, Distance >= 3000)
View(long_flight)
glimpse(long_flight)
# All flights where taxing took longer than flying
long_journey <- filter(hflights, TaxiIn + TaxiOut > AirTime)
View(long_journey)
# All flights that departed before 5am or arrived after 10pm
All_Day_Journey <- filter(hflights, DepTime < 500 | ArrTime > 2200)
# All flights that departed late but arrived ahead of schedule
Early_Flight <- filter(hflights, DepDelay > 0, ArrDelay < 0)
glimpse(Early_Flight)
# All flights that were cancelled after being delayed
Cancelled_Delay <- filter(hflights, Cancelled == 1, DepDelay > 0)
#How many weekend flights flew a distance of more than 1000 miles but
#had a total taxiing time below 15 minutes?
w <- filter(hflights, DayOfWeek == 6 |DayOfWeek == 7, Distance >1000, TaxiIn + TaxiOut <15)
nrow(w)
y <- filter(hflights, DayOfWeek %in% c(6,7), Distance > 1000, TaxiIn + TaxiOut < 15)
nrow(y)
#######################################
Arrange or re-order rows using arrange()
To arrange (or re-order) rows by a particular column such as the taxonomic order, list the name of the column you want to arrange the rows
Use This Command To Perform The Above Mentioned Function
#######################################
#arrange(): reorders the rows according to single or multiple variables,
#######################################
dtc <- filter(hflights, Cancelled == 1, !is.na(DepDelay)) #Delay not equal to NA
glimpse(dtc)
# Arrange dtc by departure delays
d <- arrange(dtc, DepDelay)
# Arrange dtc so that cancellation reasons are grouped
c <- arrange(dtc,CancellationCode )
#By default, arrange() arranges the rows from smallest to largest.
#Rows with the smallest value of the variable will appear at the top of the data set.
#You can reverse this behavior with the desc() function.
# Arrange according to carrier and decreasing departure delays
des_Flight <- arrange(hflights, desc(DepDelay))
# Arrange flights by total delay (normal order).
arrange(hflights, ArrDelay + DepDelay)
#######################################
Selecting columns using select()
select() keeps only the variables you mention
Use This Command To Perform The Above Mentioned Function
#######################################
#select(): Select specific column from tbl
#######################################
tbl <- select (hflights, ActualElapsedTime, AirTime, ArrDelay, DepDelay )
glimpse(tbl)
#starts_with("X"): every name that starts with "X",
#ends_with("X"): every name that ends with "X",
#contains("X"): every name that contains "X",
#matches("X"): every name that matches "X", where "X" can be a regular expression,
#num_range("x", 1:5): the variables named x01, x02, x03, x04 and x05,
#one_of(x): every name that appears in x, which should be a character vector.
#Example: print out only the UniqueCarrier, FlightNum, TailNum, Cancelled, and CancellationCode columns of hflights
select(hflights, ends_with("Num"))
select(hflights, starts_with("Cancel"))
select(hflights, UniqueCarrier, ends_with("Num"), starts_with("Cancel"))
Create new columns using mutate()
mutate() is the second of five data manipulation functions you will get familiar with in this course. mutate() creates new columns which are added to a copy of the dataset.
Use This Command To Perform The Above Mentioned Function
#######################################
#mutate(): Add columns from existing data
#######################################
g2 <- mutate(hflights, loss = ArrDelay - DepDelay)
g2
g1 <- mutate(hflights, ActualGroundTime = ActualElapsedTime - AirTime)
g1
#hflights$ActualGroundTime <- hflights$ActualElapsedTime - hflights$AirTime
#######################################
Selecting rows using filter()
Filtering data is one of the very basic operation when you work with data. You want to remove a part of the data that is invalid or simply you’re not interested in. Or, you want to zero in on a particular part of the data you want to know more about. Of course, dplyr has ’filter()’ function to do such filtering, but there is even more. With dplyr you can do the kind of filtering, which could be hard to perform or complicated to construct with tools like SQL and traditional BI tools, in such a simple and more intuitive way.
R comes with a set of logical operators that you can use inside filter():
• <
• <=
• ==
• !=
• !=
• >
Use This Command To Perform The Above Mentioned Function
#filter() : Filter specific rows which matches the logical condition
#######################################
#R comes with a set of logical operators that you can use inside filter():
#x < y, TRUE if x is less than y
#x <= y, TRUE if x is less than or equal to y
#x == y, TRUE if x equals y
#x != y, TRUE if x does not equal y
#x >= y, TRUE if x is greater than or equal to y
#x > y, TRUE if x is greater than y
#x %in% c(a, b, c), TRUE if x is in the vector c(a, b, c)
# All flights that traveled 3000 miles or more
long_flight <- filter(hflights, Distance >= 3000)
View(long_flight)
glimpse(long_flight)
# All flights where taxing took longer than flying
long_journey <- filter(hflights, TaxiIn + TaxiOut > AirTime)
View(long_journey)
# All flights that departed before 5am or arrived after 10pm
All_Day_Journey <- filter(hflights, DepTime < 500 | ArrTime > 2200)
# All flights that departed late but arrived ahead of schedule
Early_Flight <- filter(hflights, DepDelay > 0, ArrDelay < 0)
glimpse(Early_Flight)
# All flights that were cancelled after being delayed
Cancelled_Delay <- filter(hflights, Cancelled == 1, DepDelay > 0)
#How many weekend flights flew a distance of more than 1000 miles but
#had a total taxiing time below 15 minutes?
w <- filter(hflights, DayOfWeek == 6 |DayOfWeek == 7, Distance >1000, TaxiIn + TaxiOut <15)
nrow(w)
y <- filter(hflights, DayOfWeek %in% c(6,7), Distance > 1000, TaxiIn + TaxiOut < 15)
nrow(y)
#######################################
Arrange or re-order rows using arrange()
To arrange (or re-order) rows by a particular column such as the taxonomic order, list the name of the column you want to arrange the rows
Use This Command To Perform The Above Mentioned Function
#######################################
#arrange(): reorders the rows according to single or multiple variables,
#######################################
dtc <- filter(hflights, Cancelled == 1, !is.na(DepDelay)) #Delay not equal to NA
glimpse(dtc)
# Arrange dtc by departure delays
d <- arrange(dtc, DepDelay)
# Arrange dtc so that cancellation reasons are grouped
c <- arrange(dtc,CancellationCode )
#By default, arrange() arranges the rows from smallest to largest.
#Rows with the smallest value of the variable will appear at the top of the data set.
#You can reverse this behavior with the desc() function.
# Arrange according to carrier and decreasing departure delays
des_Flight <- arrange(hflights, desc(DepDelay))
# Arrange flights by total delay (normal order).
arrange(hflights, ArrDelay + DepDelay)
#######################################
Create summaries of the data frame using summarise()
The summarise() function will create summary statistics for a given column in the data frame such as finding the mean.
Use This Command To Perform The Above Mentioned Function
#######################################
#summarise(): reduces each group to a single row by calculating aggregate measures.
#######################################
#summarise(), follows the same syntax as mutate(),
#but the resulting dataset consists of a single row instead of an entire new column in the case of mutate()
#min(x) - minimum value of vector x.
#max(x) - maximum value of vector x.
#mean(x) - mean value of vector x.
#median(x) - median value of vector x.
#quantile(x, p) - pth quantile of vector x.
#sd(x) - standard deviation of vector x.
#var(x) - variance of vector x.
#IQR(x) - Inter Quartile Range (IQR) of vector x.
#diff(range(x)) - total range of vector x.
# Print out a summary with variables
# min_dist, the shortest distance flown, and max_dist, the longest distance flown
summarise(hflights, max_dist = max(Distance),min_dist = min(Distance))
# Print out a summary of hflights with max_div: the longest Distance for diverted flights.
# Print out a summary with variable max_div
div <- filter(hflights, Diverted ==1 )
summarise(div, max_div = max(Distance))
summarise(filter(hflights, Diverted == 1), max_div = max(Distance))
###########################################################
Pipe operator: %>%
Before we go any futher, let’s introduce the pipe operator: %>%. dplyr imports this operator from another package (magrittr). This operator allows you to pipe the output from one function to the input of another function. Instead of nesting functions (reading from the inside to the outside), the idea of of piping is to read the functions from left to right.
Use This Command To Perform The Above Mentioned Function
#######################################
#Chaining function using Pipe Operators
#######################################
hflights %>%
filter(DepDelay>240) %>%
mutate(TaxingTime = TaxiIn + TaxiOut) %>%
arrange(TaxingTime)%>%
select(TailNum )
# Write the 'piped' version of the English sentences.
# Use dplyr functions and the pipe operator to transform the following English sentences into R code:
# Take the hflights data set and then ...
# Add a variable named diff that is the result of subtracting TaxiIn from TaxiOut, and then ...
# Pick all of the rows whose diff value does not equal NA, and then ...
# Summarise the data set with a value named avg that is the mean diff value.
hflights %>%
mutate(diff = TaxiOut - TaxiIn) %>%
filter(!is.na(diff)) %>%
summarise(avg = mean(diff))
# mutate() the hflights dataset and add two variables:
# RealTime: the actual elapsed time plus 100 minutes (for the overhead that flying involves) and
# mph: calculated as Distance / RealTime * 60, then
# filter() to keep observations that have an mph that is not NA and that is below 70, finally
# summarise() the result by creating four summary variables:
# n_less, the number of observations,
# n_dest, the number of destinations,
# min_dist, the minimum distance and
# max_dist, the maximum distance.
# Chain together mutate(), filter() and summarise()
hflights %>%
mutate(RealTime = ActualElapsedTime + 100, mph = Distance / RealTime * 60) %>%
filter(!is.na(mph), mph < 70) %>%
summarise(n_less = n(),
n_dest = n_distinct(Dest),
min_dist = min(Distance),
max_dist = max(Distance))
#######################################
Date with R
Dates can be imported from character, numeric formats using the as.Date function from the base package.
If your data were exported from Excel, they will possibly be in numeric format. Otherwise, they will most likely be stored in character format. If your dates are stored as characters, you simply need to provide as.Date with your vector of dates and the format they are currently stored in
There are a number of different formats you can specify, here are a few of them:
%Y: 4-digit year (1982)
%y: 2-digit year (82)
%m: 2-digit month (01)
%d: 2-digit day of the month (13)
%A: weekday (Wednesday)
%a: abbreviated weekday (Wed)
%B: month (January)
%b: abbreviated month (Jan)
Use This Command To Perform The Above Mentioned Function
####################################################################################
####################################################################################
# Lesson 6:
# Topic 3: Date in R
###################################################################################
# Today's date
today <- Sys.Date()
today
class(today)
#Creating date from character
character_date <- "1957-03-04"
class(character_date)
# Convert into date class by as.Date function
sp500_birthday <- as.Date(character_date)
sp500_birthday
class(sp500_birthday)
# Date format
#default - ISO 8601 ISO 8601 Standard: year-month-day
as.Date("2017-01-28")
# Alternative form: year/month/day
as.Date("2017/01/28")
#Fails: month/day/year
as.Date("01/28/2017")
# Explicitly tell R the format
as.Date("01/28/2017", format = "%m/%d/%Y")
#Date format
# %d - Day of the month (01-31)
# %m - Month (01-12)
# %y - Year without century (00-99)
# %Y - Year with century (0-9999)
# %b - Abbreviated month name
# %B - Full month name
# "/" "-" "," Common separators
# Example: September 15, 2008
as.Date("September 15, 2008", format = "%B %d, %Y")
# Extract the Weekdays
dates <- as.Date(c("2017-01-02", "2017-05-03", "2017-08-04", "2017-10-17"))
dates
weekdays(dates)
# Extract the months
months(dates)
# Extract the quarters
quarters(dates)
Data Visualization
Basic Visualization
Scatter Plot
Line Chart
Bar Plot
Pie Chart
Histogram
Density plot
Box Plot
Advanced Visualization
Mosaic Plot
Heat Map
3D charts
Correlation Plot
Word Cloud
Scatter Plot
Scatterplots use a collection of points placed using Cartesian Coordinates to display values from two variables. By displaying a variable in each axis, you can detect if a relationship or correlation between the two variables exists.
Use This Command To Perform Above Mentioned Function:
######################################################################
# Lesson 7
# Topic 1: Types of Graphic in R
######################################################################
#########################################################################
#########################################################################
#Following are the basic types of graphs, which can be chosen based on
#the situation and the data available.
# Basic Visualization
# Scatter Plot
# Line Chart
# Bar Plot
# Pie Chart
# Histogram
# Density plot
# Box Plot
# Advanced Visualization
# Mosaic Plot
# Heat Map
# 3D charts
# Correlation Plot
# Word Cloud
#########################################################################
# Basic plot - Scatter Plot
# Example -1
x <- c (1, 2, 3, 4, 5)
y <- c (1, 5, 3, 2, 0)
plot (x, y)
# Example -2
dose <- c(20, 30, 40, 50, 60)
drugA <- c(16, 20, 27, 40, 60)
drugB <- c(40, 31, 25, 18, 12)
plot(dose, drugA)
plot(dose, drugB)
help(plot)
#type argument
#"p" for points,
#"l" for lines,
#"b" for both,
#"c" for the lines part alone of "b",
#"o" for both 'overplotted',
#"h" for 'histogram' like (or 'high-density') vertical lines,
#"s" for stair steps,
#"S" for other steps, see 'Details' below,
#"n" for no plotting.
#Different types of plot
plot(dose, drugA, type="p")
plot(dose, drugA, type="l")
plot(dose, drugA, type="b")
plot(dose, drugA, type="c")
plot(dose, drugA, type="o")
plot(dose, drugA, type="h")
plot(dose, drugA, type="s")
plot(dose, drugA, type="n")
#Example 3
# Load the MASS package
library(MASS)
str(mtcars)
# https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html
########################################################
#[, 1] mpg Miles/(US) gallon
#[, 2] cyl Number of cylinders
#[, 3] disp Displacement (cu.in.)
#[, 4] hp Gross horsepower
#[, 5] drat Rear axle ratio
#[, 6] wt Weight (1000 lbs)
#[, 7] qsec 1/4 mile time
#[, 8] vs Engine (0 = V-shaped, 1 = straight)
#[, 9] am Transmission (0 = automatic, 1 = manual)
#[,10] gear Number of forward gears
#[,11] carb Number of carburetors
########################################################
summary(mtcars)
plot(mtcars$hp, mtcars$mpg)
plot(mtcars$hp, mtcars$mpg, xlab = "Horsepower", ylab = "Gas mileage")
plot(mtcars$hp, mtcars$mpg, xlab = "Horsepower", ylab = "Gas mileage", main = "MPG vs Horsepower")
# Compute max_hp
max_hp <- max(mtcars$hp)
# Compute max_mpg
max_mpg <- max(mtcars$mpg)
plot(mtcars$hp, mtcars$mpg,type = "p",
xlim = c(0, max_hp),
ylim = c(0, max_mpg), xlab = "Horsepower",
ylab = "Miles per gallon", main = "Horsepower vs Mileage")
#################################################################################
Data Visualization – mfrow
Create a multi-paneled plotting window. The par(mfrow) function is handy for creating a simple multi-paneled plot, while layout should be used for customized panel plots of varying sizes.
Use This Command To Perform Above Mentioned Function:
# Adding details with par function
#########################################################################
# par function
#View current setting
par()
# Assign the return value from the par() function to plot_pars
plot_pars <- par()
# Display the names of the par() function's list elements
names(plot_pars)
# Display the number of par() function list elements
length(plot_pars)
#########################################################################
#mfrow =c(row,col)
# Creating plot array with mfrow parameter
# Set up a two-by-two plot array
par(mfrow = c(2, 2))
# Plot y1 vs. x1
plot(anscombe$x1, anscombe$y1)
# Plot y2 vs. x2
plot(anscombe$x2, anscombe$y2)
# Plot y3 vs. x3
plot(anscombe$x3, anscombe$y3)
# Plot y4 vs. x4
plot(anscombe$x4, anscombe$y4)
# Define common x and y limits for the four plots
xmin <- min(anscombe$x1, anscombe$x2, anscombe$x3, anscombe$x4)
xmax <- max(anscombe$x1, anscombe$x2, anscombe$x3, anscombe$x4)
ymin <- min(anscombe$y1, anscombe$y2, anscombe$y3, anscombe$y4)
ymax <- max(anscombe$y1, anscombe$y2, anscombe$y3, anscombe$y4)
# Set up a two-by-two plot array
par(mfrow = c(2, 2))
# Plot y1 vs. x1 with common x and y limits, labels & title
plot(anscombe$x1, anscombe$y1,
xlim = c(xmin, xmax),
ylim = c(ymin, ymax),
xlab = "x value", ylab = "y value",
main = "First dataset")
# Do the same for the y2 vs. x2 plot
plot(anscombe$x2, anscombe$y2,
xlim = c(xmin, xmax),
ylim = c(ymin, ymax),
xlab = "x value", ylab = "y value",
main = "Second dataset")
# Do the same for the y3 vs. x3 plot
plot(anscombe$x3, anscombe$y3,
xlim = c(xmin, xmax),
ylim = c(ymin, ymax),
xlab = "x value", ylab = "y value",
main = "Third dataset")
# Do the same for the y4 vs. x4 plot
plot(anscombe$x4, anscombe$y4,
xlim = c(xmin, xmax),
ylim = c(ymin, ymax),
xlab = "x value", ylab = "y value",
main = "Fourth dataset")
Data Visualization - pch
Different plotting symbols are available in R. The graphical argument used to specify point shapes is pch.
Use This Command To Perform Above Mentioned Function:
#######################################################################
library(MASS)
data("mtcars")
# pch
# Create plot with type = "n"
plot(mtcars$hp, mtcars$mpg,
type = "n", xlim = c(0, max_hp),
ylim = c(0, max_mpg), xlab = "Horsepower",
ylab = "Miles per gallon")
# Add solid squares to plot
points(mtcars$hp, mtcars$mpg,pch = 15)
# Add open circles to plot
points(mtcars$hp, mtcars$mpg, pch = 1)
# Add open triangles to plot
points(mtcars$hp, mtcars$mpg,pch = 2)
# Create an empty plot using type = "n"
plot(mtcars$hp, mtcars$mpg,
type = "n", xlim = c(0, max_hp),
ylim = c(0, max_mpg), xlab = "Horsepower",
ylab = "Miles per gallon")
# Add points with shapes determined by cylinder number
points(mtcars$hp, mtcars$mpg, pch = mtcars$cyl)
# Create a second empty plot
plot(mtcars$hp, mtcars$mpg, type = "n",
xlab = "Horsepower", ylab = "Gas mileage")
# Add points with shapes as cylinder characters
points(mtcars$hp, mtcars$mpg,
pch = as.character(mtcars$cyl))
# Adjusting text position, size, and font
# Create a second empty plot
plot(mtcars$hp, mtcars$mpg, type = "n",
xlab = "Horsepower", ylab = "Gas mileage")
# Create index3, pointing to 3-cylinder cars
index6 <- which(mtcars$cyl == 6)
# Highlight 6-cylinder cars as solid circles
points(mtcars$hp[index6],
mtcars$mpg[index6],
pch = 19)
# Add car names, offset from points, with larger bold text
text(mtcars$hp[index6],
mtcars$mpg[index6],
adj = -0.2, cex = 1.2, font = 4)
#################################################################
Data Visualization – Color
Data visualization (visualisation), or the visual communication of data, is the study or creation of data represented visually. A good graph is easy to read. A goal when creating data visualizations is to convey information in a clear and concise way. One of the most prominent features of most data visualizations is color.Color is important because it lets you set the mood and color lets you guide the viewer’s eye, draw attention to something and therefore tell a story.Both aspects are important for data visualisations.
In data visualization
There are 657 builtin color names
R uses hexadecimal to represent colors
You can create vectors of using rainbow(n),heat.colos(n),terrain.color(n),topo.colors(n) and cm.colors(n).
Data Visualization -Line Chart
Line charts display information as a series of data points connected by straight line segments on an X-Y axis. They are best used to track changes over time, using equal intervals of time between each data point.
CHARACTERISTICS
INCLUDE A ZERO BASELINE IF POSSIBLE
DON’T PLOT MORE THAN 4 LINES
USE SOLID LINES ONLY
USE THE RIGHT HEIGHT
LABEL THE LINES DIRECTLY
When to use a line chart
Line graphs are useful in that they show data variables and trends very clearly.
It helps to make predictions about the results of data not yet recorded. If seeing the trend of your data is the goal, then this is the chart to use.
Line charts show time-series relationships using continuous data.
They allow a quick assessment of acceleration (lines curving upward), deceleration (lines curving downward), and volatility (up/down frequency).
They are excellent for tracking multiple data sets on the same chart to see any correlation in trends.
They can also be used to display several dependent variables against one independent variable.
Line charts are great visualizations to see how a metric changes over time. For example, the exchange rate for GBP to USD.
By this command we can perform the above mentioned package
#################################################################################
# Line Chart
plot(AirPassengers,type="l") #Simple Line Plot
#Example 2
# Create the data for the chart.
v <- c(7,12,28,3,41)
# Plot the bar chart.
plot(v,type = "o")
# Plot the bar chart.
plot(v,type = "o", col = "red", xlab = "Month", ylab = "Rain fall",
main = "Rain fall chart")
#Multiple Lines
# More than line can be drawn on the same chart by using the line() function
# Create the data for the chart.
t <- c(14,7,6,19,3)
lines(t, type = "o", col = "blue")
#################################################################################
This brings an end to this post, I encourage you to re-read the post to understand it completely if you haven’t and THANK YOU.
Data Visualization - Histogram
A Histogram visualizes the distribution of data over a continuous interval or certain time period. Each bar in a histogram represents the tabulated frequency at each interval/bin.
Histograms help give an estimate as to where values are concentrated, what the extremes are and whether there are any gaps or unusual values.
They are also useful for giving a rough view of the probability distribution.
Histogram is a common variation of charts used to present distribution and relationships of a single variable over a set of categories.
By this command we can perform the above-mentioned package
###############################################################################
###############################################################################
#Histogram
#Simple histogram
hist(mtcars$mpg)
#Colored histogram
?hist
#The width of each of the bar can be decided by using breaks.
hist(mtcars$mpg, breaks = 4, col = "lightblue", xlab = "mpg", ylab = "freq")
hist(mtcars$mpg, breaks = 15, col=rainbow(7), xlab = "mpg", ylab = "freq")
#Change of bin
hist(AirPassengers, col=rainbow(7))
#Histogram of the AirPassengers dataset with 5 breakpoints
hist(AirPassengers, breaks=5)
# If you want to have more control over the breakpoints between bins,
# you can enrich the breaks argument by giving it a vector of breakpoints.
# You can do this by using the c() function:
# Compute a histogram for the data values in AirPassengers,
# and set the bins such that they run from 100 to 300, 300 to 500 and 500 to 700.
hist(AirPassengers, breaks= c(100, 300, 500, 700))
# We can use seq(x, y, z) function instaed of c()
# x = begin number of the x-axis,
# y = end number of the x-axis
# z = the interval in which these numbers appear.
hist(AirPassengers, breaks= seq(100, 700, 100))
# Note that you can also combine the two functions:
# Make a histogram for the AirPassengers dataset, start at 100 on the x-axis,
# and from values 200 to 700, make the bins 150 wide
hist(AirPassengers, breaks=c(100, seq(200,600, 150), 700))
###############################################################################
This brings an end to this post, I encourage you to re-read the post to understand it completely if you haven’t and THANK YOU.
Data Visualization - Box Plot
A Box Plot is a convenient way of visually displaying the data distribution through their quartiles.
Box Plots can be drawn either vertically or horizontally.
Although Box Plots may seem primitive in comparison to a Histogram or Density Plot, they have the advantage of taking up less space, which is useful when comparing distributions between many groups or data sets.
The types of observations from viewing a Box Plot:
· What the key values are, such as: the average, median percentile etc.
· If there are any outliers and what their values are.
· Is the data symmetrical.
· How tightly is the data grouped?
· If the data is skewed and if so, in what direction.
Two of the most commonly used variation of Box Plot are:
Variable-width Box Plots
Notched Box Plots.
By this command we can perform the above mentioned package
###############################################################################
# Boxplot
vec <- c(3,2,5,6,4,8,1,2,3,2,4,30,36)
?boxplot
boxplot(vec)
boxplot(vec, varwidth = TRUE)
# Boxplot of MPG by Car Cylinders
# a formula, such as y ~ grp, where y is a numeric vector of data values
# to be split into groups according to the grouping variable grp (usually a factor).
boxplot(mpg~cyl, data = mtcars)
boxplot(mpg~cyl,data=mtcars, main="Car Milage Data",
xlab="Number of Cylinders", ylab="Miles Per Gallon",col=(c("gold","darkgreen","Blue")))
###############################################################################
#########################################################################
This brings an end to this post, I encourage you to re-read the post to understand it completely if you haven’t and THANK YOU.
Data Visualization - Mosaic Plot
Mosaic plots were introduced by Hartigan and Kleiner in 1981 and expanded on by Friendly in 1994.Mosaic plots are also called Mekko charts due to their resemblance to a Marimekko print.
The function Mosaic Plot summarizes the conditional probabilities of co-occurrence of the categorical values in a list of records of the same length. The list of records is assumed to be a full array and the columns to represent categorical values.
A mosaic plot Z is a graphical method for visualizing data from two or more qualitative variables.
It is the multidimensional extension of spine plots, which graphically display the same information for only one variable.
It gives an overview of the data and makes it possible to recognize relationships between different variables. For example, independence is shown when the boxes across categories all have the same areas.
Data Visualization - Heat Map
Heatmaps visualize data through variations in coloring.
Heatmaps are useful for cross-examining multivariate data, through placing variables in the rows and columns and coloring the cells within the table.
Heatmaps are good for showing variance across multiple variables, revealing any patterns, displaying whether any variables are similar to each other, and for detecting if any correlations exist in-between them.
Heatmaps can also be used to show the changes in data over time if one of the rows or columns are set to time intervals.
Heatmaps are a chart better suited to displaying a more generalized view of numerical data
By this command we can perform the above mentioned package
###############################################################################
#########################################################################
# Mosiac Plot
data(HairEyeColor)
mosaicplot(HairEyeColor)
?mosaicplot
###############################################################################
# Heatmap
# Heat map uses color gradient to make comparisons and
# when you want compare different categories across two dimensions you can make use heat map.
library(MASS)
mtcars
heatmap(as.matrix(mtcars))
?heatmap
heatmap(as.matrix(mtcars), Rowv = NA, Colv = NA, scale = "column", col = cm.colors(256),
xlab = "Attributes", main = "heatmap")
#########################################################################
This brings an end to this post, I encourage you to re-read the post to understand it completely if you haven’t and THANK YOU.
Data Visualization - 3D Plot
3D Plot is used where 2D Plots fails creating a chart.
We use the lattice package which acts as Graphical User Interface (GUI).
Simply install and load lattice package
Use the cloud function
Plotly is a platform for data analysis, graphing, and collaboration. Now, you can you can also make 3D plots. In this post we will show how to make 3D plots with Plotly's R API.
By this command we can perform the above-mentioned package
#########################################################################
#3D graph with lattice package
library(lattice)
attach(mtcars)
# Change am column to factor as "Automatic" and "Manual"
mtcars$am[which(mtcars$am == 0)] <- 'Automatic'
mtcars$am[which(mtcars$am == 1)] <- 'Manual'
mtcars$am <- as.factor(mtcars$am)
#3d scatterplot by factor level
cloud(hp~mpg*wt, data = mtcars)
cloud(hp~mpg*wt, data = mtcars, main = "3D Scatterplot")
cloud(hp~mpg*wt, data = mtcars, main = "3D Scatterplot", col = cyl)
cloud(hp~mpg*wt, data = mtcars, main = "3D Scatterplot", col = cyl, pch = 17)
cloud(hp~mpg*wt|am, data = mtcars, main = "3D Scatterplot", col = cyl, pch = 17)
?cloud
##############################################################
# 3D graph with plotly packaage
install.packages("plotly")
library(plotly)
data(mtcars)
# Basic 3D Scatter Plot
plot_ly(mtcars, x = ~wt, y = ~hp, z = ~qsec)
# Basic 3D Scatter Plot with Color
plot_ly(mtcars, x = ~wt, y = ~hp, z = ~qsec, color = ~am, colors = c('#BF382A', '#0C4B8E')) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Weight'),
yaxis = list(title = 'horsepower'),
zaxis = list(title = 'qsec')))
#3D Scatter Plot with color scaling
plot_ly(mtcars, x = ~wt, y = ~hp, z = ~qsec,
marker = list(color = ~mpg, colorscale = c('#FFE1A1', '#683531'), showscale = TRUE)) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Weight'),
yaxis = list(title = 'horsepower'),
zaxis = list(title = 'qsec')),
annotations = list(
x = 1.13,
y = 1.05,
text = 'Miles/(US) gallon',
xref = 'paper',
yref = 'paper',
showarrow = FALSE
))
# Load the `plotly` library
library(plotly)
# Your volcano data
str(volcano)
volcano
# The 3d surface map
plot_ly(z = ~volcano, type = "surface")
#########################################################################
This brings an end to this post, I encourage you to re-read the post to understand it completely if you haven’t and THANK YOU.
Data Visualization - Word Cloud
A visualization method that displays how frequently words appear in a given body of text, by making the size of each word proportional to its frequency.
All the words are arranged in a cluster or cloud of words.
The words can also be arranged in any format: horizontal lines, columns or within a shape.
Word Clouds can also be used to display words that have meta-data assigned to them. For example, in a Word Cloud with all the World's country's names, the population could be assigned to each name to determine its size.
Color used on Word Clouds is usually meaningless and is primarily aesthetic, but it can be used to categorize words or to display another data variable.
Typically, Word Clouds are used on websites or blogs to depict keyword or tag usage. Word Clouds can also be used to compare two different bodies of text together.
Although being simple and easy to understand, Word Clouds have major flaws:
1. Long words are emphasized over short words.
2. Words whose letters contain many ascenders and descenders may receive more attention.
3. They're not great for analytical accuracy, so used more for aesthetic reasons instead.
By this command we can perform the above-mentioned package
########################################################################
# WordCloud
#Instal the packages
install.packages("wordcloud")
install.packages("RColorBrewer")
#Load the packages
library("wordcloud")
library("RColorBrewer")
# Create model_table of manufacturer frequencies
rownames(mtcars)
model_table <- table(rownames(mtcars))
model_table
# Create the default wordcloud from this table
#scale - range of the size of the word
#freq - frequency of word
wordcloud(words = names(model_table),
freq = as.numeric(model_table),
scale = c(1.5, 0.25))
# Change the minimum word frequency
#min.freq - below min.freq word will not be plotted
wordcloud(words = names(model_table),
freq = as.numeric(model_table),
scale = c(1, 0.25),
min.freq = 1)
##################################################################################
This brings an end to this post, I encourage you to re-read the post to understand it completely if you haven’t and THANK YOU.
PART 1 Data Visualization - ggplot2
By this command we can perform the above-mentioned package
##################################################################################
# Sesson 7: Data Visualization
# Topic 2: Graphics with Ggplot2
install.packages(ggplot2)
library(ggplot2)
# ggplot2 Layer:
###########################################################################
#1. Data Layer
#2. Aesthetic layer: x-axis, y-axis, color, fill, size
#3. Geometric layer: point, line, histogram, barplot, boxplot
#4. Facet layer: column , rows
#5. Statics layer: binning, smoothing
#6. Coordinates layer: fixed, polar, cartesian
#7. Themes Layer: non data link
###########################################################################
# Scatter plot
ggplot(mtcars, aes(x=wt, y = mpg)) # 2 Layer
ggplot(mtcars, aes(x=wt, y = mpg))+ geom_point() # 3 Layer
# Adding color
ggplot(mtcars, aes(x=wt, y = mpg, col = disp))+geom_point() # 3 Layer
#Adding color based on a factor
ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl))) + geom_point()# 3
#Add size
ggplot(mtcars, aes(x = wt, y = mpg, size = disp)) + geom_point()
# Add color and shape (4 aesthetics):
ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl), shape = factor(am))) + geom_point()
# Add color shape and size(hp/wt) (5 aesthetics):
ggplot(mtcars, aes(x = wt, y = mpg, col = factor(cyl), shape = factor(am), size = (hp/wt))) + geom_point()
#############################################################
Introduction To Statistic - Part 1
This course introduces you to sampling and exploring data, as well as basic probability theory and Bayes' rule. You will examine various types of sampling methods, and discuss how such methods can impact the scope of inference. A variety of exploratory data analysis techniques will be covered, including numeric summary statistics and basic data visualization.
R is a language and environment for statistical computing and graphics.
R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible.
Why Statistics
Data are everywhere
Statistical Techniques are useful in making many crucial decisions that impact our lives
Irrespective of our career,you will make professional decisions that involve data
Knowledge of statistical methods will contribute in making these decisions effectively
Type Of Statistics
Terminologies in Statistics
Four big terms in statistics are population, sample, parameter, and statistic.A statistic is a quantitative characteristic of a sample that often helps estimate or test the population parameter (such as a sample mean or proportion). A population is the entire group of individuals and a sample is a subset of that group.
Population: This refers to a set of all possible measurements. This is an ideal that can only be approached. Greek letters are used to symbolize population statistics.
In statistics, the term population is used to describe the subjects of a particular study—everything or everyone who is the subject of a statistical observation.
Populations can be large or small in size and defined by any number of characteristics, though these groups are typically defined specifically rather than vaguely—for instance, a population of women over 18 who buy coffee at Starbucks rather than a population of women over 18.
Statistical populations are used to observe behaviors, trends, and patterns in the way individuals in a defined group interact with the world around them, allowing statisticians to draw conclusions about the characteristics of the subjects of study, although these subjects are most often humans, animals, and plants, and even objects like stars.
Sample: This refers to a set of actual measurements. The distinction between sample and population statistics is most important for a small number of measurements (less than 20).
In statistics and quantitative research methodology, a data sample is a set of data collected and/or selected from a statistical population by a defined procedure.
The elements of a sample are known as sample points, sampling units or observations.The sample usually represents a subset of manageable size.
Different sampling techniques, such as forming stratified samples, can help in dealing with subpopulations, and many of these techniques assume that a specific type of sample, called a simple random sample, has been selected from the population.
Statistical Investigation
Statistical investigation is part of an information gathering and learning process which is undertaken to seek meaning from and to learn more about observed phenomena as well as to inform decisions and actions. The ultimate goal of statistical investigation is to learn more about a real world situation and to expand the body of contextual knowledge.
Sampling
In statistics, quality assurance, and survey methodology, sampling is the selection of a subset (a statistical sample) of individuals from within a statistical population to estimate characteristics of the whole population. Two advantages of sampling are that the cost is lower and data collection is faster than measuring the entire population.
Representative Sample
A representative sample is a small quantity of something that accurately reflects the larger entity. An example is when a small number of people accurately reflect the members of an entire population.
Bias
Statistical bias is a feature of a statistical technique or of its results whereby the expected value of the results differs from the true underlying quantitative parameter being estimated.
Sampling Bias
In statistics, sampling bias is a bias in which a sample is collected in such a way that some members of the intended population are less likely to be included than others. It results in a biased sample, a non-random sample of a population (or non-human factors) in which all individuals, or instances, were not equally likely to have been selected. If this is not accounted for, results can be erroneously attributed to the phenomenon under study rather than to the method of sampling.
Types Of Sampling Methods
Simple Random Sampling (SRS)
Stratified Sampling
Cluster Sampling
Systematic Sampling
Simple Random Sampling (SRS):
In a simple random sample (SRS) of a given size, all such subsets of the frame are given an equal probability.
Each element of the frame thus has an equal probability of selection: the frame is not subdivided partitioned.
Any given pair of elements has the same chance of selection as any other such pair (and similarly for triples, and so on). This minimizes bias and simplifies analysis of results.
In particular, the variance between individual results within the sample is a good indicator of variance in the overall population, which makes it relatively easy to estimate the accuracy of results.
· SRS can be vulnerable to sampling error because the randomness of the selection may result in a sample that doesn't reflect the makeup of the population.
Stratified Sampling:
It is possible when it makes sense to partition the population into groups based on a factor that may influence the variable that is being measured. These groups are then called strata. An individual group is called a stratum. With stratified sampling one should:
partition the population into groups (strata)
obtain a simple random sample from each group (stratum)
collect data on each sampling unit that was randomly sampled from each group (stratum)
Cluster Sampling:
It is very different from Stratified Sampling. With cluster sampling one should
divide the population into groups (clusters).
obtain a simple random sample of so many clusters from all possible clusters.
obtain data on every sampling unit in each of the randomly selected clusters.
Systematic Sampling:
Systematic sampling (also known as interval sampling) relies on arranging the study population according to some ordering scheme and then selecting elements at regular intervals through that ordered list.
Systematic sampling involves a random start and then proceeds with the selection of every kth element from then onwards.
It is easy to implement and the stratification induced can make it
However, systematic sampling is especially vulnerable to periodicities in the list. If periodicity is present and the period is a multiple or factor of the interval used, the sample is especially likely to be unrepresentative of the overall population, making the scheme less accurate than simple random sampling.
Systematic sampling is that even in scenarios where it is more accurate than SRS, its theoretical properties make it difficult to quantify that accuracy.
This brings an end to this post, I encourage you to re-read the post to understand it completely if you haven’t and THANK YOU.
Introduction To Statistic - Part 2
Quantitative and Qualitative data
Quantitative and qualitative data provide different outcomes, and are often used together to get a full picture of a population. For example, if data are collected on annual income (quantitative), occupation data (qualitative) could also be gathered to get more detail on the average annual income for each type of occupation.
Quantitative and qualitative data can be gathered from the same data unit depending on whether the variable of interest is numerical or categorical.
Quantitative data
Information that can be handled numerically. Quantitative variables are numerical variables: counts, percents, or numbers.
Quantitative data always are associated with a scale measure.
Probably the most common scale type is the ratio-scale. Observations of this type are on a scale that has a meaningful zero value but also have an equidistant measure
Statistics that describe or summarise can be produced for quantitative data and to a lesser extent for qualitative data.
As quantitative data are always numeric they can be ordered, added together, and the frequency of an observation can be counted. Therefore, all descriptive statistics can be calculated using quantitative data.
By making inferences about quantitative data from a sample, estimates or projections for the total population can be produced.
Quantitative data can be used to inform broader understandings of a population
Qualitative data
Information that refers to the quality of something. Ethnographic research, participant observation, open-ended interviews, etc., may collect qualitative data. However, often there is some element of the results obtained via qualitative research that can be handled numerically, e.g., how many observations, number of interviews conducted, etc.
Qualitative data are measures of 'types' and may be represented by a name, symbol, or a number code.
Qualitative data are data about categorical variables (e.g. what type).
Statistics that describe or summarise can be produced for quantitative data and to a lesser extent for qualitative data.
As qualitative data represent individual (mutually exclusive) categories, the descriptive statistics that can be calculated are limited, as many of these techniques require numeric values which can be logically ordered from lowest to highest and which express a count.
Qualitative data are not compatible with inferential statistics as all techniques are based on numeric values.
It is used to consider how a population may change or progress into the future.is Qualitative data
Descriptive Statistics
Descriptive statistics give information that describes the data in some manner. For example, suppose a pet shop sells cats, dogs, birds and fish. If 100 pets are sold, and 40 out of the 100 were dogs, then one description of the data on the pets sold would be that 40% were dogs.
This same pet shop may conduct a study on the number of fish sold each day for one month and determine that an average of 10 fish were sold each day. The average is an example of descriptive statistics.
Some other measurements in descriptive statistics answer questions such as 'How widely dispersed is this data?', 'Are there a lot of different values?' or 'Are many of the values the same?', 'What value is in the middle of this data?', 'Where does a particular data value stand with respect with the other values in the data set?'
A graphical representation of data is another method of descriptive statistics. Examples of this visual representation are histograms, bar graphs and pie graphs, to name a few. Using these methods, the data is described by compiling it into a graph, table or other visual representation.
This provides a quick method to make comparisons between different data sets and to spot the smallest and largest values and trends or changes over a period of time. If the pet shop owner wanted to know what type of pet was purchased most in the summer, a graph might be a good medium to compare the number of each type of pet sold and the months of the year.
Inferential statistics
Inferential statistics is one of the two main branches of statistics. Inferential statistics use a random sample of data taken from a population to describe and make inferences about the population.
For instance, we use inferential statistics to try to infer from the sample data what the population might think. Or, we use inferential statistics to make judgments of the probability that an observed difference between groups is a dependable one or one that might have happened by chance in this study. Thus, we use inferential statistics to make inferences from our data to more general conditions; we use descriptive statistics simply to describe what's going on in our data.
Properties of samples, such as the mean or standard deviation, are not called parameters, but statistics. Inferential statistics are techniques that allow us to use these samples to make generalizations about the populations from which the samples were drawn.
It is, therefore, important that the sample accurately represents the population. The process of achieving this is called sampling (sampling strategies are discussed in detail here on our sister site). Inferential statistics arise out of the fact that sampling naturally incurs sampling error and thus a sample is not expected to perfectly represent the population.
The methods of inferential statistics are (1) the estimation of parameter(s) and (2) testing of statistical hypotheses.
By this command we can perform the above-mentioned package
# Lesson 7 :Introduction To Statitics
# Topic : Basic Statistics
head(mtcars)
#Get mean column-wise
apply(mtcars,2,mean)
#Get median column-wise
apply(mtcars,2,median)
#Get standard deviation column-wise
apply(mtcars,2,sd)
#Get variance column-wise
apply(mtcars,2,var)
#frequency Table
table(mtcars$cyl)
#Relative frequency
table(mtcars$cyl)/sum(table(mtcars$cyl))
#Quartile
quantile(mtcars$mpg, probs = c(0.05, 0.1, 0.5, 0.9, 0.95))
#range
range(mtcars$mpg)
#default summary
summary(mtcars)
#Using hmisc package
install.packages("Hmisc")
library(Hmisc)
describe(mtcars)
#Compute mean excluding missing values
sapply(mtcars, mean, na.rm = TRUE)
##########################################################################
# Lesson 7 :Introduction To Statitics
# Topic : Normal Distribution
##########################################################################
# Example 1: Vehicle Speed
# The average speed of vehicles traveling on a stretch of highway is 67 miles per hour
# with a standard deviation of 3.5 miles per hour. A vehicle is selected at random.
# a. What is the probability that it is violating the 70 mile per hour speed limit?
# Assume that the speeds are normally distributed.
# Solution:
# The random variable X is speed .We are told that X has a normal distribution.
µ = 67 # Mean
?? = 3.5 # The standard deviation
# We are looking for the probability of the event that X > 70 .
# Step 1: Convert 70 into a z-score:
z = (70-67)/3.5
z # 0.86
# Step 2: Find the appropriate area between the normal curve and the axis using the table:
# The table contains cumulative areas (to the left of the z-value).
# The area corresponding to a z-score of 0.86 in the table is 0.8051.
# Since we are interested in X > 70, we need the area to the right of the z-score, thus P(X > 70) ???
1 - 0.8051 # 0.1949
# Solution 2 - using pnorm function in R
pnorm(70, mean = 67, sd = 3.5, lower.tail = FALSE)
########################################################################
This brings an end to this post, I encourage you to re-read the post to understand it completely if you haven’t and THANK YOU.
Introduction To Statistic - Part 3
In probability and statistics, a random variable, random quantity, aleatory variable, or stochastic variable is a variable whose possible values are outcomes of a random phenomenon. As a function, a random variable is required to be measurable, which rules out certain cases where the quantity which the random variable returns is infinitely sensitive to small changes in the outcome.
It is common that these outcomes depend on some physical variables that are not well understood. For example, when tossing a fair coin, the final outcome of heads or tails depends on the uncertain physics. Which outcome will be observed is not certain. The coin could get caught in a crack in the floor, but such a possibility is excluded from consideration. The domain of a random variable is the set of possible outcomes. In the case of the coin, there are only two possible outcomes, namely heads or tails. Since one of these outcomes must occur, either the event that the coin lands heads or the event that the coin lands tails must have non-zero probability.
A random variable is defined as a function that maps the outcomes of unpredictable processes to numerical quantities (labels), typically real numbers. In this sense, it is a procedure for assigning a numerical quantity to each physical outcome. Contrary to its name, this procedure itself is neither random nor variable. Rather, the underlying process providing the input to this procedure yields random (possibly non-numerical) output that the procedure maps to a real-numbered value.
Discrete random variable
In an experiment a person may be chosen at random, and one random variable may be the person's height. Mathematically, the random variable is interpreted as a function which maps the person to the person's height. Associated with the random variable is a probability distribution that allows the computation of the probability that the height is in any subset of possible values, such as the probability that the height is between 180 and 190 cm, or the probability that the height is either less than 150 or more than 200 cm.
Continuous random variable
A variable is a quantity that has a changing value; the value can vary from one example to the next. A continuous variable is a variable that has an infinite number of possible values. In other words, any value is possible for the variable.
An example of a continuous random variable would be one based on a spinner that can choose a horizontal direction. Then the values taken by the random variable are directions. We could represent these directions by North, West, East, South, Southeast, etc.
Some examples of continuous variables:
Time
A person’s weight.
Income.
Age.
The price of gas.
Discrete Variable
A variable is a quantity that has changing values. A discrete variable is a variable that can only take on a certain number of values. In other words, they don’t have an infinite number of values. If you can count a set of items, then it’s a discrete variable. The opposite of a discrete variable is a continuous variable. Continuous variables can take on an infinite number of possibilities.
Some examples of discrete variables:
Number of quarters in a purse, jar, or bank
The number of cars in a parking lot
Points on a 10-point rating scale
Ages on birthday cards.
Aspects of Random Variable
A distribution
A mean
A standard deviation
Distribution
Suppose a variable X can take the values 1, 2, 3, or 4. ... It is a function giving the probability that the random variable X is less than or equal to x, for every value x. For a discrete random variable, the distribution function is found by summing up the probabilities.
Mean
The mean of a discrete random variable X is a weighted average of the possible values that the random variable can take. Unlike the sample mean of a group of observations, which gives each observation equal weight, the mean of a random variable weights each outcome xi according to its probability, pi.
Standard deviation
Standard deviation (of a discrete random variable) A measure of spread for a distribution of a random variable that determines the degree to which the values differ from the expected value. The standard deviation of random variable X is often written as σ or σX.
Resources for this lecture
Introduction To Statistic Part 4
A parameter is a characteristic of a population. A statistic is a characteristic of a sample. Inferential statistics enables you to make an educated guess about a population parameter based on a statistic computed from a sample randomly drawn from that population
Population parameter refers to the calculated value (May be mean, median or variance or any other such value) using all the data of the entire population
where as
Sample statistic refers to the calculated values such as mean or variance or similar metrics on a sample of data.
Sample statistic is used to estimate the population parameter as collecting the data for entire population is tedious and sometimes impractical or even not worth the effort.
Now, the difference between Population parameter and Sample statistic clearly depends on how good your sample is, how representative your sample is and how closely it represents the overall population.
Standard Deviation
The standard deviation is a measure of the spread of scores within a set of data. Usually, we are interested in the standard deviation of a population. However, as we are often presented with data from a sample only, we can estimate the population standard deviation from a sample standard deviation. These two standard deviations - sample and population standard deviations - are calculated differently. In statistics, we are usually presented with having to calculate sample standard deviations, and so this is what this article will focus on, although the formula for a population standard deviation will also be shown.
What are the formulas for the standard deviation?
The sample standard deviation formula is:
where,
s = sample standard deviation
= sum of...
= sample mean
n = number of scores in sample.
The population standard deviation formula is:
where,
= population standard deviation
= sum of...
= population mean
n = number of scores in sample.
Introduction To Statistic - Part 5
Is the data discrete or continuous?
The first and most obvious categorization of data should be on whether the data is restricted to taking on only discrete values or if it is continuous. Consider the inputs into a typical project analysis at a firm. Most estimates that go into the analysis come from distributions that are continuous; market size, market share and profit margins, for instance, are all continuous variables. There are some important risk factors, though, that can take on only discrete forms, including regulatory actions and the threat of a terrorist attack; in the first case, the regulatory authority may dispense one of two or more decisions which are specified up front and in the latter, you are subjected to a terrorist attack or you are not.
With discrete data, the entire distribution can either be developed from scratch or the data can be fitted to a pre-specified discrete distribution. With the former, there are two steps to building the distribution. The first is identifying the possible outcomes and the second is to estimate probabilities to each outcome. As we noted in the text, we can draw on historical data or experience as well as specific knowledge about the investment being analyzed to arrive at the final distribution. This process is relatively simple to accomplish when there are a few outcomes with a well-established basis for estimating probabilities but becomes more tedious as the number of outcomes increases. If it is difficult or impossible to build up a customized distribution
Types of Distributions
Discrete Probability Distribution
1. Binomial Distribution
2. Negative Binomial Distribution
3. Geometric Distribution
Continuous Probability Distribution
1. Normal Distribution
2. Gamma Distribution
3. Beta Distribution
4. Exponential Distribution
Normal distribution
Normal distribution represents the behavior of most of the situations in the universe (That is why it’s called a “normal” distribution. I guess!). The large sum of (small) random variables often turns out to be normally distributed, contributing to its widespread application. Any distribution is known as Normal distribution if it has the following characteristics:
The mean, median and mode of the distribution coincide.
The curve of the distribution is bell-shaped and symmetrical about the line x=μ.
The total area under the curve is 1.
Exactly half of the values are to the left of the center and the other half to the right.
A normal distribution is highly different from Binomial Distribution. However, if the number of trials approaches infinity then the shapes will be quite similar.
Z Score
A z-score (aka, a standard score) indicates how many standard deviations an element is from the mean. A z-score can be calculated from the following formula.
z = (X - μ) / σ
where z is the z-score, X is the value of the element, μ is the population mean, and σ is the standard deviation.
Here is how to interpret z-scores
A z-score less than 0 represents an element less than the mean.
A z-score greater than 0 represents an element greater than the mean.
A z-score equal to 0 represents an element equal to the mean.
A z-score equal to 1 represents an element that is 1 standard deviation greater than the mean; a z-score equal to 2, 2 standard deviations greater than the mean; etc.
A z-score equal to -1 represents an element that is 1 standard deviation less than the mean; a z-score equal to -2, 2 standard deviations less than the mean; etc.
If the number of elements in the set is large, about 68% of the elements have a z-score between -1 and 1; about 95% have a z-score between -2 and 2; and about 99% have a z-score between -3 and 3.
Introduction To Statistic - Part 7
By this command we can perform the above-mentioned package
##########################################################################
# Lesson 7 :Introduction To Statitics
# Topic : Normal Distribution
##########################################################################
# Example 1: Vehicle Speed
# The average speed of vehicles traveling on a stretch of highway is 67 miles per hour
# with a standard deviation of 3.5 miles per hour. A vehicle is selected at random.
# a. What is the probability that it is violating the 70 mile per hour speed limit?
# Assume that the speeds are normally distributed.
# Solution:
# The random variable X is speed .We are told that X has a normal distribution.
µ = 67 # Mean
?? = 3.5 # The standard deviation
# We are looking for the probability of the event that X > 70 .
# Step 1: Convert 70 into a z-score:
z = (70-67)/3.5
z # 0.86
# Step 2: Find the appropriate area between the normal curve and the axis using the table:
# The table contains cumulative areas (to the left of the z-value).
# The area corresponding to a z-score of 0.86 in the table is 0.8051.
# Since we are interested in X > 70, we need the area to the right of the z-score, thus P(X > 70) ???
1 - 0.8051 # 0.1949
# Solution 2 - using pnorm function in R
pnorm(70, mean = 67, sd = 3.5, lower.tail = FALSE)
########################################################################
# Normal distribution in R
##########################################
# pnorm(q) - The function pnorm returns the integral from ?????? to q
# of the normal distribution where q is a Z-score.
# pnorm(1.96, 0, 1) Gives the area under the standard normal curve to
# the left of 1.96, i.e. ~0.975
##########################################
pnorm(0) # ?????? to Z =0 in Z-Normal distribution
pnorm(1) # 50% + 34% = 84
pnorm(2) # 50% + (95/2)% = 50% + 47.5% = 97.5%
pnorm(3) # 50% + (99.7/2)% = 50% + 49.85% = 99.85%
# Pnorm by default assumes Z-Normal distribution but you can change the mean and std dev
pnorm(0, mean = 0, sd = 1)
pnorm(1, mean = 0, sd = 1)
pnorm(2, mean = 0, sd = 1)
pnorm(3, mean = 0, sd = 1)
# Pnorm by default assumes lower.tail = TRUE,
# but you can calculate the right side area of the normal curve by giving lower.tail = FALSE
1- pnorm(1, mean = 0, sd = 1)
pnorm(1, mean = 0, sd = 1, lower.tail = FALSE)
1 - pnorm(2, mean = 0, sd = 1)
pnorm(2, mean = 0, sd = 1, lower.tail = FALSE)
1 - pnorm(3, mean = 0, sd = 1)
pnorm(3, mean = 0, sd = 1, lower.tail = FALSE)
# Pnorm by default assumes Z-Normal distribution but you can change the mean and std dev
# Pnorm converts X to Z score and then gives probability
pnorm(10, mean = 10, sd = 2)
pnorm(12, mean = 10, sd = 2)
pnorm(14, mean = 10, sd = 2)
pnorm(16, mean = 10, sd = 2)
# Example:2
# Suppose IQ's are normally distributed with a mean of 100 and a standard deviation of 15.
# What percentage of people have an IQ less than 125?
pnorm(125, mean = 100, sd = 15, lower.tail=TRUE)
#What percentage of people have an IQ greater than 110?
pnorm(110, mean = 100, sd = 15, lower.tail=FALSE)
#What percentage of people have an IQ between 110 and 125?
a = pnorm(125, mean = 100, sd = 15, lower.tail=TRUE)
b = pnorm(110, mean = 100, sd = 15, lower.tail=TRUE)
a-b
##########################################
Introduction To Statistic - Part 10
Statistical Inference
The process of making claims about a population based on information from a sample
Population
Sample
Definition
Complete enumeration of items is considered
Part of the population chosen for study
Characteristics
Parameters
Statistics
Central Limit Theorem
The central limit theorem (CLT) is a statistical theory that states that
• Given a sufficiently large sample size from a population with a finite level of variance, the mean of all samples from the same population will be approximately equal to the mean of the population.
• So, for a population with finite mean p and a finite non-zero variance 01%2, the sampling distribution of the mean approaches a normal distribution with a mean of p and a variance of 0A2/n as the sample size (n) increases.
• The amazing and counter-intuitive thing about the central limit theorem is that no matter what the shape of the original (parent) distribution, the sampling distribution of the mean approaches a normal distribution
Population
Sample
Definition
Complete enumeration of items is considered
Part of the population chosen for study
Characteristics
Parameters
Statistics
Are you planing to build your career in Data Science in This Year?
Do you the the Average Salary of a Data Scientist is $100,000/yr?
Do you know over 10 Million+ New Job will be created for the Data Science Filed in Just Next 3 years??
If you are a Student / a Job Holder/ a Job Seeker then it is the Right time for you to go for Data Science!
Do you Ever Wonder that Data Science is the "Hottest" Job Globally in 2018 - 2019!
>> 30+ Hours Video
>> 4 Capstone Projects
>> 8+ Case Studies
>> 24x7 Support
>>ENROLL TODAY & GET DATA SCIENCE INTERVIEW PREPARATION COURSE FOR FREE <<
What Projects We are Going to Cover In the Course?
Project 1- Titanic Case Study which is based on Classification Problem.
Project 2 - E-commerce Sale Data Analysis - based on Regression.
Project 3 - Customer Segmentation which is based on Unsupervised learning.
Final Project - Market Basket Analysis - based on Association rule mining
Why Data Science is a MUST HAVE for Now A Days?
The Answer Why Data Science is a Must have for Now a days will take a lot of time to explain. Let's have a look into the Company name who are using Data Science and Machine Learning. Then You will get the Idea How it BOOST your Salary if you have Depth Knowledge in Data Science & Machine Learning!
What Students Are Saying:
"A great course to kick-start journey in Machine Learning. It gives a clear contextual overview in most areas of Machine Learning . The effort in explaining the intuition of algorithms is especially useful"
- John Doe, Co-Founder, Impressive LLC
I simply love this course and I definitely learned a ton of new concepts.
Nevertheless, I wish there was some real life examples at the end of the course. A few homework problems and solutions would’ve been good enough.
- - Brain Dee, Data Scientist
It was amazing experience. I really liked the course. The way the trainers explained the concepts were too good. The only think which I thought was missing was more of real world datasets and application in the course. Overall it was great experience. The course will really help the beginners to gain knowledge. Cheers to the team
- - Devon Smeeth, Software Developer
Above, we just give you a very few examples why you Should move into Data Science and Test the Hot Demanding Job Market Ever Created!
The Good News is That From this Hands On Data Science and Machine Learning in R course You will Learn All the Knowledge what you need to be a MASTER in Data Science.
Why Data Science is a MUST HAVE for Now A Days?
The Answer Why Data Science is a Must have for Now a days will take a lot of time to explain. Let's have a look into the Company name who are using Data Science and Machine Learning. Then You will get the Idea How it BOOST your Salary if you have Depth Knowledge in Data Science & Machine Learning!
Here we list a Very Few Companies : -
Google - For Advertise Serving, Advertise Targeting, Self Driving Car, Super Computer, Google Home etc. Google use Data Science + ML + AI to Take Decision
Apple: Apple Use Data Science in different places like: Siri, Face Detection etc
Facebook: Data Science , Machine Learning and AI used in Graph Algorithm for Find a Friend, Photo Tagging, Advertising Targeting, Chat bot, Face Detection etc
NASA: Use Data Science For different Purpose
Microsoft: Amplifying human ingenuity with Data Science
So From the List of the Companies you can Understand all Big Giant to Very Small Startups all are chessing Data Science and Artificial Intelligence and it the Opportunity for You!
Why Choose This Data Science with R Course?
We not only "How" to do it but also Cover "WHY" to do it?
Theory explained by Hands On Example!
30+ Hours Long Data Science Course
100+ Study Materials on Each and Every Topic of Data Science!
Code Templates are Ready to Download! Save a lot of Time
What You Will Learn From The Data Science MASTERCLASS Course:
Learn what is Data science and how Data Science is helping the modern world!
What are the benefits of Data Science , Machine Learning and Artificial Intelligence
Able to Solve Data Science Related Problem with the Help of R Programming
Why R is a Must Have for Data Science , AI and Machine Learning!
Right Guidance of the Path if You want to be a Data Scientist + Data Science Interview Preparation Guide
How to switch career in Data Science?
R Data Structure - Matrix, Array, Data Frame, Factor, List
Work with R’s conditional statements, functions, and loops
Systematically explore data in R
Data Science Package: Dplyr , GGPlot 2
Index, slice, and Subset Data
Get your data in and out of R - CSV, Excel, Database, Web, Text Data
Data Science - Data Visualization : plot different types of data & draw insights like: Line Chart, Bar Plot, Pie Chart, Histogram, Density Plot, Box Plot, 3D Plot, Mosaic Plot
Data Science - Data Manipulation - Apply function, mutate(), filter(), arrange (), summarise(), groupby(), date in R
Statistics - A Must have for Data Science
Data Science - Hypothesis Testing
Business Use Case Understanding
Data Pre-processing
Supervised Learning
Logistic Regression
K-NN
SVM
Naive Bayes
Decision Tree
Random Forest
K-Mean Clustering
Hierarchical Clustering
DBScan Clustering
PCA (Principal Component Analysis)
Association Rule Mining
Model Deployment
>> 30+ Hours Video
>> 4 Capstone Projects
>> 8+ Case Studies
>> 24x7 Support
>>ENROLL TODAY & GET DATA SCIENCE INTERVIEW PREPARATION COURSE FOR FREE <<