
Explore how data analysis sits inside the data science landscape—from research questions and data collection to data cleaning, validation, exploratory analysis, modeling, and communicating findings.
Explore the data analysis pipeline from initial data analysis to exploratory data analysis and confirmatory data analysis, focusing on data quality checks, preprocessing, visualization, and hypothesis testing.
Learn why researchers use samples instead of testing whole populations, and how random, representative samples estimate population parameters with sample statistics.
Explore the tidyverse ecosystem, a suite of data cleaning and preprocessing packages with pipe-based workflows. Master core tools and the reader and table concepts for efficient data preparation.
Install and update R and RStudio, run the course scripts to install required packages (base and tidyverse), and explore datasets car parts, diamonds, and flights used as training data.
Contrast data science with data analysis and the initial and exploratory stages. Grasp population versus sample and the normal distribution while setting up R and Studio.
Learn to perform initial data analysis by preparing and validating data quality, importing and restructuring oddly shaped data, and applying duplicate removal, missing value imputation, and outlier detection in R.
Learn the succession of data preprocessing steps in R, from importing and tabular restructuring to type classification, duplicate removal, missing value imputation, and outlier handling.
Import tabular data into R using the reader package, skip the x1 column, and classify fields as factors or dates.
Learn how to reshape oddly shaped data in R by using tidyr's gather and spread to convert between long and wide formats, and separate concatenated variables with separate functions.
Apply sampling methods in R base to create representative subsets, addressing sample size and sampling error. Use stratified sampling to preserve population proportions and ensure reproducibility with seed.
Learn stratified sampling in R with slice_sample and group_by to achieve proportional representation across iris species, using seeds and pipe syntax for reproducible, streamlined sampling.
Explore core data types and structures in R, vectors, lists, data frames, and tables, and how class and structure affect analysis and visualization.
Identify and categorize missing values as MCAR, MAR, or MNAR. Apply deletion, hot deck, mean imputation, interpolation, and multiple imputation by chained equations to mitigate bias.
Explore visual tools from the debt package to detect patterns in missing data, using color-coded and greyscale plots to relate missingness to data types and overall missing value percentages.
Learn simple missing-value handling in R: delete with na.omit, and impute with super package tools (na.fill, na.locf, na.approx, na.spline) and na.aggregate for group-wise means.
Explore missing data with the mice package, performing multiple imputation by chained equations using 24 methods for diverse data types, and inspect patterns with the MBD pattern tool.
Assess missing values in profit and sales to decide an appropriate imputation strategy, considering outliers and avoiding mean-based or ordered methods, and apply random forest imputation.
Explore a random forest-based multiple imputation method using mice with three imputations and a fixed seed, then validate data for sales and profit with no new outliers.
Validate numeric variables before outlier detection by ensuring positive sales and profit and that sales exceed profit, then flag validity and remove invalid observations for reliable analysis in R.
Identify outliers and invalid values after imputing missing data, and assess their impact on analysis accuracy. Explore delete, substitute, or preserve strategies, with one-column and multivariable approaches and plausibility checks.
Explore univariate outlier detection with descriptive statistics, six sigma methods, visual tools, and box plots in a car parts dataset, imputing missing values and examining profit and sales.
Learn how the boxplot method detects outliers in a single variable using Q1, Q3, and 1.5 times the intercoastal distance, with whiskers and the out element.
Apply the Six Sigma Ed Rule to identify outliers using mean plus or minus three standard deviations, with attention to normal distribution and imputation.
Explore outlier detection with hypothesis tests in R using Grub's test and the outliers package, including one- and two-sided options, test types, p-values, and practical interpretation of results.
Explore multivariate outlier detection by contrasting model-based and proximity-based approaches. Learn methods that transform data along eigenvectors to account for covariance and high-dimensional data, and compare points to a distribution.
Apply the robust principal component algorithm with the pc out function to detect outliers in numeric car parts data, then assign a binary normal/outlier classifier and visualize the results.
Apply the Box-Cox transformation to right-skewed data using lambda to assess normality and detect outliers. Compare methods after transformation and decide to keep, remove, or impute identified observations.
Explore plausibility checks for non-numeric data using a summary to spot typos and misclassified categories, then reclassify and drop empty levels; adjust dates with the lubricate package year function.
Learn initial data analysis in R by importing, cleaning, and converting data to tables, then handle duplicates, missing values, and outliers with simple and multiple imputation.
Master exploratory data analysis with a reusable blueprint, using data visualizations and chart types to reveal factors influencing price and arrival delays, including time-series summaries and a generalized linear model.
Explore how data visualizations empower exploratory data analysis by enabling interactive, rapid insights, telling the analytical story through charts like box plots and a dashboard.
Explore the ggplot2 diamonds data set in R by inspecting its structure, summary statistics, and potential outliers, duplicates, balance, and missing values for a quick quality check.
Explore numeric variable distributions using non-parametric methods, compare histograms and box plots to reveal skewness, center, and variability without assuming a specific distribution.
Compare numeric variable distributions to theoretical models using the Cucu plot to assess normality and fit. Explore parametric options like lognormal, gamma, and weibull, with data transformations and outlier considerations.
Explore how numeric variables relate using scatterplots to reveal linear trends, correlation, and regression concepts, and understand why correlation does not imply causation in exploratory data analysis.
Explore the nycflights13 flights data with 336k observations and 19 variables, focusing on delays and fragmented date-time fields. Learn when to use correlation and hypothesis tests, plus sampling.
Analyze the dataset summary and fix variable classes and date time components from the 2013 data with 24-hour time, remove incomplete rows, and prepare for the exploratory phase in R.
Explore grouping variables using count summaries to reveal unbalanced carrier groups and evenly distributed origin airports, then visualize with a bar chart using reorder for clear ordering.
Assemble summary tables using group by and summarize to compute flights, total distance, arrival delays in minutes, and arrival delay per 1,000 miles by carrier, then compare punctuality across carriers.
Analyze numeric distributions with histograms and box plots to compare flight distance, air time, and departure and arrival delays, reveal skewness, and guide nonparametric testing.
Explore a year's flight data by merging date and time components into a daily timeline, then analyze daily flights and delays with histograms, box plots, and scatterplots to reveal irregularities.
Explore time-based data in R by visualizing daily flights against the date, and analyze weekly and seasonal patterns with line and bar plots, delays, and derived weekday and week variables.
Examine how to assess linear relationships between numeric variables using correlation coefficients and scatterplots, with Spearman and Pearson comparisons showing near zero distance–delay and strong departure–arrival delay correlations.
Learn to measure the association between departure and arrival delays using the odds ratio with binary delay indicators and relative risk, and interpret the results.
Explore how statistical models in exploratory analysis identify independent and dependent variables, using correlations and regressions to assess relationships, select features, and quantify uncertainty with coefficients and random error.
Explore logistic regression for binary outcomes with grouping variables, using a binomial GLM to model arrival delay. Learn that departure delay and carrier matter, while distance and origin/destination may not.
Investigate the factors behind arrival delays in the flights dataset through exploratory analysis. Highlight carrier and departure delay as influential predictors, and note weather context on extreme days.
Celebrate completing the data science course and apply techniques from data analysis courses in Python and Tablo. Explore the portfolio and instructor profile for latest news, courses, monthly promotions, discounts.
Are you new to R and data analysis?
Do you ever struggle starting an analysis with a new dataset?
Do you have problems getting the data into shape and selecting the right tools to work with?
Have you ever wondered if a dataset had the information you were interested in and if it was worth the effort?
If some of these questions occurred to you, then this program might be a good start to set you up on your data analysis journey. Actually, these were the question I had in mind when I designed the curriculum of this course. As you can see below, the curriculum is divided into three main sections. Although this course doesn't have a focus on the basic concepts of statistics, some of the most important concepts are covered in the first section of the course.
The two other sections have their focus on the initial and the exploratory data analysis phases respectively. Initial data analysis (or IDA for short) is where we clean and shape the data into a form suitable for the planned methods. This is also where we make sure the data makes sense from a statistical point of view. In the IDA section I present tools and methods that will help you figure out if the data was collected properly and if it is worthy of being analyzed.
On the other hand, the exploratory data analysis (EDA) section offers techniques to find out if the data can answer your analytical questions, or in other words, if the data has a relevant story to tell. This will spare you from investing time and effort into a project that will not deliver the results you hoped for. In an ideal case the results of EDA may confirm that the planned analysis is worth it and that there are insights to be gained from that dataset and project.
If you are interested in statistical methods and R tools that help you bridge the gap between data collection and the confirmatory data analysis (CDA), then this program is for you. Take a look at the curriculum and give this course a try!