
This course includes our updated coding exercises so you can practice your skills as you learn.
See a demo
Begin a scratch course in statistics and biostatistics, building concepts from basics to practical data analysis with theory, coding in R, assignments, and quizzes.
Explore numeric and categorical data concepts, including numeric variables and continuous and discrete numeric data, and learn how these variables underpin data visualization and analysis.
Learn how discretization converts numeric data to categorical data using body mass index as an example, creating underweight, normal, overweight, and obese categories.
Define the population and contrast random sampling in observational studies with random assignment in experimental studies. Show how random sampling yields representative samples and random assignment applies treatments randomly.
Learn to select representative samples from a population for observational studies. Explore sampling strategies such as random, stratified, cluster, and multistage sampling, and why no single method fits all.
Learn how stratified sampling divides a population into strata, then samples from each group to ensure a representative data set with equal or proportional sizes.
Explore cluster and multi-stage sampling strategies for cost-effective surveys in homogeneous populations, using subpopulations and two-stage sampling to balance practicality and representativeness.
Learn how to install R from CRAN and set up R Studio on your system, choosing the right OS options and verifying integration by running a simple calculation.
Explore vectors, the simplest one-dimensional data type in R, and learn how numeric, logical, and character vectors are created with concatenation, with appropriate quoting rules.
Explore matrices in R as two-dimensional arrays where all elements share the same type, and learn to create a 3x2 matrix using the matrix function with data 1 to 6.
Discover the R base package and its automatically activated functions, including mean, median, and density plot, usable without calling a library in R Studio.
Explore Bioconductor, an open source project for rigorous, reproducible analysis of biological data, and learn to install Bioconductor in R using BiocManager to install packages like GenomicFeatures.
Learn the basic definition of statistics as a tool to analyze, interpret, and present data in a meaningful form.
Learn the two basic types of statistics: descriptive statistics describe and summarize data, while inferential statistics use a sample to infer the population mean.
Understand data distribution, including normal and skewed shapes, identify the central point and the two equal halves, and learn to apply statistics accordingly.
Build and interpret distribution plots in R to analyze a CSV clinical data set using histogram, box plot, density plot, and q-q plot.
Identify what outliers are and why they appear in data, from measurement errors to natural extremes, and learn why they heavily influence statistical analysis.
Learn the concept of the mode, the most frequent value in a data set, illustrated with a bar plot and a clinical example, and its suitability for categorical data.
Learn to calculate the interquartile range in R using the quantile function on the age variable, interpreting Q1, Q2, and Q3 to obtain the IQR.
Explore how variance measures data spread by calculating squared deviations from the mean, using the formula s^2, summing differences, and dividing by n-1, with a 1–5 example.
Learn how standard deviation, the square root of variance, makes variability interpretable. Compare small and large deviations to gauge data spread.
Learn to calculate the standard deviation in R with the sd function on the data set's age, view the 10.85 result in the console, and compare it to variance.
Learn to compute all descriptive statistics for numeric variables at once using the psyche package, describe function, and interpret mean, median, sd, mad, min, max, skew, kurtosis, and range.
Discover how to analyze categorical data by using frequency tables, percentage proportions, mode, contingency tables, and visualizations.
Learn how to create frequency tables for categorical data, describing counts by category (sex, exercise, diabetes, smoking), and visualize with charts, using R's table function.
Learn how to build a frequency table for a categorical variable in R using the table function, with a walkthrough of the sex variable and interpreting the console output.
Learn to analyze categorical data by calculating percentages and proportions using frequency tables and prop.table in R, enabling comparison across datasets of different sizes.
Find the mode of a categorical variable in R with a custom function after loading the clinical data in RStudio, for exercise frequency (0–3 levels) showing 1 as the mode.
Explore contingency tables, or cross tabs, showing frequency distributions for variables, such as gender and pet type, with margins; learn to build them in R with table and add margin.
Learn to build a contingency table for two categorical variables in R using table and add margins, store it in a variable, and print the final table with margins.
Build bar plots, pie plots, and dot plots from a single categorical variable in R by creating a frequency table and visualizing the gender data from the sex column.
Build heat maps, mosaic plots, and stacked bar charts to visualize the interaction of two categorical variables using a contingency table in R, with ggplot insights.
Explore the four basic types of probability—classical, empirical, subjective, and conditional—and how equal outcomes, from coin flips to genotype crosses, illustrate theoretical probability.
Explore empirical probability, the experimental probability based on real observations in clinical trials and population studies. Contrast it with subjective probability rooted in judgment when data is scarce.
Explore conditional probability by computing the chance of event a given b occurred, using joint probability and a positive probability of b, illustrated with red and blue balls.
Explore correlation as a statistical measure of the relationship between two variables, from -1 to 1, and interpret positive, negative, and zero correlations using Pearson, Spearman, and Kendall methods.
Apply Pearson correlation to continuous data with normal distribution and a linear relationship, using x_i, y_i, and their means. Compute it in R with the base function.
Learn to calculate Pearson correlation between systolic blood pressure and cholesterol in R, using a CSV dataset, with density and scatter plots and a regression line.
Learn to install and load ggplot2 in R, build a scatter plot with systolic blood pressure vs cholesterol, refine with themes, labels, and a linear trend line for publications.
Explore how to apply Spearman correlation to data with a monotonic relationship by ranking math and science scores, computing rank differences, and using the formula, including an R calculation example.
Compute the spearman correlation in r with a practical workflow using a data set and ggplot visualizations, interpreting a weak negative relationship between systolic blood pressure and heart rate.
Explore Kendall tau correlation, a non-parametric measure of monotonic relationships for ordinal or continuous data, using concordant and discordant pairs and computing with cor(x, y, method = 'kendall').
Define regression and model the relationship between the independent variable (explanatory) and the dependent variable (response), using nutrition and height as an example with x and y axes.
Explore simple linear regression by modeling y from x with a regression line, using y = beta naught + beta one x + epsilon and minimizing sum of squared differences.
Understand the r square, or coefficient of determination, which measures how much the dependent variable's variability is explained by the independent variable; values span 0 to 1.
Build a simple linear regression model in R using the lm function from the stats package, specify dependent and independent variables, and view results with summary including coefficients and r-squared.
Build a simple linear regression model in R to explore how age affects heart rate, then visualize with ggplot and interpret the model outputs.
Visualize the regression line and interpret the intercept as about 72.68 beats per minute when age is zero, then compare the BMI-cholesterol model with slope 5.1358, p<0.001, and r-squared 0.5835.
Learn to build a multiple linear regression model in R using lm and data variable, with heart rate as the response and age, systolic blood pressure, and cholesterol as predictors.
Visualize a four-variable multiple linear regression in R with ggplot, plotting heart rate against age, systolic blood pressure, and cholesterol on a single plot with regression lines from model coefficients.
Interpret a multiple linear regression model in R, examining residuals, coefficients, p-values, and R-squared while relating age, systolic blood pressure, and cholesterol to heart rate.
Explore linearity in regression and how a proportional relationship between response and explanatory variables improves model predictability. See additive effects in categorical variables for better models in multiple linear regression.
Explain the independence assumption by ensuring observations are independent and how violations bias parameters and increase type one error. Illustrate with a 30-student example and random sampling to verify independence.
Explore why linear regression fails with a categorical variable and see how logistic regression models the data, emphasizing the shift from continuous to categorical prediction.
Explore the logistic regression formula, including the logit function and odds, and learn how beta naught and beta one estimate the pass probability from study hours.
Load the DB2 data into RStudio, convert diabetes_status to a factor, and build a logistic regression model with glm and binomial family to analyze age and BMI effects.
Explore interpreting logistic regression results in r-studio via the age coefficient, p-values, and null vs residual deviance, plus visualizing the sigmoid probability curve.
Interpret the logistic regression curve to read age-related diabetes probability from color-coded dots (red = non-diabetic, blue = diabetic). Learn estimating and converting a given age’s probability to a percentage.
Define null and alternative hypotheses as the default and opposite statements in hypothesis testing, with battery life and drug efficacy examples, and explain type one error and type two error.
Set up the hypotheses and the alpha level, then choose an appropriate test—t, z, one-way ANOVA, or two-way ANOVA—based on data and design, and decide on the null.
Explore the level of significance, alpha, and how they set a threshold to reject the null hypothesis in hypothesis testing, balancing confidence, evidence, and risk.
Explore parametric and nonparametric hypothesis tests, learn when to use z, t, f tests and ANOVA, and identify chi-square, Mann–Whitney, and Wilcoxon tests for skewed data.
Explore traditional method and the p-value based method of hypothesis testing, decide on one- or two-tailed tests, and apply rules using critical values and acceptance or rejection of null hypothesis.
Master the p-value based decision method by comparing p values to alpha, calculate p values via z, t, and f tests, and plan future parametric and nonparametric analyses.
Determine which data require which statistical test by learning the basic assumptions of each test. We will start with the Z test.
Learn to perform a one-sample z-test in R using the BSD library's z.test, specifying sample data, population mean and standard deviation, choosing one- or two-tailed tests, and interpreting p-values.
This lecture introduces the t test, contrasts it with the z test, and explains when to use it for n<30 samples, including one-sample and two-sample (independent and dependent) cases.
Learn to perform a one-sample t-test in R using the Bsda package's t.test, compare sample data to a population mean, and interpret p-values and confidence levels.
Learn to perform a two-sample dependent (paired) t test in R with the BSD t.test function, including setting alternatives, conf level, and the correct variable order.
Learn to perform a paired two-sample t-test in R using before-and-after weight data, loading a CSV, running t.test, and interpreting a left-tailed result.
Learn how the f-test, named for Ronald A. Fisher, compares variances between two groups, using an example with two dosages and a 0.05 significance level.
Compare the F test with t and z tests, showing why F handles two or more groups and variance homogeneity better, and guide R-based calculation.
Use the bsda library in R to perform an f-test with var.test, specifying the high-variance and low-variance variables in the correct order, a right-tailed test, and a chosen confidence level.
Learn to run a two-way anova in R using bsda library's aov function, with data in long format, and specify two factors using tilde and the multiplication sign in RStudio.
Explore nonparametric hypothesis testing for categorical data with the chi-square test, including independence and goodness-of-fit, and learn how to perform calculations in R.
Apply the chi-square test of independence to assess association between two categorical variables, such as genotype and disease resistance, by comparing observed and expected frequencies with a 0.05 significance level.
Learn to compute chi square statistics in R using the base function chi squared dot test with sample data and no library, ready for use in RStudio.
Use the chi-square goodness of fit test to compare observed versus known population frequencies, set alpha 0.05 and degrees of freedom, and decide acceptance or rejection of the null hypothesis.
Master chi square goodness of fit with the R base function, provide the data variable name, and let R compute the statistics from your data.
Welcome to our fourth course "Learn Statistics & Biostatistics Data Analysis From Scratch". In this course, you will start from the very fundamentals of Data and slowly move forward to the analysis of the data using different statistical tools. In an era dominated by big data and machine learning, statistics is the cornerstone that allows us to make sense of the vast amounts of information we collect. It provides the methodologies for the collection, analysis, interpretation, and presentation of data. This course not only makes you literate in the language of data but also empowers you to make informed decisions in business, science, and technology.
In this course, you will also learn the R-Programming to calculate different statistics on your data. R programming is one of the most sought-after skills in the fields of statistics, biostatistics and data analysis. With its extensive libraries and frameworks, R provides an unparalleled platform for analyzing and visualizing data, making it an indispensable tool for statisticians and data scientists for statistics. This course provides hands-on experience with R, ensuring you can apply statistical methods effectively in real-world scenarios.
This course is divided into Eight Modules
What is Data? - Understand the basics of data, its types, and how it's collected and organized.
Introduction to R Programming - Dive into R and R-studio, a powerful tool for statistical computing and graphics, essential for modern data analysis.
Descriptive Statistics - Learn to summarize and describe the essential features of numerical data, crucial for initial data exploration. You will also learn how to build their visualization.
Handling Categorical Data - Explore techniques for effectively managing and analyzing categorical variables.
Probabilities - Gain insights into the concepts of probability, a foundational pillar for statistical inference. You will understand the subjective, classical, conditional, etc probabilities concepts at the end of this module.
Correlation - Discover the methods to measure the strength and direction of a relationship between two variables. We will explain to you the Pearson, Kendall and Spearman correlations.
Regression - Understand how to model relationships between variables and make predictions. We will teach you about Simple linear Regression, Multiple Linear Regression, and Logistic Regression.
Hypothesis Testing - Develop the ability to test assumptions and make decisions based on data. You will learn the Z-test, T-test and its types, F-test, ANOVA and its types, and Chi-Sq test and its types.
This course is a unique blend of theory and practical. You will learn the theory of statistical concepts and along with it you will learn the R-programming to apply those statistical concepts to your data. We hope this journey will be enlightening for you. After having this course, you will be confident to analyze your data by your own.