This course is intended for R and data science professionals aiming to master R. Intermediate and advanced users, will both find that this course will separate them from the rest of people doing analytics with R. We don't recommend this course on beginners.
We start by explaining how to work with closures, environments, dates, and more advanced topics. We then move into regex expressions and parsing html data. We explain how to write R packages, and write the proper documentation that the CRAN team expects if you want to upload your code into R's libraries. After that we introduce the necessary skills for profiling your R code. We then move into C++ and Rcpp, and we show how to write super fast C++ parallel code that uses OpenMP. Understanding and mastering Rcpp will allow you to push your R skills to another dimension. When your colleagues are writing R functions, you will be able to get Rcpp+OpenMP equivalent code running 4-8X times faster. We then move into Python and Java, and show how these can be called from R and vice-versa. This will be really helpful for writing code that leverages the excellent object oriented features from this pair of languages. You will be able to build your own classes in Java or Python that store the data that you get from R. Since the Python community is growing so fast, and producing so wonderful packages, it's great to know that you will be able to call any function from any Python package directly from R. We finally explain how to use sqldf, which is a wonderful package for doing serious, production grade data processing in R. Even though it has its limitations, we will be able to write SQL queries directly in R. We will certainly show how to bypass those limitations, such as its inability to write full joins using specific tricks.
All the code (R,JAVA,C++,.csv) used in this course is available for download, and all the lectures can be downloaded as well. Our teaching strategy is to present you with examples carrying the minimal complexity, so we hope you can easily follow each lecture. In case you have doubts or comments, feel free to send us a message
Once we develop our R code, we would ideally like to encapsulate our code. In that fashion, we can distribute it in a more efficient way (since we will just create one compressed file), we can upload it to the CRAN servers, and we can properly document our functions.
Finding our code in the future will certainly be much easier, and understanding what each function does will be trivial, since we will be able to access the documentation via the help() function.
In this lecture we discuss how to write R packages, how to install pdflatex, what are the common problems, and how to use package.skeleton(). We use all this steps to create a real package that we then install from a file.
R functions can both accept functions are arguments, and return functions as outputs. We discuss how to code functions that build functions, and how to pass functions to functions.
You would not be surprised to know that the well known function apply(), uses those concepts. In fact we will reuse this concept of passing functions as arguments in a much more advanced Section: Rcpp and high performance R-C++ computing, when we will pass R functions to C++ code.
We finally explain how can R keep references from the functions to the variables that exist in same workspace where those functions live.
Imagine you would like to encapsulate several variables and functions into a common workspace. The advantages could be potentially interesting because, you will end up with more organized code. You will also be able to invoke that workspace and use the variable values that you have there.
Environments in R are designed for that, and are quite useful for protecting a subspace of your workplace from the rest of the code. They end up being extensively in packages, but their utility needs not be restricted to that
We use a very tricky example involving dates in R, explaining how to properly read dates coming in nonstandard formats. We review several strategies, the first one involving using as.Date(), and the second one using the ISODate() function that will create a POSIXct object. We discuss the difference between Dates and POSIXct objects, and we show how to use the days(), months() functions on the variables that we create
When we need to do very complex string pattern matching, or replacing we can't just rely on the traditional string functions. Regex expressions via the grep command allow us to search for patterns within strings in a very convenient way. We can also use this regex expressions on a replacement function, and do very complex string replacement operations.
Regex expressions are well known to most programmers, but it can be quite daunting to learn the basics. Here we show the essential operations that can be used as a basis for doing much more complex work
We review some extra concepts on Regex expressions, specially how to do quite complex replacements.
We combine the results from the previous Regex section with an appropriate R package that pulls and parses a web page. We use this approach to retrieve the Wikipedia article on Albert Einstein, and then build some statistical analysis on his impact on science. We end up identifying the key moments in his life, and his how his tremendous impact on science has evolved through time. All this, requiring no interaction with the user.
After developing our R production functions, we should always try to optimize them. The best way of doing that, is to profile the functions, and evaluate which other functions are called, and how much time each sub-call contributes to the overall time.
Rcpp allows us to inject C++ code into R seamlessly. What's more, we will be capable of using R code directly in our C++ programs.
A central topic that confuses many practitioners is that we will manipulate both C++ native variables (such as vectors, ints, doubles,etc.) and R specific variables such as NumericVectors,etc. We invest quite a lot of time explaining these two groups.
We show how to write C++ code in our R IDE, and how to load it externally via SourceCpp.
With the Rcpp machinery we will know how to write super efficient code, appropriate for high performance computing.
We discuss more advanced topics in Rcp such as passing dataframes from R to C++, calling R functions in C++, etc.
The Rcpp sugar syntax allows us to write highly optimized vectorized code in C++, which will make us doubt whether we are writing C++ or R code! Remember that vectorized operations are not included per se, in C++
We discuss several sugar strategies and operators. At the end, we will be able to write our C++ code using much less syntax.
This is by far the most challenging and technical part of this course. We will show how to run parallel C++ code via OMP through Rcpp. The result? Code that runs up to 6X faster depending on how many cores your PC has.
We introduce one example using parallel fors. In particular we show a quite complicated example where we want to do a reduction using parallel for. The problem in this case is that we will encounter lots of synchronization problems, unless we either:
a) declare the main operation as atomic and use private variables
b) declare everything as a reduction.
Mastering OpenMP takes a lot of time, but this chapter will hopefully be enough to teach the elemental concepts of designing a parallel C++ program.
Python cannot be called from R so easily in Windows (we do have good packages but they work only in Linux). In this lecture, we develop R code that writes instructions to a file and then calls Python via batch mode. In that way we can share files and programs, operating in R, in a quite easy way.
R can easily be called from Python using an appropriate package. This means that we can leverage some excellent R functions directly in Python.
Python is a more advanced programming language which is more powerful, but it lacks the same statistical power that R has.
In this lecture we create Python code that reads and sends data into R, and then receives data from R. We show how the statistical capabilities of R can be taken to the next level when we create the corresponding Python classes. In particular, we develop a histogram() Python class that is fed from the histogram created in R.
We show how to call Java code from R via the RJava package. We create specific classes in Java via Eclipse, and then add them to the java classpath. We then instantiate multiple Java classes and store them in a list. We also discuss some basic Java classes fundamentals. This approach allows us to encapsulate R data into proper Java classes which can help organize our data and functions in a better way
We show how to call R directly from Java via the Rserve package. We run this example using Eclipse and Windows 10.
sqldf allows us to use sql syntax directly on R dataframes. Thus, we can easily build very complex queries, that would otherwise require a lot of cumbersome R notation. We discuss the basic sqldf operations, filtering, ordering, transposing. And how to execute queries on data frames
We use sqldf for more realistic applications, such as merging data from different dataframes.
We review the Full vs Inner vs Left vs Right join. We end up using a customised approach to simulate a full join on sqldf, and can also be extended for finding observations in one table and not in the other one.
Let's see how much you know about sqldf
I worked for 7+ years exp as statistical programmer in the industry. Expert in programming, statistics, data science, statistical algorithms. I have wide experience in many programming languages. Regular contributor to the R community, with 3 published packages. I also am expert SAS programmer. Contributor to scientific statistical journals. Latest publication on the Journal of Statistical Software.