R Datasets and Dataframes

Learn more from the full course
R Level 1 - Data Analytics with R
Use R for Data Analytics and Data Mining
07:22:32 of on-demand video • Updated April 2019
this course will show you how the most common types of graphs can be produced with R base
you will get a good understanding of functions and loops in R which are very useful programming skills to have
you will get the necessary theoretical background for R
you will learn how to create and handle different types of objects
you will get fluent in the R programming language to master your specific quantitative tasks
English [Auto]
One of the great things about R is the availability of exercise datasets. Those are various datasets that come with R base or with an add on package. This allows you to easily try out new features of R. It also makes sure that you can communicate the problem to a global R user community with the same data set on your path to R mastery. You will encounter many code examples using those exact datasets like empty cars and iris. I would highly recommend to get familiar with at least those two datasets. Now in this video I will show you where those datasets are, how you can access them, how you can make yourself familiar with those datasets, how to manipulate them. And while we are already working with Dataframes mainly, I will also show you the basics about working with those Dataframes to even get a list. You could check out the package datasets. This is where all the our base data sets are located. All the famous ones and most of the ones I used in my tutorials are housed in this package. If we check this one out, we can get an alphabetic list of the data sets available. For example, if we click on air Miles. We learn what this data is about, which format it has, the source and even the references. If you use one of these datasets for the first time, it's quite wise to check out the Help section. Variables, including the dimensions are well explained here. By the way, this is the same as if you would code question mark. Airmiles. You would again get to the help section of this dataset. But of course, if you import your own dataset, all of this is not available. You should know what your dataset is about. Anyways, to get a quick overview on your dataset. And this is of course also applicable to your own imported datasets. You could state head air miles or tail air miles. This gives you the first and last six rows of observations. This is important to at least know how many variables there are, what the dimensions are, and you also get an idea about the class of each variable. Again, this is a quick way so that you do not need to print out the whole data set, which can slow down your work drastically if you have several thousand observations to work with. Another way in how to orient yourself with new data is the summary function. If you run this one on the very famous Empty Cars dataset. We get basic statistics like quartiles, minimum, maximum median and mean for each variable. In this case, we can see that there are 11 variables. You could also plot a data set to get a first impression. In this case, since there are 11 variables, we get a scatter plot matrix, which is not that helpful. There are too many plots as to be able to see clearly any patterns. If you have only one variable in your dataset, like in a Simple Time series data, you could also use a histogram. In this case, let's try the hist command of our base to get an idea about the distribution of the airmiles time series data. In this case, we would quickly learn that the most observations are in the first bin between 0 and 5000 miles. I would say that visual impressions are a valuable source for insight into your data. Our base has the functions, plot and Hist which can be tweaked to get a quick impression on nearly any data out there. While we are already talking about those pre-installed R datasets, let's also take a look at working with those datasets, especially with Dataframes like empty cars. Let's recall the empty car dataset by calling head. Empty cars so that we see the variables that are in the dataframe. We now want to learn how to manipulate such a data frame. And I will also show you general rules of working with those data frames. If you want to extract a single column, you need to use the dollar sign. That way the computer knows that you are talking about column X of data frame Y. Let's say we want to get the sum of the column weight in the empty cars dataset. You might already know the function sum, but now we need to specify the name of the data frame, which is empty cars. Then a dollar sign and then the name of the variable, which is w t in this case. If we run the line, we learn that this column has a total of 102.9 times £1,000, which is the dimension for this variable stated in the documentation. Now this method with the dollar sign is fine if you work with several different data frames at the same time and you want to avoid confusion with the variables. But if you plan on working intensely with a single data frame, you can spare yourself some time by attaching the data set to your environment. That way our knows which data frame a given variable belongs to. If we attach the dataset empty cars, we can get the same sum without using the dollar sign. We can just state the variable name and now our automatically connects this call with the empty cars dataset. Again, this is useful if you have one main data frame. This could, however, lead to confusion. If you have several data frames with similar or the same variable names. By the way, if you want your dataset to be attached only temporarily, you could undo it with the function detach like you do now. If I now run the same function sum w t r does not know anymore which dataset I mean, and they get an error message. So now that we know how to work with specific variables, I also want to show you how you can extract data from a data frame, like with all manipulations based on index positions. We need the box brackets for that. Let's say we want to extract the values of the variable w t of the second observation from the head of the data. We know that w t is the variable number six. Therefore, we code the name of the data set empty cars, then the box brackets. At first we state the position of the row and then the position of the variable. That way we can extract very specific info from our dataset. If you want to enlarge the spectrum, you can simply use the Concatenate tool to insert a vector like, for example, in the next line. So this line gives us the weight values for the rows two, five and eight. All right, guys, let's recap what we learned in this video. I showed you that there is a package called data sets in our base, which houses an array of pre-installed data sets. You can easily use these data sets to learn R and to try out new tools. I showed you how to get help for these pre-installed data sets for a first impression on newly imported data. We discussed the functions Head Tail summary and we also took a look at visual impressions with plot and hist very importantly, I showed you how to work with data frames. You could use the dollar sign to get access to a specific column or you can even attach and detach this data frame to make coding easier. And at last I showed you how to extract not only at a variable level but also at the value level by using the box brackets.