dplyr tutorial

Dplyr is one of the main packages in the tidyverse universe, and one of the most used packages in R. Without a doubt, dplyr is a very powerful package, since allows you to manipulate data very easily, and it enables you to work with other languages and frameworks, such as SQL, Spark o R’s data.table.

Besides, as it is part of the tidyverse universe, it is very easy to use dplyr with other packages within tidyverse, such as ggplot, which allows, for example, to make very cool graphics in a simple way and without having to create any intermediate objects.

As you can see, dplyr is very powerful. However, do you know exactly how it works, all the features it offers, and why it is so powerful? Well, in this tutorial I will explain everything you need to know about dplyr. And you’ll even have exercises to practice. There is much to learn, so set’s get to it!

Installation and pipe operator

The first thing we will have to do is install dplyr. To do this, you simply have to run the following code:

install.packages("dplyr")
library(dplyr)

dplyr includes the %>% operator, called pipe, which is very useful and applicable to the entire tidyverse ecosystem. This operator allows you to concatenate functions in such a way that the data is passed from one function to another without having to assign it to any variable.

For example, suppose we want to rank cars that have more than 100 horsepower by their gas mileage (mpg variable) in descending order. Let’s see how we would do it without using the pipe operator.

# Without using the pipe operator
mtcars_filtered = filter(mtcars,  hp>100)
mtcars_ordered = arrange(mtcars_filtered, desc(mpg))

head(mtcars_ordered)

Now, let’s see how we would do it with the pipe operator:

mtcars %>%
  filter(hp>100) %>%
  arrange(desc(mpg)) %>%
  head()

As we can see, the result is exactly the same, but by using the pipe we have avoided having to create intermediate objects, so if we wanted to change our transformation it would be very simple. Also, the code is much cleaner if we use the pipe operator.

Now that you know what the pipe operator is, let’s continue with the dplyr tutorial, gradually seeing what are the main functions of dplyr focused on different transformations.

Main functions of dplyr

Basically, dplyr has 5 different groups of functions: summary, grouping, selection/filter, manipulation, and combination functions.

Although the value of dplyr lies in being able to combine all these functions to be able to manipulate data in a simple and clean way, in order to do that it is essential to first know the main functions that the package offers.

So, let’s go step by step knowing each of the functions of each group. Let’s get to it!

dplyr grouping functions

The grouping functions allow grouping the data in such a way that we can apply certain functions for each of the groups. There are basically two types of functions: group_by and ungroup .

group_by function

The group_by function allows grouping the data into different groups based on one or more variables of our dataframe. When a dataframe is grouped, it is still a single dataframe, but the operations we do will apply to each of the groups. This is why the group_by function is rarely used alone, but will most likely be used with other dplyr functions, such as summarize or mutate.

Likewise, it is important to emphasize that the most common is to group based on one or more variables in text or categorical format since it rarely makes sense that several numerical observations have the same (although it may be the case).

In our case, we are going to use the gapminder dataset, which contains data on life expectancy, population, and GDP per Capita of different countries for different years. This is the dataset:

library(gapminder)
glimpse(gapminder)
Rows: 1,704
Columns: 6
$ country   <fct> Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanistan, Afghanis~
$ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Europe, Europe, Europe, Europe, Europe, Europe, Europ~
$ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002, 2007, 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992,~
$ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.822, 41.674, 41.763, 42.129, 43.828, 55.230, 59.280, 64.820, 66.22~
$ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12881816, 13867957, 16317921, 22227415, 25268405, 31889923, 1282697~
$ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, 978.0114, 852.3959, 649.3414, 635.3414, 726.7341, 974.5803, 1601.~

Let’s imagine that we want to obtain the average life expectancy for each continent for each year so that we can visualize how life expectancy has evolved by continent.

To do this, we will group the data by continent and year and use the summarize function (which I will explain later) to obtain the average life expectancy.

gapminder %>%
  group_by(continent, year) %>%
  summarize(mean(lifeExp))

As we can see, with only 3 lines of code we have managed to make a very interesting summary of the data. Thus, almost all the functions that we are going to learn in this dplyr tutorial can be applied with grouped data.

Exercise of the group_by function

To check that you have understood, I propose the following exercise: calculate the average life expectancy worldwide, for each year.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnMgPSBjKFwiZHBseXJcIixcImdhcG1pbmRlclwiKVxuc2FwcGx5KGxpYnNbIWxpYnMgJWluJSBpbnN0YWxsZWQucGFja2FnZXMoKV0sIGluc3RhbGwucGFja2FnZXMpXG5zYXBwbHkobGlicywgcmVxdWlyZSwgY2hhcmFjdGVyLm9ubHkgPSBUKVxucm0obGlicykiLCJzYW1wbGUiOiIjIGRwbHlyIGFuZCBnYXBtaW5kZXIgYXJlIGxvYWRlZCBhbHJlYWR5XG4jIGRwbHlyIHkgZ2FwbWluZGVyIHlhIGVzdFx1MDBlMW4gY2FyZ2Fkb3NcbnJlc3VsdFxuXG4jIFByaW50IHJlc3VsdFxucHJpbnQocmVzdWx0KSIsInNvbHV0aW9uIjoicmVzdWx0ID0gZ2FwbWluZGVyICU+JVxuICBncm91cF9ieSh5ZWFyKSAlPiVcbiAgc3VtbWFyaXplKG1lYW4obGlmZUV4cCkpXG5cbnByaW50KHJlc3VsdCkiLCJzY3QiOiJ0ZXN0X2Z1bmN0aW9uKFwiZ3JvdXBfYnlcIiwgXG4gICAgICAgICAgICAgIGluY29ycmVjdF9tc2cgPSBcIllvdSBzaG91bGQgdXNlIGdyb3VwX2J5IGZ1bmN0aW9uLiAtIERlYmVyXHUwMGVkYXMgdXNhciBsYSBmdW5jaVx1MDBmM24gZ3JvdXBfYnkuXCIpXG50ZXN0X2Z1bmN0aW9uKFwic3VtbWFyaXplXCIsIFxuICAgICAgICAgICAgICBpbmNvcnJlY3RfbXNnID0gXCJZb3Ugc2hvdWxkIHVzZSBzdW1tYXJpemUgZnVuY3Rpb24uIC0gRGViZXJcdTAwZWRhcyB1c2FyIGxhIGZ1bmNpXHUwMGYzbiBzdW1tYXJpemVcIilcblxudGVzdF9vYmplY3QoXCJyZXN1bHRcIixcbiAgICAgICAgICAgIGluY29ycmVjdF9tc2cgPSBcIkluY29ycmVjdC4gLSBJbmNvcnJlY3RvLlwiXG4gICAgICAgICAgICApXG5cbnRlc3RfZXJyb3IoKVxuc3VjY2Vzc19tc2coXCJXZWxsIGRvbmUhIC0gQmllbiBoZWNobyFcIikifQ==

Ungroup function

Sometimes, we do a grouping to obtain grouped data (the total sum of the variable, for example), but we do not want to continue performing operations with the grouped data. In that case, we will have to ungroup the data, and this can be done with the ungroup function.

For example, suppose that for a specific year (say 1952) we want to show a graph in which we compare each of the countries with the average life expectancy of their continent. To do this, you would simply have to:

  • Filter the data by the year we want.
  • Group the data by continent.
  • Calculate the average life expectancy of the continent.
  • Ungroup the data.
gapminder %>%
  select(-pop, -gdpPercap) %>%
  filter(year == 1952) %>%
  group_by(continent) %>%
  mutate(mean_lifeExp = mean(lifeExp)) %>%
  ungroup() 

As we can see, all the countries of the same continent have the same mean_lifeExp , which makes sense, because the average life expectancy of a continent is the same for all the countries of the continent.

And with this, you already know the grouping functions. As you can see, grouping functions are usually used with other functions, so let’s continue with this dplyr tutorial and learn new dplyr functions!

dplyr filter and selection functions

select () function

The select function is very simple, it allows you to indicate the columns that you want to select or stop using. Usually, when you manipulate data you create quite some auxiliary variables to check something or to calculate new fields. These are variables that generally you won’t want in the final dataset. For this reason, this function is usually used to only keep the variables you want.

Another use case is when the dataset includes more variables than you are going to use for analysis. In those cases, keeping only the variables that interest you from the beginning will allow you to:

  1. Make the code faster.
  2. Make the results more easily interpretable, since there will be no noisy variables.

To select some variables, you just have to pass the name of the columns to the function:

gapminder %>%
  select(country, year, pop)

Likewise, the select function also allows you to drop variables too, that is, to deselect one or more columns. To do this, simply pass the symbol – in front of the variable. Example:

gapminder %>%
  select(-year, -country, -pop)
Exercise select function

To practice the dplyr select function, I propose the following exercise: from the gapminder dataframe, choose the variables continent, year and lifeExp. First, you must do it by selecting the variables that interest you.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnMgPSBjKFwiZHBseXJcIixcImdhcG1pbmRlclwiKVxuc2FwcGx5KGxpYnNbIWxpYnMgJWluJSBpbnN0YWxsZWQucGFja2FnZXMoKV0sIGluc3RhbGwucGFja2FnZXMpXG5zYXBwbHkobGlicywgcmVxdWlyZSwgY2hhcmFjdGVyLm9ubHkgPSBUKVxucm0obGlicykiLCJzYW1wbGUiOiIjIGRwbHlyIGFuZCBnYXBtaW5kZXIgYXJlIGxvYWRlZCBhbHJlYWR5XG4jIGRwbHlyIHkgZ2FwbWluZGVyIHlhIGVzdFx1MDBlMW4gY2FyZ2Fkb3NcbnJlc3VsdCBcblxuIyBQcmludCByZXN1bHRcbnByaW50KHJlc3VsdCkiLCJzb2x1dGlvbiI6InJlc3VsdCA9IGdhcG1pbmRlciAlPiUgc2VsZWN0KGNvbnRpbmVudCwgeWVhciwgbGlmZUV4cClcbnByaW50KHJlc3VsdCkiLCJzY3QiOiJ0ZXN0X2Z1bmN0aW9uKFwic2VsZWN0XCIsIFxuICAgICAgICAgICAgICBpbmNvcnJlY3RfbXNnID0gXCJZb3Ugc2hvdWxkIHVzZSBzZWxlY3QgZnVuY3Rpb24uIDwvYnI+IERlYmVyXHUwMGVkYXMgdXNhciBsYSBmdW5jaVx1MDBmM24gc2VsZWN0XCIpXG5cbnRlc3Rfb2JqZWN0KFwicmVzdWx0XCIsXG4gICAgICAgICAgICBpbmNvcnJlY3RfbXNnID0gXCJJbmNvcnJlY3QuIC0gSW5jb3JyZWN0by5cIlxuICAgICAgICAgICAgKVxuXG50ZXN0X2Vycm9yKClcbnN1Y2Nlc3NfbXNnKFwiV2VsbCBkb25lISAtIEJpZW4gaGVjaG8hXCIpIn0=

Now, get the same result, but in a different way, using dplyr’s select function:

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnMgPSBjKFwiZHBseXJcIixcImdhcG1pbmRlclwiKVxuc2FwcGx5KGxpYnNbIWxpYnMgJWluJSBpbnN0YWxsZWQucGFja2FnZXMoKV0sIGluc3RhbGwucGFja2FnZXMpXG5zYXBwbHkobGlicywgcmVxdWlyZSwgY2hhcmFjdGVyLm9ubHkgPSBUKVxucm0obGlicykiLCJzYW1wbGUiOiIjIGRwbHlyIGFuZCBnYXBtaW5kZXIgYXJlIGxvYWRlZCBhbHJlYWR5XG4jIGRwbHlyIHkgZ2FwbWluZGVyIHlhIGVzdFx1MDBlMW4gY2FyZ2Fkb3NcblxucmVzdWx0IFxuXG4jIFByaW50IHJlc3VsdFxucHJpbnQocmVzdWx0KSIsInNvbHV0aW9uIjoicmVzdWx0ID0gZ2FwbWluZGVyICU+JSBzZWxlY3QoLWNvdW50cnksIC1wb3AsIC1nZHBQZXJjYXApXG5cbnByaW50KHJlc3VsdCkiLCJzY3QiOiJ0ZXN0X2Z1bmN0aW9uKFxuICBcInNlbGVjdFwiLCBcbiAgaW5jb3JyZWN0X21zZyA9IFwiWW91IHNob3VsZCB1c2Ugc2VsZWN0IGZ1bmN0aW9uLiAtIERlYmVyXHUwMGVkYXMgdXNhciBsYSBmdW5jaVx1MDBmM24gc2VsZWN0XCIpXG5cbnRlc3Rfb3V0cHV0X2NvbnRhaW5zKFwiLWNvdW50cnlcIixcbiAgICAgICAgICAgICAgICAgICAgIGluY29ycmVjdF9tc2c9XCJZb3Ugc2hvdWxkIHVzZSA8Y29kZT4tPC9jb2RlPiB0byBleGNsdWRlIHZhcmlhYmxlcy4gLSBEZWJlclx1MDBlZGFzIHVzYXIgIDxjb2RlPi08L2NvZGU+IHBhcmEgZXhjbHVpciB2YXJpYWJsZXMuXCIpXG50ZXN0X29iamVjdChcInJlc3VsdFwiLFxuICAgICAgICAgICAgaW5jb3JyZWN0X21zZyA9IFwiSW5jb3JyZWN0LiAtIEluY29ycmVjdG8uXCJcbiAgICAgICAgICAgIClcblxudGVzdF9lcnJvcigpXG5zdWNjZXNzX21zZyhcIldlbGwgZG9uZSEgLSBCaWVuIGhlY2hvIVwiKSJ9

Filter function

The filter function allows you to filter the data for one or more conditions that we pass to it. For example, as we have done in the previous example, we can filter the data to keep only cars with more than 100 horsepower (hp).

mtcars %>%
  filter(hp > 100)

When it comes to numeric variables, there are six filters that we can apply: greater than (>), less than (<), equal to (==), different than (! =), greater than or equal to (> =) and less or equal to (<=).

t1 = mtcars %>%
  filter(hp > 100) 

t2 = mtcars %>%
  filter(hp < 100)

t3 = mtcars %>%
  filter(hp == 100)

t4 = mtcars %>%
  filter(hp != 100)

t5 = mtcars %>%
  filter(hp >= 100)

t6 = mtcars %>%
  filter(hp <= 100) 

In addition, for numeric variables, the filter function also allows you to filter the variables that are text or factor. In this case, there are three options that we can apply: equal to (==), different from (! =), it is included in (% in% c ()), it is not included in (! Value% in% c () ).

Let’s see an example with the Species variable of the iris dataset, which is a factor that has 3 values:

# La especie es setosa 
iris %>%
  filter(Species == "setosa") 

# La especie NO es setosa  
iris %>%
  filter(Species != "setosa") 

# La especie es setosa o virginica
iris %>%
  filter(Species %in% c("setosa", "virginica")) 

# La especie NO es setosa ni virginica
iris %>%
  filter(!Species %in% c("setosa", "virginica")) 

In addition, we can combine several filters, in such a way that we can filter, at the same time, by several variables, each one with its condition. In these cases there are two options: that both options must be met, in which case the filters are joined by the symbol & or that either of the two options must be satisfied, in which case the | symbol is used.

# More than 100 horsepower AND the number of cylinders is equal to 6 
mtcars %>%
  filter(hp > 100 & cyl == 6)


# More than 100 horsepower AND the number of cylinders is equal to 6
mtcars %>%
  filter(hp > 100 | cyl == 6)

As you can see, the filter function is very intuitive and very powerful, since it allows you to filter information in a much more intuitive and clear way than R’s default filters.

Filter function exercises

To practice the filter function, I am going to ask you to keep the observations from the gapminder dataset that: they are from Africa, that the year is 2007 and that the life expectancy is greater than 60 years.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnMgPSBjKFwiZHBseXJcIixcImdhcG1pbmRlclwiLFwidGlkeXJcIilcbnNhcHBseShsaWJzWyFsaWJzICVpbiUgaW5zdGFsbGVkLnBhY2thZ2VzKCldLCBpbnN0YWxsLnBhY2thZ2VzKVxuc2FwcGx5KGxpYnMsIHJlcXVpcmUsIGNoYXJhY3Rlci5vbmx5ID0gVClcbnJtKGxpYnMpIiwic2FtcGxlIjoiIyBkcGx5ciBhbmQgZ2FwbWluZGVyIGFyZSBsb2FkZWQgYWxyZWFkeVxuIyBkcGx5ciB5IGdhcG1pbmRlciB5YSBlc3RcdTAwZTFuIGNhcmdhZG9zXG5yZXN1bHRcblxuIyBQcmludCByZXN1bHRcbnByaW50KHJlc3VsdCkiLCJzb2x1dGlvbiI6InJlc3VsdCA9IGdhcG1pbmRlciAlPiUgXG4gIGZpbHRlcihjb250aW5lbnQgPT0gXCJBZnJpY2FcIiAmXG4gICAgICAgICB5ZWFyID09IDIwMDcgJlxuICAgICAgICAgICBsaWZlRXhwID4gNjApXG5cbnByaW50KHJlc3VsdCkiLCJzY3QiOiJ0ZXN0X2Z1bmN0aW9uKFwiZmlsdGVyXCIsIFxuICAgICAgICAgICAgICBpbmNvcnJlY3RfbXNnID0gXCJZb3Ugc2hvdWxkIHVzZSBmaWx0ZXIgZnVuY3Rpb24uIC0gRGViZXJcdTAwZWRhcyB1c2FyIGxhIGZ1bmNpXHUwMGYzbiBmaWx0ZXJcIilcblxudGVzdF9vYmplY3QoXCJyZXN1bHRcIixcbiAgICAgICAgICAgIGluY29ycmVjdF9tc2cgPSBcIkluY29ycmVjdC4gLSBJbmNvcnJlY3RvLlwiXG4gICAgICAgICAgICApXG5cbnRlc3RfZXJyb3IoKVxuc3VjY2Vzc19tc2coXCJXZWxsIGRvbmUhIC0gQmllbiBoZWNobyFcIikifQ==

distinct() function

The distinct function is very simple since it allows us to eliminate duplicated data from our dataset. This is something that does not make sense in a clean dataset, but it becomes handy when we have to clean and transform the data.

To do this, we are going to create a new dataset, named iris_duplicate, which will basically contains the duplicate iris dataset. So, we will see how this function works:

dim(iris)
[1] 150   5
iris %>%
  distinct() %>%
  dim()
[1] 149   5
iris_duplicate = bind_rows(iris, iris)

dim(iris_duplicate)
[1] 300   5
iris_duplicate %>%
  distinct() %>%
  dim()
[1] 149   5

As we can see, the iris dataset has 150 observations, but one of them must be duplicated (or the values coincide), since if we apply the distinct function we will be left with 149 observations.

Also, even though the iris_duplicate dataset has 300 observations, after applying the distinct function we are left with the 149 values that are unique.

slice () function

The slice function allows you to keep the data of the positions that we say. That is if we run slice (1) we will keep the first value, and if we run slice(1:20), we will keep the first 20 values:

iris %>%
  slice(1) %>%
  dim()
[1] 1 5
iris %>%
  slice(1:20) %>%
  dim()
[1] 20  5

Although it seems like a very silly function, the slice function is very useful especially when applied with the group_by function, since it will allow us to obtain the number of values ​​that we want for each of the groups. In fact, this is what I used to create a bar chart race in the post about how to create animations with R.

Likewise, there are other slice functions that allow you to perform the same task, but with some modification. For example, the slice_min and slice_max functions allow you to get the rows with the minimum or maximum value of a variable.

This, combined with the group_by function, is very powerful, since in two lines we are able to obtain the observations with maximum values ​​for each group. With Gapminder, for example, we can obtain the country with the highest life expectancy (LifeExp) by continent and year.

library(gapminder)

gapminder %>%
  group_by(year, continent) %>%
  slice_max(lifeExp)

On the other hand, the slice_sample function allows you to obtain a random amount of data, which can be very useful, for example, to split the data between train, test, and validation, in order to train a machine-learning model.

# We obtain 20 random observations from Gapminder
gapminder %>%
  slice_sample(n = 5)
# We obtain a random sample that represents 1% of all observations
gapminder %>%
  slice_sample(prop = 0.01)

Finally, the slice_head and slice_tail functions allow you to obtain the amount of data you want starting at the top (head) or bottom (tail) and respecting the current order of the data.

Now that you know how it works, let’s see how to use them in practice.

Exercises slice function

Using the Gapminder dataset and some function of the slice family, find the countries with the highest life expectancy each year.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnMgPSBjKFwiZHBseXJcIixcImdhcG1pbmRlclwiLFwidGlkeXJcIilcbnNhcHBseShsaWJzWyFsaWJzICVpbiUgaW5zdGFsbGVkLnBhY2thZ2VzKCldLCBpbnN0YWxsLnBhY2thZ2VzKVxuc2FwcGx5KGxpYnMsIHJlcXVpcmUsIGNoYXJhY3Rlci5vbmx5ID0gVClcbnJtKGxpYnMpIiwic2FtcGxlIjoiIyBkcGx5ciBhbmQgZ2FwbWluZGVyIGFyZSBsb2FkZWQgYWxyZWFkeVxuIyBkcGx5ciB5IGdhcG1pbmRlciB5YSBlc3RcdTAwZTFuIGNhcmdhZG9zXG5yZXN1bHRcblxuIyBQcmludCByZXN1bHRcbnByaW50KHJlc3VsdCkiLCJzb2x1dGlvbiI6ImdhcG1pbmRlciAlPiUgXG4gIGZpbHRlcihjb250aW5lbnQgPT0gXCJBZnJpY2FcIiAmXG4gICAgICAgICB5ZWFyID09IDIwMDcgJlxuICAgICAgICAgICBsaWZlRXhwID4gNjApIiwic2N0IjoidGVzdF9mdW5jdGlvbihcInNsaWNlXCIsIFxuICAgICAgICAgICAgICBpbmNvcnJlY3RfbXNnID0gXCJZb3Ugc2hvdWxkIHVzZSBmaWx0ZXIgZnVuY3Rpb24uIC0gRGViZXJcdTAwZWRhcyB1c2FyIGxhIGZ1bmNpXHUwMGYzbiBmaWx0ZXJcIilcblxudGVzdF9vYmplY3QoXCJyZXN1bHRcIixcbiAgICAgICAgICAgIGluY29ycmVjdF9tc2cgPSBcIkluY29ycmVjdC4gLSBJbmNvcnJlY3RvLlwiXG4gICAgICAgICAgICApXG5cbnRlc3RfZXJyb3IoKVxuc3VjY2Vzc19tc2coXCJXZWxsIGRvbmUhIC0gQmllbiBoZWNobyFcIikifQ==

Now, find the 5 countries with the highest life expectancy of each year.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnMgPSBjKFwiZHBseXJcIixcImdhcG1pbmRlclwiLFwidGlkeXJcIilcbnNhcHBseShsaWJzWyFsaWJzICVpbiUgaW5zdGFsbGVkLnBhY2thZ2VzKCldLCBpbnN0YWxsLnBhY2thZ2VzKVxuc2FwcGx5KGxpYnMsIHJlcXVpcmUsIGNoYXJhY3Rlci5vbmx5ID0gVClcbnJtKGxpYnMpIiwic2FtcGxlIjoiIyBkcGx5ciBhbmQgZ2FwbWluZGVyIGFyZSBsb2FkZWQgYWxyZWFkeVxuIyBkcGx5ciB5IGdhcG1pbmRlciB5YSBlc3RcdTAwZTFuIGNhcmdhZG9zXG5yZXN1bHRcblxuIyBQcmludCByZXN1bHRcbnByaW50KHJlc3VsdCkiLCJzb2x1dGlvbiI6InJlc3VsdCA9IGdhcG1pbmRlciAlPiUgXG4gIGdyb3VwX2J5KHllYXIpICU+JVxuICBzbGljZV9tYXgobGlmZUV4cCwgbiA9IDUpXG5cbnByaW50KHJlc3VsdCkiLCJzY3QiOiJ0ZXN0X2Z1bmN0aW9uKFwic2xpY2VcIiwgXG4gICAgICAgICAgICAgIGluY29ycmVjdF9tc2cgPSBcIllvdSBzaG91bGQgdXNlIGZpbHRlciBmdW5jdGlvbi4gLSBEZWJlclx1MDBlZGFzIHVzYXIgbGEgZnVuY2lcdTAwZjNuIGZpbHRlclwiKVxuXG50ZXN0X29iamVjdChcInJlc3VsdFwiLFxuICAgICAgICAgICAgaW5jb3JyZWN0X21zZyA9IFwiSW5jb3JyZWN0LiAtIEluY29ycmVjdG8uXCJcbiAgICAgICAgICAgIClcblxudGVzdF9lcnJvcigpXG5zdWNjZXNzX21zZyhcIldlbGwgZG9uZSEgLSBCaWVuIGhlY2hvIVwiKSJ9

Arrange function

The arrange function allows you to sort the data based on one or more variables, both ascending and descending. By default, the data will be sorted in ascending order, to indicate that it is sorted in descending order, we will have to wrap the variable with the desc function.

For example, suppose we want to sort the cars according to their consumption (mpg):

mtcars %>%
  arrange(desc(mpg))

As we can see, the cars are ordered, but there are cases in which the mpg value coincides, as in the case of the Honda and Lotus (30.4) or the Datsu and Mercedes (22.8).

In those cases, the function will sort according to the position they were in the dataframe. However, we can indicate that it orders them by another variable, such as the horses (hp) in descending order as well:

mtcars %>%
  arrange(desc(mpg), desc(hp))

As we can see, in this case both the Lotus and the Mercedes appear ahead of the Honda and the Datsu, since, with equal consumption, they have more horses.

Arrange function exercise

Let’s put everything we’ve seen about dplyr’s arrange function into practice. To do this, I ask you to order the gapminder dataset by year (newest first), continent (from Z to A) and by life expectancy (from lowest to highest).

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnMgPSBjKFwiZHBseXJcIixcImdhcG1pbmRlclwiKVxuc2FwcGx5KGxpYnNbIWxpYnMgJWluJSBpbnN0YWxsZWQucGFja2FnZXMoKV0sIGluc3RhbGwucGFja2FnZXMpXG5zYXBwbHkobGlicywgcmVxdWlyZSwgY2hhcmFjdGVyLm9ubHkgPSBUKVxucm0obGlicykiLCJzYW1wbGUiOiIjIGRwbHlyIGFuZCBnYXBtaW5kZXIgYXJlIGxvYWRlZCBhbHJlYWR5XG4jIGRwbHlyIHkgZ2FwbWluZGVyIHlhIGVzdFx1MDBlMW4gY2FyZ2Fkb3NcbnJlc3VsdFxuXG4jIFByaW50IHJlc3VsdFxucHJpbnQocmVzdWx0KSIsInNvbHV0aW9uIjoicmVzdWx0ID0gZ2FwbWluZGVyICU+JSBcbiAgYXJyYW5nZShkZXNjKHllYXIpLCBkZXNjKGNvbnRpbmVudCksIGxpZmVFeHApXG5cbnByaW50KHJlc3VsdCkiLCJzY3QiOiJ0ZXN0X2Z1bmN0aW9uKFwiYXJyYW5nZVwiLCBcbiAgICAgICAgICAgICAgaW5jb3JyZWN0X21zZyA9IFwiWW91IHNob3VsZCB1c2UgYXJyYW5nZSBmdW5jdGlvbi4gLSBEZWJlclx1MDBlZGFzIHVzYXIgbGEgZnVuY2lcdTAwZjNuIGFycmFuZ2VcIilcblxudGVzdF9vYmplY3QoXCJyZXN1bHRcIixcbiAgICAgICAgICAgIGluY29ycmVjdF9tc2cgPSBcIkluY29ycmVjdC4gLSBJbmNvcnJlY3RvLlwiXG4gICAgICAgICAgICApXG5cbnRlc3RfZXJyb3IoKVxuc3VjY2Vzc19tc2coXCJXZWxsIGRvbmUhIC0gQmllbiBoZWNobyFcIikifQ==

And with this very simple (and practical) function you have already seen the most important filtering functions within dplyr. As you can see, a simple but very interesting and practical function.

However, dplyr hides much more, so let’s continue with the tutorial, in this case with summary functions.

Summary functions

The summary functions are usually used, almost always, after a grouping function ( group_by ), since this allows to obtain metrics for each one of the groups in a very simple way.

summarize function

The summarize (or summarize ) function allows obtaining data from each of the groups, such as the maximum, minimum, mean, standard deviation, number of observations, etc.

To do this, you simply have to:

  1. Indicate the name of the new column that will be created.
  2. Pass the function and the variable that will be used to obtain said column.

For example, we could get the mean, maximum, and mean life expectancy of the entire dataset. We can obtain this in the following way:

gapminder %>%
  summarise(
    lifeExp_min = min(lifeExp),
    lifeExp_max = max(lifeExp),
    lifeExp_mean = mean(lifeExp)
  )

In this case, since the data is not grouped, the calculations are performed on the entire dataset. However, as I mentioned previously, the most common is to use the summarize function together with the group_by function, in such a way that the calculations are performed for each of the groups.

For example, let’s say we want to get the mean, minimum, and maximum of life expectancy, but instead of for the entire dataset, we want to see it for each year. This could be obtained as follows:

gapminder %>%
  group_by(year) %>%
  summarise(
    lifeExp_min = min(lifeExp),
    lifeExp_max = max(lifeExp),
    lifeExp_mean = mean(lifeExp)
  )
Exercise summarize function

Let’s put the summarize function into practice! Using this function and some other function we have seen so far, calculate the maximum, minimum and mean of the population and life expectancy for each continent and year. The variables should be called lifeExp_min, lifeExp_max, lifeExp_mean and pop_min, pop_max and pop_mean, in that order.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnMgPSBjKFwiZHBseXJcIixcImdhcG1pbmRlclwiKVxuc2FwcGx5KGxpYnNbIWxpYnMgJWluJSBpbnN0YWxsZWQucGFja2FnZXMoKV0sIGluc3RhbGwucGFja2FnZXMpXG5zYXBwbHkobGlicywgcmVxdWlyZSwgY2hhcmFjdGVyLm9ubHkgPSBUKVxucm0obGlicykiLCJzYW1wbGUiOiIjIGRwbHlyIGFuZCBnYXBtaW5kZXIgYXJlIGxvYWRlZCBhbHJlYWR5XG4jIGRwbHlyIHkgZ2FwbWluZGVyIHlhIGVzdFx1MDBlMW4gY2FyZ2Fkb3MiLCJzb2x1dGlvbiI6InJlc3VsdCA9IGdhcG1pbmRlciAlPiUgXG4gIGdyb3VwX2J5KHllYXIsIGNvbnRpbmVudCkgJT4lXG4gIHN1bW1hcmlzZShcbiAgICAgbGlmZUV4cF9taW4gPSBtaW4obGlmZUV4cCksIFxuICAgICBsaWZlRXhwX21heCA9IG1heChsaWZlRXhwKSwgXG4gICAgIGxpZmVFeHBfbWVhbiA9IG1lYW4obGlmZUV4cCksXG4gICAgIHBvcF9taW4gPSBtaW4ocG9wKSwgXG4gICAgIHBvcF9tYXggPSBtYXgocG9wKSxcbiAgICAgcG9wX21lYW4gPSBtZWFuKHBvcClcbiAgKVxuXG5wcmludChyZXN1bHQpIiwic2N0IjoidGVzdF9mdW5jdGlvbihcInN1bW1hcmlzZVwiLCBcbiAgICAgICAgICAgICAgaW5jb3JyZWN0X21zZyA9IFwiWW91IHNob3VsZCB1c2Ugc3VtbWFyaXNlIGZ1bmN0aW9uLiAtIERlYmVyXHUwMGVkYXMgdXNhciBsYSBmdW5jaVx1MDBmM24gc3VtbWFyaXNlXCIpXG5cbnRlc3RfZnVuY3Rpb24oXCJncm91cF9ieVwiLCBcbiAgICAgICAgICAgICAgaW5jb3JyZWN0X21zZyA9IFwiWW91IHNob3VsZCB1c2UgZ3JvdXBfYnkgZnVuY3Rpb24uIC0gRGViZXJcdTAwZWRhcyB1c2FyIGxhIGZ1bmNpXHUwMGYzbiBncm91cF9ieVwiKVxuXG5cbnRlc3Rfb2JqZWN0KFwicmVzdWx0XCIsXG4gICAgICAgICAgICBpbmNvcnJlY3RfbXNnID0gXCJJbmNvcnJlY3QuIC0gSW5jb3JyZWN0by5cIlxuICAgICAgICAgICAgKVxuXG50ZXN0X2Vycm9yKClcbnN1Y2Nlc3NfbXNnKFwiV2VsbCBkb25lISAtIEJpZW4gaGVjaG8hXCIpIn0=

As you can see, the summarize function is very very powerful. Above all, if it is combined with the next function that we are going to see, let’s continue with our complete dplyr tutorial!

Across function

It could be the case that we want to calculate a series of functions (minimum, maximum, mean and number of cases, for example) to several columns. With what we have seen so far, one option would be to indicate one by one each of the operations (maximum, minimum, etc.) for each column. However, this does not seem very efficient …

In those cases, a better option is to use the across function. The across function allows us to apply the functions that we indicate to the columns that we indicate. Although the end result is the same, using the across function is much more optimal, since (1) it will take less time, (2) you are less likely to make mistakes and also (3) the code will remain a lot cleaner.

Let’s imagine that we want to calculate the maximum, minimum and mean for life expectancy and population. Without using the across function we would do it as follows:

gapminder %>%
  group_by(year) %>%
  summarise(

    # Life Expectancy
    lifeExp_min = min(lifeExp),
    lifeExp_max = max(lifeExp),
    lifeExp_mean = mean(lifeExp),

    # Population
    pop_min = min(pop),
    pop_max = max(pop),
    pop_mean = mean(pop),

    # GDP
    gpd_min = min(gdpPercap),
    gpd_max = max(gdpPercap),
    gpd_mean = mean(gdpPercap)

  )

As you can see, the above code works, but it is quite cumbersome. Now, let’s try doing it with the across function:

gapminder %>%
  group_by(year) %>%
  summarise(
    across(
      c(lifeExp,pop, gdpPercap), 
      list(min, max, mean)
         )
  )

As we can see, we obtain the same result in much fewer lines of code, it is much easier to read and modify.

In fact, with the across function we can pass conditionals, such as, for example, that the column is numeric, or factor. Let’s see an example, obtaining the mode of the categorical variables and the mean of the numerical variables:

Note: to use fashion, I have used the function indicated in this post.

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

gapminder %>%
  group_by(year) %>%
  summarise(
    across(is.numeric, mean),
    across(is.factor, Mode)
  )
Exercise across function

As you can see, the across function is very useful, and it would have saved us a lot of time in the previous exercise. That is why, I ask you again to calculate the minimum, maximum, and average life expectancy and the population of each continent in each year, but this time using the across function.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnMgPSBjKFwiZHBseXJcIixcImdhcG1pbmRlclwiKVxuc2FwcGx5KGxpYnNbIWxpYnMgJWluJSBpbnN0YWxsZWQucGFja2FnZXMoKV0sIGluc3RhbGwucGFja2FnZXMpXG5zYXBwbHkobGlicywgcmVxdWlyZSwgY2hhcmFjdGVyLm9ubHkgPSBUKVxucm0obGlicykiLCJzYW1wbGUiOiIjIGRwbHlyIGFuZCBnYXBtaW5kZXIgYXJlIGxvYWRlZCBhbHJlYWR5XG4jIGRwbHlyIHkgZ2FwbWluZGVyIHlhIGVzdFx1MDBlMW4gY2FyZ2Fkb3NcbnJlc3VsdCAgXG5cbiMgUHJpbnQgcmVzdWx0XG5wcmludChyZXN1bHQpIiwic29sdXRpb24iOiJyZXN1bHQgPSBnYXBtaW5kZXIgJT4lIFxuICBncm91cF9ieSh5ZWFyLCBjb250aW5lbnQpICU+JVxuICBzdW1tYXJpc2UoXG4gICAgYWNyb3NzKGMobGlmZUV4cCxwb3ApLCBsaXN0KG1pbiwgbWF4LCBtZWFuKSlcbiAgKVxuXG5wcmludChyZXN1bHQpIiwic2N0IjoidGVzdF9mdW5jdGlvbihcInN1bW1hcmlzZVwiLCBcbiAgICAgICAgICAgICAgaW5jb3JyZWN0X21zZyA9IFwiWW91IHNob3VsZCB1c2Ugc3VtbWFyaXNlIGZ1bmN0aW9uLiAtIERlYmVyXHUwMGVkYXMgdXNhciBsYSBmdW5jaVx1MDBmM24gc3VtbWFyaXNlXCIpXG5cbnRlc3RfZnVuY3Rpb24oXCJncm91cF9ieVwiLCBcbiAgICAgICAgICAgICAgaW5jb3JyZWN0X21zZyA9IFwiWW91IHNob3VsZCB1c2UgZ3JvdXBfYnkgZnVuY3Rpb24uIC0gRGViZXJcdTAwZWRhcyB1c2FyIGxhIGZ1bmNpXHUwMGYzbiBncm91cF9ieVwiKVxuXG5cbnRlc3Rfb2JqZWN0KFwicmVzdWx0XCIsXG4gICAgICAgICAgICBpbmNvcnJlY3RfbXNnID0gXCJJbmNvcnJlY3QuIC0gSW5jb3JyZWN0by5cIlxuICAgICAgICAgICAgKVxuXG50ZXN0X2Vycm9yKClcbnN1Y2Nlc3NfbXNnKFwiV2VsbCBkb25lISAtIEJpZW4gaGVjaG8hXCIpIn0=

count function

Finally, another very recurring question is to obtain the number of cases/observations in each group. Although this could be obtained with the summarize function, another faster way to obtain it is with the count function.

Let’s look at an example assuming that we want to obtain the number of countries for which we have data for each of the years. Using the summarize function, it would be done as follows:

gapminder %>%
  group_by(year) %>%
  summarise(cases = n())

However, another faster option is to use the count function directly:

gapminder %>%
  group_by(year) %>%
  count()

As we can see, we save a few lines, again, making our code more easily understandable.

Have a look at the summary functions! Let’s continue with this dplyr tutorial looking at the manipulation functions.

dplyr data manipulation functions

The manipulation functions allow you to modify the current columns, either by creating new ones or renaming the current columns.

Within the data manipulation functions we find the functions: mutate , transmute , add_column , rename . Also, although they are not from the dplyr package, but from the tidyr package, I will comment on the pivot_wider and pivot_longer functions, since which are also very frequent in dplyr pipes.

mutate function

The mutate function allows you to create new columns, which will be the result of applying a function. Therefore, within the mutate function, we must always have two elements: the name of the new column and the function that will result in the values ​​of the column.

For example, we can use the mutate function to check if the name of the passenger on the Titanic included Mr .:

titanic = read.csv("https://gist.githubusercontent.com/michhar/2dfd2de0d4f8727f873422c5d959fff5/raw/fa71405126017e6a37bea592440b4bee94bf7b9e/titanic.csv")

titanic %>%
  mutate(
    contiene_mr = grepl("Mr.", Name, fixed = T)
  ) %>%
  select(Name, contiene_mr)

As we can see, the mutate function is very simple, but at the same time very useful, since it avoids having to make many assignments and / or corrections, which later is to know if they have been made or no.

In this sense, a function widely used together with mutate is the case_when function, which allows you to create conditionals in a very simple way.

For example, we could create a new variable called Gender that returns Man if the name includes Mr. & nbsp; or Master, Woman if it includes Ms., Mrs. & nbsp; or Miss, for example.

In addition, it may be the case that a name does not fall into any of the previous categories. In this case we could indicate that it is an NA, for example.

Doing this with the case_when function is very simple:

titanic %>%
  mutate(
    gender = case_when(
      grepl("Mr.|Master", Name) ~ 'Man',
      grepl("Ms.|Miss.|Mrs.", Name) ~ 'Woman',
      TRUE ~ NA_character_ # this indicates missclasified ones
    )
  ) %>%
  select(Name, gender)
Mutate exercises

Using the titanic dataset, create a new Boolean column, called adult, which returns TRUE if the person is of legal age or FALSE if he is not.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnMgPSBjKFwiZHBseXJcIixcImdhcG1pbmRlclwiKVxuc2FwcGx5KGxpYnNbIWxpYnMgJWluJSBpbnN0YWxsZWQucGFja2FnZXMoKV0sIGluc3RhbGwucGFja2FnZXMpXG5zYXBwbHkobGlicywgcmVxdWlyZSwgY2hhcmFjdGVyLm9ubHkgPSBUKVxucm0obGlicylcblxudGl0YW5pYyA9IHJlYWQuY3N2KFwiaHR0cHM6Ly9naXN0LmdpdGh1YnVzZXJjb250ZW50LmNvbS9taWNoaGFyLzJkZmQyZGUwZDRmODcyN2Y4NzM0MjJjNWQ5NTlmZmY1L3Jhdy9mYTcxNDA1MTI2MDE3ZTZhMzdiZWE1OTI0NDBiNGJlZTk0YmY3YjllL3RpdGFuaWMuY3N2XCIpIiwic2FtcGxlIjoiIyBkcGx5ciBhbmQgZ2FwbWluZGVyIGFyZSBsb2FkZWQgYWxyZWFkeVxuIyBkcGx5ciB5IGdhcG1pbmRlciB5YSBlc3RcdTAwZTFuIGNhcmdhZG9zXG4jIHRpdGFuaWMgaXMgYWxyZWFkeSBsb2FkZWQgLSB0aXRhbmljIHlhIGVzdFx1MDBlMSBjYXJnYWRvXG5cbnJlc3VsdCAgXG5cbiMgUHJpbnQgcmVzdWx0XG5wcmludChyZXN1bHQpIiwic29sdXRpb24iOiJyZXN1bHQgPSB0aXRhbmljICU+JVxuICBtdXRhdGUoXG4gICAgYWR1bHQgPSBpZmVsc2UoQWdlPj0xOCxULEYpXG4gIClcblxucHJpbnQocmVzdWx0KSIsInNjdCI6InRlc3RfZnVuY3Rpb24oXCJtdXRhdGVcIiwgXG4gICAgICAgICAgICAgIGluY29ycmVjdF9tc2cgPSBcIllvdSBzaG91bGQgdXNlIG11dGF0ZSBmdW5jdGlvbi4gLSBEZWJlclx1MDBlZGFzIHVzYXIgbGEgZnVuY2lcdTAwZjNuIG11dGF0ZVwiKVxuXG5cbnRlc3Rfb2JqZWN0KFwicmVzdWx0XCIsXG4gICAgICAgICAgICBpbmNvcnJlY3RfbXNnID0gXCJJbmNvcnJlY3QuIC0gSW5jb3JyZWN0by5cIlxuICAgICAgICAgICAgKVxuXG50ZXN0X2Vycm9yKClcbnN1Y2Nlc3NfbXNnKFwiV2VsbCBkb25lISAtIEJpZW4gaGVjaG8hXCIpIn0=

As we can see, making modifications with mutate is very simple, it’s faster than doing a normal assignment and makes the code more understandable.

However, there will be times when we just want to create a couple of new variables and forget about the rest. Although we could do that with a mutate function followed by a select, an easier way is to use the function transmute.

Transmute function

The transmute function allows us to create new variables and get rid of the rest of the variables. For example, imagine that we want to analyze the relationship between the variable sex and the name of the person, to check how many times the default variable Gender is correct (that is, we want to analyze the internal consistency of the data).

For this, we could use, for example, a mutate function followed by a filter:

titanic %>%
  mutate(
        gender_name = case_when(
      grepl("Mr.|Master", Name) ~ 'Man',
      grepl("Ms.|Miss.|Mrs.", Name) ~ 'Woman',
      TRUE ~ NA_character_ # this indicates missclasified ones
    )
  ) %>%
  select(gender_name, Sex)

But, another simpler option and do it all within the same function, using transmute.

titanic %>%
  transmute(
        gender_name = case_when(
      grepl("Mr.|Master", Name) ~ 'Man',
      grepl("Ms.|Miss.|Mrs.", Name) ~ 'Woman',
      TRUE ~ NA_character_ # this indicates missclasified ones
    ),
    Sex
  )

rename function

Another very common case is when we want to change the name of a column. With what we’ve seen so far, this could be done with the mutate function. However, a simpler and clearer way is to do it with the rename function.

Modifying the name with mutate would be done like this:

gapminder %>%
  mutate(life_expectancy = lifeExp, lifeExp = NULL )

Whereas with the rename function, the process is somewhat simpler:

gapminder %>%
  rename(life_expectancy = lifeExp)

pivot_longer and pivot_wider functions

Although these two functions are not from the dplyr library, but from the tidyr library, they are very useful and, generally, widely used in dplyr pipes.

The idea of ​​these functions is simple: go from having several variables in a column to having one variable per column or, vice versa, going from one variable per column to a column with several variables.

In the first case (from several variables in one column to one column per variable), we will use the pivot_wider function.

In this function you simply have to indicate:

  1. The columns that should not be modified.
  2. Which column will create the columns (in our case, the year).
  3. What variable will be used to fill the columns new columns (in our case, lifeExp).
library(tidyr)

gapminder %>%
  select(-pop, gdpPercap) %>%
  pivot_wider(
    id_cols = c("continent", "country"),
    names_from = "year",
    values_from = "lifeExp"
    )

This function is useful, especially when there is redundancy in the data, as in this case it happens with the year column.

On the contrary, we can go from wide data to narrower ones, making several variables are included within the same column.

We can achieve this with the pivot_longer function, in which we only have to choose the columns to pivot, the name of the column where the names of the variables will go and the name of the column where it will go the value:

gapminder %>%
  pivot_longer(
    cols = c("pop","gdpPercap", "lifeExp"),
    names_to = "variable",
    values_to = "value"
    )

Given the data manipulation functions, we would only have one family of functions left: the data combination functions.

Data combination functions

The data combination functions serve precisely for that: to combine information from two dataframes or tibbles. These combinations can be of two types:

  • Horizontal join: when a dataframe adds more columns than already exist. Within these functions we would find the bind_cols function, the joins: left_join , right_join , inner_join and full_join , as well as filter functions based on join, such as semi_join and anti_join.
  • Vertical join: when a dataframe includes more rows. Within this type of functions we would find: bind_rows , union , intersect , setdiff and setequal .

So, let’s see how these functions work!

bind_rows and bind_cols function

The bind_rows and bind_cols functions are similar to the rbind and cbind functions, although they have some of the same features main differences are:

  • bind_rows directly manages the lists, whereas rbind you must use the do.call function.
  • bind_rows is much more efficient than rbind , although bind_cols is not more efficient than cbind.
  • bind_rows handles mismatched columns, including NAs, while rbind will fail.

As an example, we can compare the efficiency of cbind with bind_cols :

microbenchmark(
  cbind = cbind(mtcars, mtcars),
  bind_cols = bind_cols(mtcars, mtcars)
)

And we can do the same with the rbind and bind_rows functions:

microbenchmark(
  rbind = rbind(mtcars, mtcars),
  bind_rows = bind_rows(mtcars, mtcars)
)

As we can see, when it comes to binding rows, dplyr’s bind_rows is much more efficient. Although this is not noticeable with little data, it is something that becomes important when we start working with larger datasets.

Now, let’s see how join functions work.

Join functions: left_join, right_join, inner_join, full_join, semi_join and anti_join

Join functions allow you to join data horizontally. Each of the functions will return a different result. For example:

  • left_join: returns all cells in the table on the left, even if they do not match the table on the right.
  • right_join: returns the data from the table on the right, even if it does not match the data from the table on the left.
  • inner_join: only returns the data that matches in both tables.
  • semi_join: returns the data from X (X only) that match the key of Y.
  • anti_join: displays the X data (X only) that has no match with the Y key.
  • full_join: returns the X and Y data, but does not match.

A good way to look at it is through examples. To do this, we are going to see a typical case of ecommer analysis, since the information is usually stored for each purchase in a table.

clients = read.csv("https://github.com/anderfernandez/datasets/blob/main/telcom-support/clients.csv")
clients
orders = read.csv("https://github.com/anderfernandez/datasets/blob/main/telcom-support/orders.csv")
orders

If we wanted to get the customers and their order data, we should simply do a left_join:

clients %>%
  left_join(orders, by = c("num_client" = "num_client")) %>%
  dim()
[1] 21 38

In this case, we will have 38 columns, (14 from clients, 25 from orders – the join column). In addition, even though there are only 21 customers, we have 22 rows, since the same customer may have made more than one purchase.

Likewise, we could obtain the order data only from those customers who have made purchases, for which we will use the inner_join:

clients %>%
  inner_join(orders, by = c("num_client" = "num_client")) %>%
  dim()
[1] 10 38

It could also be the case that we want the data of the customers who have placed orders, but we are not interested in including the order data, but we simply want to know which customers they are. To do this, we can use the semi_join function.

clients %>%
  semi_join(orders, by = c("num_client" = "num_client")) %>%
  dim()
[1]  9 14

As we can see, there are only 8 customers at our customer table who have placed orders. Or what is the same, there are 9 people at the customer table who have not placed orders. So if we wanted to identify these users, we could use the anti_join function:

clients %>%
  anti_join(orders, by = c("num_client" = "num_client")) %>%
  dim()
[1] 11 14

Indeed, the function returns those 11 customers from the customer table who have not placed orders.

You may not see big differences between using semi_join and / or anti_join compared to the filter function. Although it is true that for these cases we could obtain the same result with the filter function, the filter function only works for those cases in which there is a single key, while The semi_join and anti_join functions serve regardless of the number of keys in the table.

Join exercises

Enrich the calls that have a client, using the inner_join function.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnMgPSBjKFwiZHBseXJcIixcImdhcG1pbmRlclwiKVxuc2FwcGx5KGxpYnNbIWxpYnMgJWluJSBpbnN0YWxsZWQucGFja2FnZXMoKV0sIGluc3RhbGwucGFja2FnZXMpXG5zYXBwbHkobGlicywgcmVxdWlyZSwgY2hhcmFjdGVyLm9ubHkgPSBUKVxucm0obGlicylcblxuY2xpZW50cyA9IHJlYWQuY3N2KCdodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vYW5kZXJmZXJuYW5kZXovZGF0YXNldHMvbWFpbi90ZWxjb20tc3VwcG9ydC9jbGllbnRzLmNzdicpXG5cbm9yZGVycyA9IHJlYWQuY3N2KCdodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vYW5kZXJmZXJuYW5kZXovZGF0YXNldHMvbWFpbi90ZWxjb20tc3VwcG9ydC9vcmRlcnMuY3N2JykiLCJzYW1wbGUiOiIjIGNsaWVudHMgYW5kIG9yZGVycyBhcmUgbG9hZGVkIGFscmVhZHlcbiMgY2xpZW50cyB5IG9yZGVycyB5YSBlc3RcdTAwZTFuIGNhcmdhZG9zXG5cblxucmVzdWx0XG5cbiMgUHJpbnQgcmVzdWx0XG5wcmludChyZXN1bHQpIiwic29sdXRpb24iOiJyZXN1bHQgPSBjbGllbnRzICU+JVxuICBpbm5lcl9qb2luKG9yZGVycywgIGJ5ID0gYyhcIm51bV9jbGllbnRcIiA9IFwibnVtX2NsaWVudFwiKSkgXG4gIFxuXG5wcmludChyZXN1bHQpIiwic2N0IjoidGVzdF9mdW5jdGlvbihcImlubmVyX2pvaW5cIiwgXG4gICAgICAgICAgICAgIGluY29ycmVjdF9tc2cgPSBcIllvdSBzaG91bGQgdXNlIGlubmVyX2pvaW4gZnVuY3Rpb24uIC0gRGViZXJcdTAwZWRhcyB1c2FyIGxhIGZ1bmNpXHUwMGYzbiBpbm5lcl9qb2luXCIpXG5cblxudGVzdF9vYmplY3QoXCJyZXN1bHRcIixcbiAgICAgICAgICAgIGluY29ycmVjdF9tc2cgPSBcIkluY29ycmVjdC4gLSBJbmNvcnJlY3RvLlwiXG4gICAgICAgICAgICApXG5cbnRlc3RfZXJyb3IoKVxuc3VjY2Vzc19tc2coXCJXZWxsIGRvbmUhIC0gQmllbiBoZWNobyFcIikifQ==

Now, do the same process, but using the left_join function. Important: the left_join may return data that you are not interested in, so you will have to use another function that we have learned to solve it.

eyJsYW5ndWFnZSI6InIiLCJwcmVfZXhlcmNpc2VfY29kZSI6ImxpYnMgPSBjKFwiZHBseXJcIixcImdhcG1pbmRlclwiKVxuc2FwcGx5KGxpYnNbIWxpYnMgJWluJSBpbnN0YWxsZWQucGFja2FnZXMoKV0sIGluc3RhbGwucGFja2FnZXMpXG5zYXBwbHkobGlicywgcmVxdWlyZSwgY2hhcmFjdGVyLm9ubHkgPSBUKVxucm0obGlicylcblxuY2xpZW50cyA9IHJlYWQuY3N2KCdodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vYW5kZXJmZXJuYW5kZXovZGF0YXNldHMvbWFpbi90ZWxjb20tc3VwcG9ydC9jbGllbnRzLmNzdicpXG5cbm9yZGVycyA9IHJlYWQuY3N2KCdodHRwczovL3Jhdy5naXRodWJ1c2VyY29udGVudC5jb20vYW5kZXJmZXJuYW5kZXovZGF0YXNldHMvbWFpbi90ZWxjb20tc3VwcG9ydC9vcmRlcnMuY3N2JykiLCJzYW1wbGUiOiIjIGNsaWVudHMgYW5kIG9yZGVycyBhcmUgbG9hZGVkIGFscmVhZHlcbiMgY2xpZW50cyB5IG9yZGVycyB5YSBlc3RcdTAwZTFuIGNhcmdhZG9zXG5cblxucmVzdWx0XG5cbiMgUHJpbnQgcmVzdWx0XG5wcmludChyZXN1bHQpIiwic29sdXRpb24iOiJjbGllbnRzICU+JVxuICBsZWZ0X2pvaW4ob3JkZXJzLCAgYnkgPSBjKFwibnVtX2NsaWVudFwiID0gXCJudW1fY2xpZW50XCIpKSAlPiVcbiAgZmlsdGVyKCFpcy5uYShlbnRyeV9kYXRlLnkpKVxuICBcblxucHJpbnQocmVzdWx0KSIsInNjdCI6InRlc3RfZnVuY3Rpb24oXCJsZWZ0X2pvaW5cIiwgXG4gICAgICAgICAgICAgIGluY29ycmVjdF9tc2cgPSBcIllvdSBzaG91bGQgdXNlIGxlZnRfam9pbiBmdW5jdGlvbi4gLSBEZWJlclx1MDBlZGFzIHVzYXIgbGEgZnVuY2lcdTAwZjNuIGxlZnRfam9pblwiKVxuXG50ZXN0X2Z1bmN0aW9uKFwiZmlsdGVyXCIsIFxuICAgICAgICAgICAgICBpbmNvcnJlY3RfbXNnID0gXCJZb3Ugc2hvdWxkIHVzZSBmaWx0ZXIgZnVuY3Rpb24uIC0gRGViZXJcdTAwZWRhcyB1c2FyIGxhIGZ1bmNpXHUwMGYzbiBmaWx0ZXJcIilcblxudGVzdF9vYmplY3QoXCJyZXN1bHRcIixcbiAgICAgICAgICAgIGluY29ycmVjdF9tc2cgPSBcIkluY29ycmVjdC4gLSBJbmNvcnJlY3RvLlwiXG4gICAgICAgICAgICApXG5cbnRlc3RfZXJyb3IoKVxuc3VjY2Vzc19tc2coXCJXZWxsIGRvbmUhIC0gQmllbiBoZWNobyFcIikifQ==

With this, in our complete guide to dplyr we have already seen all the families of functions that the package has. Now let’s see some practical cases where they can be interesting.

Practical examples of the dplyr library

Visualizing gapminder data with dplyr and ggplot

A classic dplyr use case is to create a cool visualization from a dataframe with ggplot . And, the dplyr functions that we have learned in the tutorial together with the pivot_wider and pivot_longer functions offer practically everything you need to be able to make all kinds of visualizations.

For example, in this case, starting from the original gapminder dataset, we will create a dplyr pipe that allows us to visualize the average, maximum and minimum evolution for each of the continents and each of the variables we have.

To do this, we will use the group_by , summarize and across functions to calculate the maximum, minimum and mean for each of the continents in every year.

Once this is done, we will transform the data in long format with pivot_longer . Thus, we can use the mutate function to have the function that has been used and the value of the function separately.

Once we have this, we will pivot the data again, this time in wide-format using pivot_wider , so that we have the data ready to be painted with ggplot2.

library(ggplot2)

gapminder %>%
  group_by(year, continent) %>%
  summarise(across(is.numeric, .fns = list(mean = mean, max = max,min =  min))) %>%
  pivot_longer(
    cols = -c("year", "continent"), 
    names_to = "variable", 
    values_to = "valor" 
    ) %>%
  mutate(
    funcion = gsub(".*_","", variable),
    variable = gsub("_.*", "", variable)
  ) %>%
  pivot_wider(names_from = funcion, values_from = valor) %>%
  ggplot(aes(x= year)) + 
  geom_line(aes(y = mean), group = 1) +
  geom_ribbon(aes(ymax = max, ymin = min), alpha = 0.6, fill = "skyblue") +
  facet_grid(variable~continent, scales = "free_y") +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)
    )

As a result, we see a very interesting and useful graph that allows us to see curious things, such as the growth of the population in Asia in recent years, the convergence of life expectancy and the increase in the differences in terms of GDP per Capita in Europe, the general increase in life expectancy on all continents.

And all this has been obtained using only 4 dplyr functions, 2 tidyr functions and the ggplot2 library, all in a fairly simple and friendly code.

Manipulating stream data with dplyr

A typical case of dplyr pipes is when the observations refer to different actions within a flow, such as the positions of a taxi or the pages visited by users on the web.

Although there are no typical datasets for this type of data, we will use the gapminder dataset to show how it would be used in this type of case. To do this, we will make a Sankey visualization that shows the evolution of the countries with the highest GDP per Capita for different years, the data being the difference with respect to the next country.

For this, we will again use the packages, dplyr , ggplot2 and alluvial , which allows us to make sankey graphics in ggplot2 . We are going to create a Sankey diagram that allows us to see how the deviation from the mean has evolved in terms of GDP per Capita of the ten countries with the highest GDP per Capita each year.

To do this, we are simply going to group the data by year and sort it by GDP per Capita in descending order. In this way, with mutate we can: (1) obtain the position of each country for that year, (2), for each country obtain the deviation with respect to the average GDP per capita of that year.

library(ggalluvial)

gapminder %>%
  select(-pop, -lifeExp) %>%
  group_by(year) %>%
  arrange(desc(gdpPercap)) %>%
  mutate(
    mean_gdp = mean(gdpPercap),
    index = row_number(),
    percent_better = (gdpPercap/mean_gdp) -1
  ) %>%
  slice_min(index, n = 10) %>%
  select(index, country, year, percent_better) %>%
  ggplot(aes(as.factor(year), 
             stratum = reorder(country, -index), 
             y = percent_better,
             label = country, 
             alluvium = index, 
             fill = country
             )) +
  geom_flow(stat = "alluvium", lode.guidance = "frontback",
            color = "darkgray") +
  geom_stratum(alpha = .5) +
  #geom_text(stat = "stratum", size = 2.5) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    legend.key.size = unit(0.5, "cm"),
    legend.key.width = unit(0.5,"cm") 
    ) +
  labs(
    x = "", 
    y = "% better than mean gdpPerCap",
    fill = "") +
  scale_y_reverse()

As we can see, with a few dplyr functions we have been able to transform the data to have it in such a way that I could easily create the Sankey graph.

Conclusions from the dplyr tutorial

If dplyr is one of the most used R packages, it is because it is a very simple library to understand and with which you get a very understandable and very efficient code.

Also, being part of one of the tidyverse world packages, its integration with other packages of the “universe”, such as ggplot2 or tidyr makes it very practical, being able to create complex and visually stunning visualizations without the need to save any intermediate objects.

So, I hope this complete tutorial on dplyr has helped you get to know the package better. Also, don’t forget to take a look at the post about how to create animations in R with gganimate, which complements this post very well. If you want to be aware of the posts I publish, I recommend that you subscribe to my blog. And, if you liked it and can afford it, I would appreciate it if you could contribute a donation to allow me to continue creating more content like this. In any case, see you in the next post!