?Close-up of a knot in a thick rope? by Robert Zunikoff on Unsplash
Often when working with disparate datasets that are perhaps exported from a database or standalone CSVs, you might want to join the data together on a common key or column. This can typically take place within a database, but if you don?t have permissions to do so, or don?t want to ETL for one-off analysis, then utilising dplyr and R to join the data can prove to be more efficient.
As a work-through for joining data in R, let?s look at we might have one table which contains data such as the amount of times a customer has purchased at a store, We then might have a second table which contains demographic information about the customer which we have collected from a loyalty scheme, or surveys.
If this data was within the same data warehouse or database, then of course we could join these tables together directly within the system. However, there will often be times where we are working with separate CSVs or are not able to upload the data to a database in order to join using SQL. Thankfully, that is where dplyr comes in.
Using dplyr within R, we can easily import our data and join these tables, using the following join types.
- Inner Join (inner_join)
- Left Join (left_join)
- Right Join (right_join)
- Full Join (full_join)
- Semi Join (semi_join)
- Anti Join (anti_join)
The general syntax of these joins is as follows:
join_type(firstTable, secondTable, by=columnTojoinOn)
e.g.
innerJoinDf <- inner_join(tableA,tableB,by=”Customer.ID”)
We?ll now run through an example of using each of these join types on our two tables.
Loading dplyr & our tables
library(dplyr)tableA <- read.csv(“.csv”)tableB <- read.csv(“.csv”)
Inner join: This join creates a new table which will combine table A and table B, based on the join-predicate (the column we decide to link the data on). For example, if we decided to join on Customer ID, the new table would contain rows 1 and 2:
innerJoinDf <- inner_join(tableA,tableB,by=”Customer.ID”)View(innerJoinDf)
Left join: This join will take all of the values from the table we specify as left (e.g., the first one) and match them to records from the table on the right (e.g. the second one). If there isn?t a match in the second table, then it will return NULL for the row in question For example, if we left joined table A to table B, our data would look as follows:
leftJoinDf <- left_join(tableA,tableB,by=?Customer.ID?)View(leftJoinDf)
Right join: Perhaps one of the easiest ways to consider a right join is the opposite of a left join! In this instance, the table specified second within the join statement will be the one that the new table takes all of its values from. If there isn?t a match in the first table (the table specified first in the query), then it will return NULL for the row(s) that do not match. In this instance, if we right joined table B to table A, our data would look as follows:
rightJoinDf <- right_join(tableA,tableB,by=?Customer.ID?)View(rightJoinDf)
Full join: The full outer join returns all of the records in a new table, whether it matches on either the left or right tables. If the table rows match, then a join will be executed, otherwise it will return NULL in places where a matching row does not exist.
fullJoinDf <- full_join(tableA,tableB,by=?Customer.ID?)View(fullJoinDf)
Semi join: This is arguably a little more complex than the previous examples of joins, but is still pretty straight-forward. A semi join creates a new table where it will return all rows from the first table where there is a corresponding matching value in second, but instead of the new table combining both the first and second tables, it only contains data from the first table. This can be illustrated as follows:
semiJoinDf <- semi_join(tableA,tableB,by=?Customer.ID?)View(semiJoinDf)
Anti join: As we have seen when looking at creating training & test datasets for machine learning in dplyr, anti joins are super helpful. An anti join will return all of the rows from the first table where there are not matching values from the second. The new anti join table will only contain data from the first table, based on the join predicate listed above. An example of this is shown below:
antiJoinDf <- anti_join(tableA,tableB,by=?Customer.ID?)View(antiJoinDf)
Useful Resources
If you?d like to find out more about joining data with dplyr or SQL, you can check out these great resources:
https://dplyr.tidyverse.org/reference/join.html
http://www.sql-join.com/