Best Practices for Cleaning Data in R

Tags:

A few months ago, I gave a talk at the previously known as NYC R conference, now known as New York Data Science and AI. (Watch it here!) My presentation focused on my favorite topic: handling duplicates in data, and the importance of data cleaning.

The saying “90% of data science is cleaning the data” rings especially true for me. I love really love digging into the weeds of cleaning data - figuring out what went wrong, whether the errors were systematic or not, whether there were user input errors (always), etc.

I’ve spent the last 10+ years working in the education space, where messy data is everywhere and data-driven decisions have a real impact on a student’s success. The challenge is to get data as clean as possible to have a correct analysis to base the decisions on.

Some common data challenges I’ve seen are:

Missing/Incomplete data
Different data sources without matching IDs
Incorrect/overlapping dates
(Mis)-alignment of data and data processes across all schools and regions
Changing student IDs (not many)
Human data reporting error
Historical data quality

Tackling these and more messy data challenges is the 90% of the work that drives meaningful outcomes for students.

Duplicates! Oh no! Where did they come from?

Duplicates in data are everywhere. Any dataset has the potential for some level of duplication, and if you’re not on the lookout, they can persist and cause analysis errors.

Most data duplicates that I’ve seen are caused by inadequate processes. If your organization doesn’t have the right data processes to start with, messy data will continue to flow, no matter how much you code. Creating and training in structures and processes will help reduce errors across the board.

For example, let’s say we have a student named James, who moved from School A to School B mid-year.

Bad process example: School B records James entering the school the week before he officially starts, to get started on his course schedule and other paperwork. School A doesn’t record his exit until a week after he left, because they got busy or wanted to wait to see if he changed his mind. As the data person, you don’t know which school James actually attended during the overlapping two weeks in the database.

Better process example: Make sure the system of record allows forward and back dating such that School B records James in their system with the correct start date and School A records James as having left on his last day. If James comes back to School A, they will start a new record without overlapping dates. To ensure data uniqueness, the system should verify that James has the same ID in School A and School B.

This example and other data issues can be checked and rechecked through data audits. Even with identifiers, duplicates can still occur (e.g., the same person with two different email addresses), so we use additional fields to audit for duplicates. Names, emails, birthdays, phone numbers, and home addresses are good places to check

Duplicates you caused!

Duplicates are not always the fault of the data itself. We can cause our own duplicates through incorrectly written code.

Joining – using the incorrect fields or edits to fields needed pivot_wider – including too many columns in the select pivot_longer – including too many columns in the pivot Integrating validation steps, unit testing, and code reviews into your work will reduce the number of “coder-caused” duplicates.

Let’s take a look at some R code…

Janitor was built with beginning-to-intermediate R users in mind and is optimized for user-friendliness. Advanced users can already do everything covered here, but they can do it faster with Janitor and save their thinking for more fun tasks. (Sam Firke)

If you’re experienced with Tidyverse in general, you should be able to do everything inside Janitor on your own; however, it’s always nice to have a function do it for you.

library(tidyverse)
library(janitor)
library(readxl)

# Set up fake student data
# Your fake data might be different from mine, as it's totally random IDs.
students <- tibble::tibble(student_id = round(runif(10, 1e6, 1e7-1), 0), 
                           grade = round(runif(10, 1, 12)),
                           entrydate = Sys.Date() - 30,
                           exitdate = Sys.Date())

students[3, 1]  <- students[5, 1] 
students[3, 2]  <- students[5, 2] - 1 # set up duplicate

students |>  get_dupes(student_id)

# A tibble: 2 x 6
  student_id dupe_count grade  entrydate        exitdate
       <dbl>      <int>  <dbl>          <date>            <date>
1    4137115          2         1             2017-12-02    2018-01-01
2    4137115          2         2            2017-12-02    2018-01-01

Using get_dupes and verify() from the assertr package is a great way to add checks in case the data changes (which it inevitably will).

check <- students |> 
  get_dupes(student_id) |> 
  verify(nrow(.) == 0)

If a student ID changes or new duplicates occur, the code will stop at this step.

Fixing the duplicates

Option 1:

Correct the dupes individually with if_else or case_when. This method is best for errors in one or 2 rows as a quick and easy fix.

correct_students <- students |>
  mutate(grade = if_else(student_id == ______, CORRECT-GRADE, grade)) |>
  distinct()

Option 2:

Systematic errors can be fixed by taking a summarize on the incorrect column. In this case, we could assume that the lower grade-level is the correct one for all duplicate enrollments. This method works better for systematic issues that you know how to correct, such as taking the higher of two homework assignemtns.

correct_students <- students |>
   group_by(student_id) |> 
   summarize(grade = min(grade))

correct_students

Option 3:

Output the duplicates and manually choose which version to keep.

students |> 
  get_dupes(student_id) |>
  write_csv("../data/dupes.csv", na = "")

# Create this file in excel manually, often by asking your coworkers for help
dupes_remove <- read_csv("../data/dupes_correct.csv") |> 
  filter(delete == 1)

students_correct <- anti_join(students, dupes_remove)

DOCUMENT DOCUMENT DOCUMENT.

Now that you’ve fixed the duplicates, whether in the database and/or code, DOCUMENT what you did and WHY, so that when the data changes and new duplicates are found, the code still runs.

Written on November 14, 2025