⇦ Back

The Problem

Let’s say we have the following incomplete dataset:

main <- data.frame(
    Record.Id = c(100, 200, 300, 400, 500, 600),
    Name = c("Alpha", "Bravo", NA, NA, "Echo", "Foxtrot"),
    Age = c(NA, NA, NA, NA, 25, 26),
    Height = c(NA, NA, 173, 174, NA, NA)
)
print(main)
##   Record.Id    Name Age Height
## 1       100   Alpha  NA     NA
## 2       200   Bravo  NA     NA
## 3       300    <NA>  NA    173
## 4       400    <NA>  NA    174
## 5       500    Echo  25     NA
## 6       600 Foxtrot  26     NA

There are a whole bunch of NA values where data is missing.

Now let’s say that we have a supplementary dataset that contains some data that the main dataset doesn’t:

supplementary <- data.frame(
    Record.Id = c(300, 400, 500, 600, 100),
    Name = c("Charlie", "Delta", NA, NA, "Alpha"),
    Age = c(23, 24, NA, NA, 21),
    Height = c(173, 174, NA, 176, 171)
)
print(supplementary)
##   Record.Id    Name Age Height
## 1       300 Charlie  23    173
## 2       400   Delta  24    174
## 3       500    <NA>  NA     NA
## 4       600    <NA>  NA    176
## 5       100   Alpha  21    171

We could use this supplementary data to fill in some of the gaps, except that we have two problems:

  • The supplementary dataset is not in the same order as the main dataset
  • It is not the same size as the main dataset

So we can’t just put the data frames on top of each other and create a simple combination of the two. Essentially, what we want to do is merge the two data frames, but “merge” has a very specific definition in R and it’s not what we want in this situation.

The Solution

We need to go row-by-row and column-by-column and assess for each cell of the main data frame that contains a null whether or not the supplementary data frame contains data that could be used to fill the gap. Then, of course, we need to perform the replacement:

# For each column in main
for (i in seq_len(length(colnames(main)))) {
    # For each row in main
    for (j in seq_len(nrow(main))) {
        # Name of the column
        col <- colnames(main)[i]
        # Get the Record.Id from main
        record_id <- main[[j, "Record.Id"]]
        # Find the row corresponding to this Record.Id in supplementary
        idx <- match(record_id, supplementary$"Record.Id")
        tryCatch({
            # If the cell in main is NA, replace it with the corresponding cell
            # in supplementary
            main[[j, col]] <- ifelse(
                is.na(main[[j, col]]), supplementary[[idx, col]], main[[j, col]]
            )
        }, error = function(e) {
        })
    }
}
print(main)
##   Record.Id    Name Age Height
## 1       100   Alpha  21    171
## 2       200   Bravo  NA     NA
## 3       300 Charlie  23    173
## 4       400   Delta  24    174
## 5       500    Echo  25     NA
## 6       600 Foxtrot  26    176

This final data frame contains all the data we have available to us.

⇦ Back