Data Handling in R: Filter by Number of Times a Values Appears

⇦ Back

If you are running an experiment with multiple timepoints and you want to select only the paired samples - ie the ones that have been tested twice - you need to filter by the number of times a values appears. Take the following example data:

df <- data.frame(
    id = c(
        "101-0001", "101-0001", "101-0002", "101-0002", "101-0003", "101-0004",
        "101-0005"
    ),
    timepoint = c("A", "B", "A", "B", "A", "A", "A")
)
print(df)

##         id timepoint
## 1 101-0001         A
## 2 101-0001         B
## 3 101-0002         A
## 4 101-0002         B
## 5 101-0003         A
## 6 101-0004         A
## 7 101-0005         A

We can see that participants “101-0001” and “101-0002” have been tested at two timepoints each. To select only them, you can group by id then file for those that appear twice using dplyr:

library(dplyr, warn.conflicts = FALSE)

df %>%
    group_by(id) %>%
    filter(n() == 2) -> subset

print(subset)

## # A tibble: 4 × 2
## # Groups:   id [2]
##   id       timepoint
##   <chr>    <chr>    
## 1 101-0001 A        
## 2 101-0001 B        
## 3 101-0002 A        
## 4 101-0002 B

…or, using only Base R, you can split the data frame by factors and count the number of rows for each factor using nrow. Then, search for factors that appear twice and filter by that search result:

res <- by(df, df$id, nrow)
res <- res == 2
subset <- df[df$id %in% names(which(res)), ]

print(subset)

##         id timepoint
## 1 101-0001         A
## 2 101-0001         B
## 3 101-0002         A
## 4 101-0002         B

⇦ Back

Data Handling in R:Filter by Number of Times a Values Appears

Data Handling in R:
Filter by Number of Times a Values Appears