⇦ Back

1 What’s a Data Frame?

Data frames are the most important data type in R. The power that R has as a programming language and its usefulness in statistics and data analysis comes from that fact that it has these. They allow calculations to be rapidly performed on data that is tabular in nature, ie data that can be represented as rows and columns (like Excel, but more powerful). The reason this is possible is because:

A data frame just is a fancy table

It also has rows and columns and, often, data frames are created by importing data from Excel. A more manual way to create them is by using the data.frame() function, which turns vectors into the columns of a ‘table’. Here’s an example using the top 10 most-streamed songs on Spotify:

spotify <- data.frame(
    Rank = c(1:10),
    Song = c(
        "Shape of You", "Rockstar", "One Dance", "Closer", "Thinking Out Loud",
        "God's Plan", "Havana", "Sunflower", "Perfect", "Say You Won't Let Go"
    ),
    Artist = c(
        "Ed Sheeran", "Post Malone", "Drake", "The Chainsmokers", "Ed Sheeran",
        "Drake", "Camila Cabello", "Post Malone", "Ed Sheeran", "James Arthur"
    ),
    Featuring = c(
        NA, "21 Savage", "Wizkid, Kyla", "Halsey", NA, NA, "Young Thug",
        "Swae Lee", NA, NA
    ),
    Label = c(
        "Warner", "Universal", "Universal", "Sony", "Warner", "Universal",
        "Sony", "Universal", "Warner", "Sony"
    ),
    Streams_Millions = c(
        2449, 1851, 1824, 1734, 1504, 1495, 1407, 1400, 1366, 1360
    ),
    Release_Date = c(
        "06/01/2017", "15/09/2017", "05/04/2016", "29/07/2016", "20/06/2014",
        "19/01/2018", "03/08/2017", "18/10/2018", "03/03/2017", "09/09/2016"
    ),
    Weeks_at_no1 = c(14, 17, 14, 11, 2, 11, 1, 2, NA, NA)
)

As you can see above, eight vectors were created using the c() function and each was given a name. This creates a data frame that looks like this:

Rank Song Artist Featuring Label Streams_Millions Release_Date Weeks_at_no1
1 Shape of You Ed Sheeran NA Warner 2449 06/01/2017 14
2 Rockstar Post Malone 21 Savage Universal 1851 15/09/2017 17
3 One Dance Drake Wizkid, Kyla Universal 1824 05/04/2016 14
4 Closer The Chainsmokers Halsey Sony 1734 29/07/2016 11
5 Thinking Out Loud Ed Sheeran NA Warner 1504 20/06/2014 2
6 God’s Plan Drake NA Universal 1495 19/01/2018 11
7 Havana Camila Cabello Young Thug Sony 1407 03/08/2017 1
8 Sunflower Post Malone Swae Lee Universal 1400 18/10/2018 2
9 Perfect Ed Sheeran NA Warner 1366 03/03/2017 NA
10 Say You Won’t Let Go James Arthur NA Sony 1360 09/09/2016 NA

Each row and column has a name (the ‘row heading’ or the ‘column heading’) and a number (the ‘index’). So the words “One Dance” are in the ‘Song’ column which is also known as column 2 (ie the column’s heading is ‘Song’ and the column’s index is 2) and they are in row 3. By default, the rows’ names are the same as their indices. You can use either the name or the index when referring to a specific row/column.

2 Quick Info About a Data Frame

Often data frames can be very large, so you don’t want to (or maybe you can’t) view the entire thing at once. Use head() to just see the first few rows and tail() to just see the last few:

head(spotify)
##   Rank              Song           Artist    Featuring     Label Streams_Millions Release_Date Weeks_at_no1
## 1    1      Shape of You       Ed Sheeran         <NA>    Warner             2449   06/01/2017           14
## 2    2          Rockstar      Post Malone    21 Savage Universal             1851   15/09/2017           17
## 3    3         One Dance            Drake Wizkid, Kyla Universal             1824   05/04/2016           14
## 4    4            Closer The Chainsmokers       Halsey      Sony             1734   29/07/2016           11
## 5    5 Thinking Out Loud       Ed Sheeran         <NA>    Warner             1504   20/06/2014            2
## 6    6        God's Plan            Drake         <NA> Universal             1495   19/01/2018           11

You can specify exactly how many rows you see with these functions, eg head(spotify, 15).

There are various other functions that can give you an overview of the data that is in a data frame, which may or may not be useful depending on what type of data you have:

Function Returns
colnames(df) The column names of a data frame
rownames(df) The row names of a data frame
ncol(df) Number of columns
nrow(df) Number of rows
describeBy(df) Descriptive statistics
summary(df) Results of various model fitting functions

3 Columns in a Data Frame

3.1 Add Columns

Append a vector as a new column using cbind(). In this case we’re appending the name of the album each song came from:

df <- cbind(
    spotify, Album = c(
        "÷", "Beerbongs & Bentleys", "Views", "Collage", "×", "Scorpion",
        "Camila", "Hollywood's Bleeding", "÷", "Back from the Edge"
    )
)

3.2 Delete Columns

We can remove columns in a few different ways:

# Delete a specific column using its name
df$weeks <- NULL
# Delete multiple columns by listing their names using the subset() function
df <- subset(spotify, select = -c(Rank, Song, Artist))
# Delete multiple columns by listing their names using the select() function
# from the dplyr package
library(dplyr)
df <- select(spotify, -c("Rank", "Song", "Artist"))
# Delete multiple columns by listing their names using indexing
df <- spotify[!names(spotify) %in% c("Rank", "Song", "Artist")]
# Delete columns using their indices
df <- spotify[-c(1, 3:6, 12)]

3.3 Rename Columns

First let’s remind ourselves of what the column names are:

print(colnames(spotify))
## [1] "Rank"             "Song"             "Artist"           "Featuring"        "Label"            "Streams_Millions"
## [7] "Release_Date"     "Weeks_at_no1"

3.3.1 Rename Using the Current Name

Now use the rename() function from the dplyr package. The format to remember with this is "New" = "Old":

# Rename the "Artist" column to "Artist Name"
df <- rename(spotify, "Artist Name" = "Artist")
# Rename the "Song" column to "Song Name"
df <- rename(df, "Song Name" = "Song")
# Rename the "Featuring" column to "ft"
df <- rename(df, "ft" = "Featuring")

3.3.2 Rename Using the Index

Column 5 is called ‘Label’ but really it contains the name of the music group that owns that artist’s record label. So let’s change the name of column 5 to “Group”:

colnames(df)[5] <- "Group"

3.3.3 Rename Using the Current Name from the Index

Let’s rename the column called “Streams_Millions” to “Millions_of_Streams”. Let’s do this by first looking up the index of the column that’s called “Streams_Millions” (it is column 6) and then renaming column 6 to “Millions_of_Streams”.

idx <- which(names(df) == "Streams_Millions")
colnames(df)[idx] <- "Millions_of_Streams"

Check out the new names:

print(colnames(df))
## [1] "Rank"                "Song Name"           "Artist Name"         "ft"                  "Group"              
## [6] "Millions_of_Streams" "Release_Date"        "Weeks_at_no1"

4 Rows in a Data Frame

4.1 Add Rows

To add a new row (or rows) create a new data frame which has the same column headings as your current data frame and include the new information you want to add:

new_row <- data.frame(
    Rank = 11,
    Song = "Love Yourself",
    Artist = "Justin Bieber",
    Featuring = NA,
    Label = "Universal",
    Streams_Millions = 1357,
    Release_Date = "09/11/2015",
    Weeks_at_no1 = 3
)

Use rbind() to do the actual appending:

df <- rbind(spotify, new_row)

4.2 Delete Rows

# Delete multiple rows by index
df <- spotify[-c(2, 4, 6), ]
# Delete the first row
df <- spotify[-1, ]

Note how the above notation works:

  • The square brackets indicate that you are indexing (extracting a subset of) the data frame
  • The minus sign before the c() indicates that you are ‘subtracting’ (deleting) the named columns
  • The lack of any column names after the comma indicates all columns are relevant to this operation, ie you want to delete the values in this row from all columns

4.3 Rename Rows

# Change the row names to be sequential numbers
rownames(df) <- NULL
rownames(df) <- seq_len(nrow(df))

5 Indexing (aka Selecting)

If you know exactly which row(s) and column(s) you want, you can index a data frame in order to select only that portion of the whole object. This can be done in one of three ways:

  • Using a function
  • Using square bracket notation: [ ]
  • Using dollar sign notation: $

5.1 Select Certain Columns

This is essentially the same process as deleting columns, but this time you are choosing which columns to keep instead of which ones to get rid of. Again, this can be done using a function:

# Select columns by name using the subset() function
df <- subset(spotify, select = c("Rank", "Song", "Artist"))
# Select columns by name using the select() function from the dplyr package
df <- select(spotify, c("Rank", "Song", "Artist"))

…or by using square bracket notation:

# Select columns by name using indexing
df <- spotify[c("Rank", "Song", "Artist")]
# Select columns by index
df <- spotify[c(1:3)]
# Select columns by matching names to a list
df <- spotify[names(spotify) %in% c("Rank", "Song", "Artist")]

…or by using dollar sign notation:

# Select a single column by using the dollar sign notation
df <- spotify$Song

Dollar sign notation can only be used to select COLUMNS. To select rows, see below:

5.2 Select Certain Rows

When you use square brackets to index a data frame, R first selects the columns you asked for and then selects the rows. Thus, if you only provide one piece of information, R assumes you are asking for columns. So this code: spotify[c(1:3)] will select COLUMNS 1, 2 and 3 because you only provided one vector. If you want to select ROWS 1, 2 and 3, you need to provide two pieces of information: the columns that you want (in this case, all of them) and the rows that you want (1, 2 and 3) in the format spotify[<rows>, <columns>]:

# Select certain rows by index
df <- spotify[c(1, 2, 3), ]

Notice that, because we wanted all the columns, we left the <columns> portion of the indexing blank. But we still needed to include the blank portion because otherwise R would have assumed we were asking for columns! In summary:

  • spotify[c(1:3)] selects COLUMNS 1 to 3
  • spotify[c(1:3),] selects ROWS 1 to 3

5.3 Select a Single Cell

Looking up a value in a particular cell in the table is essentially a combination of the above: selecting a certain column and a certain row and seeing what is at their intersection. Again, the format to use is spotify[<rows>, <columns>]:

# Select a single cell
df <- spotify[4, "Song"]
print(df)
## [1] "Closer"

Notice that, despite the fact you’ve only selected one cell, R remembers the other values in the column and prints them as ‘Levels’ in the output above. That is because the single cell you’ve selected is still a data frame, it’s just a very small one, and so all the ‘structure’ of the original object is still there. To remove this, explicitly convert the cell’s contents to a character object:

# Select a single cell
df <- as.character(spotify$"Song"[4])
print(df)
## [1] "Closer"

See how only the name of the song is printed. Also note that a slightly different method of indexing was used in that second example: the doller sign $ is used to select an entire column before the square brackets are used to select a single element of that column.

6 Searching, Finding, Filtering

While indexing/selecting is something you do when you know exactly which row(s) and column(s) you want, sometimes you do NOT know exact what you want. Instead, you first need to search through the data frame to find the information you are looking for. Then you can subset the data frame to filter out any information you don’t want:

  • ‘Search’ the data frame to see if the data you are looking for exists
    • The result will be a boolean (ie TRUE/FALSE): the data was either found or it wasn’t
  • ‘Find’ the data if it does exist
    • The result will be the location of the data
  • ‘Filter’ the data frame so that it only contains what you want
    • The result will be a new data frame, smaller than the original one, containing only the data you want

6.1 Searching

Does a column contain a certain value? For example, is there a song by The Chainsmokers in the Spotify top 10?

# Does a column contain a certain value?
bool <- "The Chainsmokers" %in% spotify$"Artist"
print(bool)
## [1] TRUE

You can also search an entire column at once, ie search each row. For example, for each artist in the top 10, are they signed to a Universal Music Group label?

# Does the rows contain a certain value?
bool <- (spotify$Label == "Universal")
print(bool)
##  [1] FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE

Instead of searching through the data in a column you can also search through the names of the columns to see if the one you are looking for has been created yet. For example, has a column been created that contains the number of weeks a song was at number one for?

# Is there a column with a certain name?
bool <- "Weeks_at_no1" %in% colnames(spotify)
print(bool)
## [1] TRUE

6.2 Finding

‘Finding’ is similar to ‘searching’ except you get the locations of the matches, not just whether or not they exist. For example, what position in the top 10 is Camila Cabello’s song “Havana”?

idx <- match("Havana", spotify$"Song")
print(idx)
## [1] 7

6.3 Filtering

Filtering removes all rows whose values do not meet certain criteria.

6.3.1 Filter Using Indexing

Square bracket notation can be used in a similar way to the “Select Certain Rows” section above. This time, however, instead of knowing what rows we are interested in beforehand and simply selecting them, we only know the criteria we are interested in and so need to filter for that. For example, if I want to get all songs with more than 1,500 million streams my criteria is that the value in the ‘Streams_Millions’ column must be greater than 1,500:

# Return all rows where the number of streams is greater than 1,500 million and all columns
df <- spotify[spotify$Streams_Millions > 1500, ]
print(df)
##   Rank              Song           Artist    Featuring     Label Streams_Millions Release_Date Weeks_at_no1
## 1    1      Shape of You       Ed Sheeran         <NA>    Warner             2449   06/01/2017           14
## 2    2          Rockstar      Post Malone    21 Savage Universal             1851   15/09/2017           17
## 3    3         One Dance            Drake Wizkid, Kyla Universal             1824   05/04/2016           14
## 4    4            Closer The Chainsmokers       Halsey      Sony             1734   29/07/2016           11
## 5    5 Thinking Out Loud       Ed Sheeran         <NA>    Warner             1504   20/06/2014            2

The expression spotify$Streams_Millions > 1500 looks at all values in the Streams_Millions column and evaluates to either TRUE or FALSE depending on whether each value is larger than 1,500 or not. The result is a ‘Boolean vector’ - a vector of Booleans corresponding to which rows in the column meet your criteria:

# For each column, is the number of streams greater than 1,500 million?
vc <- spotify$Streams_Millions > 1500
print(vc)
##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

This vector can then be used to index the full data frame using square bracket notation, but remember that this notation requires you to enter both the rows and columns you are interested in. Because we want all the columns we can leave this blank, which is why there is a comma followed by just a blank space:

# Return all rows where the number of streams is greater than 1,500 million and all columns
df <- spotify[spotify$Streams_Millions > 1500, ]

If you forget this comma it will assume you are filtering by column, not by row. Thus, the following code will use the Boolean vector created by the colnames(spotify) == "Streams_Millions" expression to filter for columns called “Streams_Millions”:

# Return all columns with a certain name
df <- spotify[colnames(spotify) == "Streams_Millions"]
print(df)
##    Streams_Millions
## 1              2449
## 2              1851
## 3              1824
## 4              1734
## 5              1504
## 6              1495
## 7              1407
## 8              1400
## 9              1366
## 10             1360

Filter for multiple values that are listed in a vector using the %in% operator:

# Return all rows that match the given stream counts
streams <- c(1851, 1734, 1495)
df <- spotify[df$Streams_Millions %in% streams, ]
print(df)
##   Rank       Song           Artist Featuring     Label Streams_Millions Release_Date Weeks_at_no1
## 2    2   Rockstar      Post Malone 21 Savage Universal             1851   15/09/2017           17
## 4    4     Closer The Chainsmokers    Halsey      Sony             1734   29/07/2016           11
## 6    6 God's Plan            Drake      <NA> Universal             1495   19/01/2018           11

6.3.2 Filter Using subset()

You can filter out rows that do not meet a single condition, eg if I am only interested in Ed Sheeran (which I am <3):

df <- subset(spotify, Artist == "Ed Sheeran")

Or you can filter out rows that do not meet multiple conditions, eg if I am only interested in Ed’s most popular songs:

df <- subset(spotify, Artist == "Ed Sheeran" & Streams_Millions >= 1500)

You can keep all rows that meet any one of two+ conditions, eg if I am interested in Ed and all popular songs:

df <- subset(spotify, Artist == "Ed Sheeran" | Streams_Millions >= 1500)

Or you can keep all rows that meet one condition or another condition BUT NOT BOTH (ie you can apply an ‘exclusive or’ condition), eg if I am interested in Post Malone’s songs and songs which feature Swae Lee, but happen to not like Sunflower:

df <- subset(spotify, xor(Artist == "Post Malone", Featuring == "Swae Lee"))

6.3.3 Filter Using filter() from the dplyr library

You can do the exact same as above with the filter() function:

library(dplyr)

df <- filter(spotify, Artist == "Ed Sheeran")
df <- filter(spotify, Artist == "Ed Sheeran" & Streams_Millions >= 1500)
df <- filter(spotify, Artist == "Ed Sheeran" | Streams_Millions >= 1500)
df <- filter(spotify, xor(Artist == "Post Malone", Featuring == "Swae Lee"))

6.4 Counting

How many cells, columns or rows meet certain criteria?

6.4.1 Count Rows with (or without) Missing Data

How many rows have no featuring artist and have never been at number 1?

bool <- (is.na(spotify$Featuring) & is.na(spotify$Weeks_at_no1))
count <- length(bool[bool == TRUE])
print(count)
## [1] 2

How many rows have a featuring artist and have been at number 1?

bool <- (!is.na(spotify$Featuring) & !is.na(spotify$Weeks_at_no1))
count <- length(bool[bool == TRUE])
print(count)
## [1] 5

6.4.2 Find the Proportion of Rows That Meet a Criteria

What proportion of the top 10 songs have over 1 700 million streams?

df <- subset(spotify, Streams_Millions >= 1700)
proportion <- nrow(df) / nrow(spotify)
percentage <- nrow(df) / nrow(spotify) * 100
sprintf(
    "Number of songs with over 1 700 million streams: %s (%3.1f or %3.1f%%)",
    nrow(df), proportion, percentage
)
## [1] "Number of songs with over 1 700 million streams: 4 (0.4 or 40.0%)"

7 Cleaning

Data cleaning is done to handle incomplete or partially incorrect datasets:

  • An incomplete dataset will have NULL values in some places
  • A partially incorrect dataset might have misspelled column headings or values

7.1 Remove Rows with Missing Data

To remove all songs in the data frame that do not have a featuring artist (ie their value in the “Featuring” column is <NA>):

  • Take the column called “Featuring” from the spotify data frame: spotify["Featuring"].
  • Ask the question “Is the value equal to <NA>?” for each row in that column. This is done by the is.na() function.
  • We want to find the rows that do NOT contain a null, but the function is.na() tells us which ones DO contain a null. So we need to inverse the question. This is done with the ‘not’ operator, which is an exclamation mark: !()
  • Select only the rows from the spotify data frame that passed this logical test (ie those that do not contain a null) but include all columns: spotify[<rows>, <columns>]
  • Assign the output to a new data frame df
df <- spotify[!(is.na(spotify["Featuring"])), ]
print(df)
##   Rank      Song           Artist    Featuring     Label Streams_Millions Release_Date Weeks_at_no1
## 2    2  Rockstar      Post Malone    21 Savage Universal             1851   15/09/2017           17
## 3    3 One Dance            Drake Wizkid, Kyla Universal             1824   05/04/2016           14
## 4    4    Closer The Chainsmokers       Halsey      Sony             1734   29/07/2016           11
## 7    7    Havana   Camila Cabello   Young Thug      Sony             1407   03/08/2017            1
## 8    8 Sunflower      Post Malone     Swae Lee Universal             1400   18/10/2018            2

As you can see, 5 of the top 10 most-streamed songs on Spotify have a featuring artist.

7.2 Find Missing Data

Sometimes you don’t want to remove missing data but rather analyse it. For example, is there a music label which releases songs without featuring artists significantly more often than other labels? In order to answer this question you could group the data by “Label” then search for more than 2 instances of missing “Featuring Artist” data in their songs:

library(dplyr)

# Find missing data
df <- group_by(spotify, Label)
df <- summarize(df, "No Featuring Artists" = ifelse(sum(is.na(Featuring)) >= 2, TRUE, FALSE))
print(df)
## # A tibble: 3 x 2
##   Label     `No Featuring Artists`
##   <chr>     <lgl>                 
## 1 Sony      FALSE                 
## 2 Universal FALSE                 
## 3 Warner    TRUE

7.3 Standardise Column Headings

If you are working with multiple data frames it might be the case that they each have slightly different column headings despite having the same type of information. This could arise, for example, due to typos if the data was created manually. It’s good practice to decide on a standard format and use that consistently: for example, you might choose to use full stops in column headings instead of underscores to break up works:

df <- spotify

names(df) <- gsub("_", "\\.", names(df))
print(names(df))
## [1] "Rank"             "Song"             "Artist"           "Featuring"        "Label"            "Streams.Millions"
## [7] "Release.Date"     "Weeks.at.no1"

The columns “Streams_Millions”, “Release_Date” and “Weeks_at_no1” have been changed to “Streams.Millions”, “Release.Date” and “Weeks.at.no1”.

7.4 Change Data Types of Columns

If you want to convert the data type of the elements in a column you can use the as.xxx() functions. For example:

  • as.character() converts a column’s elements to strings
  • as.numeric() converts them to numbers

Let’s test this out by converting the “Weeks at no 1” column from numerical data (specifically, double-precision floats) to text data:

print(typeof(spotify$Weeks_at_no1))
## [1] "double"
spotify$Weeks_at_no1 <- as.character(spotify$Weeks_at_no1)
print(typeof(spotify$Weeks_at_no1))
## [1] "character"

And now convert it back to numerical data:

spotify$Weeks_at_no1 <- as.numeric(spotify$Weeks_at_no1)
print(typeof(spotify$Weeks_at_no1))
## [1] "double"

This can be done for multiple columns at a time by using the lappy() function. To demonstrate this, let’s repeat the above conversions for the “Streams Millions” column in addition to the “Weeks at no 1” column:

print(c(typeof(spotify$Weeks_at_no1), typeof(spotify$Streams_Millions)))
## [1] "double" "double"
cols <- c("Weeks_at_no1", "Streams_Millions")
spotify[cols] <- lapply(spotify[cols], as.character)
print(c(typeof(spotify$Weeks_at_no1), typeof(spotify$Streams_Millions)))
## [1] "character" "character"

Now convert them back to double-precision floats:

spotify[cols] <- lapply(spotify[cols], as.numeric)
print(c(typeof(spotify$Weeks_at_no1), typeof(spotify$Streams_Millions)))
## [1] "double" "double"

8 Sorting

This section is pretty self-explanatory.

8.1 Ascending Order

Sort alphabetically by artist name using order() together with indexing:

df <- spotify[order(spotify$Artist),]
print(df)
##    Rank                 Song           Artist    Featuring     Label Streams_Millions Release_Date Weeks_at_no1
## 7     7               Havana   Camila Cabello   Young Thug      Sony             1407   03/08/2017            1
## 3     3            One Dance            Drake Wizkid, Kyla Universal             1824   05/04/2016           14
## 6     6           God's Plan            Drake         <NA> Universal             1495   19/01/2018           11
## 1     1         Shape of You       Ed Sheeran         <NA>    Warner             2449   06/01/2017           14
## 5     5    Thinking Out Loud       Ed Sheeran         <NA>    Warner             1504   20/06/2014            2
## 9     9              Perfect       Ed Sheeran         <NA>    Warner             1366   03/03/2017           NA
## 10   10 Say You Won't Let Go     James Arthur         <NA>      Sony             1360   09/09/2016           NA
## 2     2             Rockstar      Post Malone    21 Savage Universal             1851   15/09/2017           17
## 8     8            Sunflower      Post Malone     Swae Lee Universal             1400   18/10/2018            2
## 4     4               Closer The Chainsmokers       Halsey      Sony             1734   29/07/2016           11

Sort twice - first alphabetically by artist name, then in order of number of streams - using arrange() from dplyr:

library(dplyr)

df <- arrange(spotify, Artist, Streams_Millions)

8.2 Descending Order

Sort into reverse alphabetical order using the decreasing = TRUE option:

df <- spotify[order(spotify$Artist, decreasing = TRUE),]
print(df)
##    Rank                 Song           Artist    Featuring     Label Streams_Millions Release_Date Weeks_at_no1
## 4     4               Closer The Chainsmokers       Halsey      Sony             1734   29/07/2016           11
## 2     2             Rockstar      Post Malone    21 Savage Universal             1851   15/09/2017           17
## 8     8            Sunflower      Post Malone     Swae Lee Universal             1400   18/10/2018            2
## 10   10 Say You Won't Let Go     James Arthur         <NA>      Sony             1360   09/09/2016           NA
## 1     1         Shape of You       Ed Sheeran         <NA>    Warner             2449   06/01/2017           14
## 5     5    Thinking Out Loud       Ed Sheeran         <NA>    Warner             1504   20/06/2014            2
## 9     9              Perfect       Ed Sheeran         <NA>    Warner             1366   03/03/2017           NA
## 3     3            One Dance            Drake Wizkid, Kyla Universal             1824   05/04/2016           14
## 6     6           God's Plan            Drake         <NA> Universal             1495   19/01/2018           11
## 7     7               Havana   Camila Cabello   Young Thug      Sony             1407   03/08/2017            1

Sort into reverse alphabetical order using arrange() and desc() from dplyr:

library(dplyr)

df <- arrange(spotify, desc(Artist))

9 Creating New Columns From Existing Columns

9.1 Concatenate Strings

spotify["Song and Artist"] <- paste(spotify$Song, "by", spotify$Artist)
print(spotify[c("Song", "Artist", "Song and Artist")])
##                    Song           Artist                      Song and Artist
## 1          Shape of You       Ed Sheeran           Shape of You by Ed Sheeran
## 2              Rockstar      Post Malone              Rockstar by Post Malone
## 3             One Dance            Drake                   One Dance by Drake
## 4                Closer The Chainsmokers           Closer by The Chainsmokers
## 5     Thinking Out Loud       Ed Sheeran      Thinking Out Loud by Ed Sheeran
## 6            God's Plan            Drake                  God's Plan by Drake
## 7                Havana   Camila Cabello             Havana by Camila Cabello
## 8             Sunflower      Post Malone             Sunflower by Post Malone
## 9               Perfect       Ed Sheeran                Perfect by Ed Sheeran
## 10 Say You Won't Let Go     James Arthur Say You Won't Let Go by James Arthur

9.2 Perform Calculations

To estimate the number of days each song was at number one for, multiple the number of weeks it was at number one by 7:

df <- mutate(spotify, Days.at.no1 = Weeks_at_no1 * 7)
print(df[c("Song", "Artist", "Weeks_at_no1", "Days.at.no1")])
##                    Song           Artist Weeks_at_no1 Days.at.no1
## 1          Shape of You       Ed Sheeran           14          98
## 2              Rockstar      Post Malone           17         119
## 3             One Dance            Drake           14          98
## 4                Closer The Chainsmokers           11          77
## 5     Thinking Out Loud       Ed Sheeran            2          14
## 6            God's Plan            Drake           11          77
## 7                Havana   Camila Cabello            1           7
## 8             Sunflower      Post Malone            2          14
## 9               Perfect       Ed Sheeran           NA          NA
## 10 Say You Won't Let Go     James Arthur           NA          NA

10 Iterate over a Data Frame

10.1 Iterate over a Column

Sometimes, instead of performing an operation on an entire column you want to go row-by-row and do something different in each. For this we can iterate over all values in order. For example, if we want to get the average number of streams for each artist:

# Initialise a new vector to contain the output
average_streams <- vector()
# Get a list of all the artists in the data frame
artists <- unique(spotify["Artist"])
# Iterate over the Artist column
for (i in 1:seq_along(artists)) {
    # For each artist in the data frame:
    artist <- artists[i, "Artist"]

    # Option 1 (doesn't use dplyr)
    this_artist_only <- spotify[spotify$Artist == artist, ]
    # Option 2 (uses dplyr)
    this_artist_only <- filter(spotify, Artist == artist)

    average_streams_this_artist <- mean(this_artist_only[, "Streams_Millions"])
    average_streams <- c(average_streams, average_streams_this_artist)
}

# Combine the results into a new data frame
average_streams_df <- data.frame(
    Artist = artists,
    Average.Streams_Millions = average_streams
)

Now let’s take a look at the results:

kable_input <- kable(average_streams_df)
kable_styling(
    kable_input, bootstrap_options = c("striped", "hover", "condensed"),
    full_width = F
)
Artist Average.Streams_Millions
1 Ed Sheeran 1773
2 Post Malone 1773
3 Drake 1773
4 The Chainsmokers 1773
7 Camila Cabello 1773
10 James Arthur 1773

10.2 Iterate over a Column in Place

In this example we will iterate over the cells in a column without extracting the column from the data frame it is in:

for (i in 1:nrow(spotify)) {
    artist <- spotify[i, "Artist"]
    if (is.na(spotify[i, "Featuring"])) {
        spotify[i, "Performed By"] <- artist
    } else {
        featuring <- spotify[i, "Featuring"]
        spotify[i, "Performed By"] <- sprintf("%s ft %s", artist, featuring)
    }
}
print(spotify[["Performed By"]])
##  [1] "Ed Sheeran"                   "Post Malone ft 21 Savage"     "Drake ft Wizkid, Kyla"       
##  [4] "The Chainsmokers ft Halsey"   "Ed Sheeran"                   "Drake"                       
##  [7] "Camila Cabello ft Young Thug" "Post Malone ft Swae Lee"      "Ed Sheeran"                  
## [10] "James Arthur"

10.3 Iterate over Two Columns

# Initialise new vectors to contain the output
groups <- vector()
averages <- vector()

# Set the variables you are looking at
variables <- c("Rank", "Streams_Millions")

# Iterate over the variables vector
for (j in variables) {
    # Iterate over the Study column
    artists <- unique(df["Artist"])
    for (i in seq_along(artists)) {
        artist <- artists[i, "Artist"]

        # Option 1 (doesn't use dplyr)
        this_artist_only <- spotify[spotify$Artist == artist, ]
        # Option 2 (uses dplyr)
        this_artist_only <- filter(spotify, Artist == artist)

        average <- mean(this_artist_only[, j])
        # Update output columns
        groups <- c(groups, as.character(artist))
        averages <- c(averages, average)
    }
}

# Combine
df_out <- data.frame(
    "group" <- groups,
    "average" <- averages
)
kable_input <- kable(df_out)
kable_styling(
    kable_input, bootstrap_options = c("striped", "hover", "condensed"),
    full_width = F
)
X.group…..groups X.average…..averages
Ed Sheeran 5
Ed Sheeran 1773

11 Summary

  • Data frames hold tabular data
    • The have rows and columns
    • Each row and column has a name and an index
    • The columns are actually vectors
  • Information can be added and deleted
  • Data frames can be indexed, cleaned, searched, filtered, sorted, etc
  • Mathematical and programmatic operations can be performed on whole columns at a time
  • All this versatility makes data frames the most important data type in R

⇦ Back