Data frames are the most important data type in R. The power that R has as a programming language and its usefulness in statistics and data analysis comes from that fact that it has these. They allow calculations to be rapidly performed on data that is tabular in nature, ie data that can be represented as rows and columns (like Excel, but more powerful). The reason this is possible is because:
A data frame just is a fancy table
It also has rows and columns and, often, data frames are created by importing data from Excel. A more manual way to create them is by using the data.frame()
function, which turns vectors into the columns of a ‘table’. Here’s an example using the top 10 most-streamed songs on Spotify:
spotify <- data.frame(
Rank = c(1:10),
Song = c(
"Shape of You", "Rockstar", "One Dance", "Closer", "Thinking Out Loud",
"God's Plan", "Havana", "Sunflower", "Perfect", "Say You Won't Let Go"
),
Artist = c(
"Ed Sheeran", "Post Malone", "Drake", "The Chainsmokers", "Ed Sheeran",
"Drake", "Camila Cabello", "Post Malone", "Ed Sheeran", "James Arthur"
),
Featuring = c(
NA, "21 Savage", "Wizkid, Kyla", "Halsey", NA, NA, "Young Thug",
"Swae Lee", NA, NA
),
Label = c(
"Warner", "Universal", "Universal", "Sony", "Warner", "Universal",
"Sony", "Universal", "Warner", "Sony"
),
Streams_Millions = c(
2449, 1851, 1824, 1734, 1504, 1495, 1407, 1400, 1366, 1360
),
Release_Date = c(
"06/01/2017", "15/09/2017", "05/04/2016", "29/07/2016", "20/06/2014",
"19/01/2018", "03/08/2017", "18/10/2018", "03/03/2017", "09/09/2016"
),
Weeks_at_no1 = c(14, 17, 14, 11, 2, 11, 1, 2, NA, NA)
)
As you can see above, eight vectors were created using the c()
function and each was given a name. This creates a data frame that looks like this:
Rank | Song | Artist | Featuring | Label | Streams_Millions | Release_Date | Weeks_at_no1 |
---|---|---|---|---|---|---|---|
1 | Shape of You | Ed Sheeran | NA | Warner | 2449 | 06/01/2017 | 14 |
2 | Rockstar | Post Malone | 21 Savage | Universal | 1851 | 15/09/2017 | 17 |
3 | One Dance | Drake | Wizkid, Kyla | Universal | 1824 | 05/04/2016 | 14 |
4 | Closer | The Chainsmokers | Halsey | Sony | 1734 | 29/07/2016 | 11 |
5 | Thinking Out Loud | Ed Sheeran | NA | Warner | 1504 | 20/06/2014 | 2 |
6 | God’s Plan | Drake | NA | Universal | 1495 | 19/01/2018 | 11 |
7 | Havana | Camila Cabello | Young Thug | Sony | 1407 | 03/08/2017 | 1 |
8 | Sunflower | Post Malone | Swae Lee | Universal | 1400 | 18/10/2018 | 2 |
9 | Perfect | Ed Sheeran | NA | Warner | 1366 | 03/03/2017 | NA |
10 | Say You Won’t Let Go | James Arthur | NA | Sony | 1360 | 09/09/2016 | NA |
Each row and column has a name (the ‘row heading’ or the ‘column heading’) and a number (the ‘index’). So the words “One Dance” are in the ‘Song’ column which is also known as column 2 (ie the column’s heading is ‘Song’ and the column’s index is 2) and they are in row 3. By default, the rows’ names are the same as their indices. You can use either the name or the index when referring to a specific row/column.
Often data frames can be very large, so you don’t want to (or maybe you can’t) view the entire thing at once. Use head()
to just see the first few rows and tail()
to just see the last few:
head(spotify)
## Rank Song Artist Featuring Label Streams_Millions Release_Date Weeks_at_no1
## 1 1 Shape of You Ed Sheeran <NA> Warner 2449 06/01/2017 14
## 2 2 Rockstar Post Malone 21 Savage Universal 1851 15/09/2017 17
## 3 3 One Dance Drake Wizkid, Kyla Universal 1824 05/04/2016 14
## 4 4 Closer The Chainsmokers Halsey Sony 1734 29/07/2016 11
## 5 5 Thinking Out Loud Ed Sheeran <NA> Warner 1504 20/06/2014 2
## 6 6 God's Plan Drake <NA> Universal 1495 19/01/2018 11
You can specify exactly how many rows you see with these functions, eg head(spotify, 15)
.
There are various other functions that can give you an overview of the data that is in a data frame, which may or may not be useful depending on what type of data you have:
Function | Returns |
---|---|
colnames(df) |
The column names of a data frame |
rownames(df) |
The row names of a data frame |
ncol(df) |
Number of columns |
nrow(df) |
Number of rows |
describeBy(df) |
Descriptive statistics |
summary(df) |
Results of various model fitting functions |
Append a vector as a new column using cbind()
. In this case we’re appending the name of the album each song came from:
df <- cbind(
spotify, Album = c(
"÷", "Beerbongs & Bentleys", "Views", "Collage", "×", "Scorpion",
"Camila", "Hollywood's Bleeding", "÷", "Back from the Edge"
)
)
We can remove columns in a few different ways:
# Delete a specific column using its name
df$weeks <- NULL
# Delete multiple columns by listing their names using the subset() function
df <- subset(spotify, select = -c(Rank, Song, Artist))
# Delete multiple columns by listing their names using the select() function
# from the dplyr package
library(dplyr)
df <- select(spotify, -c("Rank", "Song", "Artist"))
# Delete multiple columns by listing their names using indexing
df <- spotify[!names(spotify) %in% c("Rank", "Song", "Artist")]
# Delete columns using their indices
df <- spotify[-c(1, 3:6, 12)]
First let’s remind ourselves of what the column names are:
print(colnames(spotify))
## [1] "Rank" "Song" "Artist" "Featuring" "Label" "Streams_Millions"
## [7] "Release_Date" "Weeks_at_no1"
Now use the rename()
function from the dplyr
package. The format to remember with this is "New" = "Old"
:
# Rename the "Artist" column to "Artist Name"
df <- rename(spotify, "Artist Name" = "Artist")
# Rename the "Song" column to "Song Name"
df <- rename(df, "Song Name" = "Song")
# Rename the "Featuring" column to "ft"
df <- rename(df, "ft" = "Featuring")
Column 5 is called ‘Label’ but really it contains the name of the music group that owns that artist’s record label. So let’s change the name of column 5 to “Group”:
colnames(df)[5] <- "Group"
Let’s rename the column called “Streams_Millions” to “Millions_of_Streams”. Let’s do this by first looking up the index of the column that’s called “Streams_Millions” (it is column 6) and then renaming column 6 to “Millions_of_Streams”.
idx <- which(names(df) == "Streams_Millions")
colnames(df)[idx] <- "Millions_of_Streams"
Check out the new names:
print(colnames(df))
## [1] "Rank" "Song Name" "Artist Name" "ft" "Group"
## [6] "Millions_of_Streams" "Release_Date" "Weeks_at_no1"
To add a new row (or rows) create a new data frame which has the same column headings as your current data frame and include the new information you want to add:
new_row <- data.frame(
Rank = 11,
Song = "Love Yourself",
Artist = "Justin Bieber",
Featuring = NA,
Label = "Universal",
Streams_Millions = 1357,
Release_Date = "09/11/2015",
Weeks_at_no1 = 3
)
Use rbind()
to do the actual appending:
df <- rbind(spotify, new_row)
# Delete multiple rows by index
df <- spotify[-c(2, 4, 6), ]
# Delete the first row
df <- spotify[-1, ]
Note how the above notation works:
c()
indicates that you are ‘subtracting’ (deleting) the named columns# Change the row names to be sequential numbers
rownames(df) <- NULL
rownames(df) <- seq_len(nrow(df))
If you know exactly which row(s) and column(s) you want, you can index a data frame in order to select only that portion of the whole object. This can be done in one of three ways:
[ ]
$
This is essentially the same process as deleting columns, but this time you are choosing which columns to keep instead of which ones to get rid of. Again, this can be done using a function:
# Select columns by name using the subset() function
df <- subset(spotify, select = c("Rank", "Song", "Artist"))
# Select columns by name using the select() function from the dplyr package
df <- select(spotify, c("Rank", "Song", "Artist"))
…or by using square bracket notation:
# Select columns by name using indexing
df <- spotify[c("Rank", "Song", "Artist")]
# Select columns by index
df <- spotify[c(1:3)]
# Select columns by matching names to a list
df <- spotify[names(spotify) %in% c("Rank", "Song", "Artist")]
…or by using dollar sign notation:
# Select a single column by using the dollar sign notation
df <- spotify$Song
Dollar sign notation can only be used to select COLUMNS. To select rows, see below:
When you use square brackets to index a data frame, R first selects the columns you asked for and then selects the rows. Thus, if you only provide one piece of information, R assumes you are asking for columns. So this code: spotify[c(1:3)]
will select COLUMNS 1, 2 and 3 because you only provided one vector. If you want to select ROWS 1, 2 and 3, you need to provide two pieces of information: the columns that you want (in this case, all of them) and the rows that you want (1, 2 and 3) in the format spotify[<rows>, <columns>]
:
# Select certain rows by index
df <- spotify[c(1, 2, 3), ]
Notice that, because we wanted all the columns, we left the <columns>
portion of the indexing blank. But we still needed to include the blank portion because otherwise R would have assumed we were asking for columns! In summary:
spotify[c(1:3)]
selects COLUMNS 1 to 3spotify[c(1:3),]
selects ROWS 1 to 3Looking up a value in a particular cell in the table is essentially a combination of the above: selecting a certain column and a certain row and seeing what is at their intersection. Again, the format to use is spotify[<rows>, <columns>]
:
# Select a single cell
df <- spotify[4, "Song"]
print(df)
## [1] "Closer"
Notice that, despite the fact you’ve only selected one cell, R remembers the other values in the column and prints them as ‘Levels’ in the output above. That is because the single cell you’ve selected is still a data frame, it’s just a very small one, and so all the ‘structure’ of the original object is still there. To remove this, explicitly convert the cell’s contents to a character object:
# Select a single cell
df <- as.character(spotify$"Song"[4])
print(df)
## [1] "Closer"
See how only the name of the song is printed. Also note that a slightly different method of indexing was used in that second example: the doller sign $
is used to select an entire column before the square brackets are used to select a single element of that column.
While indexing/selecting is something you do when you know exactly which row(s) and column(s) you want, sometimes you do NOT know exact what you want. Instead, you first need to search through the data frame to find the information you are looking for. Then you can subset the data frame to filter out any information you don’t want:
Does a column contain a certain value? For example, is there a song by The Chainsmokers in the Spotify top 10?
# Does a column contain a certain value?
bool <- "The Chainsmokers" %in% spotify$"Artist"
print(bool)
## [1] TRUE
You can also search an entire column at once, ie search each row. For example, for each artist in the top 10, are they signed to a Universal Music Group label?
# Does the rows contain a certain value?
bool <- (spotify$Label == "Universal")
print(bool)
## [1] FALSE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
Instead of searching through the data in a column you can also search through the names of the columns to see if the one you are looking for has been created yet. For example, has a column been created that contains the number of weeks a song was at number one for?
# Is there a column with a certain name?
bool <- "Weeks_at_no1" %in% colnames(spotify)
print(bool)
## [1] TRUE
‘Finding’ is similar to ‘searching’ except you get the locations of the matches, not just whether or not they exist. For example, what position in the top 10 is Camila Cabello’s song “Havana”?
idx <- match("Havana", spotify$"Song")
print(idx)
## [1] 7
Filtering removes all rows whose values do not meet certain criteria.
Square bracket notation can be used in a similar way to the “Select Certain Rows” section above. This time, however, instead of knowing what rows we are interested in beforehand and simply selecting them, we only know the criteria we are interested in and so need to filter for that. For example, if I want to get all songs with more than 1,500 million streams my criteria is that the value in the ‘Streams_Millions’ column must be greater than 1,500:
# Return all rows where the number of streams is greater than 1,500 million and all columns
df <- spotify[spotify$Streams_Millions > 1500, ]
print(df)
## Rank Song Artist Featuring Label Streams_Millions Release_Date Weeks_at_no1
## 1 1 Shape of You Ed Sheeran <NA> Warner 2449 06/01/2017 14
## 2 2 Rockstar Post Malone 21 Savage Universal 1851 15/09/2017 17
## 3 3 One Dance Drake Wizkid, Kyla Universal 1824 05/04/2016 14
## 4 4 Closer The Chainsmokers Halsey Sony 1734 29/07/2016 11
## 5 5 Thinking Out Loud Ed Sheeran <NA> Warner 1504 20/06/2014 2
The expression spotify$Streams_Millions > 1500
looks at all values in the Streams_Millions column and evaluates to either TRUE
or FALSE
depending on whether each value is larger than 1,500 or not. The result is a ‘Boolean vector’ - a vector of Booleans corresponding to which rows in the column meet your criteria:
# For each column, is the number of streams greater than 1,500 million?
vc <- spotify$Streams_Millions > 1500
print(vc)
## [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
This vector can then be used to index the full data frame using square bracket notation, but remember that this notation requires you to enter both the rows and columns you are interested in. Because we want all the columns we can leave this blank, which is why there is a comma followed by just a blank space:
# Return all rows where the number of streams is greater than 1,500 million and all columns
df <- spotify[spotify$Streams_Millions > 1500, ]
If you forget this comma it will assume you are filtering by column, not by row. Thus, the following code will use the Boolean vector created by the colnames(spotify) == "Streams_Millions"
expression to filter for columns called “Streams_Millions”:
# Return all columns with a certain name
df <- spotify[colnames(spotify) == "Streams_Millions"]
print(df)
## Streams_Millions
## 1 2449
## 2 1851
## 3 1824
## 4 1734
## 5 1504
## 6 1495
## 7 1407
## 8 1400
## 9 1366
## 10 1360
Filter for multiple values that are listed in a vector using the %in%
operator:
# Return all rows that match the given stream counts
streams <- c(1851, 1734, 1495)
df <- spotify[df$Streams_Millions %in% streams, ]
print(df)
## Rank Song Artist Featuring Label Streams_Millions Release_Date Weeks_at_no1
## 2 2 Rockstar Post Malone 21 Savage Universal 1851 15/09/2017 17
## 4 4 Closer The Chainsmokers Halsey Sony 1734 29/07/2016 11
## 6 6 God's Plan Drake <NA> Universal 1495 19/01/2018 11
subset()
You can filter out rows that do not meet a single condition, eg if I am only interested in Ed Sheeran (which I am <3):
df <- subset(spotify, Artist == "Ed Sheeran")
Or you can filter out rows that do not meet multiple conditions, eg if I am only interested in Ed’s most popular songs:
df <- subset(spotify, Artist == "Ed Sheeran" & Streams_Millions >= 1500)
You can keep all rows that meet any one of two+ conditions, eg if I am interested in Ed and all popular songs:
df <- subset(spotify, Artist == "Ed Sheeran" | Streams_Millions >= 1500)
Or you can keep all rows that meet one condition or another condition BUT NOT BOTH (ie you can apply an ‘exclusive or’ condition), eg if I am interested in Post Malone’s songs and songs which feature Swae Lee, but happen to not like Sunflower:
df <- subset(spotify, xor(Artist == "Post Malone", Featuring == "Swae Lee"))
filter()
from the dplyr
libraryYou can do the exact same as above with the filter()
function:
library(dplyr)
df <- filter(spotify, Artist == "Ed Sheeran")
df <- filter(spotify, Artist == "Ed Sheeran" & Streams_Millions >= 1500)
df <- filter(spotify, Artist == "Ed Sheeran" | Streams_Millions >= 1500)
df <- filter(spotify, xor(Artist == "Post Malone", Featuring == "Swae Lee"))
How many cells, columns or rows meet certain criteria?
How many rows have no featuring artist and have never been at number 1?
bool <- (is.na(spotify$Featuring) & is.na(spotify$Weeks_at_no1))
count <- length(bool[bool == TRUE])
print(count)
## [1] 2
How many rows have a featuring artist and have been at number 1?
bool <- (!is.na(spotify$Featuring) & !is.na(spotify$Weeks_at_no1))
count <- length(bool[bool == TRUE])
print(count)
## [1] 5
What proportion of the top 10 songs have over 1 700 million streams?
df <- subset(spotify, Streams_Millions >= 1700)
proportion <- nrow(df) / nrow(spotify)
percentage <- nrow(df) / nrow(spotify) * 100
sprintf(
"Number of songs with over 1 700 million streams: %s (%3.1f or %3.1f%%)",
nrow(df), proportion, percentage
)
## [1] "Number of songs with over 1 700 million streams: 4 (0.4 or 40.0%)"
Data cleaning is done to handle incomplete or partially incorrect datasets:
To remove all songs in the data frame that do not have a featuring artist (ie their value in the “Featuring” column is <NA>
):
spotify["Featuring"]
.<NA>
?” for each row in that column. This is done by the is.na()
function.is.na()
tells us which ones DO contain a null. So we need to inverse the question. This is done with the ‘not’ operator, which is an exclamation mark: !()
spotify[<rows>, <columns>]
df
df <- spotify[!(is.na(spotify["Featuring"])), ]
print(df)
## Rank Song Artist Featuring Label Streams_Millions Release_Date Weeks_at_no1
## 2 2 Rockstar Post Malone 21 Savage Universal 1851 15/09/2017 17
## 3 3 One Dance Drake Wizkid, Kyla Universal 1824 05/04/2016 14
## 4 4 Closer The Chainsmokers Halsey Sony 1734 29/07/2016 11
## 7 7 Havana Camila Cabello Young Thug Sony 1407 03/08/2017 1
## 8 8 Sunflower Post Malone Swae Lee Universal 1400 18/10/2018 2
As you can see, 5 of the top 10 most-streamed songs on Spotify have a featuring artist.
Sometimes you don’t want to remove missing data but rather analyse it. For example, is there a music label which releases songs without featuring artists significantly more often than other labels? In order to answer this question you could group the data by “Label” then search for more than 2 instances of missing “Featuring Artist” data in their songs:
library(dplyr)
# Find missing data
df <- group_by(spotify, Label)
df <- summarize(df, "No Featuring Artists" = ifelse(sum(is.na(Featuring)) >= 2, TRUE, FALSE))
print(df)
## # A tibble: 3 x 2
## Label `No Featuring Artists`
## <chr> <lgl>
## 1 Sony FALSE
## 2 Universal FALSE
## 3 Warner TRUE
If you are working with multiple data frames it might be the case that they each have slightly different column headings despite having the same type of information. This could arise, for example, due to typos if the data was created manually. It’s good practice to decide on a standard format and use that consistently: for example, you might choose to use full stops in column headings instead of underscores to break up works:
df <- spotify
names(df) <- gsub("_", "\\.", names(df))
print(names(df))
## [1] "Rank" "Song" "Artist" "Featuring" "Label" "Streams.Millions"
## [7] "Release.Date" "Weeks.at.no1"
The columns “Streams_Millions”, “Release_Date” and “Weeks_at_no1” have been changed to “Streams.Millions”, “Release.Date” and “Weeks.at.no1”.
If you want to convert the data type of the elements in a column you can use the as.xxx()
functions. For example:
as.character()
converts a column’s elements to stringsas.numeric()
converts them to numbersLet’s test this out by converting the “Weeks at no 1” column from numerical data (specifically, double-precision floats) to text data:
print(typeof(spotify$Weeks_at_no1))
## [1] "double"
spotify$Weeks_at_no1 <- as.character(spotify$Weeks_at_no1)
print(typeof(spotify$Weeks_at_no1))
## [1] "character"
And now convert it back to numerical data:
spotify$Weeks_at_no1 <- as.numeric(spotify$Weeks_at_no1)
print(typeof(spotify$Weeks_at_no1))
## [1] "double"
This can be done for multiple columns at a time by using the lappy()
function. To demonstrate this, let’s repeat the above conversions for the “Streams Millions” column in addition to the “Weeks at no 1” column:
print(c(typeof(spotify$Weeks_at_no1), typeof(spotify$Streams_Millions)))
## [1] "double" "double"
cols <- c("Weeks_at_no1", "Streams_Millions")
spotify[cols] <- lapply(spotify[cols], as.character)
print(c(typeof(spotify$Weeks_at_no1), typeof(spotify$Streams_Millions)))
## [1] "character" "character"
Now convert them back to double-precision floats:
spotify[cols] <- lapply(spotify[cols], as.numeric)
print(c(typeof(spotify$Weeks_at_no1), typeof(spotify$Streams_Millions)))
## [1] "double" "double"
This section is pretty self-explanatory.
Sort alphabetically by artist name using order()
together with indexing:
df <- spotify[order(spotify$Artist),]
print(df)
## Rank Song Artist Featuring Label Streams_Millions Release_Date Weeks_at_no1
## 7 7 Havana Camila Cabello Young Thug Sony 1407 03/08/2017 1
## 3 3 One Dance Drake Wizkid, Kyla Universal 1824 05/04/2016 14
## 6 6 God's Plan Drake <NA> Universal 1495 19/01/2018 11
## 1 1 Shape of You Ed Sheeran <NA> Warner 2449 06/01/2017 14
## 5 5 Thinking Out Loud Ed Sheeran <NA> Warner 1504 20/06/2014 2
## 9 9 Perfect Ed Sheeran <NA> Warner 1366 03/03/2017 NA
## 10 10 Say You Won't Let Go James Arthur <NA> Sony 1360 09/09/2016 NA
## 2 2 Rockstar Post Malone 21 Savage Universal 1851 15/09/2017 17
## 8 8 Sunflower Post Malone Swae Lee Universal 1400 18/10/2018 2
## 4 4 Closer The Chainsmokers Halsey Sony 1734 29/07/2016 11
Sort twice - first alphabetically by artist name, then in order of number of streams - using arrange()
from dplyr:
library(dplyr)
df <- arrange(spotify, Artist, Streams_Millions)
Sort into reverse alphabetical order using the decreasing = TRUE
option:
df <- spotify[order(spotify$Artist, decreasing = TRUE),]
print(df)
## Rank Song Artist Featuring Label Streams_Millions Release_Date Weeks_at_no1
## 4 4 Closer The Chainsmokers Halsey Sony 1734 29/07/2016 11
## 2 2 Rockstar Post Malone 21 Savage Universal 1851 15/09/2017 17
## 8 8 Sunflower Post Malone Swae Lee Universal 1400 18/10/2018 2
## 10 10 Say You Won't Let Go James Arthur <NA> Sony 1360 09/09/2016 NA
## 1 1 Shape of You Ed Sheeran <NA> Warner 2449 06/01/2017 14
## 5 5 Thinking Out Loud Ed Sheeran <NA> Warner 1504 20/06/2014 2
## 9 9 Perfect Ed Sheeran <NA> Warner 1366 03/03/2017 NA
## 3 3 One Dance Drake Wizkid, Kyla Universal 1824 05/04/2016 14
## 6 6 God's Plan Drake <NA> Universal 1495 19/01/2018 11
## 7 7 Havana Camila Cabello Young Thug Sony 1407 03/08/2017 1
Sort into reverse alphabetical order using arrange()
and desc()
from dplyr:
library(dplyr)
df <- arrange(spotify, desc(Artist))
spotify["Song and Artist"] <- paste(spotify$Song, "by", spotify$Artist)
print(spotify[c("Song", "Artist", "Song and Artist")])
## Song Artist Song and Artist
## 1 Shape of You Ed Sheeran Shape of You by Ed Sheeran
## 2 Rockstar Post Malone Rockstar by Post Malone
## 3 One Dance Drake One Dance by Drake
## 4 Closer The Chainsmokers Closer by The Chainsmokers
## 5 Thinking Out Loud Ed Sheeran Thinking Out Loud by Ed Sheeran
## 6 God's Plan Drake God's Plan by Drake
## 7 Havana Camila Cabello Havana by Camila Cabello
## 8 Sunflower Post Malone Sunflower by Post Malone
## 9 Perfect Ed Sheeran Perfect by Ed Sheeran
## 10 Say You Won't Let Go James Arthur Say You Won't Let Go by James Arthur
To estimate the number of days each song was at number one for, multiple the number of weeks it was at number one by 7:
df <- mutate(spotify, Days.at.no1 = Weeks_at_no1 * 7)
print(df[c("Song", "Artist", "Weeks_at_no1", "Days.at.no1")])
## Song Artist Weeks_at_no1 Days.at.no1
## 1 Shape of You Ed Sheeran 14 98
## 2 Rockstar Post Malone 17 119
## 3 One Dance Drake 14 98
## 4 Closer The Chainsmokers 11 77
## 5 Thinking Out Loud Ed Sheeran 2 14
## 6 God's Plan Drake 11 77
## 7 Havana Camila Cabello 1 7
## 8 Sunflower Post Malone 2 14
## 9 Perfect Ed Sheeran NA NA
## 10 Say You Won't Let Go James Arthur NA NA
Sometimes, instead of performing an operation on an entire column you want to go row-by-row and do something different in each. For this we can iterate over all values in order. For example, if we want to get the average number of streams for each artist:
# Initialise a new vector to contain the output
average_streams <- vector()
# Get a list of all the artists in the data frame
artists <- unique(spotify["Artist"])
# Iterate over the Artist column
for (i in 1:seq_along(artists)) {
# For each artist in the data frame:
artist <- artists[i, "Artist"]
# Option 1 (doesn't use dplyr)
this_artist_only <- spotify[spotify$Artist == artist, ]
# Option 2 (uses dplyr)
this_artist_only <- filter(spotify, Artist == artist)
average_streams_this_artist <- mean(this_artist_only[, "Streams_Millions"])
average_streams <- c(average_streams, average_streams_this_artist)
}
# Combine the results into a new data frame
average_streams_df <- data.frame(
Artist = artists,
Average.Streams_Millions = average_streams
)
Now let’s take a look at the results:
kable_input <- kable(average_streams_df)
kable_styling(
kable_input, bootstrap_options = c("striped", "hover", "condensed"),
full_width = F
)
Artist | Average.Streams_Millions | |
---|---|---|
1 | Ed Sheeran | 1773 |
2 | Post Malone | 1773 |
3 | Drake | 1773 |
4 | The Chainsmokers | 1773 |
7 | Camila Cabello | 1773 |
10 | James Arthur | 1773 |
In this example we will iterate over the cells in a column without extracting the column from the data frame it is in:
for (i in 1:nrow(spotify)) {
artist <- spotify[i, "Artist"]
if (is.na(spotify[i, "Featuring"])) {
spotify[i, "Performed By"] <- artist
} else {
featuring <- spotify[i, "Featuring"]
spotify[i, "Performed By"] <- sprintf("%s ft %s", artist, featuring)
}
}
print(spotify[["Performed By"]])
## [1] "Ed Sheeran" "Post Malone ft 21 Savage" "Drake ft Wizkid, Kyla"
## [4] "The Chainsmokers ft Halsey" "Ed Sheeran" "Drake"
## [7] "Camila Cabello ft Young Thug" "Post Malone ft Swae Lee" "Ed Sheeran"
## [10] "James Arthur"
# Initialise new vectors to contain the output
groups <- vector()
averages <- vector()
# Set the variables you are looking at
variables <- c("Rank", "Streams_Millions")
# Iterate over the variables vector
for (j in variables) {
# Iterate over the Study column
artists <- unique(df["Artist"])
for (i in seq_along(artists)) {
artist <- artists[i, "Artist"]
# Option 1 (doesn't use dplyr)
this_artist_only <- spotify[spotify$Artist == artist, ]
# Option 2 (uses dplyr)
this_artist_only <- filter(spotify, Artist == artist)
average <- mean(this_artist_only[, j])
# Update output columns
groups <- c(groups, as.character(artist))
averages <- c(averages, average)
}
}
# Combine
df_out <- data.frame(
"group" <- groups,
"average" <- averages
)
kable_input <- kable(df_out)
kable_styling(
kable_input, bootstrap_options = c("striped", "hover", "condensed"),
full_width = F
)
X.group…..groups | X.average…..averages |
---|---|
Ed Sheeran | 5 |
Ed Sheeran | 1773 |