Data frames are the most important data type in R. The power that R has as a programming language and its usefulness in statistics and data analysis comes from that fact that it has these. They allow calculations to be rapidly performed on data that is tabular in nature, ie data that can be represented as rows and columns (like Excel, but more powerful). The reason this is possible is because:
A data frame just is a fancy table
It also has rows and columns and, often, data frames are created by
importing data from Excel. A more manual way to create them is by using
the data.frame()
function, which turns vectors into the
columns of a ‘table’. Here’s an example using the top 10 most-streamed
songs on Spotify:
spotify <- data.frame(
Rank = c(1:10),
Song = c(
"Shape of You", "Rockstar", "One Dance", "Closer", "Thinking Out Loud",
"God's Plan", "Havana", "Sunflower", "Perfect", "Say You Won't Let Go"
),
Artist = c(
"Ed Sheeran", "Post Malone", "Drake", "The Chainsmokers", "Ed Sheeran",
"Drake", "Camila Cabello", "Post Malone", "Ed Sheeran", "James Arthur"
),
Featuring = c(
NA, "21 Savage", "Wizkid, Kyla", "Halsey", NA, NA, "Young Thug",
"Swae Lee", NA, NA
),
Label = c(
"Warner", "Universal", "Universal", "Sony", "Warner", "Universal",
"Sony", "Universal", "Warner", "Sony"
),
Streams_Millions = c(
2449, 1851, 1824, 1734, 1504, 1495, 1407, 1400, 1366, 1360
),
Release_Date = c(
"06/01/2017", "15/09/2017", "05/04/2016", "29/07/2016", "20/06/2014",
"19/01/2018", "03/08/2017", "18/10/2018", "03/03/2017", "09/09/2016"
),
Weeks_at_no1 = c(14, 17, 14, 11, 2, 11, 1, 2, NA, NA)
)
As you can see above, eight vectors were created using the
c()
function and each was given a name. This creates a data
frame that looks like this:
Rank | Song | Artist | Featuring | Label | Streams_Millions | Release_Date | Weeks_at_no1 |
---|---|---|---|---|---|---|---|
1 | Shape of You | Ed Sheeran | NA | Warner | 2449 | 06/01/2017 | 14 |
2 | Rockstar | Post Malone | 21 Savage | Universal | 1851 | 15/09/2017 | 17 |
3 | One Dance | Drake | Wizkid, Kyla | Universal | 1824 | 05/04/2016 | 14 |
4 | Closer | The Chainsmokers | Halsey | Sony | 1734 | 29/07/2016 | 11 |
5 | Thinking Out Loud | Ed Sheeran | NA | Warner | 1504 | 20/06/2014 | 2 |
6 | God’s Plan | Drake | NA | Universal | 1495 | 19/01/2018 | 11 |
7 | Havana | Camila Cabello | Young Thug | Sony | 1407 | 03/08/2017 | 1 |
8 | Sunflower | Post Malone | Swae Lee | Universal | 1400 | 18/10/2018 | 2 |
9 | Perfect | Ed Sheeran | NA | Warner | 1366 | 03/03/2017 | NA |
10 | Say You Won’t Let Go | James Arthur | NA | Sony | 1360 | 09/09/2016 | NA |
Each row and column has a name (the ‘row heading’ or the ‘column heading’) and a number (the ‘index’). So the words “One Dance” are in the ‘Song’ column which is also known as column 2 (ie the column’s heading is ‘Song’ and the column’s index is 2) and they are in row 3. By default, the rows’ names are the same as their indices. You can use either the name or the index when referring to a specific row/column.
Often data frames can be very large, so you don’t want to (or maybe
you can’t) view the entire thing at once. Use head()
to
just see the first few rows and tail()
to just see the last
few:
head(spotify)
## Rank Song Artist Featuring Label Streams_Millions Release_Date Weeks_at_no1
## 1 1 Shape of You Ed Sheeran <NA> Warner 2449 06/01/2017 14
## 2 2 Rockstar Post Malone 21 Savage Universal 1851 15/09/2017 17
## 3 3 One Dance Drake Wizkid, Kyla Universal 1824 05/04/2016 14
## 4 4 Closer The Chainsmokers Halsey Sony 1734 29/07/2016 11
## 5 5 Thinking Out Loud Ed Sheeran <NA> Warner 1504 20/06/2014 2
## 6 6 God's Plan Drake <NA> Universal 1495 19/01/2018 11
You can specify exactly how many rows you see with these functions,
eg head(spotify, 15)
.
There are various other functions that can give you an overview of the data that is in a data frame, which may or may not be useful depending on what type of data you have:
Function | Returns |
---|---|
colnames(df) |
The column names of a data frame |
rownames(df) |
The row names of a data frame |
ncol(df) |
Number of columns |
nrow(df) |
Number of rows |
describeBy(df) |
Descriptive statistics |
summary(df) |
Results of various model fitting functions |
Append a vector as a new column using cbind()
. In this
case we’re appending the name of the album each song came from:
df <- cbind(
spotify, Album = c(
"÷", "Beerbongs & Bentleys", "Views", "Collage", "×", "Scorpion",
"Camila", "Hollywood's Bleeding", "÷", "Back from the Edge"
)
)
We can remove columns in a few different ways:
# Delete a specific column using its name
df$weeks <- NULL
# Delete multiple columns by listing their names using the subset() function
df <- subset(spotify, select = -c(Rank, Song, Artist))
# Delete multiple columns by listing their names using the select() function
# from the dplyr package
library(dplyr)
df <- select(spotify, -c("Rank", "Song", "Artist"))
# Delete multiple columns by listing their names using indexing
df <- spotify[!names(spotify) %in% c("Rank", "Song", "Artist")]
# Delete columns using their indices
df <- spotify[-c(1, 3:6, 12)]
First let’s remind ourselves of what the column names are:
print(colnames(spotify))
## [1] "Rank" "Song" "Artist" "Featuring" "Label" "Streams_Millions"
## [7] "Release_Date" "Weeks_at_no1"
Now use the rename()
function from the
dplyr
package. The format to remember with this is
"New" = "Old"
:
# Rename the "Artist" column to "Artist Name"
df <- rename(spotify, "Artist Name" = "Artist")
# Rename the "Song" column to "Song Name"
df <- rename(df, "Song Name" = "Song")
# Rename the "Featuring" column to "ft"
df <- rename(df, "ft" = "Featuring")
Column 5 is called ‘Label’ but really it contains the name of the music group that owns that artist’s record label. So let’s change the name of column 5 to “Group”:
colnames(df)[5] <- "Group"
Let’s rename the column called “Streams_Millions” to “Millions_of_Streams”. Let’s do this by first looking up the index of the column that’s called “Streams_Millions” (it is column 6) and then renaming column 6 to “Millions_of_Streams”.
idx <- which(names(df) == "Streams_Millions")
colnames(df)[idx] <- "Millions_of_Streams"
Check out the new names:
print(colnames(df))
## [1] "Rank" "Song Name" "Artist Name" "ft" "Group"
## [6] "Millions_of_Streams" "Release_Date" "Weeks_at_no1"
Given the pre-installed iris
data frame:
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Sum all the values in the columns without grouping by
selecting the numeric columns and using colSums()
to add up
all the values in each:
cols = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width")
colSums(iris[, cols])
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 876.5 458.6 563.7 179.9
Sum all the values in the columns with grouping by selecting
the numeric columns, specifying the groups in the data and using
rowsum()
to add up all the values in each numeric column
for each group. In this example, the groups are the Species
names:
rowsum(iris[, cols], iris$Species)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa 250.3 171.4 73.1 12.3
## versicolor 296.8 138.5 213.0 66.3
## virginica 329.4 148.7 277.6 101.3
To add a new row (or rows) create a new data frame which has the same column headings as your current data frame and include the new information you want to add:
new_row <- data.frame(
Rank = 11,
Song = "Love Yourself",
Artist = "Justin Bieber",
Featuring = NA,
Label = "Universal",
Streams_Millions = 1357,
Release_Date = "09/11/2015",
Weeks_at_no1 = 3
)
Use rbind()
to do the actual appending:
df <- rbind(spotify, new_row)
# Delete multiple rows by index
df <- spotify[-c(2, 4, 6), ]
# Delete the first row
df <- spotify[-1, ]
Note how the above notation works:
c()
indicates that you are
‘subtracting’ (deleting) the named columns# Change the row names to be sequential numbers
rownames(df) <- NULL
rownames(df) <- seq_len(nrow(df))
Given the pre-installed randu
data frame:
head(randu)
## x y z
## 1 0.000031 0.000183 0.000824
## 2 0.044495 0.155732 0.533939
## 3 0.822440 0.873416 0.838542
## 4 0.322291 0.648545 0.990648
## 5 0.393595 0.826873 0.418881
## 6 0.309097 0.926590 0.777664
Sum all the values in each row using the rowSums()
function:
rowSums(head(randu))
## 1 2 3 4 5 6
## 0.001038 0.734166 2.534398 1.961484 1.639349 2.013351
If you know exactly which row(s) and column(s) you want, you can index a data frame in order to select only that portion of the whole object. This can be done in one of three ways:
[ ]
$
This is essentially the same process as deleting columns, but this time you are choosing which columns to keep instead of which ones to get rid of. Again, this can be done using a function:
# Select columns by name using the subset() function
df <- subset(spotify, select = c("Rank", "Song", "Artist"))
# Select columns by name using the select() function from the dplyr package
df <- select(spotify, c("Rank", "Song", "Artist"))
…or by using square bracket notation:
# Select columns by name using indexing
df <- spotify[c("Rank", "Song", "Artist")]
# Select columns by index
df <- spotify[c(1:3)]
# Select columns by matching names to a list
df <- spotify[names(spotify) %in% c("Rank", "Song", "Artist")]
…or by using dollar sign notation:
# Select a single column by using the dollar sign notation
df <- spotify$Song
Dollar sign notation can only be used to select COLUMNS. To select rows, see below:
When you use square brackets to index a data frame, R first selects
the columns you asked for and then selects the rows. Thus, if you only
provide one piece of information, R assumes you are asking for columns.
So this code: spotify[c(1:3)]
will select COLUMNS 1, 2 and
3 because you only provided one vector. If you want to select ROWS 1, 2
and 3, you need to provide two pieces of information: the columns that
you want (in this case, all of them) and the rows that you want (1, 2
and 3) in the format
spotify[<rows>, <columns>]
:
# Select certain rows by index
df <- spotify[c(1, 2, 3), ]
Notice that, because we wanted all the columns, we left the
<columns>
portion of the indexing blank. But we still
needed to include the blank portion because otherwise R would have
assumed we were asking for columns! In summary:
spotify[c(1:3)]
selects COLUMNS 1 to 3spotify[c(1:3),]
selects ROWS 1 to 3Looking up a value in a particular cell in the table is essentially a
combination of the above: selecting a certain column and a certain row
and seeing what is at their intersection. Again, the format to use is
spotify[<rows>, <columns>]
:
# Select a single cell
df <- spotify[4, "Song"]
print(df)
## [1] "Closer"
Notice that, despite the fact you’ve only selected one cell, R remembers the other values in the column and prints them as ‘Levels’ in the output above. That is because the single cell you’ve selected is still a data frame, it’s just a very small one, and so all the ‘structure’ of the original object is still there. To remove this, explicitly convert the cell’s contents to a character object:
# Select a single cell
df <- as.character(spotify$"Song"[4])
print(df)
## [1] "Closer"
See how only the name of the song is printed. Also note that a
slightly different method of indexing was used in that second example:
the doller sign $
is used to select an entire column before
the square brackets are used to select a single element of that
column.
While indexing/selecting is something you do when you know exactly which row(s) and column(s) you want, sometimes you do NOT know exact what you want. Instead, you first need to search through the data frame to find the information you are looking for. Then you can subset the data frame to filter out any information you don’t want:
Does a column contain a certain value? For example, is there a song by The Chainsmokers in the Spotify top 10?
# Does a column contain a certain value?
bool <- "The Chainsmokers" %in% spotify$"Artist"
print(bool)
## [1] TRUE
You can also search an entire column at once, ie search each row. For example, for each artist in the top 10, are they signed to a Universal Music Group label?
# Does the rows contain a certain value?
bool <- (spotify$Label == "Universal")
print(bool)
## [1] FALSE TRUE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
Instead of searching through the data in a column you can also search through the names of the columns to see if the one you are looking for has been created yet. For example, has a column been created that contains the number of weeks a song was at number one for?
# Is there a column with a certain name?
bool <- "Weeks_at_no1" %in% colnames(spotify)
print(bool)
## [1] TRUE
‘Finding’ is similar to ‘searching’ except you get the locations of the matches, not just whether or not they exist. For example, what position in the top 10 is Camila Cabello’s song “Havana”?
idx <- match("Havana", spotify$"Song")
print(idx)
## [1] 7
Filtering removes all rows whose values do not meet certain criteria.
Square bracket notation can be used in a similar way to the “Select Certain Rows” section above. This time, however, instead of knowing what rows we are interested in beforehand and simply selecting them, we only know the criteria we are interested in and so need to filter for that. For example, if I want to get all songs with more than 1,500 million streams my criteria is that the value in the ‘Streams_Millions’ column must be greater than 1,500:
# Return all rows where the number of streams is greater than 1,500 million and all columns
df <- spotify[spotify$Streams_Millions > 1500, ]
print(df)
## Rank Song Artist Featuring Label Streams_Millions Release_Date Weeks_at_no1
## 1 1 Shape of You Ed Sheeran <NA> Warner 2449 06/01/2017 14
## 2 2 Rockstar Post Malone 21 Savage Universal 1851 15/09/2017 17
## 3 3 One Dance Drake Wizkid, Kyla Universal 1824 05/04/2016 14
## 4 4 Closer The Chainsmokers Halsey Sony 1734 29/07/2016 11
## 5 5 Thinking Out Loud Ed Sheeran <NA> Warner 1504 20/06/2014 2
The expression spotify$Streams_Millions > 1500
looks
at all values in the Streams_Millions column and evaluates to either
TRUE
or FALSE
depending on whether each value
is larger than 1,500 or not. The result is a ‘Boolean vector’ - a vector
of Booleans corresponding to which rows in the column meet your
criteria:
# For each column, is the number of streams greater than 1,500 million?
vc <- spotify$Streams_Millions > 1500
print(vc)
## [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
This vector can then be used to index the full data frame using square bracket notation, but remember that this notation requires you to enter both the rows and columns you are interested in. Because we want all the columns we can leave this blank, which is why there is a comma followed by just a blank space:
# Return all rows where the number of streams is greater than 1,500 million and all columns
df <- spotify[spotify$Streams_Millions > 1500, ]
If you forget this comma it will assume you are filtering by
column, not by row. Thus, the following code will use the
Boolean vector created by the
colnames(spotify) == "Streams_Millions"
expression to
filter for columns called “Streams_Millions”:
# Return all columns with a certain name
df <- spotify[colnames(spotify) == "Streams_Millions"]
print(df)
## Streams_Millions
## 1 2449
## 2 1851
## 3 1824
## 4 1734
## 5 1504
## 6 1495
## 7 1407
## 8 1400
## 9 1366
## 10 1360
Filter for multiple values that are listed in a vector using
the %in%
operator:
# Return all rows that match the given stream counts
streams <- c(1851, 1734, 1495)
df <- spotify[df$Streams_Millions %in% streams, ]
print(df)
## Rank Song Artist Featuring Label Streams_Millions Release_Date Weeks_at_no1
## 2 2 Rockstar Post Malone 21 Savage Universal 1851 15/09/2017 17
## 4 4 Closer The Chainsmokers Halsey Sony 1734 29/07/2016 11
## 6 6 God's Plan Drake <NA> Universal 1495 19/01/2018 11
subset()
You can filter out rows that do not meet a single condition, eg if I am only interested in Ed Sheeran (which I am <3):
df <- subset(spotify, Artist == "Ed Sheeran")
Or you can filter out rows that do not meet multiple conditions, eg if I am only interested in Ed’s most popular songs:
df <- subset(spotify, Artist == "Ed Sheeran" & Streams_Millions >= 1500)
You can keep all rows that meet any one of two+ conditions, eg if I am interested in Ed and all popular songs:
df <- subset(spotify, Artist == "Ed Sheeran" | Streams_Millions >= 1500)
Or you can keep all rows that meet one condition or another condition BUT NOT BOTH (ie you can apply an ‘exclusive or’ condition), eg if I am interested in Post Malone’s songs and songs which feature Swae Lee, but happen to not like Sunflower:
df <- subset(spotify, xor(Artist == "Post Malone", Featuring == "Swae Lee"))
filter()
from the dplyr
libraryYou can do the exact same as above with the filter()
function:
library(dplyr)
df <- filter(spotify, Artist == "Ed Sheeran")
df <- filter(spotify, Artist == "Ed Sheeran" & Streams_Millions >= 1500)
df <- filter(spotify, Artist == "Ed Sheeran" | Streams_Millions >= 1500)
df <- filter(spotify, xor(Artist == "Post Malone", Featuring == "Swae Lee"))
How many cells, columns or rows meet certain criteria?
How many rows have no featuring artist and have never been at number 1?
bool <- (is.na(spotify$Featuring) & is.na(spotify$Weeks_at_no1))
count <- length(bool[bool == TRUE])
print(count)
## [1] 2
How many rows have a featuring artist and have been at number 1?
bool <- (!is.na(spotify$Featuring) & !is.na(spotify$Weeks_at_no1))
count <- length(bool[bool == TRUE])
print(count)
## [1] 5
What proportion of the top 10 songs have over 1 700 million streams?
df <- subset(spotify, Streams_Millions >= 1700)
proportion <- nrow(df) / nrow(spotify)
percentage <- nrow(df) / nrow(spotify) * 100
sprintf(
"Number of songs with over 1 700 million streams: %s (%3.1f or %3.1f%%)",
nrow(df), proportion, percentage
)
## [1] "Number of songs with over 1 700 million streams: 4 (0.4 or 40.0%)"
Data cleaning is done to handle incomplete or partially incorrect datasets:
To remove all songs in the data frame that do not have a featuring
artist (ie their value in the “Featuring” column is
<NA>
):
spotify["Featuring"]
.<NA>
?”
for each row in that column. This is done by the is.na()
function.is.na()
tells us which ones DO contain a null. So
we need to inverse the question. This is done with the ‘not’ operator,
which is an exclamation mark: !()
spotify[<rows>, <columns>]
df
df <- spotify[!(is.na(spotify["Featuring"])), ]
print(df)
## Rank Song Artist Featuring Label Streams_Millions Release_Date Weeks_at_no1
## 2 2 Rockstar Post Malone 21 Savage Universal 1851 15/09/2017 17
## 3 3 One Dance Drake Wizkid, Kyla Universal 1824 05/04/2016 14
## 4 4 Closer The Chainsmokers Halsey Sony 1734 29/07/2016 11
## 7 7 Havana Camila Cabello Young Thug Sony 1407 03/08/2017 1
## 8 8 Sunflower Post Malone Swae Lee Universal 1400 18/10/2018 2
As you can see, 5 of the top 10 most-streamed songs on Spotify have a featuring artist.
Sometimes you don’t want to remove missing data but rather analyse it. For example, is there a music label which releases songs without featuring artists significantly more often than other labels? In order to answer this question you could group the data by “Label” then search for more than 2 instances of missing “Featuring Artist” data in their songs:
library(dplyr)
# Find missing data
df <- group_by(spotify, Label)
df <- summarize(df, "No Featuring Artists" = ifelse(sum(is.na(Featuring)) >= 2, TRUE, FALSE))
print(df)
## # A tibble: 3 x 2
## Label `No Featuring Artists`
## <chr> <lgl>
## 1 Sony FALSE
## 2 Universal FALSE
## 3 Warner TRUE
If you are working with multiple data frames it might be the case that they each have slightly different column headings despite having the same type of information. This could arise, for example, due to typos if the data was created manually. It’s good practice to decide on a standard format and use that consistently: for example, you might choose to use full stops in column headings instead of underscores to break up works:
df <- spotify
names(df) <- gsub("_", "\\.", names(df))
print(names(df))
## [1] "Rank" "Song" "Artist" "Featuring" "Label" "Streams.Millions"
## [7] "Release.Date" "Weeks.at.no1"
The columns “Streams_Millions”, “Release_Date” and “Weeks_at_no1” have been changed to “Streams.Millions”, “Release.Date” and “Weeks.at.no1”.
If you want to convert the data type of the elements in a column you
can use the as.xxx()
functions. For example:
as.character()
converts a column’s elements to
stringsas.numeric()
converts them to numbersLet’s test this out by converting the “Weeks at no 1” column from numerical data (specifically, double-precision floats) to text data:
print(typeof(spotify$Weeks_at_no1))
## [1] "double"
spotify$Weeks_at_no1 <- as.character(spotify$Weeks_at_no1)
print(typeof(spotify$Weeks_at_no1))
## [1] "character"
And now convert it back to numerical data:
spotify$Weeks_at_no1 <- as.numeric(spotify$Weeks_at_no1)
print(typeof(spotify$Weeks_at_no1))
## [1] "double"
This can be done for multiple columns at a time by using the
lappy()
function. To demonstrate this, let’s repeat the
above conversions for the “Streams Millions” column in addition to the
“Weeks at no 1” column:
print(c(typeof(spotify$Weeks_at_no1), typeof(spotify$Streams_Millions)))
## [1] "double" "double"
cols <- c("Weeks_at_no1", "Streams_Millions")
spotify[cols] <- lapply(spotify[cols], as.character)
print(c(typeof(spotify$Weeks_at_no1), typeof(spotify$Streams_Millions)))
## [1] "character" "character"
Now convert them back to double-precision floats:
spotify[cols] <- lapply(spotify[cols], as.numeric)
print(c(typeof(spotify$Weeks_at_no1), typeof(spotify$Streams_Millions)))
## [1] "double" "double"
This section is pretty self-explanatory.
Sort alphabetically by artist name using order()
together with indexing:
df <- spotify[order(spotify$Artist),]
print(df)
## Rank Song Artist Featuring Label Streams_Millions Release_Date Weeks_at_no1
## 7 7 Havana Camila Cabello Young Thug Sony 1407 03/08/2017 1
## 3 3 One Dance Drake Wizkid, Kyla Universal 1824 05/04/2016 14
## 6 6 God's Plan Drake <NA> Universal 1495 19/01/2018 11
## 1 1 Shape of You Ed Sheeran <NA> Warner 2449 06/01/2017 14
## 5 5 Thinking Out Loud Ed Sheeran <NA> Warner 1504 20/06/2014 2
## 9 9 Perfect Ed Sheeran <NA> Warner 1366 03/03/2017 NA
## 10 10 Say You Won't Let Go James Arthur <NA> Sony 1360 09/09/2016 NA
## 2 2 Rockstar Post Malone 21 Savage Universal 1851 15/09/2017 17
## 8 8 Sunflower Post Malone Swae Lee Universal 1400 18/10/2018 2
## 4 4 Closer The Chainsmokers Halsey Sony 1734 29/07/2016 11
Sort twice - first alphabetically by artist name, then in order of
number of streams - using arrange()
from dplyr:
library(dplyr)
df <- arrange(spotify, Artist, Streams_Millions)
Sort into reverse alphabetical order using the
decreasing = TRUE
option:
df <- spotify[order(spotify$Artist, decreasing = TRUE),]
print(df)
## Rank Song Artist Featuring Label Streams_Millions Release_Date Weeks_at_no1
## 4 4 Closer The Chainsmokers Halsey Sony 1734 29/07/2016 11
## 2 2 Rockstar Post Malone 21 Savage Universal 1851 15/09/2017 17
## 8 8 Sunflower Post Malone Swae Lee Universal 1400 18/10/2018 2
## 10 10 Say You Won't Let Go James Arthur <NA> Sony 1360 09/09/2016 NA
## 1 1 Shape of You Ed Sheeran <NA> Warner 2449 06/01/2017 14
## 5 5 Thinking Out Loud Ed Sheeran <NA> Warner 1504 20/06/2014 2
## 9 9 Perfect Ed Sheeran <NA> Warner 1366 03/03/2017 NA
## 3 3 One Dance Drake Wizkid, Kyla Universal 1824 05/04/2016 14
## 6 6 God's Plan Drake <NA> Universal 1495 19/01/2018 11
## 7 7 Havana Camila Cabello Young Thug Sony 1407 03/08/2017 1
Sort into reverse alphabetical order using arrange()
and
desc()
from dplyr:
library(dplyr)
df <- arrange(spotify, desc(Artist))
spotify["Song and Artist"] <- paste(spotify$Song, "by", spotify$Artist)
print(spotify[c("Song", "Artist", "Song and Artist")])
## Song Artist Song and Artist
## 1 Shape of You Ed Sheeran Shape of You by Ed Sheeran
## 2 Rockstar Post Malone Rockstar by Post Malone
## 3 One Dance Drake One Dance by Drake
## 4 Closer The Chainsmokers Closer by The Chainsmokers
## 5 Thinking Out Loud Ed Sheeran Thinking Out Loud by Ed Sheeran
## 6 God's Plan Drake God's Plan by Drake
## 7 Havana Camila Cabello Havana by Camila Cabello
## 8 Sunflower Post Malone Sunflower by Post Malone
## 9 Perfect Ed Sheeran Perfect by Ed Sheeran
## 10 Say You Won't Let Go James Arthur Say You Won't Let Go by James Arthur
To estimate the number of days each song was at number one for, multiple the number of weeks it was at number one by 7:
df <- mutate(spotify, Days.at.no1 = Weeks_at_no1 * 7)
print(df[c("Song", "Artist", "Weeks_at_no1", "Days.at.no1")])
## Song Artist Weeks_at_no1 Days.at.no1
## 1 Shape of You Ed Sheeran 14 98
## 2 Rockstar Post Malone 17 119
## 3 One Dance Drake 14 98
## 4 Closer The Chainsmokers 11 77
## 5 Thinking Out Loud Ed Sheeran 2 14
## 6 God's Plan Drake 11 77
## 7 Havana Camila Cabello 1 7
## 8 Sunflower Post Malone 2 14
## 9 Perfect Ed Sheeran NA NA
## 10 Say You Won't Let Go James Arthur NA NA
Sometimes, instead of performing an operation on an entire column you want to go row-by-row and do something different in each. For this we can iterate over all values in order. For example, if we want to get the average number of streams for each artist:
# Initialise a new vector to contain the output
average_streams <- vector()
# Get a list of all the artists in the data frame
artists <- unique(spotify["Artist"])
# Iterate over the Artist column
for (i in 1:seq_along(artists)) {
# For each artist in the data frame:
artist <- artists[i, "Artist"]
# Option 1 (doesn't use dplyr)
this_artist_only <- spotify[spotify$Artist == artist, ]
# Option 2 (uses dplyr)
this_artist_only <- filter(spotify, Artist == artist)
average_streams_this_artist <- mean(this_artist_only[, "Streams_Millions"])
average_streams <- c(average_streams, average_streams_this_artist)
}
# Combine the results into a new data frame
average_streams_df <- data.frame(
Artist = artists,
Average.Streams_Millions = average_streams
)
Now let’s take a look at the results:
kable_input <- kable(average_streams_df)
kable_styling(
kable_input, bootstrap_options = c("striped", "hover", "condensed"),
full_width = F
)
Artist | Average.Streams_Millions | |
---|---|---|
1 | Ed Sheeran | 1773 |
2 | Post Malone | 1773 |
3 | Drake | 1773 |
4 | The Chainsmokers | 1773 |
7 | Camila Cabello | 1773 |
10 | James Arthur | 1773 |
In this example we will iterate over the cells in a column without extracting the column from the data frame it is in:
for (i in 1:nrow(spotify)) {
artist <- spotify[i, "Artist"]
if (is.na(spotify[i, "Featuring"])) {
spotify[i, "Performed By"] <- artist
} else {
featuring <- spotify[i, "Featuring"]
spotify[i, "Performed By"] <- sprintf("%s ft %s", artist, featuring)
}
}
print(spotify[["Performed By"]])
## [1] "Ed Sheeran" "Post Malone ft 21 Savage" "Drake ft Wizkid, Kyla"
## [4] "The Chainsmokers ft Halsey" "Ed Sheeran" "Drake"
## [7] "Camila Cabello ft Young Thug" "Post Malone ft Swae Lee" "Ed Sheeran"
## [10] "James Arthur"
# Initialise new vectors to contain the output
groups <- vector()
averages <- vector()
# Set the variables you are looking at
variables <- c("Rank", "Streams_Millions")
# Iterate over the variables vector
for (j in variables) {
# Iterate over the Study column
artists <- unique(df["Artist"])
for (i in seq_along(artists)) {
artist <- artists[i, "Artist"]
# Option 1 (doesn't use dplyr)
this_artist_only <- spotify[spotify$Artist == artist, ]
# Option 2 (uses dplyr)
this_artist_only <- filter(spotify, Artist == artist)
average <- mean(this_artist_only[, j])
# Update output columns
groups <- c(groups, as.character(artist))
averages <- c(averages, average)
}
}
# Combine
df_out <- data.frame(
"group" <- groups,
"average" <- averages
)
kable_input <- kable(df_out)
kable_styling(
kable_input, bootstrap_options = c("striped", "hover", "condensed"),
full_width = F
)
X.group…..groups | X.average…..averages |
---|---|
Ed Sheeran | 5 |
Ed Sheeran | 1773 |