This is Part 2 in a series on SQLite:

Introduction
Worked Example
Cheat Sheet

1 Open SQLite

Open a terminal and run:

$ sqlite3

SQLite version 3.45.1 2024-01-30 16:01:20
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite>

As mentioned in the text, a “transient in-memory database” has been created for us to work in. This means that everything we do will be deleted when we close out of SQLite. We can make it persistent so that it can be saved to file with the following:

.open cricketers.db

We’ve used the name cricketers.db as we will be storing information about cricketers.

2 Create Tables

Create a teams table to store cricket team names:

-- Create a new table
CREATE TABLE teams (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT,
    tournament TEXT
);

The PRIMARY KEY constraint ensures that each row’s id value is unique and can be used to identify that row - no two rows will have the same id number. Each time a new row is added, if no id is explicitly specified then SQL will provide the next highest number to be that row’s primary key. The AUTOINCREMENT command does the same thing but is more strict: if a row is ever deleted its id cannot be reused when a new row is added.

2.1 Populate the Table

We’re going to add in some cricket teams to our teams table:

-- Create and populate new rows
INSERT INTO teams(name, tournament) VALUES ('Hampshire Hawks', 'T20 Blast');
INSERT INTO teams(name, tournament) VALUES ('Sydney Sixers', 'Big Bash');
INSERT INTO teams(name, tournament) VALUES ('Yorkshire Vikings', 'T20 Blast');

Let’s have a look to double check that the table has been populated as expected:

-- Display all rows and columns
SELECT * FROM teams;

3 records
id	name	tournament
1	Hampshire Hawks	T20 Blast
2	Sydney Sixers	Big Bash
3	Yorkshire Vikings	T20 Blast

2.2 A Second Table

Create and populate a players table to store information about individual players:

-- Create a new table
CREATE TABLE players (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT,
    runs INTEGER,
    average REAL,
    dob DATE
);

Insert data about players:

-- Create and populate new rows
INSERT INTO players(name, runs, average, dob) VALUES ('Ellyse Perry', 2173, 31.04, '1990-11-03');
INSERT INTO players(name, runs, average, dob) VALUES ('James Vince', 12564, 31.96, '1991-03-14');
INSERT INTO players(name, runs, average, dob) VALUES ('Hollie Armitage', 604, 17.20, '1997-06-14');
INSERT INTO players(name, runs, average, dob) VALUES ('Hilton Cartwright', 2550, 28.02, '1992-02-14');

Check:

-- Display all rows and columns
SELECT * FROM players;

4 records
id	name	runs	average	dob
1	Ellyse Perry	2173	31.04	1990-11-03
2	James Vince	12564	31.96	1991-03-14
3	Hollie Armitage	604	17.20	1997-06-14
4	Hilton Cartwright	2550	28.02	1992-02-14

2.3 A Cross-Reference Table

Now that we have a table of cricket teams and a table of cricket players we can create a relationship between these two sets of data and create team rosters that detail who plays for whom. Of course, each cricket team will have multiple players playing for it and one player can play for multiple teams, so the specific relationship will be many-to-many (many players are associated with each team and many teams are associated with each player).

A many-to-many relationship needs a third table - a cross-reference or join table - to store the associations between pairs of data points. This table should have the following two constraints:

Foreign keys referencing the primary keys of the two member tables (the teams and the players)
A composite primary key made up of the two foreign keys (each player can only appear on a team’s roster once)

With this in mind, here’s how to create the rosters table representing which player plays for which team, linking teams and players together via foreign keys:

-- Create a new table
CREATE TABLE rosters (
    team_id INTEGER,
    player_id INTEGER,
    PRIMARY KEY (team_id, player_id)
    FOREIGN KEY (team_id) REFERENCES teams(id),
    FOREIGN KEY (player_id) REFERENCES players(id)
);

Populate the rows:

-- Create and populate new rows
INSERT INTO rosters VALUES (1, 1);
INSERT INTO rosters VALUES (1, 2);
INSERT INTO rosters VALUES (1, 4);
INSERT INTO rosters VALUES (2, 1);
INSERT INTO rosters VALUES (2, 2);
INSERT INTO rosters VALUES (2, 3);
INSERT INTO rosters VALUES (3, 3);

Check:

-- Display all rows and columns
SELECT * FROM rosters;

7 records
team_id	player_id
1	1
1	2
1	4
2	1
2	2
2	3
3	3

Now that we have three tables we can more easily change and add data:

If information related to a player changes we can update the players table
If a new team forms we can add them to the teams table
If a player joins a new team we can update the rosters table

Our data is organised, manageable and clearly separated from one another with no repetition of information. Each record (row in a table) carries one additional data point.

3 Merging

Of course, we want to be able to see who plays for which team! This is done by merging the data tables with the cross-reference table:

-- Merge the tables
SELECT
    t.name AS team_name,
    p.name AS player_name
FROM rosters r
JOIN teams t ON r.team_id = t.id
JOIN players p ON r.player_id = p.id;

7 records
team_name	player_name
Hampshire Hawks	Ellyse Perry
Hampshire Hawks	James Vince
Hampshire Hawks	Hilton Cartwright
Sydney Sixers	Ellyse Perry
Sydney Sixers	James Vince
Sydney Sixers	Hollie Armitage
Yorkshire Vikings	Hollie Armitage

3.1 Filtering

It takes just one more step to filter the amalgamated table by one of the columns and get - for example - just the Hampshire Hawks players:

-- Filter the tables
SELECT
    t.name AS team_name,
    p.name AS player_name
FROM rosters r
JOIN teams t ON r.team_id = t.id
JOIN players p ON r.player_id = p.id
WHERE t.name = 'Hampshire Hawks';

3 records
team_name	player_name
Hampshire Hawks	Ellyse Perry
Hampshire Hawks	James Vince
Hampshire Hawks	Hilton Cartwright

4 Using Transactions

Let’s now imagine that the Hampshire Hawks men’s team plays a match wherein James Vince scores a hundred and Hilton Cartwright scores a fifty. We want to update the players table to reflect these additional runs but need to be careful that BOTH players have their stats updated. If something goes wrong during the execution of the code and only one player has their run total updated it will mean that our table is inaccurate. In this example where only two records are being updated it might not be such a big problem, but if tens or hundred or thousands of records were being updated it would be a nightmare to figure out what has been updated and what has not. In other words:

The table is accurate before the update (it lists how many runs the players have scored before this latest match)
The table is accurate after the update (it lists how many runs the players have scored after this latest match)
The table is not accurate during the update (some players get their run totals updated before other players)

This is a good use case for a transaction in SQL. Just like with a financial transaction, it is important that all aspects of the update are successful before was can consider it complete. If one part of the update fails we should roll back to the most recent state where we knew that all the data was accurate.

Here’s how to do it:

-- Create a temporary state whereafter all operations will be executed together
BEGIN TRANSACTION;
-- Operations which must *all* succeed
-- (eg subtracting from one bank account and adding to another)
UPDATE players SET runs = runs + 100 WHERE name = 'James Vince';
UPDATE players SET runs = runs + 50 WHERE name = 'Hilton Cartwright';

Check that it has worked:

-- Display all columns and filter the rows
SELECT * FROM players WHERE name = 'James Vince' OR name = 'Hilton Cartwright';

2 records
id	name	runs	average	dob
2	James Vince	12664	31.96	1991-03-14
4	Hilton Cartwright	2600	28.02	1992-02-14

Jame Vince’s career runs tally has increased by 100 and Hilton Cartwright’s by 50, which is what we expected. We can now commit to these changes:

-- Save all changes
COMMIT;

If it had not worked as expected, we could have rolled the database back to its original state:

-- Undo everything that has happened so far during the transaction
ROLLBACK;

5 Using Aggregate Functions

Before we aggregate our data let’s go back to when we merged the three tables and save the output of that to a new table (this is done simply by adding CREATE TABLE merged_tables AS before the code we used to do the merging above):

-- Merge the tables and save the output as a new table
CREATE TABLE merged_tables AS
SELECT
    t.name AS team_name,
    p.name AS player_name,
    p.runs AS runs
FROM rosters r
JOIN teams t ON r.team_id = t.id
JOIN players p ON r.player_id = p.id;

This is identical to the code we ran in the “Merging” section above except we’ve now added CREATE TABLE... at the start and also kept the runs column. Let’s take a look:

-- Display all rows and columns
SELECT * FROM merged_tables;

7 records
team_name	player_name	runs
Hampshire Hawks	Ellyse Perry	2173
Hampshire Hawks	James Vince	12664
Hampshire Hawks	Hilton Cartwright	2600
Sydney Sixers	Ellyse Perry	2173
Sydney Sixers	James Vince	12664
Sydney Sixers	Hollie Armitage	604
Yorkshire Vikings	Hollie Armitage	604

Now we can have some fun calculating the total number of runs and counting the total number of players for each team:

-- Aggregates for each team
SELECT
    team_name,
    SUM(runs) AS total_runs,
    COUNT(player_name) AS num_players
FROM merged_tables
GROUP BY team_name
ORDER BY team_name;

3 records
team_name	total_runs	num_players
Hampshire Hawks	17437	3
Sydney Sixers	15441	3
Yorkshire Vikings	604	1

6 Using Window Functions

Window functions preserve the values in the original table while displaying grouped or summative information. For this example we are going to use the runs that James Vince has scored in his 17 T20Is:

-- Create and populate a new table
CREATE TABLE scores (
    innings INTEGER,
    runs INTEGER,
    year INTEGER,
    month INTEGER
);
INSERT INTO scores VALUES (1, 41, 2015, 11);
INSERT INTO scores VALUES (2, 38, 2015, 11);
INSERT INTO scores VALUES (3, 46, 2015, 11);
INSERT INTO scores VALUES (4, 22, 2016, 3);
INSERT INTO scores VALUES (5, 16, 2016, 7);
INSERT INTO scores VALUES (6, 21, 2018, 2);
INSERT INTO scores VALUES (7, 10, 2018, 2);
INSERT INTO scores VALUES (8, 36, 2019, 5);
INSERT INTO scores VALUES (9, 59, 2019, 11);
INSERT INTO scores VALUES (10, 1, 2019, 11);
INSERT INTO scores VALUES (11, 49, 2019, 11);
INSERT INTO scores VALUES (12, 1, 2019, 11);
INSERT INTO scores VALUES (13, 14, 2022, 1);
INSERT INTO scores VALUES (14, 4, 2022, 1);
INSERT INTO scores VALUES (15, 16, 2022, 1);
INSERT INTO scores VALUES (16, 34, 2022, 1);
INSERT INTO scores VALUES (17, 55, 2022, 1);

-- Display the first 10 rows and all columns
SELECT * FROM scores;

Displaying records 1 - 10
innings	runs	year	month
1	41	2015	11
2	38	2015	11
3	46	2015	11
4	22	2016	3
5	16	2016	7
6	21	2018	2
7	10	2018	2
8	36	2019	5
9	59	2019	11
10	1	2019	11

To calculate the total number of runs per year along with the cumulative number of runs together with the running average, we need to use aggregate functions inside of aggregate functions - once to do the aggregation for each year and once to do the aggregation for all years:

-- Total for each year with running total, average and count
SELECT
    year,
    SUM(runs) AS total_runs,
    SUM(SUM(runs)) OVER (ORDER BY year) AS running_total,
    SUM(COUNT(*)) OVER (ORDER BY year) AS running_innings,
    1.0 * SUM(SUM(runs)) OVER (ORDER BY year)
        / SUM(COUNT(*)) OVER (ORDER BY year) AS running_avg
FROM scores
GROUP BY year
ORDER BY year;

5 records
year	total_runs	running_total	running_innings	running_avg
2015	125	125	3	41.66667
2016	38	163	5	32.60000
2018	31	194	7	27.71429
2019	146	340	12	28.33333
2022	123	463	17	27.23529

We can sanity check the overall accuracy of this table by comparing it to Vince’s CricInfo profile: 463 runs across 17 innings at an average of 27.23 is indeed what we expect to see.

Because we have his scores broken down into years AND months, we can do a similar grouping for each month in which he played T20 matches for England. This is a bit more tricky to understand because the running total of runs resets at the end of each year:

-- Running total for each month, resetting when a new year starts
SELECT
    year,
    month,
    runs,
    SUM(runs) OVER (PARTITION BY year ORDER BY month) AS running_annual_total
FROM scores
WHERE year < 2020
GROUP BY year, month
ORDER BY year;

6 records
year	month	runs	running_annual_total
2015	11	41	41
2016	3	22	22
2016	7	16	38
2018	2	21	21
2019	5	36	36
2019	11	59	95

We’ve also demonstrated the WHERE year < 2020 filter, which has given us an analysis of Vince’s pre-COVID career.

⇦ Back

SQLite:Worked Example