As introduced in the data types page, ‘strings’ are characters that have been ‘strung’ together to form words.
A string literal is a ‘normal string’ - it’s the type you will most often encounter in coding. You can write a string by using single or double quotation marks, or two matched sets of three double or single quotation marks:
# These are strings, representing text
st = "Hello, World!"
st = 'Hello, World!'
st = """Hello, World!"""
st = '''Hello, World!'''
Strings that take up more than one line can only be defined using the latter two methods:
# This is a string that crosses lines
st = '''Hello,
World!'''
print(st)
## Hello,
## World!
Or, alternatively, a ‘newline character’ can be used. In this case, one pair of single or double quotation marks can be used:
# This is a string that crosses lines
st = 'Hello,\nWorld!'
print(st)
## Hello,
## World!
If you want to include quotation marks in a string, you need to define the string using one of the other types of quotation marks. For example, if you want to have double quotation marks in your string you need to use single quotation marks to define it. If you try to use double quotation marks to define it Python will get confused as to whether you have created one string or two:
# Correct
st = 'This string contains a "quote"'
# Incorrect - will create an error! The letters "quote" are outside your string
# st = "This string contains a "quote""
Python strings are particularly robust: they can handle many foreign characters:
# Foreign characters entered explicitly
print('À Æ Ç Ë Õ Ω 💩')
## À Æ Ç Ë Õ Ω 💩
One thing to take note of is the backslash character (“\”). In string literals, this used as an escape character. This means that whatever immediately follows a backslash is NOT simply interpreted as text and, if it has been used correctly, it will actually be interpreted as an instruction to insert a special character:
# These following strings all contain escape characters that cause special
# characters to be inserted
print('\" Quotation marks')
print('\\ A backslash')
print('\t A tab')
print('\n A newline')
print('\xe1 \xA9 Foreign characters entered as hex code')
print('\u00df \u03A9 Foreign characters entered as Unicode')
## " Quotation marks
## \ A backslash
## A tab
##
## A newline
## á © Foreign characters entered as hex code
## ß Ω Foreign characters entered as Unicode
A raw string literal is exactly the same as a regular string literal except that the backslash is NOT an escape character - everything is interpreted as ‘raw’ text. A raw string literal is created by putting an “r” or “R” in front of the string:
# These characters are not escaped
print(r'\" \\ \t \n \xe1 \xA9 \u00df \u03A9 💩')
## \" \\ \t \n \xe1 \xA9 \u00df \u03A9 💩
These strings are used for formatting and are created by prepending the letter “f” or “F”. An f-string can display the value of a variable directly in it, and it uses curly brackets to indicate the position of the variable:
st = 'Hello, World!'
print(f'My string is: {st}')
## My string is: Hello, World!
More information about these can be found on the f-string page.
As the name suggests, these combine raw strings and f-strings, ie they are f-strings where backslashes are not treated as escape characters. As you would expect, they are created by prepending “fr” or “FR”:
st = 'Hello, World!'
print(fr'\n {st}')
## \n Hello, World!
Combining strings in Python is very easy, simply use “+” to add them together!
st1 = 'Alpha'
st2 = 'bet'
st = st1 + st2
print(st)
## Alphabet
If the strings are in a list, they can be combined using the .join()
method:
ls = ['Alpha', 'bet']
word = ''.join(x for x in ls)
print(word)
## Alphabet
Note that it is also possible to display two strings as one without actually combining them: in the next example the same strings Alpha
and bet
are shown side-by-side whilst remaining two objects:
print(st1, st2, sep='')
## Alphabet
The above was achieved by changing the default behaviour of the print()
function - which usually displays elements with a space in between them - through the use of the sep
keyword argument. This determines what characters separate the elements being printed. Note that this doesn’t affect the strings themself, it merely changes how they are shown in the console on that particular occasion:
print(st1, st2, sep=', ')
## Alpha, bet
Similarly, the end
keyword argument determines what characters end a string when it gets displayed. By default, this is a ‘new line’ character, ie everything gets shown on a new line when you use the print()
function. This can be changed to get a result similar to the above:
print(st1, end='')
print(st2)
## Alphabet
If you are dealing with files and folders on your computer it means you will be using file paths. The temptation will be to do something like this:
# Define the location of a file on your computer
home_dir = 'home'
new_dir = 'New Folder'
file_name = 'My File.py'
# Create a file path
file_path = home_dir + '/' + new_dir + '/' + file_name
print(file_path)
## home/New Folder/My File.py
However, this is bad practice. In addition to being cumbersome, this method will not work if someone runs your code on a Windows computer (because on Windows the slashes between elements in a path point the other way to those on Unix-based OSs like macOS and Linux). The best way to do this is actually to import the os
package and use a function specially created for this purpose:
import os
# Create a proper file path
file_path = os.path.join(home_dir, new_dir, file_name)
print(file_path)
## home/New Folder/My File.py > this will appear on macOS and Linux
## home\New Folder\My File.py > this will appear on Windows
Use the .split()
method to split a string at a certain character or sub-string. Remember that Python uses zero-indexing and so the number 0 refers to the first element in something whilst 1 refers to the second element and so on.
Split a string into two parts by splitting it once. In this case, split a file named Filename.py
at the full stop to get its name and its extension:
st = 'Filename.png'
file_name = st.split('.')[0]
extension = st.split('.')[1]
print('File name:', file_name, ' Extension:', extension)
## File name: Filename Extension: png
Notice that this approach caused us to lose the dot. When we use the .split()
method it does not include the character we are splitting on in either of the sections.
Divide an articulated string up into its constituent parts:
st = '/User/username/Downloads'
ls = st.split('/')
print(ls)
## ['', 'User', 'username', 'Downloads']
You can:
Does a string contain the letter(s) you are looking for? Search the string and return a Boolean (true or false) to find out. This is done with the in
statement:
# Do the letters "lena" appear in "Filename.png"?
boolean = 'lena' in 'Filename.png'
print(boolean)
## True
There are special methods to search for sub-strings at the start and at the end of strings:
# Does this website use SSL encryption?
address = 'https://docs.python.org/3/library/string.html'
boolean = address.startswith('https')
print(boolean)
## True
# Is this file a PNG?
filename = 'My File.png'
boolean = filename.endswith('.png')
print(boolean)
## True
You can test for multiple prefixes or suffixes by creating a ‘tuple’ of sub-strings to search for:
# Is this file a PNG or a JPG?
filename = 'My File.docx'
# Create a tuple of sub-strings to search for
image_file_extensions = ('.png', '.jpg')
boolean = filename.endswith(image_file_extensions)
print(boolean)
## False
Now that we know our string contains the letter(s) we are looking for, get the indices (positions) where they are by using the correct method. You have two options here - .index()
and .find()
- with the difference being how they behave if they do not find what they are looking for:
.index()
will return a ValueError if it does not find what it is looking for, which (under normal circumstances) will immediately cause your script to stop running.find()
will return “-1” if it does not find what it is looking for, meaning that your script will continue to runIf they do find what they are looking for, they will both return the index of the first occurrence of that sub-string (remembering that Python uses zero-indexing and so the first element is number 0):
st = '/User/username/Downloads'
# First appearance of a sub-string - return "ValueError" on failure
idx = st.index('/')
# First appearance of a sub-string - return "-1" on failure
idx = st.find('/')
print(idx)
## 0
The index of the last occurrence of a sub-string can be found with either .rindex()
or .rfind()
- the differences between these two is the same as before. As their names suggest, these methods search from the Right:
# Last appearance of a sub-string - return "ValueError" on failure
idx = st.rindex('/')
# Last appearance of a sub-string - return "-1" on failure
idx = st.rfind('/')
print(idx)
## 14
All occurrences can be found by using a more complicated statement which requires the re
(regular expression) package:
import re
# Find all occurrences of a sub-string - return an empty string on failure
idx = [m.start() for m in re.finditer('/', st)]
print(idx)
## [0, 5, 14]
The above methods which find locations within strings can be used to insert characters:
st = 'Hello World'
# Insert a comma between the words
idx = st.index(' ')
st = st[:idx] + ',' + st[idx:]
print(st)
## Hello, World
You can see what characters are at a certain location within a string by indexing it. Strings in Python are basically ‘lists of characters’ and thus they can be indexed in a similar way to lists:
st = 'Hello, World!'
# Return the first, fourth and eighth characters
print(st[0], st[3], st[7])
## H l W
Use negative numbers to refer to positions of elements starting from the right-hand-side (ie element “-1” is the last character, “-2” is the second-last, etc):
# Return the last, third-last and eighth-last characters
print(st[-1], st[-3], st[-8])
## ! l ,
Use a colon to refer to a range of characters. The end of the range (ie the second number) must be one more than the index of the last character you want. In Python, you index up until but not including a certain index:
# Return the second to fifth characters
print(st[1:5])
## ello
To return all characters from an index until the end of the string, leave the end of the range blank. To return all characters up until an index, leave the start of the range blank:
# Return the first word and the second word
print('First word:', st[:5], ' Second word:', st[7:])
## First word: Hello Second word: World!
You can:
To delete characters at, before, after or between certain indexes simply index the string:
st = 'Hello, World!'
# Delete character at index 5
st1 = st[:5] + st[5 + 1:]
# Delete characters before index 7
st2 = st[7:]
# Delete characters after index 4
st3 = st[:5]
# Delete characters between index 4 and 7
st4 = st[7:][:5]
print(st1, st2, st3, st4, sep=', ')
## Hello World!, World!, Hello, World
To delete characters at, before, after or between certain sub-strings your first need to find the sub-string, get its index, then perform the deletion:
st = 'Hello, World!'
# Find the comma
idx_comma = st.find(',')
# Delete all characters after the comma
st = st[:idx_comma]
print(st)
## Hello
This is not possible in Python. Strings are “immutable” which means that they cannot be changed. The only thing you can do is create a new string that omits the part you want to overwrite:
st = 'Hello, World!'
st = st[0] + 'i' + st[5:]
print(st)
## Hi, World!
Instead of using indexes, sub-strings can be searched for and replaced by other sub-strings using the .replace()
method:
st = 'Seven + 8 = 15'
st = st.replace('Seven', '7')
print(st)
## 7 + 8 = 15
This is useful if you have characters that cause problems in certain instances. For example, file names cannot contain slashes, so you can remove them as follows:
st = '01/02/03'
st = st.replace('/', '_')
print(st)
## 01_02_03
Similarly, if you are exporting something to Latex you will need to add escape characters:
st = '01_02_03'
st = st.replace('_', r'\_')
print(st)
## 01\_02\_03
The .removeprefix()
and .removesuffix()
methods do exactly what they say on the tin: they remove characters from the front and from the end of a string, respectively. This is useful for things like file extensions:
filename = 'Document.docx'
fileroot = filename.removesuffix('.docx')
print(fileroot)
## Document
Leading and trailing white space (space and tab characters) can be removed with the .strip()
method:
st = ' Hello '
st = st.strip()
print('|' + st + '|')
## |Hello|
Here’s a string that is formatted to look like a table:
st = """
Year NDay Pos0 Pos1 Pos2 Pos3 Pos4
2015 4 8 11 13 14 18
2016 4 18 18 19 18 17
2017 4 17 20 25 26 27
2018 4 27 26 26 26 25
2019 4 25 24 23 22 21
"""
Replace the duplicated spaces with single spaces by using a combination of .join()
and .split()
like so:
line = ' '.join(st.split())
print(line)
## Year NDay Pos0 Pos1 Pos2 Pos3 Pos4 2015 4 8 11 13 14 18 2016 4 18 18 19 18 17 2017 4 17 20 25 26 27 2018 4 27 26 26 26 25 2019 4 25 24 23 22 21
Here’s how to do it while keeping the linebreaks in:
# Remove duplicate whitespace and convert to list of strings
st = [' '.join(row.split()) for row in st.split('\n')]
# Add linebreak characters back in
st = [row + '\n' for row in st]
# Collapse to string
st = ''.join(st)
print(st)
## Year NDay Pos0 Pos1 Pos2 Pos3 Pos4
## 2015 4 8 11 13 14 18
## 2016 4 18 18 19 18 17
## 2017 4 17 20 25 26 27
## 2018 4 27 26 26 26 25
## 2019 4 25 24 23 22 21
A string can be UPPERCASE, lowercase, Sentence case or Title Case:
# lowercase
st = 'Filename.PNG'
filename = st.split('.', 1)[0]
extension = st.split('.', 1)[1].lower()
print(filename + '.' + extension)
## Filename.png
# UPPERCASE
st = 'Hello, World!'
print(st.upper())
## HELLO, WORLD!
# Sentence case
st = 'hello. greetings to the world.'
print('. '.join(x.capitalize() for x in st.split('. ')))
## Hello. Greetings to the world.
# Title Case
st = 'alice in wonderland'
print(st.title())
## Alice In Wonderland
‘Pascal case’ is when there are no spaces between words and, instead, every word is capitalised is order to make it readable (LikeThisForExample). Here is a function that converts a string written in this manner into sentence case:
def pascal_to_sentance_case(w):
"""Convert a string in Pascal case into sentence case."""
new_word = ''
for i, word in enumerate(re.findall('([A-Z][a-z]*)', w)):
if i == 0:
new_word = new_word + word
else:
new_word = new_word + ' ' + word.lower()
return new_word
# Turn Pascal case into sentence case
new_word = pascal_to_sentance_case('PascalCase')
print(new_word)
## Pascal case