R is not known to be a favorite tool for text processing. However, it contains basic functions to concatenate and split strings.
text manipulation
base
paste,paste0- concatenate multiple stringsstrplit- split one string into multiple strings (return value is a list!)substr- get a substring
Challenge 1:
If a = "R is awesome" and b = "but sometimes sucks" then
- Combine those strings together
- Split the first one by " " character
- Get “sometimes” from the second one using
substr
## [1] "R is awesome but sometimes sucks"
## [1] "R" "is" "awesome"
## [1] "sometimes"
stringr
I have very limited knowledge of stringr but it is considered to be more friendly way to do string manipulation in R. It is now a part of tidyverse collection.
See chapter about strings in r4ds book.
Challenge 2:
If you like stringr, try to do Challenge 1 with stringr functions.
## [1] "R is awesome but sometimes sucks"
## [1] "R" "is" "awesome"
## [1] "sometimes"
regex
Regular expressions (regex) allow you to search for pattern inside text. The very basic idea is demonstrated here:
http://www.dummies.com/programming/r/how-to-use-regular-expressions-in-r/
For more advanced examples, see
http://r4ds.had.co.nz/strings.html#matching-patterns-with-regular-expressions
In R we can use grep function to search for patterns, for example
strings <- c("a", "ab", "acb", "accb", "acccb", "accccb")
grep("ac*b", strings, value = TRUE)
## [1] "ab" "acb" "accb" "acccb" "accccb"
The patter “ac*b" stands for first “a”, then any number of “c”, then “b”.
Challenge 3:
In library(gapminder) use gapminder dataset, column country to search for all countries with ee.
## [1] "Greece"
operators
.: matches a single character[characters]: matches any one of the characters inside the square brackets[^characters]: similar to[characters], but matches any characters except those inside the square brackets.\: suppress the special meaning of metacharacters in regular expression, i.e. $ * + . ? [ ] ^ { } | ( ) , we often need to use double\\|: an “or” operator, matches patterns on either side of the |.^and$: beginning and end of the string
Challenge 4:
What will be the output of the code below (first guess, then try).
strings <- c("^ab", "ab", "abc", "abd", "abe", "ab 12")
grep("ab.", strings, value = TRUE)
grep("ab[c-e]", strings, value = TRUE)
grep("ab[^c]", strings, value = TRUE)
grep("^ab", strings, value = TRUE)
grep("\\^ab", strings, value = TRUE)
grep("abc|abd", strings, value = TRUE)
substitutions
With sub and gsub functions you can make one or all string substitutions.
sub("a", "b", "balalajka")
## [1] "bblalajka"
gsub("a", "b", "balalajka")
## [1] "bblblbjkb"
There is one additional regex operator for substitutions:
(...): grouping in regular expressions. This allows you to retrieve the bits that matched various parts of your regular expression so you can alter them or use them for building up a new string. Each group can than be refer using \N, with N being the No. of (…) used. This is called backreference.
For example, to add dash before the number in sample ids.
ids <- c("DO01", "DO02", "DO03", "DO04", "DO05")
sub("DO([0-9]*)", "DO-\\1", ids)
## [1] "DO-01" "DO-02" "DO-03" "DO-04" "DO-05"
Challenge 5:
In gapminder, find all countries that end with “land”, and replace “land” with “LAND” using backreference.
countries <- unique(gapminder$country)
lands <- grep("land$", countries, value = TRUE)
sub("(.*)land", "\\1LAND", lands)
## [1] "FinLAND" "IceLAND" "IreLAND" "New ZeaLAND" "PoLAND"
## [6] "SwaziLAND" "SwitzerLAND" "ThaiLAND"