3 min read

Regular Expressions and Text Processing

R is not known to be a favorite tool for text processing. However, it contains basic functions to concatenate and split strings.

text manipulation

base

  • paste, paste0 - concatenate multiple strings
  • strplit - split one string into multiple strings (return value is a list!)
  • substr - get a substring

Challenge 1:

If a = "R is awesome" and b = "but sometimes sucks" then

  • Combine those strings together
  • Split the first one by " " character
  • Get “sometimes” from the second one using substr
## [1] "R is awesome but sometimes sucks"
## [1] "R"       "is"      "awesome"
## [1] "sometimes"

stringr

I have very limited knowledge of stringr but it is considered to be more friendly way to do string manipulation in R. It is now a part of tidyverse collection.

See chapter about strings in r4ds book.

Challenge 2:

If you like stringr, try to do Challenge 1 with stringr functions.

## [1] "R is awesome but sometimes sucks"
## [1] "R"       "is"      "awesome"
## [1] "sometimes"

regex

Regular expressions (regex) allow you to search for pattern inside text. The very basic idea is demonstrated here:

http://www.dummies.com/programming/r/how-to-use-regular-expressions-in-r/

For more advanced examples, see

http://r4ds.had.co.nz/strings.html#matching-patterns-with-regular-expressions

In R we can use grep function to search for patterns, for example

strings <- c("a", "ab", "acb", "accb", "acccb", "accccb")
grep("ac*b", strings, value = TRUE)
## [1] "ab"     "acb"    "accb"   "acccb"  "accccb"

The patter “ac*b" stands for first “a”, then any number of “c”, then “b”.

Challenge 3:

In library(gapminder) use gapminder dataset, column country to search for all countries with ee.

## [1] "Greece"

operators

  • .: matches a single character
  • [characters]: matches any one of the characters inside the square brackets
  • [^characters]: similar to [characters], but matches any characters except those inside the square brackets.
  • \: suppress the special meaning of metacharacters in regular expression, i.e. $ * + . ? [ ] ^ { } | ( ) , we often need to use double \\
  • |: an “or” operator, matches patterns on either side of the |.
  • ^ and $: beginning and end of the string

Challenge 4:

What will be the output of the code below (first guess, then try).

strings <- c("^ab", "ab", "abc", "abd", "abe", "ab 12")
grep("ab.", strings, value = TRUE)
grep("ab[c-e]", strings, value = TRUE)
grep("ab[^c]", strings, value = TRUE)
grep("^ab", strings, value = TRUE)
grep("\\^ab", strings, value = TRUE)
grep("abc|abd", strings, value = TRUE)

substitutions

With sub and gsub functions you can make one or all string substitutions.

sub("a", "b", "balalajka")
## [1] "bblalajka"
gsub("a", "b", "balalajka")
## [1] "bblblbjkb"

There is one additional regex operator for substitutions:

  • (...): grouping in regular expressions. This allows you to retrieve the bits that matched various parts of your regular expression so you can alter them or use them for building up a new string. Each group can than be refer using \N, with N being the No. of (…) used. This is called backreference.

For example, to add dash before the number in sample ids.

ids <- c("DO01", "DO02", "DO03", "DO04", "DO05")
sub("DO([0-9]*)", "DO-\\1", ids)
## [1] "DO-01" "DO-02" "DO-03" "DO-04" "DO-05"

Challenge 5:

In gapminder, find all countries that end with “land”, and replace “land” with “LAND” using backreference.

countries <- unique(gapminder$country)
lands <- grep("land$", countries, value = TRUE)
sub("(.*)land", "\\1LAND", lands)
## [1] "FinLAND"     "IceLAND"     "IreLAND"     "New ZeaLAND" "PoLAND"     
## [6] "SwaziLAND"   "SwitzerLAND" "ThaiLAND"