Chapter 10: Regular Expressions
¶ At various points in the previous chapters, we had to look for patterns in string values. In chapter 4 we extracted date values from strings by writing out the precise positions at which the numbers that were part of the date could be found. Later, in chapter 6, we saw some particularly ugly pieces of code for finding certain types of characters in a string, for example the characters that had to be escaped in HTML output.
¶ Regular expressions are a language for describing patterns in strings. They form a small, separate language, which is embedded inside JavaScript (and in various other programming languages, in one way or another). It is not a very readable language ― big regular expressions tend to be completely unreadable. But, it is a useful tool, and can really simplify string-processing programs.
¶ Just like strings get written between quotes, regular expression
patterns get written between slashes (/
). This means that slashes
inside the expression have to be escaped.
var slash = /\//; show("AC/DC".search(slash));
¶ The search
method resembles indexOf
, but it searches for a
regular expression instead of a string. Patterns specified by regular
expressions can do a few things that strings can not do. For a start,
they allow some of their elements to match more than a single
character. In chapter 6, when extracting mark-up from a document, we
needed to find the first asterisk or opening brace in a string. That
could be done like this:
var asteriskOrBrace = /[\{\*]/; var story = "We noticed the *giant sloth*, hanging from a giant branch."; show(story.search(asteriskOrBrace));
¶ The [
and ]
characters have a special meaning inside a regular
expression. They can enclose a set of characters, and they mean 'any
of these characters'. Most non-alphanumeric characters have some
special meaning inside a regular expression, so it is a good idea to
always escape them with a backslash1 when you use them to refer to
the actual characters.
¶ There are a few shortcuts for sets of characters that are needed
often. The dot (.
) can be used to mean 'any character that is not a
newline', an escaped 'd' (\d
) means 'any digit', an escaped 'w'
(\w
) matches any alphanumeric character (including underscores, for
some reason), and an escaped 's' (\s
) matches any white-space (tab,
newline, space) character.
var digitSurroundedBySpace = /\s\d\s/; show("1a 2 3d".search(digitSurroundedBySpace));
¶ The escaped 'd', 'w', and 's' can be replaced by their capital letter
to mean their opposite. For example, \S
matches any character that
is not white-space. When using [
and ]
, a pattern can be
inverted by starting with a ^
character:
var notABC = /[^ABC]/; show("ABCBACCBBADABC".search(notABC));
¶ As you can see, the way regular expressions use characters to express patterns makes them A) very short, and B) very hard to read.
¶ Write a regular expression that matches a date in the format
"XX/XX/XXXX"
, where the X
s are digits. Test it against the string
"born 15/11/2003 (mother Spot): White Fang"
.
var datePattern = /\d\d\/\d\d\/\d\d\d\d/; show("born 15/11/2003 (mother Spot): White Fang".search(datePattern));
¶ Sometimes you need to make sure a pattern starts at the beginning of a
string, or ends at its end. For this, the special characters ^
and
$
can be used. The first matches the start of the string, the second
the end.
show(/a+/.test("blah")); show(/^a+$/.test("blah"));
¶ The first regular expression matches any string that contains an a
character, the second only those strings that consist entirely of a
characters.
¶ Note that regular expressions are objects, and have methods. Their
test
method returns a boolean indicating whether the given string
matches the expression.
¶ The code \b
matches a 'word boundary', which can be punctuation,
white-space, or the start or end of the string.
show(/cat/.test("concatenate")); show(/\bcat\b/.test("concatenate"));
¶ Parts of a pattern can be allowed to be repeated a number of times.
Putting an asterisk (*
) after an element allows it to be repeated
any number of times, including zero. A plus (+
) does the same, but
requires the pattern to occur at least one time. A question mark (?
)
makes an element 'optional' ― it can occur zero or one times.
var parenthesizedText = /\(.*\)/; show("Its (the sloth's) claws were gigantic!".search(parenthesizedText));
¶ When necessary, braces can be used to be more precise about the amount
of times an element may occur. A number between braces ({4}
) gives
the exact amount of times it must occur. Two numbers with a comma
between them ({3,10}
) indicate that the pattern must occur at least
as often as the first number, and at most as often as the second one.
Similarly, {2,}
means two or more occurrences, while {,4}
means
four or less.
var datePattern = /\d{1,2}\/\d\d?\/\d{4}/; show("born 15/11/2003 (mother Spot): White Fang".search(datePattern));
¶ The pieces /\d{1,2}/
and /\d\d?/
both express 'one or two digits'.
¶ Write a pattern that matches e-mail addresses. For simplicity, assume
that the parts before and after the @
can contain only alphanumeric
characters and the characters .
and -
(dot and dash), while the
last part of the address, the country code after the last dot, may
only contain alphanumeric characters, and must be two or three
characters long.
var mailAddress = /\b[\w\.-]+@[\w\.-]+\.\w{2,3}\b/; show(mailAddress.test("kenny@test.net")); show(mailAddress.test("I mailt kenny@tets.nets, but it didn wrok!")); show(mailAddress.test("the_giant_sloth@gmail.com"));
¶ The \b
s at the start and end of the pattern make sure that the
second string does not match.
¶ Part of a regular expression can be grouped together with parentheses.
This allows us to use *
and such on more than one character. For
example:
var cartoonCrying = /boo(hoo+)+/i; show("Then, he exclaimed 'Boohoooohoohooo'".search(cartoonCrying));
¶ Where did the i
at the end of that regular expression come from?
After the closing slash, 'options' may be added to a regular
expression. An i
, here, means the expression is case-insensitive,
which allows the lower-case B in the pattern to match the upper-case
one in the string.
¶ A pipe character (|
) is used to allow a pattern to make a choice
between two elements. For example:
var holyCow = /(sacred|holy) (cow|bovine|bull|taurus)/i; show(holyCow.test("Sacred bovine!"));
¶ Often, looking for a pattern is just a first step in extracting
something from a string. In previous chapters, this extraction was
done by calling a string's indexOf
and slice
methods a lot. Now
that we are aware of the existence of regular expressions, we can use
the match
method instead. When a string is matched against a regular
expression, the result will be null
if the match failed, or an array
of matched strings if it succeeded.
show("No".match(/Yes/)); show("... yes".match(/yes/)); show("Giant Ape".match(/giant (\w+)/i));
¶ The first element in the returned array is always the part of the string that matched the pattern. As the last example shows, when there are parenthesized parts in the pattern, the parts they match are also added to the array. Often, this makes extracting pieces of string very easy.
var parenthesized = prompt("Tell me something", "").match(/\((.*)\)/); if (parenthesized != null) print("You parenthesized '", parenthesized[1], "'");
¶ Re-write the function extractDate
that we wrote in chapter 4. When
given a string, this function looks for something that follows the
date format we saw earlier. If it can find such a date, it puts the
values into a Date
object. Otherwise, it throws an exception. Make
it accept dates in which the day or month are written with only one
digit.
function extractDate(string) { var found = string.match(/(\d\d?)\/(\d\d?)\/(\d{4})/); if (found == null) throw new Error("No date found in '" + string + "'."); return new Date(Number(found[3]), Number(found[2]) - 1, Number(found[1])); } show(extractDate("born 5/2/2007 (mother Noog): Long-ear Johnson"));
¶ This version is slightly longer than the previous one, but it has the
advantage of actually checking what it is doing, and shouting out when
it is given nonsensical input. This was a lot harder without regular
expressions ― it would have taken a lot of calls to indexOf
to find
out whether the numbers had one or two digits, and whether the dashes
were in the right places.
¶ The replace
method of string values, which we saw in chapter 6, can be
given a regular expression as its first argument.
print("Borobudur".replace(/[ou]/g, "a"));
¶ Notice the g
character after the regular expression. It stands for
'global', and means that every part of the string that matches the
pattern should be replaced. When this g
is omitted, only the first
"o"
would be replaced.
¶ Sometimes it is necessary to keep parts of the replaced strings. For example, we have a big string containing the names of people, one name per line, in the format "Lastname, Firstname". We want to swap these names, and remove the comma, to get a simple "Firstname Lastname" format.
var names = "Picasso, Pablo\nGauguin, Paul\nVan Gogh, Vincent"; print(names.replace(/([\w ]+), ([\w ]+)/g, "$2 $1"));
¶ The $1
and $2
the replacement string refer to the parenthesized
parts in the pattern. $1
is replaced by the text that matched
against the first pair of parentheses, $2
by the second, and so on,
up to $9
.
¶ If you have more than 9 parentheses parts in your pattern, this will
no longer work. But there is one more way to replace pieces of a
string, which can also be useful in some other tricky situations. When
the second argument given to the replace
method is a function value
instead of a string, this function is called every time a match is
found, and the matched text is replaced by whatever the function
returns. The arguments given to the function are the matched elements,
similar to the values found in the arrays returned by match
: The
first one is the whole match, and after that comes one argument for
every parenthesized part of the pattern.
function eatOne(match, amount, unit) { amount = Number(amount) - 1; if (amount == 1) { unit = unit.slice(0, unit.length - 1); } else if (amount == 0) { unit = unit + "s"; amount = "no"; } return amount + " " + unit; } var stock = "1 lemon, 2 cabbages, and 101 eggs"; stock = stock.replace(/(\d+) (\w+)/g, eatOne); print(stock);
¶ That last trick can be used to make the HTML-escaper from chapter 6 more efficient. You may remember that it looked like this:
function escapeHTML(text) { var replacements = [["&", "&"], ["\"", """], ["<", "<"], [">", ">"]]; forEach(replacements, function(replace) { text = text.replace(replace[0], replace[1]); }); return text; }
¶ Write a new function escapeHTML
, which does the same thing, but only
calls replace
once.
function escapeHTML(text) { var replacements = {"<": "<", ">": ">", "&": "&", "\"": """}; return text.replace(/[<>&"]/g, function(character) { return replacements[character]; }); } print(escapeHTML("The 'pre-formatted' tag is written \"<pre>\"."));
¶ The replacements
object is a quick way to associate each character
with its escaped version. Using it like this is safe (i.e. no
Dictionary
object is needed), because the only properties that will
be used are those matched by the /[<>&"]/
expression.
¶ There are cases where the pattern you need to match against is not known while you are writing the code. Say we are writing a (very simple-minded) obscenity filter for a message board. We only want to allow messages that do not contain obscene words. The administrator of the board can specify a list of words that he or she considers unacceptable.
¶ The most efficient way to check a piece of text for a set of words is to use a regular expression. If we have our word list as an array, we can build the regular expression like this:
var badWords = ["ape", "monkey", "simian", "gorilla", "evolution"]; var pattern = new RegExp(badWords.join("|"), "i"); function isAcceptable(text) { return !pattern.test(text); } show(isAcceptable("Mmmm, grapes.")); show(isAcceptable("No more of that monkeybusiness, now."));
¶ We could add \b
patterns around the words, so that the thing about
grapes would not be classified as unacceptable. That would also make
the second one acceptable, though, which is probably not correct.
Obscenity filters are hard to get right (and usually way too annoying
to be a good idea).
¶ The first argument to the RegExp
constructor is a string
containing the pattern, the second argument can be used to add
case-insensitivity or globalness. When building a string to hold the
pattern, you have to be careful with backslashes. Because, normally,
backslashes are removed when a string is interpreted, any backslashes
that must end up in the regular expression itself have to be escaped:
var digits = new RegExp("\\d+"); show(digits.test("101"));
¶ The most important thing to know about regular expressions is that they exist, and can greatly enhance the power of your string-mangling code. They are so cryptic that you'll probably have to look up the details on them the first ten times you want to make use of them. Persevere, and you will soon be off-handedly writing expressions that look like occult gibberish
¶ (Comic by Randall Munroe.)