Chapter 10: Regular Expressions

¶ At various points in the previous chapters, we had to look for patterns in string values. In chapter 4 we extracted date values from strings by writing out the precise positions at which the numbers that were part of the date could be found. Later, in chapter 6, we saw some particularly ugly pieces of code for finding certain types of characters in a string, for example the characters that had to be escaped in HTML output.

¶ Regular expressions are a language for describing patterns in strings. They form a small, separate language, which is embedded inside JavaScript (and in various other programming languages, in one way or another). It is not a very readable language ― big regular expressions tend to be completely unreadable. But, it is a useful tool, and can really simplify string-processing programs.

¶ Just like strings get written between quotes, regular expression patterns get written between slashes (/). This means that slashes inside the expression have to be escaped.

var slash = /\//;
show("AC/DC".search(slash));

¶ The search method resembles indexOf, but it searches for a regular expression instead of a string. Patterns specified by regular expressions can do a few things that strings can not do. For a start, they allow some of their elements to match more than a single character. In chapter 6, when extracting mark-up from a document, we needed to find the first asterisk or opening brace in a string. That could be done like this:

var asteriskOrBrace = /[\{\*]/;
var story =
  "We noticed the *giant sloth*, hanging from a giant branch.";
show(story.search(asteriskOrBrace));

¶ The [ and ] characters have a special meaning inside a regular expression. They can enclose a set of characters, and they mean 'any of these characters'. Most non-alphanumeric characters have some special meaning inside a regular expression, so it is a good idea to always escape them with a backslash1 when you use them to refer to the actual characters.

¶ There are a few shortcuts for sets of characters that are needed often. The dot (.) can be used to mean 'any character that is not a newline', an escaped 'd' (\d) means 'any digit', an escaped 'w' (\w) matches any alphanumeric character (including underscores, for some reason), and an escaped 's' (\s) matches any white-space (tab, newline, space) character.

var digitSurroundedBySpace = /\s\d\s/;
show("1a 2 3d".search(digitSurroundedBySpace));

¶ The escaped 'd', 'w', and 's' can be replaced by their capital letter to mean their opposite. For example, \S matches any character that is not white-space. When using [ and ], a pattern can be inverted by starting with a ^ character:

var notABC = /[^ABC]/;
show("ABCBACCBBADABC".search(notABC));

¶ As you can see, the way regular expressions use characters to express patterns makes them A) very short, and B) very hard to read.

Ex. 10.1

¶ Write a regular expression that matches a date in the format "XX/XX/XXXX", where the Xs are digits. Test it against the string "born 15/11/2003 (mother Spot): White Fang".

var datePattern = /\d\d\/\d\d\/\d\d\d\d/;
show("born 15/11/2003 (mother Spot): White Fang".search(datePattern));

¶ Sometimes you need to make sure a pattern starts at the beginning of a string, or ends at its end. For this, the special characters ^ and $ can be used. The first matches the start of the string, the second the end.

show(/a+/.test("blah"));
show(/^a+$/.test("blah"));

¶ The first regular expression matches any string that contains an a character, the second only those strings that consist entirely of a characters.

¶ Note that regular expressions are objects, and have methods. Their test method returns a boolean indicating whether the given string matches the expression.

¶ The code \b matches a 'word boundary', which can be punctuation, white-space, or the start or end of the string.

show(/cat/.test("concatenate"));
show(/\bcat\b/.test("concatenate"));

¶ Parts of a pattern can be allowed to be repeated a number of times. Putting an asterisk (*) after an element allows it to be repeated any number of times, including zero. A plus (+) does the same, but requires the pattern to occur at least one time. A question mark (?) makes an element 'optional' ― it can occur zero or one times.

var parenthesizedText = /\(.*\)/;
show("Its (the sloth's) claws were gigantic!".search(parenthesizedText));

¶ When necessary, braces can be used to be more precise about the amount of times an element may occur. A number between braces ({4}) gives the exact amount of times it must occur. Two numbers with a comma between them ({3,10}) indicate that the pattern must occur at least as often as the first number, and at most as often as the second one. Similarly, {2,} means two or more occurrences, while {,4} means four or less.

var datePattern = /\d{1,2}\/\d\d?\/\d{4}/;
show("born 15/11/2003 (mother Spot): White Fang".search(datePattern));

¶ The pieces /\d{1,2}/ and /\d\d?/ both express 'one or two digits'.

Ex. 10.2

¶ Write a pattern that matches e-mail addresses. For simplicity, assume that the parts before and after the @ can contain only alphanumeric characters and the characters . and - (dot and dash), while the last part of the address, the country code after the last dot, may only contain alphanumeric characters, and must be two or three characters long.

var mailAddress = /\b[\w\.-]+@[\w\.-]+\.\w{2,3}\b/;

show(mailAddress.test("kenny@test.net"));
show(mailAddress.test("I mailt kenny@tets.nets, but it didn wrok!"));
show(mailAddress.test("the_giant_sloth@gmail.com"));

¶ The \bs at the start and end of the pattern make sure that the second string does not match.

¶ Part of a regular expression can be grouped together with parentheses. This allows us to use * and such on more than one character. For example:

var cartoonCrying = /boo(hoo+)+/i;
show("Then, he exclaimed 'Boohoooohoohooo'".search(cartoonCrying));

¶ Where did the i at the end of that regular expression come from? After the closing slash, 'options' may be added to a regular expression. An i, here, means the expression is case-insensitive, which allows the lower-case B in the pattern to match the upper-case one in the string.

¶ A pipe character (|) is used to allow a pattern to make a choice between two elements. For example:

var holyCow = /(sacred|holy) (cow|bovine|bull|taurus)/i;
show(holyCow.test("Sacred bovine!"));

¶ Often, looking for a pattern is just a first step in extracting something from a string. In previous chapters, this extraction was done by calling a string's indexOf and slice methods a lot. Now that we are aware of the existence of regular expressions, we can use the match method instead. When a string is matched against a regular expression, the result will be null if the match failed, or an array of matched strings if it succeeded.

show("No".match(/Yes/));
show("... yes".match(/yes/));
show("Giant Ape".match(/giant (\w+)/i));

¶ The first element in the returned array is always the part of the string that matched the pattern. As the last example shows, when there are parenthesized parts in the pattern, the parts they match are also added to the array. Often, this makes extracting pieces of string very easy.

var parenthesized = prompt("Tell me something", "").match(/\((.*)\)/);
if (parenthesized != null)
  print("You parenthesized '", parenthesized[1], "'");

Ex. 10.3

¶ Re-write the function extractDate that we wrote in chapter 4. When given a string, this function looks for something that follows the date format we saw earlier. If it can find such a date, it puts the values into a Date object. Otherwise, it throws an exception. Make it accept dates in which the day or month are written with only one digit.

function extractDate(string) {
  var found = string.match(/(\d\d?)\/(\d\d?)\/(\d{4})/);
  if (found == null)
    throw new Error("No date found in '" + string + "'.");
  return new Date(Number(found[3]), Number(found[2]) - 1,
                  Number(found[1]));
}

show(extractDate("born 5/2/2007 (mother Noog): Long-ear Johnson"));

¶ This version is slightly longer than the previous one, but it has the advantage of actually checking what it is doing, and shouting out when it is given nonsensical input. This was a lot harder without regular expressions ― it would have taken a lot of calls to indexOf to find out whether the numbers had one or two digits, and whether the dashes were in the right places.

¶ The replace method of string values, which we saw in chapter 6, can be given a regular expression as its first argument.

print("Borobudur".replace(/[ou]/g, "a"));

¶ Notice the g character after the regular expression. It stands for 'global', and means that every part of the string that matches the pattern should be replaced. When this g is omitted, only the first "o" would be replaced.

¶ Sometimes it is necessary to keep parts of the replaced strings. For example, we have a big string containing the names of people, one name per line, in the format "Lastname, Firstname". We want to swap these names, and remove the comma, to get a simple "Firstname Lastname" format.

var names = "Picasso, Pablo\nGauguin, Paul\nVan Gogh, Vincent";
print(names.replace(/([\w ]+), ([\w ]+)/g, "$2 $1"));

¶ The $1 and $2 the replacement string refer to the parenthesized parts in the pattern. $1 is replaced by the text that matched against the first pair of parentheses, $2 by the second, and so on, up to $9.

¶ If you have more than 9 parentheses parts in your pattern, this will no longer work. But there is one more way to replace pieces of a string, which can also be useful in some other tricky situations. When the second argument given to the replace method is a function value instead of a string, this function is called every time a match is found, and the matched text is replaced by whatever the function returns. The arguments given to the function are the matched elements, similar to the values found in the arrays returned by match: The first one is the whole match, and after that comes one argument for every parenthesized part of the pattern.

function eatOne(match, amount, unit) {
  amount = Number(amount) - 1;
  if (amount == 1) {
    unit = unit.slice(0, unit.length - 1);
  }
  else if (amount == 0) {
    unit = unit + "s";
    amount = "no";
  }
  return amount + " " + unit;
}

var stock = "1 lemon, 2 cabbages, and 101 eggs";
stock = stock.replace(/(\d+) (\w+)/g, eatOne);

print(stock);

Ex. 10.4

¶ That last trick can be used to make the HTML-escaper from chapter 6 more efficient. You may remember that it looked like this:

function escapeHTML(text) {
  var replacements = [["&", "&amp;"], ["\"", "&quot;"],
                      ["<", "&lt;"], [">", "&gt;"]];
  forEach(replacements, function(replace) {
    text = text.replace(replace[0], replace[1]);
  });
  return text;
}

¶ Write a new function escapeHTML, which does the same thing, but only calls replace once.

function escapeHTML(text) {
  var replacements = {"<": "&lt;", ">": "&gt;",
                      "&": "&amp;", "\"": "&quot;"};
  return text.replace(/[<>&"]/g, function(character) {
    return replacements[character];
  });
}

print(escapeHTML("The 'pre-formatted' tag is written \"<pre>\"."));

¶ The replacements object is a quick way to associate each character with its escaped version. Using it like this is safe (i.e. no Dictionary object is needed), because the only properties that will be used are those matched by the /[<>&"]/ expression.

¶ There are cases where the pattern you need to match against is not known while you are writing the code. Say we are writing a (very simple-minded) obscenity filter for a message board. We only want to allow messages that do not contain obscene words. The administrator of the board can specify a list of words that he or she considers unacceptable.

¶ The most efficient way to check a piece of text for a set of words is to use a regular expression. If we have our word list as an array, we can build the regular expression like this:

var badWords = ["ape", "monkey", "simian", "gorilla", "evolution"];
var pattern = new RegExp(badWords.join("|"), "i");
function isAcceptable(text) {
  return !pattern.test(text);
}

show(isAcceptable("Mmmm, grapes."));
show(isAcceptable("No more of that monkeybusiness, now."));

¶ We could add \b patterns around the words, so that the thing about grapes would not be classified as unacceptable. That would also make the second one acceptable, though, which is probably not correct. Obscenity filters are hard to get right (and usually way too annoying to be a good idea).

¶ The first argument to the RegExp constructor is a string containing the pattern, the second argument can be used to add case-insensitivity or globalness. When building a string to hold the pattern, you have to be careful with backslashes. Because, normally, backslashes are removed when a string is interpreted, any backslashes that must end up in the regular expression itself have to be escaped:

var digits = new RegExp("\\d+");
show(digits.test("101"));

¶ The most important thing to know about regular expressions is that they exist, and can greatly enhance the power of your string-mangling code. They are so cryptic that you'll probably have to look up the details on them the first ten times you want to make use of them. Persevere, and you will soon be off-handedly writing expressions that look like occult gibberish

¶ (Comic by Randall Munroe.)

In this case, the backslashes were not really necessary, because the characters occur between [ and ], but it is easier to just escape them anyway, so you won't have to think about it.