Regex Demystified
Podcast: Play in new window | Download (51.5MB) | Embed
Subscribe: Apple Podcasts | Spotify | Email | RSS | More
Per wikipedia, a regular expression (otherwise known as regex or regexp) is a sequence of characters that define a search pattern. Usually, such patters are used by string searching algorithms for “find” or “find and replace” operations on a sequence of text, or for input validation. This technique was developed in theoretical computer science and formal language theory.
Regular expressions are used in search engines, search and replace dialogs of word processors and text editors. They are also used in text processing programs like sed and awk, and in lexical analysis. Most programming languages provide some means of working with regular expressions, either in the core language itself or in libraries.
Regular expressions are a deep subject. Not only is the concept complicated, but the output looks a lot like a cat sat down on the keyboard. A lot of theory underpins the concept of regular expressions and a lot of things need to be taken into account when using them in a professional environment. However, they are powerful once you understand what’s going on under the hood.
Episode Breakdown
History
Originated in 1951 when Stephen Cole Kleen, a mathematician, described regular languages (we’ll get to that in a minute) using his mathematical notation of regular events. These had their origins in theoretical computer science in the fields of automata theory and the description and classification of formal languages. Earlier implementations of pattern matching (such as those in SNOBOL), didn’t use regex, but had their own constructs.
Entered popular use in 1968 for two purposes, text matching in an editor and lexical analysis in a compiler. Ken Thompson built one of the first implementations and used it in the QED editor to match patterns in text files. His implementation used Just-In-Time (JIT) compilation to IBM 7094 code. He would later add to this implementation in the unix text editor ed. This implementation would eventually lead to the implementation in grep, whose name derived from the command for regular expression searching in the ed editor.
Bell labs was heavily involved in regex in the 70s. Vi, lex, sed, awk, expr, and emacs have some of their origins during this period. The POSIX-2 standard in 1992 included these standardized forms.
In the 80s, more complicated regexes arose in Perl. Henry Spencer wrote this implementation in 1986 and later wrote an implementation for TCL. PostgreSQL uses this implementation. Perl has continued to iterate on this, with the newest iteration (Raku) adding a lot of improvements, including the ability to define parsing expression grammars.
Regex also started appearing in a lot of other places in the 80s and following decades. ISO SGML consolidated in the 80s and included it. The kernel of the structure specification language standards consists of regexes (see DTD element group syntax). Regexes also appear in web servers such as apache, nginx, and IIS. In the late 2010s, several companies started to offer hardware implementations of PCRE compatible regex engines that are faster than CPU implementations.
Definitions
A “pattern” is roughly equivalent to a regular expression and is used to specify a set of strings for a particular purpose.
A “match” is defined as a string that meets the specifications of a particular pattern.
All characters in a pattern are either a literal or a metacharacter. A “metacharacter” is a character that has special meaning to a computer program. They are designed to be easy to type on an ascii keyboard. Metacharacters and literals together form pattern specifications that are used in regular expressions.
Wildcard characters are placeholders represented by a single character (like *) that can be interpreted as a sequence of zero or more literal characters.
Globbing is used with filename searches to specify sets of similar filenames. For instance “txt*” matches “txt1” and “txt2”.
A token is a character or group of characters used in a pattern. In the previous example, “txt” might be considered a token.
A quantifier specifies how many instances of a token are expected.
A regular language is a formal language that can be expressed using a regular expression.
A formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific sets of rules.
A set is closed under an operation if performance of that operation always produces a member of that set. For instance, positive integers are closed under addition, but not under subtraction.
Groups are sections of a regular expression enclosed in parenthesis. They can be named or referred to by position. Parentheses create numbered capturing groups. These can be used to extract part of a match.
The basics
A pipe character “|” or vertical bar is used to separate alternatives. Parenthesis are used to define the scope and precedence of operators, like in math. Parenthesis may be omitted if the order of operations is clear.
A quantifier after a token specifies how often that element is allowed to occur. Some common examples are: The question mark indicates zero or one occurrences of the preceding element. The asterisk indicates zero or more occurrences of the element. The plus sign indicates one or more occurrences. You can also specify a specific number of times, or a range.
A wildcard (shown as a period) matches any character. For instance, the string “a.*b” indicates a sequence that has an a, with a b following at some point later (note the wildcard). You can combine these constructs to build much more complex regular expressions in a manner similar to the way that you can build up complex arithmetic formulae from simple arithmetic operators.
Matching and Quantifiers
Now that you have some idea of regex syntax, it’s important to talk about quantifiers which tell you how matches occur. If you recall from early, you add a quantifier after a symbol to indicate how many of the character, subexpression, or token to the left are required for a match. If you have sequence of characters, followed by a quantifier, the quantifier only applies to the last character, unless you have grouped the characters. What makes quantifiers complicated is how many different things in a string might match the expression you give the engine. What happens if there is a wildcard in there and there are multiple places in the remainder of the string that match what follows the wildcard.
Greedy quantifiers cause the algorithm to take the longest possible string that still matches the expression. This is the default. The + and . quantifiers are examples of this.
Docile quantifiers cause the (naturally greedy) algorithm to “give back” matching characters if taking them causes a match to fail later on. The .* quantifier is an example of this. It starts with a greedy match ., then backtracks and gives up matched characters if the following quantifier isn’t met.
Lazy quantifiers cause the algorithm to return the shortest match possible. *? is an example of this. It tells the algorithm to match zero or more characters, but return the shortest successful match.
Helpful quantifiers cause the algorithm to start by matching as few tokens as the quantifier allows, and then expanding as needed. A*B is a good example of this. The * will cause it to start out by matching zero characters, then expand if that’s the only way to get a match.
A++. is an example of possessive. The plus signs cause the algorithm to greedily match all the characters in the string. When it hits the period at the end, there are no characters left to process, causing it not to match.
Replacement
Regexes can also handle character replacements. Replacements happen when a match occurs. They can replace the matched text itself or text before or after it. The match triggers it, but it isn’t necessarily a replacement of the match. This can be anything from simple literals to named backreferences within the expression. You could also do transformations of the text, for instance, finding a pattern and converting it to uppercase.
How regex replacement works is dependent on the system and is another topic of its own. Generally, you are going to either replace with a literal, or with a captured group from the string itself. Captured groups can be referenced by position by leading with a backslash and following with an offset or referenced by name.
Unicode Purgatory
“In theory, any token set can be matched by regular expressions as long as it is predefined” (wikipedia)
Many modern engines have some support for unicode. This can be tricky, as some libraries expect UTF-8, UTF-16, or UTF-32. Many only support the Basic MultiLingual Plane, in other words, 16 bits. Few can handle the full range. There are also interesting wrinkles in extending ASCII-oriented constructs to unicode, especially once you start trying to cross unicode blocks.
Formal Language theory
Regular expressions consist of constants, which denote sets of strings, and operator symbols which denote operations over these sets. The standard definition is as follows. Given a finite alphabet Σ (sigma), the following constants are defined as regular expressions. The empty set, ∅ (or theta) is defined as the set having no elements. An empty string ε (epsilon) is a set containing only the empty string. A literal, such as “a” in Σ denotes the set that contains only the character a.
Given two regular expressions, R and S, the following operations over them are defined to produce regular expressions. Concatenation – RS denotes the set of strings that can be defined by concatenating a string in the set R and the set S. Alternation – R | S denotes the set union of sets described by R and S. In other words, it combines the sets. a | b* denotes “a”, “b”, “bb”, etc. Kleene star (not Mr. Clean) R* denotes the smallest superset of the set described by R that contains ε (the empty string) and is closed under string concatenation (aka, every member of the set of strings that could be produced by concatenation is already in the set).
Expression compactness is built in. Wildcards can be defined in other ways. For instance, in the set Σ, a+ can be defined as aa* and a? can be defined as (a | ε). The Complement Operator (referring to the set of items not in a given set) can also often make such expressions more compact but isn’t strictly necessary.
Equivalence, because of the facts noted above, there is more than one way to create a regular expression that denotes equivalent sets of possible strings. It is possible to write an algorithm that can take two regular expressions and determine whether they are equal by reducing them to minimal deterministic finite state machines and checking for isomorphism (equivalence).
Uses
Obviously used in text processing. Find and Replace. Templating systems. Scraping. Syntax highlight.
Data validation and cleanup. Bulk data processing systems often use a ton of regex.
Compilers, lexers, and data parsing for computing. Finding symbols and incorrect syntax. Removing deadspace, whitespace, and noise before processing.
Heavily used in search. This includes google algorithms, database text search, etc. Most search engines use regex internally. Few offer it to the public.
Web stuff. Route pattern matching in web requests. Validation of incoming data to make sure that it meets validation rules. Cleanup of potential code injection vulnerabilities.
Potential use in genetics and other things that could be tokenized? If you could tokenize financial behavior into a regular language, it might be possible to detect some patterns of fraud with regex.
Notes for Devs
Be careful about lifetime of regexes. If your runtime dynamically generates code or abstract syntax trees for use in processing regular expressions, make sure you understand how long they hang around. Also be aware of what situations can cause a regeneration, as this can hurt performance. You may want to cache regular expressions that are frequently used by your application, especially if code generation is a time-consuming process.
Be careful of escape sequences. Generally you will be encoding regular expressions as string literals. If these are in code, make sure that you are specifying the regular expression that you think you are specifying, and not something that is messed up by your language of choice’s escape sequences. You might be better off specifying regular expressions in a way that doesn’t require dealing with your language’s escape sequences at all, especially if you expect to have to modify them in the future.
Readability is a trick. Comment heavily. Don’t assume that programmers are any good at dealing with regular expressions – that world doesn’t exist any more. Be careful about using long string literals, or worse, concatenation to create regexes in source code.
Book Club
Art Matters: Because Your Imagination Can Change The World
Neil Gaiman
“The world always seems brighter when you’ve made something that wasn’t there before.” ~ Neil Gaiman
The third section is titled “Making a Chair.” In this chapter Gaiman tells a story about assembling an office chair, something many of us have done at one point in time. He then compares assembling the chair to writing a book. He talks about how writing should come with instructions and safety warnings. He implies there are rules to writing but they are fun and useful to break. Such as using an office chair as a step ladder.
Tricks of the Trade
Don’t be afraid to lean on experts for expertise in things that you don’t want to be expert in.