Figure 25-4. Pattern matching with a regular expression.
Pattern matching with a regular expression.
Regular expressions are used as “templates” that match patterns of text. For example, the regular expression for the pattern matching in Figure 25-4 is[1]
\([Ii]f \|and \)*\(<i>[AC]\+<\/i>.\)\(and\)\? |
To understand any URL manipulation solution to the problem of non-search-engine-friendly URLs, you have to get acquainted with Regular Expressions. To get you started, read Using Regular Expressions and Matching Patterns in Text. We can only touch the basics here, for which we use material taken from A Brief Introduction to Regular Expressions:
An expression is a string of characters. Those characters that have an interpretation above and beyond their literal meaning are called metacharacters. A quote symbol, for example, may denote speech by a person, ditto, or a meta-meaning for the symbols that follow. Regular Expressions are sets of characters and/or metacharacters that UNIX endows with special features.
The main uses for Regular Expressions (REs) are text searches and string manipulation. An RE matches a single character or a set of characters (a substring or an entire string).
The asterisk -- * -- matches any number of repeats of the character string or RE preceding it, including zero.
"1133*" matches 11 + one or more 3's + possibly other characters: 113, 1133, 111312, and so forth. |
The dot -- . -- matches any one character, except a newline. [2]
"13." matches 13 + at least one of any character (including a space): 1133, 11333, but not 13 (additional character missing). |
The caret -- ^ -- matches the beginning of a line, but sometimes, depending on context, negates the meaning of a set of characters in an RE.
The dollar sign -- $ -- at the end of an RE matches the end of a line.
"^$" matches blank lines.
Brackets -- [...] -- enclose a set of characters to match in a single RE.
"[xyz]" matches the characters x, y, or z.
"[c-n]" matches any of the characters in the range c to n.
"[B-Pk-y]" matches any of the characters in the ranges B to P and k to y.
"[a-z0-9]" matches any lowercase letter or any digit.
"[^b-d]" matches all characters except those in the range b to d. This is an instance of ^ negating or inverting the meaning of the following RE (taking on a role similar to ! in a different context).
Combined sequences of bracketed characters match common word patterns. "[Yy][Ee][Ss]" matches yes, Yes, YES, yEs, and so forth. "[0-9][0-9][0-9]-[0-9][0-9]-[0-9
][0-9][0-9][0-9]" matches any Social Security number.
The backslash -- \ -- escapes a special character, which means that character gets interpreted literally.
A "\$" reverts back to its literal meaning of "$", rather than its RE meaning of end-of-line. Likewise a "\\" has the literal meaning of "\".
Escaped "angle brackets" -- \<...\> -- mark word boundaries. The angle brackets must be escaped, since otherwise they have only their literal character meaning:
"\<the\>" matches the word "the", but not the words "them", "there", "other", etc. |
The question mark -- ? -- matches zero or one of the previous RE. It is generally used for matching single characters.
The plus -- + -- matches one or more of the previous RE. It serves a role similar to the *, but does not match zero occurrences.
Escaped "curly brackets" -- \{ \} -- indicate the number of occurrences of a preceding RE to match. It is necessary to escape the curly brackets since they have only their literal character meaning otherwise.
"[0-9]\{5\}" matches exactly five digits (characters in the range of 0 to 9). |
Parenthesses -- ( ) -- enclose groups of REs. They are useful with the following "|" operator and in substring extraction using expr.
The -- | -- "or" RE operator matches any of a set of alternate characters.
What does the above tell us when we encounter a cryptic mod_rewrite directive that looks like the following?
RewriteEngine on RewriteRule ^page1\.html$ page2.html [R=301,L] |
Of course, the first line is easy: mod_rewrite is not enabled by default, so this line starts the “Rewrite Engine”. The second directive is a “Rewrite Rule” that instructs mod_rewrite to translate whatever URL is matched by the regular expression “^page1\.html$” to “page2.html”.
What URLs does the regular expression “^page1\.html$” match?
In this example, adapted from An Introduction to Redirecting URLs on an Apache Server, we have a caret at the beginning of the pattern, and a dollar sign at the end. These are regex special characters called anchors. The caret tells regex to begin looking for a match with the character that immediately follows it, in this case a "p". The dollar sign anchor tells regex that this is the end of the string we want to match. In our simple example, "page1\.html" and "^page1\.html$" are interchangable expressions and match the same string. However, "page1\.html" matches any string containing "page1.html" (apage1.html for example) anywhere in the URL, but "^page1\.html$" matches only a string which is exactly equal to "page1.html". In a more complex redirect, anchors (and other special regex characters) are often essential.
Putting all the above together, we can see that “^page1\.html$” matches URLs that start (the caret -- ^ --) with “page1”, immediately followed by a literal dot (escaped dot --\.--, as opposed to a simple tot, which is a metacharacter that matches any single character except newline), immediately followed by “html” and the end of the URL (dollar sign --$--).
In our example, we also have an "[R=301,L]". These are called flags in mod_rewrite and they're optional parameters. "R=301" instructs Apache to return a 301 status code with the delivered page and, when not included as in [R,L], defaults to 302. The "L" flag tells Apache that this is the last rule that it needs to process, IF the RewriteRule pattern is matched. Experts suggest that you get in the habit of including the "L" flag with every RewriteRule to avoid unpleasant surprises.
One powerful option in creating search patterns is specifying that a subexpression that was matched earlier in a regular expression is matched again later in the expression. We do this using backreferences. Backreferences are named by the numbers 1 through 9, preceded by the backslash/escape character when used in this manner (in mod_rewrite, you have to use the dollar sign instead of the backslash, but in PHP you will use the backslash, so don't get confused, it just depends on the context the regular expression is in). These backreferences refer to each successive group in the match pattern, as in /(one)(two)(three)/\1\2\3/ (or $1, $2 and $3 for mod_rewrite). Each numbered backreference refers to the group that has the word corresponding to the number.
Thus the following URL translation:
#Your Account RewriteRule ^userinfo-([a-zA-Z0-9_-]*)\.html modules.php?name=Your_Account&op=userinfo&username=$1 |
in the .htaccess file (Section 25.4) will match any URL that starts (carret --^--) with “userinfo-”, immediately followed by any number (star --*--) of characters belonging to the alphanumeric class (a-z, A-Z, 0-9), including underscores (_) and dashes (-), followed by a literal dot (an escaped dot --\.--) and “html”. The Rewrite Rule instructs mod_rewrite to translate ther URL to
modules.php?name=Your_Account&op=userinfo&username=$1 |
where $1 is a backreference, referring to the first matched subexpression, the one inside the parenthesses (). Since inside the parenthesses is a regular expression that matches “any number of characters belonging to the alphanumeric class, including underscores and dashes”, $1 will contain whatever alphanumeric characters were between “userinfo-” and “.html” (including underscores and dashes). In PHP-Nuke, this is the username, so that the URL returned by mod_rewrite will be
modules.php?name=Your_Account&op=userinfo&username=(some matched username) |
thus completing the transformation of a static URL (that PHP-Nuke does not understand), to a dynamic one that makes perfectly sense to PHP-Nuke (see Section 25.5.1.3 for the complete picture).
[1] | The regular expression matches the HTML code for the text shown in Figure 25-4, where the capital letters A and C were enclosed in <i> tags. This makes it look more formidable than it actually is. |