Regular expression
From DocForge
While not a complete programming language, per se, regular expressions are a way to express strings for pattern matching. They're used within Perl, PHP, JavaScript, and other languages to perform string searches and replacements.
Contents |
[edit] Implementations
[edit] Syntax
The details below apply to Perl and Perl-like implementations of regular expressions, including PHP's PCRE library.
Regular expressions are written as search patterns wrapped by a delimiter:
<delimiter><search pattern><delimiter><modifier>
The delimiter is typically a forward slash ("/"), pipe ("|"), or hash ("#"), but some implementations support other characters. The search pattern is written by combining characters, character types, meta characters, quantifiers, escape sequences, and assertions.
The search pattern can contain character classes which define sets of characters for comparison. Character classes are contained within square brackets ("[" and "]"). They consist of individual characters or ranges. For example, [abcde] and [a-e] will both match the character "c" within a search pattern. A character class by itself refers to one individual match, but can be followed by a quantifier to match more.
[edit] Meta Characters
Outside of a character class (square brackets):
\ escape character ^ assert start of subject (or line in multi-line mode) $ assert end of subject (or line in multi-line mode) . match any character except newline (by default) [ start character class definition ] end character class definition | start of alternative branch ( start subpattern ) end subpattern
Inside of a character class only the following apply:
\ escape character ^ when the first character, negate the class - indicates character range
[edit] Quantifiers
These make their preceding item match a certain number of times. By default these are greedy. Appending ? makes them non-greedy.
* Zero or more
? Zero or one time (i.e. makes the item optional)
+ One or more
{n} Exactly n times
{n,m} Between n and m times
[edit] Escape Sequences
\cx control-x (x is any character) \e escape (hex 1B) \f formfeed (hex 0C) \n newline (hex 0A) \r carriage return (hex 0D) \t tab (hex 09) \xhh character with hex code hh \ddd character with octal code ddd; also backreference
[edit] Character Types
\d any decimal digit \D any character that is not a decimal digit \s any whitespace character \S any character that is not a whitespace character \w any word character \W any non-word character
[edit] Assertions
^ start of subject or line $ end of subject or line \b word boundary \B not a word boundary \A start of subject (independent of multiline mode) \Z end of subject or newline at end (independent of multiline mode) \z end of subject (independent of multiline mode) \G first matching position in subject
Subpatterns starting with a question mark have special meaning. (?<! ... ) is a negative, look-behind, zero-width assertion. The expression contained will cause the rest of the expression to not match if the text prior to that point matches it. If you remove the exclamation, it becomes a positive, look-behind.
Zero-width assertions are not included in the final match, only analyzed through the course of parsing it.
[edit] Modifiers
Following the final delimiter of the regular expression, some implementations support modifiers.
i case-insensitive g global; across newlines m multiline s dot metacharacter matches all characters including newline
For example, /abc/i would match both "abc" and "ABC".
[edit] Examples
[edit] E-Mail Addresses
Match e-mail addresses close to the RFC definitions. This won't strictly validate an e-mail address, but can be used to closely validate and pick e-mail addresses out of text strings. If using it within quotes, such as with PHP, quotation marks may need to be escaped.
/[a-z0-9.!#$%&'*+\-\/=?\^_`{|}~]+@[a-z0-9.-]+\.[a-z]{2,4}/i
[edit] HEX Color
Six hexidecimal characters (useful for CSS color validation)
/^[0-9a-f]{6}$/i
[edit] Replace older style PHP code
while(list($key, $value) = each($array))
with
foreach ($array as $key => $value)
by using this command
$ sed -r 's/while\(list\(([^,]*),([^)]*)\) \= each\(([^)]*)\)\)/foreach\(\3 as \1 => \2\)/g' file.php
[edit] HTML
This code will get all instances of one tag, or all top-level tags, from a string of HTML. This example is written in PHP.
The weakness of using regular expressions to parse HTML (and XML) is its inability to elegantly handle nested elements. It can easily pick out top level tags. To catch all levels of tags an XML parser should be used.
function get_tags($html, $tag = '') { if ($tag) { preg_match_all("/<(". $tag .")([^>]*)>(.*)<\/". $tag .">/is", $html, $matches1, PREG_SET_ORDER); preg_match_all("/<(". $tag .")([^>]*)\/(\s*)>/is", $html, $matches2, PREG_SET_ORDER); } else { // NOTE: The first regex will only return top level tags; nested tags may come later preg_match_all("/<([\w]+)([^>]*)>(.*)<\/\1>/is", $html, $matches1, PREG_SET_ORDER); preg_match_all("/<([\w]+)([^>]*)\/(\s*)>/is", $html, $matches2, PREG_SET_ORDER); } return array_merge($matches1, $matches2); }

