Next: , Previous: , Up: Regular Expressions   [Index]


8.4.1.4 Syntax of Regular Expressions

Regular expressions have a syntax in which a few characters are “special constructs” and the rest are “ordinary”.

Special Characters

The “special characters” are:

Any other character appearing in a regular expression is ordinary, unless a ‘\’ precedes it.

Things to note:

:indic code

^

For historical compatibility reasons, ‘^’ can be used only at the beginning of the regular expression, or after ‘\(’, ‘\(?:’ or ‘\|’.

$

For historical compatibility reasons, ‘$’ can be used only at the end of the regular expression, or before ‘\)’ or ‘\|’.

\

also has special meaning in the read syntax of Lisp strings and must be quoted with ‘\’.

\\   => \
\\\\ => \\

Character Alternatives

[ ... ]’ is a character alternative.

[ad]
[a-z]
[a-z$%.]
[]a-z]

To include a ‘]’ in a character alternative, you must make it the first character.

[]a-z-]

To include a ‘-’, write ‘-’ as the first or last character of the character alternative, or as the upper bound of a range.

^

To include ‘^’ in a character alternative, put it anywhere but at the beginning.

[^…]

‘[^’ begins a “complemented character alternative”. This matches any character except the ones specified. ‘^’ is not special in a character alternative unless it is the first character. A complemented character alternative can match a newline, unless newline is mentioned as one of the characters not to match. This is in contrast to the handling of regexps in programs such as ‘grep’.

Rules Regarding ]

The exact rules are that:

POSIX Features

The following aspects of ranges are specific to Emacs, in that POSIX allows but does not require this behavior and programs other than Emacs may behave differently:

  1. If case-fold-search is non-‘nil’, ‘[a-z]’ also matches upper-case letters.
  2. A range is not affected by the locale’s collation sequence: it always represents the set of characters with codepoints ranging between those of its bounds, so that ‘[a-z]’ matches only ASCII letters, even outside the C or POSIX locale.
  3. As a special case, if either bound of a range is a raw 8-bit byte, the other bound should be a unibyte character, and the range matches only unibyte characters.
  4. If the lower bound of a range is greater than its upper bound, the range is empty and represents no characters. Thus, ‘[b-a]’ always fails to match, and ‘[^b-a]’ matches any character, including newline. However, the lower bound should be at most one greater than the upper bound; for example, ‘[c-a]’ should be avoided.
  5. A character alternative can also specify named character classes. This is a POSIX feature. Using a character class is equivalent to mentioning each of the characters in that class; but the latter is not feasible in practice, since some classes include thousands of different characters. A character class should not appear as the lower or upper bound of a range.

Next: Syntax Classes, Previous: Emacs-Only Features, Up: Regular Expressions   [Index]