Syntax of Regular Expressions (Mastering Emacs

8.4.1.4 Syntax of Regular Expressions

Regular expressions have a syntax in which a few characters are “special constructs” and the rest are “ordinary”.

Special Characters

The “special characters” are:

.
*
+
?
[ and sometimes ]
[: … :]
^
$
\
-

Any other character appearing in a regular expression is ordinary, unless a ‘\’ precedes it.

Things to note:

:indic code

^: For historical compatibility reasons, ‘^’ can be used only at the beginning of the regular expression, or after ‘\(’, ‘\(?:’ or ‘\|’.
$: For historical compatibility reasons, ‘$’ can be used only at the end of the regular expression, or before ‘\)’ or ‘\|’.
\: also has special meaning in the read syntax of Lisp strings and must be quoted with ‘\’.

\\   => \
\\\\ => \\

Character Alternatives

‘[ ... ]’ is a character alternative.

[ad]
[a-z]
[a-z$%.]
[]a-z]: To include a ‘]’ in a character alternative, you must make it the first character.
[]a-z-]: To include a ‘-’, write ‘-’ as the first or last character of the character alternative, or as the upper bound of a range.
^: To include ‘^’ in a character alternative, put it anywhere but at the beginning.
[^…]: ‘[^’ begins a “complemented character alternative”. This matches any character except the ones specified. ‘^’ is not special in a character alternative unless it is the first character. A complemented character alternative can match a newline, unless newline is mentioned as one of the characters not to match. This is in contrast to the handling of regexps in programs such as ‘grep’.

Rules Regarding ]

The exact rules are that:

at the beginning of a regexp, ‘[’ is special and ‘]’ not.
This lasts until the first unquoted ‘[’,
- after which we are in a character alternative;
- ‘[’ is no longer special
  - (except when it starts a character class)
- but ‘]’ is special,
  - unless it immediately follows the special ‘[’ or
  - that ‘[’ followed by a ‘^’.
- This lasts until the next special ‘]’ that does not end a character class.
This ends the character alternative and restores the ordinary syntax of regular expressions;
an unquoted ‘[’ is special again and a ‘]’ not.

POSIX Features

The following aspects of ranges are specific to Emacs, in that POSIX allows but does not require this behavior and programs other than Emacs may behave differently:

If case-fold-search is non-‘nil’, ‘[a-z]’ also matches upper-case letters.
A range is not affected by the locale’s collation sequence: it always represents the set of characters with codepoints ranging between those of its bounds, so that ‘[a-z]’ matches only ASCII letters, even outside the C or POSIX locale.
As a special case, if either bound of a range is a raw 8-bit byte, the other bound should be a unibyte character, and the range matches only unibyte characters.
If the lower bound of a range is greater than its upper bound, the range is empty and represents no characters. Thus, ‘[b-a]’ always fails to match, and ‘[^b-a]’ matches any character, including newline. However, the lower bound should be at most one greater than the upper bound; for example, ‘[c-a]’ should be avoided.
A character alternative can also specify named character classes. This is a POSIX feature. Using a character class is equivalent to mentioning each of the characters in that class; but the latter is not feasible in practice, since some classes include thousands of different characters. A character class should not appear as the lower or upper bound of a range.