| <html> |
| <head> |
| <title>pcre2syntax specification</title> |
| </head> |
| <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB"> |
| <h1>pcre2syntax man page</h1> |
| <p> |
| Return to the <a href="index.html">PCRE2 index page</a>. |
| </p> |
| <p> |
| This page is part of the PCRE2 HTML documentation. It was generated |
| automatically from the original man page. If there is any nonsense in it, |
| please consult the man page, in case the conversion went wrong. |
| <br> |
| <ul> |
| <li><a name="TOC1" href="#SEC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a> |
| <li><a name="TOC2" href="#SEC2">QUOTING</a> |
| <li><a name="TOC3" href="#SEC3">BRACED ITEMS</a> |
| <li><a name="TOC4" href="#SEC4">ESCAPED CHARACTERS</a> |
| <li><a name="TOC5" href="#SEC5">CHARACTER TYPES</a> |
| <li><a name="TOC6" href="#SEC6">GENERAL CATEGORY PROPERTIES FOR \p and \P</a> |
| <li><a name="TOC7" href="#SEC7">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a> |
| <li><a name="TOC8" href="#SEC8">BINARY PROPERTIES FOR \p AND \P</a> |
| <li><a name="TOC9" href="#SEC9">SCRIPT MATCHING WITH \p AND \P</a> |
| <li><a name="TOC10" href="#SEC10">THE BIDI_CLASS PROPERTY FOR \p AND \P</a> |
| <li><a name="TOC11" href="#SEC11">CHARACTER CLASSES</a> |
| <li><a name="TOC12" href="#SEC12">PERL EXTENDED CHARACTER CLASSES</a> |
| <li><a name="TOC13" href="#SEC13">QUANTIFIERS</a> |
| <li><a name="TOC14" href="#SEC14">ANCHORS AND SIMPLE ASSERTIONS</a> |
| <li><a name="TOC15" href="#SEC15">REPORTED MATCH POINT SETTING</a> |
| <li><a name="TOC16" href="#SEC16">ALTERNATION</a> |
| <li><a name="TOC17" href="#SEC17">CAPTURING</a> |
| <li><a name="TOC18" href="#SEC18">ATOMIC GROUPS</a> |
| <li><a name="TOC19" href="#SEC19">COMMENT</a> |
| <li><a name="TOC20" href="#SEC20">OPTION SETTING</a> |
| <li><a name="TOC21" href="#SEC21">NEWLINE CONVENTION</a> |
| <li><a name="TOC22" href="#SEC22">WHAT \R MATCHES</a> |
| <li><a name="TOC23" href="#SEC23">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a> |
| <li><a name="TOC24" href="#SEC24">NON-ATOMIC LOOKAROUND ASSERTIONS</a> |
| <li><a name="TOC25" href="#SEC25">SUBSTRING SCAN ASSERTION</a> |
| <li><a name="TOC26" href="#SEC26">SCRIPT RUNS</a> |
| <li><a name="TOC27" href="#SEC27">BACKREFERENCES</a> |
| <li><a name="TOC28" href="#SEC28">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a> |
| <li><a name="TOC29" href="#SEC29">CONDITIONAL PATTERNS</a> |
| <li><a name="TOC30" href="#SEC30">BACKTRACKING CONTROL</a> |
| <li><a name="TOC31" href="#SEC31">CALLOUTS</a> |
| <li><a name="TOC32" href="#SEC32">REPLACEMENT STRINGS</a> |
| <li><a name="TOC33" href="#SEC33">SEE ALSO</a> |
| <li><a name="TOC34" href="#SEC34">AUTHOR</a> |
| <li><a name="TOC35" href="#SEC35">REVISION</a> |
| </ul> |
| <h2><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a></h2> |
| <p> |
| The full syntax and semantics of the regular expression patterns that are |
| supported by PCRE2 are described in the |
| <a href="pcre2pattern.html"><b>pcre2pattern</b></a> |
| documentation. This document contains a quick-reference summary of the pattern |
| syntax followed by the syntax of replacement strings in substitution function. |
| The full description of the latter is in the |
| <a href="pcre2api.html"><b>pcre2api</b></a> |
| documentation. |
| </p> |
| <h2><a name="SEC2" href="#TOC1">QUOTING</a></h2> |
| <p> |
| <pre> |
| \x where x is non-alphanumeric is a literal x |
| \Q...\E treat enclosed characters as literal |
| </pre> |
| Note that white space inside \Q...\E is always treated as literal, even if |
| PCRE2_EXTENDED is set, causing most other white space to be ignored. Note also |
| that PCRE2's handling of \Q...\E has some differences from Perl's. See the |
| <a href="pcre2pattern.html"><b>pcre2pattern</b></a> |
| documentation for details. |
| </p> |
| <h2><a name="SEC3" href="#TOC1">BRACED ITEMS</a></h2> |
| <p> |
| With one exception, wherever brace characters { and } are required to enclose |
| data for constructions such as \g{2} or \k{name}, space and/or horizontal tab |
| characters that follow { or precede } are allowed and are ignored. In the case |
| of quantifiers, they may also appear before or after the comma. The exception |
| is \u{...} which is not Perl-compatible and is recognized only when |
| PCRE2_EXTRA_ALT_BSUX is set. This is an ECMAScript compatibility feature, and |
| follows ECMAScript's behaviour. |
| </p> |
| <h2><a name="SEC4" href="#TOC1">ESCAPED CHARACTERS</a></h2> |
| <p> |
| This table applies to ASCII and Unicode environments. An unrecognized escape |
| sequence causes an error. |
| <pre> |
| \a alarm, that is, the BEL character (hex 07) |
| \cx "control-x", where x is a non-control ASCII character |
| \e escape (hex 1B) |
| \f form feed (hex 0C) |
| \n newline (hex 0A) |
| \r carriage return (hex 0D) |
| \t tab (hex 09) |
| \0dd character with octal code 0dd |
| \ddd character with octal code ddd, or backreference |
| \o{ddd..} character with octal code ddd.. |
| \N{U+hh..} character with Unicode code point hh.. (Unicode mode only) |
| \xhh character with hex code hh |
| \x{hh..} character with hex code hh.. |
| </pre> |
| \N{U+hh..} is synonymous with \x{hh..} but is not supported in environments |
| that use EBCDIC code (mainly IBM mainframes). Note that \N not followed by an |
| opening curly bracket has a different meaning (see below). |
| </p> |
| <p> |
| If PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set ("ALT_BSUX mode"), the |
| following are also recognized: |
| <pre> |
| \U the character "U" |
| \uhhhh character with hex code hhhh |
| \u{hh..} character with hex code hh.. but only for EXTRA_ALT_BSUX |
| </pre> |
| When \x is not followed by {, one or two hexadecimal digits are read, |
| but in ALT_BSUX mode \x must be followed by two hexadecimal digits to be |
| recognized as a hexadecimal escape; otherwise it matches a literal "x". |
| Likewise, if \u (in ALT_BSUX mode) is not followed by four hexadecimal digits |
| or (in EXTRA_ALT_BSUX mode) a sequence of hex digits in curly brackets, it |
| matches a literal "u". |
| </p> |
| <p> |
| Note that \0dd is always an octal code. The treatment of backslash followed by |
| a non-zero digit is complicated; for details see the section |
| <a href="pcre2pattern.html#digitsafterbackslash">"Non-printing characters"</a> |
| in the |
| <a href="pcre2pattern.html"><b>pcre2pattern</b></a> |
| documentation, where details of escape processing in EBCDIC environments are |
| also given. |
| </p> |
| <h2><a name="SEC5" href="#TOC1">CHARACTER TYPES</a></h2> |
| <p> |
| <pre> |
| . any character except newline; |
| in dotall mode, any character whatsoever |
| \C one code unit, even in UTF mode (best avoided) |
| \d a decimal digit |
| \D a character that is not a decimal digit |
| \h a horizontal white space character |
| \H a character that is not a horizontal white space character |
| \N a character that is not a newline |
| \p{<i>xx</i>} a character with the <i>xx</i> property |
| \P{<i>xx</i>} a character without the <i>xx</i> property |
| \R a newline sequence |
| \s a white space character |
| \S a character that is not a white space character |
| \v a vertical white space character |
| \V a character that is not a vertical white space character |
| \w a "word" character |
| \W a "non-word" character |
| \X a Unicode extended grapheme cluster |
| </pre> |
| \C is dangerous because it may leave the current matching point in the middle |
| of a UTF-8 or UTF-16 character. The application can lock out the use of \C by |
| setting the PCRE2_NEVER_BACKSLASH_C option. It is also possible to build PCRE2 |
| with the use of \C permanently disabled. |
| </p> |
| <p> |
| By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode |
| or in the 16-bit and 32-bit libraries. However, if locale-specific matching is |
| happening, \s and \w may also match characters with code points in the range |
| 128-255. If the PCRE2_UCP option is set, the behaviour of these escape |
| sequences is changed to use Unicode properties and they match many more |
| characters, but there are some option settings that can restrict individual |
| sequences to matching only ASCII characters. |
| </p> |
| <p> |
| Property descriptions in \p and \P are matched caselessly; hyphens, |
| underscores, and ASCII white space characters are ignored, in accordance with |
| Unicode's "loose matching" rules. For example, \p{Bidi_Class=al} is the same |
| as \p{ bidi class = AL }. |
| </p> |
| <h2><a name="SEC6" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a></h2> |
| <p> |
| <pre> |
| C Other |
| Cc Control |
| Cf Format |
| Cn Unassigned |
| Co Private use |
| Cs Surrogate |
| |
| L Letter |
| Lc Cased letter, the union of Ll, Lu, and Lt |
| L& Synonym of Lc |
| Ll Lower case letter |
| Lm Modifier letter |
| Lo Other letter |
| Lt Title case letter |
| Lu Upper case letter |
| |
| M Mark |
| Mc Spacing mark |
| Me Enclosing mark |
| Mn Non-spacing mark |
| |
| N Number |
| Nd Decimal number |
| Nl Letter number |
| No Other number |
| |
| P Punctuation |
| Pc Connector punctuation |
| Pd Dash punctuation |
| Pe Close punctuation |
| Pf Final punctuation |
| Pi Initial punctuation |
| Po Other punctuation |
| Ps Open punctuation |
| |
| S Symbol |
| Sc Currency symbol |
| Sk Modifier symbol |
| Sm Mathematical symbol |
| So Other symbol |
| |
| Z Separator |
| Zl Line separator |
| Zp Paragraph separator |
| Zs Space separator |
| </pre> |
| From release 10.45, when caseless matching is set, Ll, Lu, and Lt are all |
| equivalent to Lc. |
| </p> |
| <h2><a name="SEC7" href="#TOC1">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a></h2> |
| <p> |
| <pre> |
| Xan Alphanumeric: union of properties L and N |
| Xps POSIX space: property Z or tab, NL, VT, FF, CR |
| Xsp Perl space: property Z or tab, NL, VT, FF, CR |
| Xuc Universally-named character: one that can be |
| represented by a Universal Character Name |
| Xwd Perl word: property Xan or underscore |
| </pre> |
| Perl and POSIX space are now the same. Perl added VT to its space character set |
| at release 5.18. |
| </p> |
| <h2><a name="SEC8" href="#TOC1">BINARY PROPERTIES FOR \p AND \P</a></h2> |
| <p> |
| Unicode defines a number of binary properties, that is, properties whose only |
| values are true or false. You can obtain a list of those that are recognized by |
| \p and \P, along with their abbreviations, by running this command: |
| <pre> |
| pcre2test -LP |
| </pre> |
| </p> |
| <h2><a name="SEC9" href="#TOC1">SCRIPT MATCHING WITH \p AND \P</a></h2> |
| <p> |
| Many script names and their 4-letter abbreviations are recognized in |
| \p{sc:...} or \p{scx:...} items, or on their own with \p (and also \P of |
| course). You can obtain a list of these scripts by running this command: |
| <pre> |
| pcre2test -LS |
| </pre> |
| </p> |
| <h2><a name="SEC10" href="#TOC1">THE BIDI_CLASS PROPERTY FOR \p AND \P</a></h2> |
| <p> |
| <pre> |
| \p{Bidi_Class:<class>} matches a character with the given class |
| \p{BC:<class>} matches a character with the given class |
| </pre> |
| The recognized classes are: |
| <pre> |
| AL Arabic letter |
| AN Arabic number |
| B paragraph separator |
| BN boundary neutral |
| CS common separator |
| EN European number |
| ES European separator |
| ET European terminator |
| FSI first strong isolate |
| L left-to-right |
| LRE left-to-right embedding |
| LRI left-to-right isolate |
| LRO left-to-right override |
| NSM non-spacing mark |
| ON other neutral |
| PDF pop directional format |
| PDI pop directional isolate |
| R right-to-left |
| RLE right-to-left embedding |
| RLI right-to-left isolate |
| RLO right-to-left override |
| S segment separator |
| WS white space |
| </pre> |
| </p> |
| <h2><a name="SEC11" href="#TOC1">CHARACTER CLASSES</a></h2> |
| <p> |
| <pre> |
| [...] positive character class |
| [^...] negative character class |
| [x-y] range (can be used for hex characters) |
| [[:xxx:]] positive POSIX named set |
| [[:^xxx:]] negative POSIX named set |
| |
| alnum alphanumeric |
| alpha alphabetic |
| ascii 0-127 |
| blank space or tab |
| cntrl control character |
| digit decimal digit |
| graph printing, excluding space |
| lower lower case letter |
| print printing, including space |
| punct printing, excluding alphanumeric |
| space white space |
| upper upper case letter |
| word same as \w |
| xdigit hexadecimal digit |
| </pre> |
| In PCRE2, POSIX character set names recognize only ASCII characters by default, |
| but some of them use Unicode properties if PCRE2_UCP is set. You can use |
| \Q...\E inside a character class. |
| </p> |
| <p> |
| When PCRE2_ALT_EXTENDED_CLASS is set, UTS#18 extended character classes may be |
| used, allowing nested character classes, combined using set operators. |
| <pre> |
| [x&&[^y]] UTS#18 extended character class |
| |
| x||y set union (OR) |
| x&&y set intersection (AND) |
| x--y set difference (AND NOT) |
| x~~y set symmetric difference (XOR) |
| |
| </pre> |
| </p> |
| <h2><a name="SEC12" href="#TOC1">PERL EXTENDED CHARACTER CLASSES</a></h2> |
| <p> |
| <pre> |
| (?[...]) Perl extended character class |
| (?[\p{Thai} & \p{Nd}]) operators; white space ignored |
| (?[(x - y) & z]) parentheses for grouping |
| |
| (?[ [^3] & \p{Nd} ]) [...] is a nested ordinary class |
| (?[ [:alpha:] - [z] ]) POSIX set is allowed outside [...] |
| (?[ \d - [3] ]) backslash-escaped set is allowed outside [...] |
| (?[ !\n & [:ascii:] ]) backslash-escaped character is allowed outside [...] |
| all other characters or ranges must be enclosed in [...] |
| |
| x|y, x+y set union (OR) |
| x&y set intersection (AND) |
| x-y set difference (AND NOT) |
| x^y set symmetric difference (XOR) |
| !x set complement (NOT) |
| </pre> |
| Inside a Perl extended character class, [...] switches mode to be interpreted |
| as an ordinary character class. Outside of a nested [...], the only items |
| permitted are backslash-escapes, POSIX sets, operators, and parentheses. Inside |
| a nested ordinary class, ^ has its usual meaning (inverts the class when used |
| as the first character); outside of a nested class, ^ is the XOR operator. |
| </p> |
| <h2><a name="SEC13" href="#TOC1">QUANTIFIERS</a></h2> |
| <p> |
| <pre> |
| ? 0 or 1, greedy |
| ?+ 0 or 1, possessive |
| ?? 0 or 1, lazy |
| * 0 or more, greedy |
| *+ 0 or more, possessive |
| *? 0 or more, lazy |
| + 1 or more, greedy |
| ++ 1 or more, possessive |
| +? 1 or more, lazy |
| {n} exactly n |
| {n,m} at least n, no more than m, greedy |
| {n,m}+ at least n, no more than m, possessive |
| {n,m}? at least n, no more than m, lazy |
| {n,} n or more, greedy |
| {n,}+ n or more, possessive |
| {n,}? n or more, lazy |
| {,m} zero up to m, greedy |
| {,m}+ zero up to m, possessive |
| {,m}? zero up to m, lazy |
| </pre> |
| </p> |
| <h2><a name="SEC14" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a></h2> |
| <p> |
| <pre> |
| \b word boundary |
| \B not a word boundary |
| ^ start of subject |
| also after an internal newline in multiline mode |
| (after any newline if PCRE2_ALT_CIRCUMFLEX is set) |
| \A start of subject |
| $ end of subject |
| also before newline at end of subject |
| also before internal newline in multiline mode |
| \Z end of subject |
| also before newline at end of subject |
| \z end of subject |
| \G first matching position in subject |
| </pre> |
| </p> |
| <h2><a name="SEC15" href="#TOC1">REPORTED MATCH POINT SETTING</a></h2> |
| <p> |
| <pre> |
| \K set reported start of match |
| </pre> |
| From release 10.38 \K is not permitted by default in lookaround assertions, |
| for compatibility with Perl. However, if the PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK |
| option is set, the previous behaviour is re-enabled. When this option is set, |
| \K is honoured in positive assertions, but ignored in negative ones. |
| </p> |
| <h2><a name="SEC16" href="#TOC1">ALTERNATION</a></h2> |
| <p> |
| <pre> |
| expr|expr|expr... |
| </pre> |
| </p> |
| <h2><a name="SEC17" href="#TOC1">CAPTURING</a></h2> |
| <p> |
| <pre> |
| (...) capture group |
| (?<name>...) named capture group (Perl) |
| (?'name'...) named capture group (Perl) |
| (?P<name>...) named capture group (Python) |
| (?:...) non-capture group |
| (?|...) non-capture group; reset group numbers for |
| capture groups in each alternative |
| </pre> |
| In non-UTF modes, names may contain underscores and ASCII letters and digits; |
| in UTF modes, any Unicode letters and Unicode decimal digits are permitted. In |
| both cases, a name must not start with a digit. |
| </p> |
| <h2><a name="SEC18" href="#TOC1">ATOMIC GROUPS</a></h2> |
| <p> |
| <pre> |
| (?>...) atomic non-capture group |
| (*atomic:...) atomic non-capture group |
| </pre> |
| </p> |
| <h2><a name="SEC19" href="#TOC1">COMMENT</a></h2> |
| <p> |
| <pre> |
| (?#....) comment (not nestable) |
| </pre> |
| </p> |
| <h2><a name="SEC20" href="#TOC1">OPTION SETTING</a></h2> |
| <p> |
| Changes of these options within a group are automatically cancelled at the end |
| of the group. |
| <pre> |
| (?a) all ASCII options |
| (?aD) restrict \d to ASCII in UCP mode |
| (?aS) restrict \s to ASCII in UCP mode |
| (?aW) restrict \w to ASCII in UCP mode |
| (?aP) restrict all POSIX classes to ASCII in UCP mode |
| (?aT) restrict POSIX digit classes to ASCII in UCP mode |
| (?i) caseless |
| (?J) allow duplicate named groups |
| (?m) multiline |
| (?n) no auto capture |
| (?r) restrict caseless to either ASCII or non-ASCII |
| (?s) single line (dotall) |
| (?U) default ungreedy (lazy) |
| (?x) ignore white space except in classes or \Q...\E |
| (?xx) as (?x) but also ignore space and tab in classes |
| (?-...) unset the given option(s) |
| (?^) unset imnrsx options |
| </pre> |
| (?aP) implies (?aT) as well, though this has no additional effect. However, it |
| means that (?-aP) also implies (?-aT) and disables all ASCII restrictions for |
| POSIX classes. |
| </p> |
| <p> |
| Unsetting x or xx unsets both. Several options may be set at once, and a |
| mixture of setting and unsetting such as (?i-x) is allowed, but there may be |
| only one hyphen. Setting (but no unsetting) is allowed after (?^ for example |
| (?^in). An option setting may appear at the start of a non-capture group, for |
| example (?i:...). |
| </p> |
| <p> |
| The following are recognized only at the very start of a pattern or after one |
| of the newline or \R sequences or options with similar syntax. More than one |
| of them may appear. For the first three, d is a decimal number. |
| <pre> |
| (*LIMIT_DEPTH=d) set the backtracking limit to d |
| (*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes |
| (*LIMIT_MATCH=d) set the match limit to d |
| (*CASELESS_RESTRICT) set PCRE2_EXTRA_CASELESS_RESTRICT when matching |
| (*NOTEMPTY) set PCRE2_NOTEMPTY when matching |
| (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching |
| (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS) |
| (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR) |
| (*NO_JIT) disable JIT optimization |
| (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE) |
| (*TURKISH_CASING) set PCRE2_EXTRA_TURKISH_CASING when matching |
| (*UTF) set appropriate UTF mode for the library in use |
| (*UCP) set PCRE2_UCP (use Unicode properties for \d etc) |
| </pre> |
| Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the value of |
| the limits set by the caller of <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>, |
| not increase them. LIMIT_RECURSION is an obsolete synonym for LIMIT_DEPTH. The |
| application can lock out the use of (*UTF) and (*UCP) by setting the |
| PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at compile time. |
| </p> |
| <h2><a name="SEC21" href="#TOC1">NEWLINE CONVENTION</a></h2> |
| <p> |
| These are recognized only at the very start of the pattern or after option |
| settings with a similar syntax. |
| <pre> |
| (*CR) carriage return only |
| (*LF) linefeed only |
| (*CRLF) carriage return followed by linefeed |
| (*ANYCRLF) all three of the above |
| (*ANY) any Unicode newline sequence |
| (*NUL) the NUL character (binary zero) |
| </pre> |
| </p> |
| <h2><a name="SEC22" href="#TOC1">WHAT \R MATCHES</a></h2> |
| <p> |
| These are recognized only at the very start of the pattern or after option |
| setting with a similar syntax. |
| <pre> |
| (*BSR_ANYCRLF) CR, LF, or CRLF |
| (*BSR_UNICODE) any Unicode newline sequence |
| </pre> |
| </p> |
| <h2><a name="SEC23" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a></h2> |
| <p> |
| <pre> |
| (?=...) ) |
| (*pla:...) ) positive lookahead |
| (*positive_lookahead:...) ) |
| |
| (?!...) ) |
| (*nla:...) ) negative lookahead |
| (*negative_lookahead:...) ) |
| |
| (?<=...) ) |
| (*plb:...) ) positive lookbehind |
| (*positive_lookbehind:...) ) |
| |
| (?<!...) ) |
| (*nlb:...) ) negative lookbehind |
| (*negative_lookbehind:...) ) |
| </pre> |
| Each top-level branch of a lookbehind must have a limit for the number of |
| characters it matches. If any branch can match a variable number of characters, |
| the maximum for each branch is limited to a value set by the caller of |
| <b>pcre2_compile()</b> or defaulted. The default is set when PCRE2 is built |
| (ultimate default 255). If every branch matches a fixed number of characters, |
| the limit for each branch is 65535 characters. |
| </p> |
| <h2><a name="SEC24" href="#TOC1">NON-ATOMIC LOOKAROUND ASSERTIONS</a></h2> |
| <p> |
| These assertions are specific to PCRE2 and are not Perl-compatible. |
| <pre> |
| (?*...) ) |
| (*napla:...) ) synonyms |
| (*non_atomic_positive_lookahead:...) ) |
| |
| (?<*...) ) |
| (*naplb:...) ) synonyms |
| (*non_atomic_positive_lookbehind:...) ) |
| </pre> |
| </p> |
| <h2><a name="SEC25" href="#TOC1">SUBSTRING SCAN ASSERTION</a></h2> |
| <p> |
| This feature is not Perl-compatible. |
| <pre> |
| (*scan_substring:(grouplist)...) scan captured substring |
| (*scs:(grouplist)...) scan captured substring |
| </pre> |
| The comma-separated list "grouplist" may identify groups in any of the |
| following ways: |
| <pre> |
| n absolute reference |
| +n relative reference |
| -n relative reference |
| <name> name |
| 'name' name |
| </pre> |
| </p> |
| <h2><a name="SEC26" href="#TOC1">SCRIPT RUNS</a></h2> |
| <p> |
| <pre> |
| (*script_run:...) ) script run, can be backtracked into |
| (*sr:...) ) |
| |
| (*atomic_script_run:...) ) atomic script run |
| (*asr:...) ) |
| </pre> |
| </p> |
| <h2><a name="SEC27" href="#TOC1">BACKREFERENCES</a></h2> |
| <p> |
| <pre> |
| \n reference by number (can be ambiguous) |
| \gn reference by number |
| \g{n} reference by number |
| \g+n relative reference by number (PCRE2 extension) |
| \g-n relative reference by number |
| \g{+n} relative reference by number (PCRE2 extension) |
| \g{-n} relative reference by number |
| \k<name> reference by name (Perl) |
| \k'name' reference by name (Perl) |
| \g{name} reference by name (Perl) |
| \k{name} reference by name (.NET) |
| (?P=name) reference by name (Python) |
| </pre> |
| </p> |
| <h2><a name="SEC28" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a></h2> |
| <p> |
| <pre> |
| (?R) recurse whole pattern |
| (?n) call subroutine by absolute number |
| (?+n) call subroutine by relative number |
| (?-n) call subroutine by relative number |
| (?&name) call subroutine by name (Perl) |
| (?P>name) call subroutine by name (Python) |
| \g<name> call subroutine by name (Oniguruma) |
| \g'name' call subroutine by name (Oniguruma) |
| \g<n> call subroutine by absolute number (Oniguruma) |
| \g'n' call subroutine by absolute number (Oniguruma) |
| \g<+n> call subroutine by relative number (PCRE2 extension) |
| \g'+n' call subroutine by relative number (PCRE2 extension) |
| \g<-n> call subroutine by relative number (PCRE2 extension) |
| \g'-n' call subroutine by relative number (PCRE2 extension) |
| </pre> |
| The variants using parentheses (?...) may also specify a list of capture groups |
| to return, which shall be retained in the calling subexpression if set during |
| the recursion (this feature is not supported by Perl). |
| <pre> |
| (?R(grouplist)) recurse whole pattern, returning capture groups |
| (PCRE2 extension) |
| (?n(grouplist)) ) |
| (?+n(grouplist)) ) call subroutine, returning capture groups |
| (?-n(grouplist)) ) (PCRE2 extension) |
| (?&name(grouplist)) ) |
| (?P>name(grouplist)) ) |
| </pre> |
| The comma-separated list "grouplist" uses the same syntax as |
| (*scan_substring:(grouplist)...), and may identify groups in any of the |
| following ways: |
| <pre> |
| n absolute reference |
| +n relative reference |
| -n relative reference |
| <name> name |
| 'name' name |
| </pre> |
| </p> |
| <h2><a name="SEC29" href="#TOC1">CONDITIONAL PATTERNS</a></h2> |
| <p> |
| <pre> |
| (?(condition)yes-pattern) |
| (?(condition)yes-pattern|no-pattern) |
| |
| (?(n) absolute reference condition |
| (?(+n) relative reference condition (PCRE2 extension) |
| (?(-n) relative reference condition (PCRE2 extension) |
| (?(<name>) named reference condition (Perl) |
| (?('name') named reference condition (Perl) |
| (?(name) named reference condition (PCRE2, deprecated) |
| (?(R) overall recursion condition |
| (?(Rn) specific numbered group recursion condition |
| (?(R&name) specific named group recursion condition |
| (?(DEFINE) define groups for reference |
| (?(VERSION[>]=n[.m]) test PCRE2 version |
| (?(assert) assertion condition |
| </pre> |
| Note the ambiguity of (?(R) and (?(Rn) which might be named reference |
| conditions or recursion tests. Such a condition is interpreted as a reference |
| condition if the relevant named group exists. |
| <br> |
| <br> |
| The parts within brackets for the VERSION conditional syntax could be ommited. |
| The fractional part of the version number defaults to 0 in that case. |
| </p> |
| <h2><a name="SEC30" href="#TOC1">BACKTRACKING CONTROL</a></h2> |
| <p> |
| All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the |
| name is mandatory, for the others it is optional. (*SKIP) changes its behaviour |
| if :NAME is present. The others just set a name for passing back to the caller, |
| but this is not a name that (*SKIP) can see. The following act immediately they |
| are reached: |
| <pre> |
| (*ACCEPT) force successful match |
| (*FAIL) force backtrack; synonym (*F) |
| (*MARK:NAME) set name to be passed back; synonym (*:NAME) |
| </pre> |
| The following act only when a subsequent match failure causes a backtrack to |
| reach them. They all force a match failure, but they differ in what happens |
| afterwards. Those that advance the start-of-match point do so only if the |
| pattern is not anchored. |
| <pre> |
| (*COMMIT) overall failure, no advance of starting point |
| (*PRUNE) advance to next starting character |
| (*SKIP) advance to current matching position |
| (*SKIP:NAME) advance to position corresponding to an earlier |
| (*MARK:NAME); if not found, the (*SKIP) is ignored |
| (*THEN) local failure, backtrack to next alternation |
| </pre> |
| The effect of one of these verbs in a group called as a subroutine is confined |
| to the subroutine call. |
| </p> |
| <h2><a name="SEC31" href="#TOC1">CALLOUTS</a></h2> |
| <p> |
| <pre> |
| (?C) callout (assumed number 0) |
| (?Cn) callout with numerical data n |
| (?C"text") callout with string data |
| </pre> |
| The allowed string delimiters are ` ' " ^ % # $ (which are the same for the |
| start and the end), and the starting delimiter { matched with the ending |
| delimiter }. To encode the ending delimiter within the string, double it. |
| </p> |
| <h2><a name="SEC32" href="#TOC1">REPLACEMENT STRINGS</a></h2> |
| <p> |
| If the PCRE2_SUBSTITUTE_LITERAL option is set, a replacement string for |
| <b>pcre2_substitute()</b> is not interpreted. Otherwise, by default, the only |
| special character is the dollar character in one of the following forms: |
| <pre> |
| $$ insert a dollar character |
| $n or ${n} insert the contents of group <i>n</i> |
| $<name> insert the contents of named group |
| $0 or $& insert the entire matched substring |
| $` insert the substring that precedes the match |
| $' insert the substring that follows the match |
| $_ insert the entire input string |
| $*MARK or ${*MARK} insert a control verb name |
| </pre> |
| For ${n}, n can be a name or a number. If PCRE2_SUBSTITUTE_EXTENDED is set, |
| there is additional interpretation: |
| </p> |
| <p> |
| 1. Backslash is an escape character, and the forms described in "ESCAPED |
| CHARACTERS" above are recognized. Also: |
| <pre> |
| \Q...\E can be used to suppress interpretation |
| \l force the next character to lower case |
| \u force the next character to upper case |
| \L force subsequent characters to lower case |
| \U force subsequent characters to upper case |
| \u\L force next character to upper case, then all lower |
| \l\U force next character to lower case, then all upper |
| \E end \L or \U case forcing |
| \b backspace character (note: as in character class in pattern) |
| \v vertical tab character (note: not the same as in a pattern) |
| </pre> |
| 2. The Python form \g<n>, where the angle brackets are part of the syntax and |
| <i>n</i> is either a group name or a number, is recognized as an alternative way |
| of inserting the contents of a group, for example \g<3>. |
| </p> |
| <p> |
| 3. Capture substitution supports the following additional forms: |
| <pre> |
| ${n:-string} default for unset group |
| ${n:+string1:string2} values for set/unset group |
| </pre> |
| The substitution strings themselves are expanded. Backslash can be used to |
| escape colons and closing curly brackets. |
| </p> |
| <h2><a name="SEC33" href="#TOC1">SEE ALSO</a></h2> |
| <p> |
| <b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3), |
| <b>pcre2matching</b>(3), <b>pcre2</b>(3). |
| </p> |
| <h2><a name="SEC34" href="#TOC1">AUTHOR</a></h2> |
| <p> |
| Philip Hazel |
| <br> |
| Retired from University Computing Service |
| <br> |
| Cambridge, England. |
| <br> |
| </p> |
| <h2><a name="SEC35" href="#TOC1">REVISION</a></h2> |
| <p> |
| Last updated: 2 September 2025 |
| <br> |
| Copyright © 1997-2024 University of Cambridge. |
| <br> |
| <p> |
| Return to the <a href="index.html">PCRE2 index page</a>. |
| </p> |