| ----------------------------------------------------------------------------- | 
 | This file contains a concatenation of the PCRE2 man pages, converted to plain | 
 | text format for ease of searching with a text editor, or for use on systems | 
 | that do not have a man page processor. The small individual files that give | 
 | synopses of each function in the library have not been included. Neither has | 
 | the pcre2demo program. There are separate text files for the pcre2grep and | 
 | pcre2test commands. | 
 | ----------------------------------------------------------------------------- | 
 |  | 
 |  | 
 | PCRE2(3)                   Library Functions Manual                   PCRE2(3) | 
 |  | 
 |  | 
 | NAME | 
 |        PCRE2 - Perl-compatible regular expressions (revised API) | 
 |  | 
 |  | 
 | INTRODUCTION | 
 |  | 
 |        PCRE2 is the name used for a revised API for the PCRE library, which is | 
 |        a  set  of  functions,  written in C, that implement regular expression | 
 |        pattern matching using the same syntax and semantics as Perl, with just | 
 |        a few differences. After nearly two decades,  the  limitations  of  the | 
 |        original  API  were  making development increasingly difficult. The new | 
 |        API is more extensible, and it was simplified by abolishing  the  sepa- | 
 |        rate  "study" optimizing function; in PCRE2, patterns are automatically | 
 |        optimized where possible. Since forking from PCRE1, the code  has  been | 
 |        extensively  refactored and new features introduced. The old library is | 
 |        now obsolete and is no longer maintained. | 
 |  | 
 |        As well as Perl-style regular expression patterns, some  features  that | 
 |        appeared  in  Python and the original PCRE before they appeared in Perl | 
 |        are available using the Python syntax. There is also support  for  some | 
 |        .NET  and  Oniguruma syntax items, and there are options for requesting | 
 |        minor changes that give better ECMAScript (JavaScript) compatibility. | 
 |  | 
 |        The source code for PCRE2 can be compiled to support strings of  8-bit, | 
 |        16-bit, or 32-bit code units, which means that up to three separate li- | 
 |        braries  may  be  installed, one for each code unit size. The size of a | 
 |        code unit is not related to the bit size of the underlying hardware. In | 
 |        a 64-bit environment that also supports 32-bit  applications,  versions | 
 |        of  PCRE2  that  are  compiled  in  both 64-bit and 32-bit modes may be | 
 |        needed. | 
 |  | 
 |        The original work to extend PCRE to 16-bit and 32-bit  code  units  was | 
 |        done by Zoltan Herczeg and Christian Persch, respectively. In all three | 
 |        cases,  strings  can  be  interpreted  either as one character per code | 
 |        unit, or as UTF-encoded Unicode, with support for Unicode general cate- | 
 |        gory properties. Unicode support is optional at build time (but is  the | 
 |        default). However, processing strings as UTF code units must be enabled | 
 |        explicitly at run time. The version of Unicode in use can be discovered | 
 |        by running | 
 |  | 
 |          pcre2test -C | 
 |  | 
 |        The  three  libraries  contain  identical sets of functions, with names | 
 |        ending in _8,  _16,  or  _32,  respectively  (for  example,  pcre2_com- | 
 |        pile_8()).  However,  by defining PCRE2_CODE_UNIT_WIDTH to be 8, 16, or | 
 |        32, a program that uses just one code unit width can be  written  using | 
 |        generic names such as pcre2_compile(), and the documentation is written | 
 |        assuming that this is the case. | 
 |  | 
 |        In addition to the Perl-compatible matching function, PCRE2 contains an | 
 |        alternative  function that matches the same compiled patterns in a dif- | 
 |        ferent way. In certain circumstances, the alternative function has some | 
 |        advantages.  For a discussion of the two matching algorithms,  see  the | 
 |        pcre2matching page. | 
 |  | 
 |        Details  of  exactly which Perl regular expression features are and are | 
 |        not supported by  PCRE2  are  given  in  separate  documents.  See  the | 
 |        pcre2pattern  and  pcre2compat  pages. There is a syntax summary in the | 
 |        pcre2syntax page. | 
 |  | 
 |        Some features of PCRE2 can be included, excluded, or changed  when  the | 
 |        library  is  built. The pcre2_config() function makes it possible for a | 
 |        client to discover which features are  available.  The  features  them- | 
 |        selves are described in the pcre2build page. Documentation about build- | 
 |        ing  PCRE2 for various operating systems can be found in the README and | 
 |        NON-AUTOTOOLS-BUILD files in the source distribution. | 
 |  | 
 |        The libraries contains a number of undocumented internal functions  and | 
 |        data  tables  that  are  used by more than one of the exported external | 
 |        functions, but which are not intended  for  use  by  external  callers. | 
 |        Their  names  all begin with "_pcre2", which hopefully will not provoke | 
 |        any name clashes. In some environments, it is possible to control which | 
 |        external symbols are exported when a shared library is  built,  and  in | 
 |        these cases the undocumented symbols are not exported. | 
 |  | 
 |  | 
 | SECURITY CONSIDERATIONS | 
 |  | 
 |        If  you  are using PCRE2 in a non-UTF application that permits users to | 
 |        supply arbitrary patterns for compilation, you should  be  aware  of  a | 
 |        feature that allows users to turn on UTF support from within a pattern. | 
 |        For  example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8 | 
 |        mode, which interprets patterns and subjects as strings of  UTF-8  code | 
 |        units instead of individual 8-bit characters. This causes both the pat- | 
 |        tern  and  any data against which it is matched to be checked for UTF-8 | 
 |        validity. If the data string is very long, such a check might use  suf- | 
 |        ficiently  many  resources as to cause your application to lose perfor- | 
 |        mance. | 
 |  | 
 |        One way of guarding against this possibility is to use  the  pcre2_pat- | 
 |        tern_info()  function  to  check  the  compiled  pattern's  options for | 
 |        PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF  option  when | 
 |        calling  pcre2_compile().  This causes a compile time error if the pat- | 
 |        tern contains a UTF-setting sequence. | 
 |  | 
 |        The use of Unicode properties for character types such as \d  can  also | 
 |        be  enabled  from within the pattern, by specifying "(*UCP)". This fea- | 
 |        ture can be disallowed by setting the PCRE2_NEVER_UCP option. | 
 |  | 
 |        If your application is one that supports UTF, be  aware  that  validity | 
 |        checking  can  take time. If the same data string is to be matched many | 
 |        times, you can use the PCRE2_NO_UTF_CHECK option  for  the  second  and | 
 |        subsequent matches to avoid running redundant checks. | 
 |  | 
 |        The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead | 
 |        to  problems,  because  it  may leave the current matching point in the | 
 |        middle of a multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C  op- | 
 |        tion can be used by an application to lock out the use of \C, causing a | 
 |        compile-time  error  if it is encountered. It is also possible to build | 
 |        PCRE2 with the use of \C permanently disabled. | 
 |  | 
 |        Another way that performance can be hit is by running  a  pattern  that | 
 |        has  a  very  large search tree against a string that will never match. | 
 |        Nested unlimited repeats in a pattern are a common example. PCRE2  pro- | 
 |        vides  some  protection  against  this: see the pcre2_set_match_limit() | 
 |        function in the pcre2api page.  There  is  a  similar  function  called | 
 |        pcre2_set_depth_limit() that can be used to restrict the amount of mem- | 
 |        ory that is used. | 
 |  | 
 |  | 
 | USER DOCUMENTATION | 
 |  | 
 |        The  user  documentation for PCRE2 comprises a number of different sec- | 
 |        tions. In the "man" format, each of these is a separate "man page".  In | 
 |        the  HTML  format, each is a separate page, linked from the index page. | 
 |        In the plain  text  format,  the  descriptions  of  the  pcre2grep  and | 
 |        pcre2test programs are in files called pcre2grep.txt and pcre2test.txt, | 
 |        respectively.  The remaining sections, except for the pcre2demo section | 
 |        (which is a program listing), and the short pages for individual  func- | 
 |        tions,  are  concatenated in pcre2.txt, for ease of searching. The sec- | 
 |        tions are as follows: | 
 |  | 
 |          pcre2              this document | 
 |          pcre2-config       show PCRE2 installation configuration information | 
 |          pcre2api           details of PCRE2's native C API | 
 |          pcre2build         building PCRE2 | 
 |          pcre2callout       details of the pattern callout feature | 
 |          pcre2compat        discussion of Perl compatibility | 
 |          pcre2convert       details of pattern conversion functions | 
 |          pcre2demo          a demonstration C program that uses PCRE2 | 
 |          pcre2grep          description of the pcre2grep command (8-bit only) | 
 |          pcre2jit           discussion of just-in-time optimization support | 
 |          pcre2limits        details of size and other limits | 
 |          pcre2matching      discussion of the two matching algorithms | 
 |          pcre2partial       details of the partial matching facility | 
 |          pcre2pattern       syntax and semantics of supported regular | 
 |                               expression patterns | 
 |          pcre2perform       discussion of performance issues | 
 |          pcre2posix         the POSIX-compatible C API for the 8-bit library | 
 |          pcre2sample        discussion of the pcre2demo program | 
 |          pcre2serialize     details of pattern serialization | 
 |          pcre2syntax        quick syntax reference | 
 |          pcre2test          description of the pcre2test command | 
 |          pcre2unicode       discussion of Unicode and UTF support | 
 |  | 
 |        In the "man" and HTML formats, there is also a short page  for  each  C | 
 |        library function, listing its arguments and results. | 
 |  | 
 |  | 
 | AUTHORS | 
 |  | 
 |        The  current  maintainers  of PCRE2 are Nicholas Wilson and Zoltan Her- | 
 |        czeg. | 
 |  | 
 |        PCRE2 was written by Philip Hazel, of the University Computing Service, | 
 |        Cambridge, England. Many others have also contributed. | 
 |  | 
 |        To contact the maintainers, please use the  GitHub  issues  tracker  or | 
 |        PCRE2    mailing    list,   as   described   at   the   project   page: | 
 |        https://github.com/PCRE2Project/pcre2 | 
 |  | 
 |  | 
 | REVISION | 
 |  | 
 |        Last updated: 22 February 2025 | 
 |        Copyright (c) 1997-2021 University of Cambridge. | 
 |  | 
 |  | 
 | PCRE2 10.48-DEV                22 February 2025                       PCRE2(3) | 
 | ------------------------------------------------------------------------------ | 
 |  | 
 |  | 
 | PCRE2API(3)                Library Functions Manual                PCRE2API(3) | 
 |  | 
 |  | 
 | NAME | 
 |        PCRE2 - Perl-compatible regular expressions (revised API) | 
 |  | 
 |        #include <pcre2.h> | 
 |  | 
 |        PCRE2  is  a  new API for PCRE, starting at release 10.0. This document | 
 |        contains a description of all its native functions. See the pcre2 docu- | 
 |        ment for an overview of all the PCRE2 documentation. | 
 |  | 
 |  | 
 | PCRE2 NATIVE API BASIC FUNCTIONS | 
 |  | 
 |        pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length, | 
 |          uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset, | 
 |          pcre2_compile_context *ccontext); | 
 |  | 
 |        void pcre2_code_free(pcre2_code *code); | 
 |  | 
 |        pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize, | 
 |          pcre2_general_context *gcontext); | 
 |  | 
 |        pcre2_match_data *pcre2_match_data_create_from_pattern( | 
 |          const pcre2_code *code, pcre2_general_context *gcontext); | 
 |  | 
 |        int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject, | 
 |          PCRE2_SIZE length, PCRE2_SIZE startoffset, | 
 |          uint32_t options, pcre2_match_data *match_data, | 
 |          pcre2_match_context *mcontext); | 
 |  | 
 |        int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject, | 
 |          PCRE2_SIZE length, PCRE2_SIZE startoffset, | 
 |          uint32_t options, pcre2_match_data *match_data, | 
 |          pcre2_match_context *mcontext, | 
 |          int *workspace, PCRE2_SIZE wscount); | 
 |  | 
 |        void pcre2_match_data_free(pcre2_match_data *match_data); | 
 |  | 
 |  | 
 | PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS | 
 |  | 
 |        PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data); | 
 |  | 
 |        PCRE2_SIZE pcre2_get_match_data_size(pcre2_match_data *match_data); | 
 |  | 
 |        PCRE2_SIZE pcre2_get_match_data_heapframes_size( | 
 |          pcre2_match_data *match_data); | 
 |  | 
 |        uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data); | 
 |  | 
 |        PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data); | 
 |  | 
 |        PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data); | 
 |  | 
 |  | 
 | PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS | 
 |  | 
 |        pcre2_general_context *pcre2_general_context_create( | 
 |          void *(*private_malloc)(PCRE2_SIZE, void *), | 
 |          void (*private_free)(void *, void *), void *memory_data); | 
 |  | 
 |        pcre2_general_context *pcre2_general_context_copy( | 
 |          pcre2_general_context *gcontext); | 
 |  | 
 |        void pcre2_general_context_free(pcre2_general_context *gcontext); | 
 |  | 
 |  | 
 | PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS | 
 |  | 
 |        pcre2_compile_context *pcre2_compile_context_create( | 
 |          pcre2_general_context *gcontext); | 
 |  | 
 |        pcre2_compile_context *pcre2_compile_context_copy( | 
 |          pcre2_compile_context *ccontext); | 
 |  | 
 |        void pcre2_compile_context_free(pcre2_compile_context *ccontext); | 
 |  | 
 |        int pcre2_set_bsr(pcre2_compile_context *ccontext, | 
 |          uint32_t value); | 
 |  | 
 |        int pcre2_set_character_tables(pcre2_compile_context *ccontext, | 
 |          const uint8_t *tables); | 
 |  | 
 |        int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext, | 
 |          uint32_t extra_options); | 
 |  | 
 |        int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext, | 
 |          PCRE2_SIZE value); | 
 |  | 
 |        int pcre2_set_max_pattern_compiled_length( | 
 |          pcre2_compile_context *ccontext, PCRE2_SIZE value); | 
 |  | 
 |        int pcre2_set_max_varlookbehind(pcre2_compile_contest *ccontext, | 
 |          uint32_t value); | 
 |  | 
 |        int pcre2_set_newline(pcre2_compile_context *ccontext, | 
 |          uint32_t value); | 
 |  | 
 |        int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext, | 
 |          uint32_t value); | 
 |  | 
 |        int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext, | 
 |          int (*guard_function)(uint32_t, void *), void *user_data); | 
 |  | 
 |        int pcre2_set_optimize(pcre2_compile_context *ccontext, | 
 |          uint32_t directive); | 
 |  | 
 |  | 
 | PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS | 
 |  | 
 |        pcre2_match_context *pcre2_match_context_create( | 
 |          pcre2_general_context *gcontext); | 
 |  | 
 |        pcre2_match_context *pcre2_match_context_copy( | 
 |          pcre2_match_context *mcontext); | 
 |  | 
 |        void pcre2_match_context_free(pcre2_match_context *mcontext); | 
 |  | 
 |        int pcre2_set_callout(pcre2_match_context *mcontext, | 
 |          int (*callout_function)(pcre2_callout_block *, void *), | 
 |          void *callout_data); | 
 |  | 
 |        int pcre2_set_substitute_callout(pcre2_match_context *mcontext, | 
 |          int (*callout_function)(pcre2_substitute_callout_block *, void *), | 
 |          void *callout_data); | 
 |  | 
 |        int pcre2_set_substitute_case_callout(pcre2_match_context *mcontext, | 
 |          PCRE2_SIZE (*callout_function)(PCRE2_SPTR, PCRE2_SIZE, | 
 |                                         PCRE2_UCHAR *, PCRE2_SIZE, | 
 |                                         int, void *), | 
 |          void *callout_data); | 
 |  | 
 |        int pcre2_set_offset_limit(pcre2_match_context *mcontext, | 
 |          PCRE2_SIZE value); | 
 |  | 
 |        int pcre2_set_heap_limit(pcre2_match_context *mcontext, | 
 |          uint32_t value); | 
 |  | 
 |        int pcre2_set_match_limit(pcre2_match_context *mcontext, | 
 |          uint32_t value); | 
 |  | 
 |        int pcre2_set_depth_limit(pcre2_match_context *mcontext, | 
 |          uint32_t value); | 
 |  | 
 |  | 
 | PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS | 
 |  | 
 |        int pcre2_substring_copy_byname(pcre2_match_data *match_data, | 
 |          PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen); | 
 |  | 
 |        int pcre2_substring_copy_bynumber(pcre2_match_data *match_data, | 
 |          uint32_t number, PCRE2_UCHAR *buffer, | 
 |          PCRE2_SIZE *bufflen); | 
 |  | 
 |        void pcre2_substring_free(PCRE2_UCHAR *buffer); | 
 |  | 
 |        int pcre2_substring_get_byname(pcre2_match_data *match_data, | 
 |          PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen); | 
 |  | 
 |        int pcre2_substring_get_bynumber(pcre2_match_data *match_data, | 
 |          uint32_t number, PCRE2_UCHAR **bufferptr, | 
 |          PCRE2_SIZE *bufflen); | 
 |  | 
 |        int pcre2_substring_length_byname(pcre2_match_data *match_data, | 
 |          PCRE2_SPTR name, PCRE2_SIZE *length); | 
 |  | 
 |        int pcre2_substring_length_bynumber(pcre2_match_data *match_data, | 
 |          uint32_t number, PCRE2_SIZE *length); | 
 |  | 
 |        int pcre2_substring_nametable_scan(const pcre2_code *code, | 
 |          PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last); | 
 |  | 
 |        int pcre2_substring_number_from_name(const pcre2_code *code, | 
 |          PCRE2_SPTR name); | 
 |  | 
 |        void pcre2_substring_list_free(PCRE2_UCHAR **list); | 
 |  | 
 |        int pcre2_substring_list_get(pcre2_match_data *match_data, | 
 |          PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr); | 
 |  | 
 |  | 
 | PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION | 
 |  | 
 |        int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject, | 
 |          PCRE2_SIZE length, PCRE2_SIZE startoffset, | 
 |          uint32_t options, pcre2_match_data *match_data, | 
 |          pcre2_match_context *mcontext, PCRE2_SPTR replacement, | 
 |          PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer, | 
 |          PCRE2_SIZE *outlengthptr); | 
 |  | 
 |  | 
 | PCRE2 NATIVE API JIT FUNCTIONS | 
 |  | 
 |        int pcre2_jit_compile(pcre2_code *code, uint32_t options); | 
 |  | 
 |        int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject, | 
 |          PCRE2_SIZE length, PCRE2_SIZE startoffset, | 
 |          uint32_t options, pcre2_match_data *match_data, | 
 |          pcre2_match_context *mcontext); | 
 |  | 
 |        void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext); | 
 |  | 
 |        pcre2_jit_stack *pcre2_jit_stack_create(size_t startsize, | 
 |          size_t maxsize, pcre2_general_context *gcontext); | 
 |  | 
 |        void pcre2_jit_stack_assign(pcre2_match_context *mcontext, | 
 |          pcre2_jit_callback callback_function, void *callback_data); | 
 |  | 
 |        void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack); | 
 |  | 
 |  | 
 | PCRE2 NATIVE API SERIALIZATION FUNCTIONS | 
 |  | 
 |        int32_t pcre2_serialize_decode(pcre2_code **codes, | 
 |          int32_t number_of_codes, const uint8_t *bytes, | 
 |          pcre2_general_context *gcontext); | 
 |  | 
 |        int32_t pcre2_serialize_encode(const pcre2_code **codes, | 
 |          int32_t number_of_codes, uint8_t **serialized_bytes, | 
 |          PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext); | 
 |  | 
 |        void pcre2_serialize_free(uint8_t *bytes); | 
 |  | 
 |        int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes); | 
 |  | 
 |  | 
 | PCRE2 NATIVE API AUXILIARY FUNCTIONS | 
 |  | 
 |        pcre2_code *pcre2_code_copy(const pcre2_code *code); | 
 |  | 
 |        pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code); | 
 |  | 
 |        int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer, | 
 |          PCRE2_SIZE bufflen); | 
 |  | 
 |        const uint8_t *pcre2_maketables(pcre2_general_context *gcontext); | 
 |  | 
 |        void pcre2_maketables_free(pcre2_general_context *gcontext, | 
 |          const uint8_t *tables); | 
 |  | 
 |        int pcre2_pattern_info(const pcre2_code *code, uint32_t what, | 
 |          void *where); | 
 |  | 
 |        int pcre2_callout_enumerate(const pcre2_code *code, | 
 |          int (*callback)(pcre2_callout_enumerate_block *, void *), | 
 |          void *user_data); | 
 |  | 
 |        int pcre2_config(uint32_t what, void *where); | 
 |  | 
 |  | 
 | PCRE2 NATIVE API OBSOLETE FUNCTIONS | 
 |  | 
 |        int pcre2_set_recursion_limit(pcre2_match_context *mcontext, | 
 |          uint32_t value); | 
 |  | 
 |        int pcre2_set_recursion_memory_management( | 
 |          pcre2_match_context *mcontext, | 
 |          void *(*private_malloc)(size_t, void *), | 
 |          void (*private_free)(void *, void *), void *memory_data); | 
 |  | 
 |        These functions became obsolete at release 10.30 and are retained  only | 
 |        for  backward  compatibility.  They should not be used in new code. The | 
 |        first is replaced by pcre2_set_depth_limit(); the second is  no  longer | 
 |        needed and has no effect (it always returns zero). | 
 |  | 
 |  | 
 | PCRE2 EXPERIMENTAL PATTERN CONVERSION FUNCTIONS | 
 |  | 
 |        pcre2_convert_context *pcre2_convert_context_create( | 
 |          pcre2_general_context *gcontext); | 
 |  | 
 |        pcre2_convert_context *pcre2_convert_context_copy( | 
 |          pcre2_convert_context *cvcontext); | 
 |  | 
 |        void pcre2_convert_context_free(pcre2_convert_context *cvcontext); | 
 |  | 
 |        int pcre2_set_glob_escape(pcre2_convert_context *cvcontext, | 
 |          uint32_t escape_char); | 
 |  | 
 |        int pcre2_set_glob_separator(pcre2_convert_context *cvcontext, | 
 |          uint32_t separator_char); | 
 |  | 
 |        int pcre2_pattern_convert(PCRE2_SPTR pattern, PCRE2_SIZE length, | 
 |          uint32_t options, PCRE2_UCHAR **buffer, | 
 |          PCRE2_SIZE *blength, pcre2_convert_context *cvcontext); | 
 |  | 
 |        void pcre2_converted_pattern_free(PCRE2_UCHAR *converted_pattern); | 
 |  | 
 |        These  functions  provide  a  way of converting non-PCRE2 patterns into | 
 |        patterns that can be processed by pcre2_compile(). This facility is ex- | 
 |        perimental and may be changed in future releases. At  present,  "globs" | 
 |        and  POSIX  basic  and  extended patterns can be converted. Details are | 
 |        given in the pcre2convert documentation. | 
 |  | 
 |  | 
 | PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES | 
 |  | 
 |        There are three PCRE2 libraries, supporting 8-bit, 16-bit,  and  32-bit | 
 |        code  units,  respectively.  However,  there  is  just one header file, | 
 |        pcre2.h.  This contains the function prototypes and  other  definitions | 
 |        for all three libraries. One, two, or all three can be installed simul- | 
 |        taneously.  On  Unix-like  systems the libraries are called libpcre2-8, | 
 |        libpcre2-16, and libpcre2-32, and they can also co-exist with the orig- | 
 |        inal PCRE libraries.  Every PCRE2 function  comes  in  three  different | 
 |        forms, one for each library, for example: | 
 |  | 
 |          pcre2_compile_8() | 
 |          pcre2_compile_16() | 
 |          pcre2_compile_32() | 
 |  | 
 |        There are also three different sets of data types: | 
 |  | 
 |          PCRE2_UCHAR8, PCRE2_UCHAR16, PCRE2_UCHAR32 | 
 |          PCRE2_SPTR8,  PCRE2_SPTR16,  PCRE2_SPTR32 | 
 |  | 
 |        The  UCHAR  types define unsigned code units of the appropriate widths. | 
 |        For example, PCRE2_UCHAR16 is usually defined as `uint16_t'.  The  SPTR | 
 |        types are pointers to constants of the equivalent UCHAR types, that is, | 
 |        they are pointers to vectors of unsigned code units. | 
 |  | 
 |        Character  strings  are  passed  to a PCRE2 library as sequences of un- | 
 |        signed integers in code units of the appropriate width. The length of a | 
 |        string may be given as a number of code units, or  the  string  may  be | 
 |        specified as zero-terminated. | 
 |  | 
 |        Many  applications use only one code unit width. For their convenience, | 
 |        macros are defined whose names are the generic forms such as pcre2_com- | 
 |        pile() and  PCRE2_SPTR.  These  macros  use  the  value  of  the  macro | 
 |        PCRE2_CODE_UNIT_WIDTH  to generate the appropriate width-specific func- | 
 |        tion and macro names.  PCRE2_CODE_UNIT_WIDTH is not defined by default. | 
 |        An application must define it to be  8,  16,  or  32  before  including | 
 |        pcre2.h in order to make use of the generic names. | 
 |  | 
 |        Applications  that use more than one code unit width can be linked with | 
 |        more than one PCRE2 library, but must define  PCRE2_CODE_UNIT_WIDTH  to | 
 |        be  0  before  including pcre2.h, and then use the real function names. | 
 |        Any code that is to be included in an environment where  the  value  of | 
 |        PCRE2_CODE_UNIT_WIDTH  is  unknown  should  also  use the real function | 
 |        names. (Unfortunately, it is not possible in C code to save and restore | 
 |        the value of a macro.) | 
 |  | 
 |        If PCRE2_CODE_UNIT_WIDTH is not defined  before  including  pcre2.h,  a | 
 |        compiler error occurs. | 
 |  | 
 |        When  using  multiple  libraries  in an application, you must take care | 
 |        when processing any particular pattern to use  only  functions  from  a | 
 |        single  library.   For example, if you want to run a match using a pat- | 
 |        tern that was compiled with pcre2_compile_16(), you  must  do  so  with | 
 |        pcre2_match_16(), not pcre2_match_8() or pcre2_match_32(). | 
 |  | 
 |        In  the  function summaries above, and in the rest of this document and | 
 |        other PCRE2 documents, functions and data  types  are  described  using | 
 |        their generic names, without the _8, _16, or _32 suffix. | 
 |  | 
 |  | 
 | PCRE2 API OVERVIEW | 
 |  | 
 |        PCRE2  has  its  own  native  API, which is described in this document. | 
 |        There are also some wrapper functions for the 8-bit library that corre- | 
 |        spond to the POSIX regular expression API, but they do not give  access | 
 |        to  all  the  functionality of PCRE2 and they are not thread-safe. They | 
 |        are described in the pcre2posix documentation. Both these APIs define a | 
 |        set of C function calls. | 
 |  | 
 |        The native API C data types, function prototypes,  option  values,  and | 
 |        error codes are defined in the header file pcre2.h, which also contains | 
 |        definitions of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release | 
 |        numbers  for the library. Applications can use these to include support | 
 |        for different releases of PCRE2. | 
 |  | 
 |        In a Windows environment, if you want to statically link an application | 
 |        program against a non-dll PCRE2 library, you must  define  PCRE2_STATIC | 
 |        before including pcre2.h. | 
 |  | 
 |        The  functions pcre2_compile() and pcre2_match() are used for compiling | 
 |        and matching regular expressions in a Perl-compatible manner. A  sample | 
 |        program that demonstrates the simplest way of using them is provided in | 
 |        the file called pcre2demo.c in the PCRE2 source distribution. A listing | 
 |        of  this  program  is  given  in  the  pcre2demo documentation, and the | 
 |        pcre2sample documentation describes how to compile and run it. | 
 |  | 
 |        The compiling and matching functions recognize various options that are | 
 |        passed as bits in an options argument. There are also some more compli- | 
 |        cated parameters such as custom memory  management  functions  and  re- | 
 |        source  limits  that  are  passed  in "contexts" (which are just memory | 
 |        blocks, described below). Simple applications do not need to  make  use | 
 |        of contexts. | 
 |  | 
 |        Just-in-time  (JIT)  compiler  support  is an optional feature of PCRE2 | 
 |        that can be built in  appropriate  hardware  environments.  It  greatly | 
 |        speeds  up  the matching performance of many patterns. Programs can re- | 
 |        quest that it be used if available by calling pcre2_jit_compile() after | 
 |        a pattern has been successfully compiled by pcre2_compile(). This  does | 
 |        nothing if JIT support is not available. | 
 |  | 
 |        More  complicated  programs  might  need  to make use of the specialist | 
 |        functions   pcre2_jit_stack_create(),    pcre2_jit_stack_free(),    and | 
 |        pcre2_jit_stack_assign()  in order to control the JIT code's memory us- | 
 |        age. | 
 |  | 
 |        JIT matching is automatically used by pcre2_match() if it is available, | 
 |        unless the PCRE2_NO_JIT option is set. There is also a direct interface | 
 |        for JIT matching, which gives improved performance at  the  expense  of | 
 |        less  sanity  checking. The JIT-specific functions are discussed in the | 
 |        pcre2jit documentation. | 
 |  | 
 |        A second matching function, pcre2_dfa_match(), which is  not  Perl-com- | 
 |        patible,  is  also  provided.  This  uses a different algorithm for the | 
 |        matching. The alternative algorithm finds all possible  matches  (at  a | 
 |        given  point  in  the subject), and scans the subject just once (unless | 
 |        there are lookaround assertions). However, this algorithm does not  re- | 
 |        turn  captured substrings. A description of the two matching algorithms | 
 |        and their advantages and disadvantages is given  in  the  pcre2matching | 
 |        documentation. There is no JIT support for pcre2_dfa_match(). | 
 |  | 
 |        In  addition  to  the  main compiling and matching functions, there are | 
 |        convenience functions for extracting captured substrings from a subject | 
 |        string that has been matched by pcre2_match(). They are: | 
 |  | 
 |          pcre2_substring_copy_byname() | 
 |          pcre2_substring_copy_bynumber() | 
 |          pcre2_substring_get_byname() | 
 |          pcre2_substring_get_bynumber() | 
 |          pcre2_substring_list_get() | 
 |          pcre2_substring_length_byname() | 
 |          pcre2_substring_length_bynumber() | 
 |          pcre2_substring_nametable_scan() | 
 |          pcre2_substring_number_from_name() | 
 |  | 
 |        pcre2_substring_free() and pcre2_substring_list_free()  are  also  pro- | 
 |        vided,  to  free  memory used for extracted strings. If either of these | 
 |        functions is called with a NULL argument, the function returns  immedi- | 
 |        ately without doing anything. | 
 |  | 
 |        The  function  pcre2_substitute()  can be called to match a pattern and | 
 |        return a copy of the subject string with substitutions for  parts  that | 
 |        were matched. | 
 |  | 
 |        Functions  whose  names begin with pcre2_serialize_ are used for saving | 
 |        compiled patterns on disc or elsewhere, and reloading them later. | 
 |  | 
 |        Finally, there are functions for finding out information about  a  com- | 
 |        piled  pattern  (pcre2_pattern_info()) and about the configuration with | 
 |        which PCRE2 was built (pcre2_config()) and that it is using. | 
 |  | 
 |        Functions with names ending with _free() are used  for  freeing  memory | 
 |        blocks  of  various  sorts.  In all cases, if one of these functions is | 
 |        called with a NULL argument, it does nothing. | 
 |  | 
 |  | 
 | STRING LENGTHS AND OFFSETS | 
 |  | 
 |        The PCRE2 API uses string lengths and  offsets  into  strings  of  code | 
 |        units  in  several  places. These values are always of type PCRE2_SIZE, | 
 |        which is an unsigned integer type, currently always defined as  size_t. | 
 |        The  largest  value  that  can  be  stored  in  such  a  type  (that is | 
 |        ~(PCRE2_SIZE)0) is reserved as a special indicator for  zero-terminated | 
 |        strings  and  unset offsets.  Therefore, the longest string that can be | 
 |        handled is one less than this maximum. Note that string lengths are al- | 
 |        ways given in code units. Only in the 8-bit library is  such  a  length | 
 |        the same as the number of bytes in the string. | 
 |  | 
 |  | 
 | NEWLINES | 
 |  | 
 |        PCRE2 supports five different conventions for indicating line breaks in | 
 |        strings:  a  single  CR (carriage return) character, a single LF (line- | 
 |        feed) character, the two-character sequence CRLF, any of the three pre- | 
 |        ceding, or any Unicode newline sequence. The Unicode newline  sequences | 
 |        are  the  three just mentioned, plus the single characters VT (vertical | 
 |        tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line | 
 |        separator, U+2028), and PS (paragraph separator, U+2029). | 
 |  | 
 |        Each of the first three conventions is used by at least  one  operating | 
 |        system as its standard newline sequence. When PCRE2 is built, a default | 
 |        can be specified.  If it is not, the default is set to LF, which is the | 
 |        Unix standard. However, the newline convention can be changed by an ap- | 
 |        plication  when calling pcre2_compile(), or it can be specified by spe- | 
 |        cial text at the start of the pattern itself; this overrides any  other | 
 |        settings.  See the pcre2pattern page for details of the special charac- | 
 |        ter sequences. | 
 |  | 
 |        In the PCRE2 documentation the word "newline"  is  used  to  mean  "the | 
 |        character or pair of characters that indicate a line break". The choice | 
 |        of  newline convention affects the handling of the dot, circumflex, and | 
 |        dollar metacharacters, the handling of #-comments in /x mode, and, when | 
 |        CRLF is a recognized line ending sequence, the match position  advance- | 
 |        ment for a non-anchored pattern. There is more detail about this in the | 
 |        section on pcre2_match() options below. | 
 |  | 
 |        The  choice of newline convention does not affect the interpretation of | 
 |        the \n or \r escape sequences, nor does it affect what \R matches; this | 
 |        has its own separate convention. | 
 |  | 
 |  | 
 | MULTITHREADING | 
 |  | 
 |        In a multithreaded application it is important to keep  thread-specific | 
 |        data  separate  from data that can be shared between threads. The PCRE2 | 
 |        library code itself is thread-safe: it contains  no  static  or  global | 
 |        variables. The API is designed to be fairly simple for non-threaded ap- | 
 |        plications  while at the same time ensuring that multithreaded applica- | 
 |        tions can use it. | 
 |  | 
 |        There are several different blocks of data that are used to pass infor- | 
 |        mation between the application and the PCRE2 libraries. | 
 |  | 
 |    The compiled pattern | 
 |  | 
 |        A pointer to the compiled form of a pattern is  returned  to  the  user | 
 |        when pcre2_compile() is successful. The data in the compiled pattern is | 
 |        fixed,  and  does not change when the pattern is matched. Therefore, it | 
 |        is thread-safe, that is, the same compiled pattern can be used by  more | 
 |        than one thread simultaneously. For example, an application can compile | 
 |        all its patterns at the start, before forking off multiple threads that | 
 |        use  them.  However,  if the just-in-time (JIT) optimization feature is | 
 |        being used, it needs separate memory stack areas for each  thread.  See | 
 |        the pcre2jit documentation for more details. | 
 |  | 
 |        In  a more complicated situation, where patterns are compiled only when | 
 |        they are first needed, but are still shared between  threads,  pointers | 
 |        to  compiled  patterns  must  be protected from simultaneous writing by | 
 |        multiple threads. This is somewhat tricky to do correctly. If you  know | 
 |        that  writing  to  a pointer is atomic in your environment, you can use | 
 |        logic like this: | 
 |  | 
 |          Get a read-only (shared) lock (mutex) for pointer | 
 |          if (pointer == NULL) | 
 |            { | 
 |            Get a write (unique) lock for pointer | 
 |            if (pointer == NULL) pointer = pcre2_compile(... | 
 |            } | 
 |          Release the lock | 
 |          Use pointer in pcre2_match() | 
 |  | 
 |        Of course, testing for compilation errors should also  be  included  in | 
 |        the code. | 
 |  | 
 |        The  reason  for checking the pointer a second time is as follows: Sev- | 
 |        eral threads may have acquired the shared lock and tested  the  pointer | 
 |        for being NULL, but only one of them will be given the write lock, with | 
 |        the  rest kept waiting. The winning thread will compile the pattern and | 
 |        store the result.  After this thread releases the write  lock,  another | 
 |        thread  will  get it, and if it does not retest pointer for being NULL, | 
 |        will recompile the pattern and overwrite the pointer, creating a memory | 
 |        leak and possibly causing other issues. | 
 |  | 
 |        In an environment where writing to a pointer may  not  be  atomic,  the | 
 |        above  logic  is not sufficient. The thread that is doing the compiling | 
 |        may be descheduled after writing only part of the pointer, which  could | 
 |        cause  other  threads  to use an invalid value. Instead of checking the | 
 |        pointer itself, a separate "pointer is valid" flag (that can be updated | 
 |        atomically) must be used: | 
 |  | 
 |          Get a read-only (shared) lock (mutex) for pointer | 
 |          if (!pointer_is_valid) | 
 |            { | 
 |            Get a write (unique) lock for pointer | 
 |            if (!pointer_is_valid) | 
 |              { | 
 |              pointer = pcre2_compile(... | 
 |              pointer_is_valid = TRUE | 
 |              } | 
 |            } | 
 |          Release the lock | 
 |          Use pointer in pcre2_match() | 
 |  | 
 |        If JIT is being used, but the JIT compilation is not being done immedi- | 
 |        ately (perhaps waiting to see if the pattern  is  used  often  enough), | 
 |        similar  logic  is required. JIT compilation updates a value within the | 
 |        compiled code block, so a thread must gain unique write access  to  the | 
 |        pointer     before    calling    pcre2_jit_compile().    Alternatively, | 
 |        pcre2_code_copy() or pcre2_code_copy_with_tables() can be used  to  ob- | 
 |        tain  a  private  copy of the compiled code before calling the JIT com- | 
 |        piler. | 
 |  | 
 |    Context blocks | 
 |  | 
 |        The next main section below introduces the idea of "contexts" in  which | 
 |        PCRE2 functions are called. A context is nothing more than a collection | 
 |        of parameters that control the way PCRE2 operates. Grouping a number of | 
 |        parameters together in a context is a convenient way of passing them to | 
 |        a  PCRE2  function without using lots of arguments. The parameters that | 
 |        are stored in contexts are in some sense  "advanced  features"  of  the | 
 |        API. Many straightforward applications will not need to use contexts. | 
 |  | 
 |        In a multithreaded application, if the parameters in a context are val- | 
 |        ues  that  are  never  changed, the same context can be used by all the | 
 |        threads. However, if any thread needs to change any value in a context, | 
 |        it must make its own thread-specific copy. | 
 |  | 
 |    Match blocks | 
 |  | 
 |        The matching functions need a block of memory for storing  the  results | 
 |        of a match. This includes details of what was matched, as well as addi- | 
 |        tional  information  such as the name of a (*MARK) setting. Each thread | 
 |        must provide its own copy of this memory. | 
 |  | 
 |  | 
 | PCRE2 CONTEXTS | 
 |  | 
 |        Some PCRE2 functions have a lot of parameters, many of which  are  used | 
 |        only  by  specialist  applications,  for example, those that use custom | 
 |        memory management or non-standard character tables.  To  keep  function | 
 |        argument  lists  at a reasonable size, and at the same time to keep the | 
 |        API extensible, "uncommon" parameters are passed to  certain  functions | 
 |        in  a  context instead of directly. A context is just a block of memory | 
 |        that holds the parameter values.  Applications that do not need to  ad- | 
 |        just any of the context parameters can pass NULL when a context pointer | 
 |        is required. | 
 |  | 
 |        There  are  three different types of context: a general context that is | 
 |        relevant for several PCRE2 operations, a compile-time  context,  and  a | 
 |        match-time context. | 
 |  | 
 |    The general context | 
 |  | 
 |        At  present,  this context just contains pointers to (and data for) ex- | 
 |        ternal memory management functions that are called from several  places | 
 |        in  the  PCRE2  library.  The  context  is  named `general' rather than | 
 |        specifically `memory' because in future other fields may be  added.  If | 
 |        you  do not want to supply your own custom memory management functions, | 
 |        you do not need to bother with a general context. A general context  is | 
 |        created by: | 
 |  | 
 |        pcre2_general_context *pcre2_general_context_create( | 
 |          void *(*private_malloc)(PCRE2_SIZE, void *), | 
 |          void (*private_free)(void *, void *), void *memory_data); | 
 |  | 
 |        The  two  function pointers specify custom memory management functions, | 
 |        whose prototypes are: | 
 |  | 
 |          void *private_malloc(PCRE2_SIZE, void *); | 
 |          void  private_free(void *, void *); | 
 |  | 
 |        Whenever code in PCRE2 calls these functions, the final argument is the | 
 |        value of memory_data. Either of the first two arguments of the creation | 
 |        function may be NULL, in which case the system memory management  func- | 
 |        tions  malloc()  and free() are used. (This is not currently useful, as | 
 |        there are no other fields in a general context,  but  in  future  there | 
 |        might  be.)  The private_malloc() function is used (if supplied) to ob- | 
 |        tain memory for storing the context, and all three values are saved  as | 
 |        part of the context. | 
 |  | 
 |        Whenever  PCRE2  creates a data block of any kind, the block contains a | 
 |        pointer to the free() function that matches the malloc() function  that | 
 |        was  used.  When  the  time  comes  to free the block, this function is | 
 |        called. | 
 |  | 
 |        A general context can be copied by calling: | 
 |  | 
 |        pcre2_general_context *pcre2_general_context_copy( | 
 |          pcre2_general_context *gcontext); | 
 |  | 
 |        The memory used for a general context should be freed by calling: | 
 |  | 
 |        void pcre2_general_context_free(pcre2_general_context *gcontext); | 
 |  | 
 |        If this function is passed a  NULL  argument,  it  returns  immediately | 
 |        without doing anything. | 
 |  | 
 |    The compile context | 
 |  | 
 |        A  compile context is required if you want to provide an external func- | 
 |        tion for stack checking during compilation or  to  change  the  default | 
 |        values of any of the following compile-time parameters: | 
 |  | 
 |          What \R matches (Unicode newlines or CR, LF, CRLF only) | 
 |          PCRE2's character tables | 
 |          The newline character sequence | 
 |          The compile time nested parentheses limit | 
 |          The maximum length of the pattern string | 
 |          The extra options bits (none set by default) | 
 |          Which performance optimizations the compiler should apply | 
 |  | 
 |        A  compile context is also required if you are using custom memory man- | 
 |        agement.  If none of these apply, just pass NULL as the  context  argu- | 
 |        ment of pcre2_compile(). | 
 |  | 
 |        A  compile context is created, copied, and freed by the following func- | 
 |        tions: | 
 |  | 
 |        pcre2_compile_context *pcre2_compile_context_create( | 
 |          pcre2_general_context *gcontext); | 
 |  | 
 |        pcre2_compile_context *pcre2_compile_context_copy( | 
 |          pcre2_compile_context *ccontext); | 
 |  | 
 |        void pcre2_compile_context_free(pcre2_compile_context *ccontext); | 
 |  | 
 |        A compile context is created with default values  for  its  parameters. | 
 |        These can be changed by calling the following functions, which return 0 | 
 |        on success, or PCRE2_ERROR_BADDATA if invalid data is detected. | 
 |  | 
 |        int pcre2_set_bsr(pcre2_compile_context *ccontext, | 
 |          uint32_t value); | 
 |  | 
 |        The  value  must  be PCRE2_BSR_ANYCRLF, to specify that \R matches only | 
 |        CR, LF, or CRLF, or PCRE2_BSR_UNICODE, to specify that \R  matches  any | 
 |        Unicode line ending sequence. The value is used by the JIT compiler and | 
 |        by   the   two   interpreted   matching  functions,  pcre2_match()  and | 
 |        pcre2_dfa_match(). | 
 |  | 
 |        int pcre2_set_character_tables(pcre2_compile_context *ccontext, | 
 |          const uint8_t *tables); | 
 |  | 
 |        The value must be the result of a  call  to  pcre2_maketables(),  whose | 
 |        only argument is a general context. This function builds a set of char- | 
 |        acter tables in the current locale. | 
 |  | 
 |        int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext, | 
 |          uint32_t extra_options); | 
 |  | 
 |        As  PCRE2  has developed, almost all the 32 option bits that are avail- | 
 |        able in the options argument of pcre2_compile() have been used  up.  To | 
 |        avoid  running  out, the compile context contains a set of extra option | 
 |        bits which are used for some newer, assumed rarer, options. This  func- | 
 |        tion  sets  those bits. It always sets all the bits (either on or off). | 
 |        It does not modify any existing setting. The available options are  de- | 
 |        fined in the section entitled "Extra compile options" below. | 
 |  | 
 |        int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext, | 
 |          PCRE2_SIZE value); | 
 |  | 
 |        This  sets a maximum length, in code units, for any pattern string that | 
 |        is compiled with this context. If the pattern is longer,  an  error  is | 
 |        generated.   This facility is provided so that applications that accept | 
 |        patterns from external sources can limit their size. The default is the | 
 |        largest number that a PCRE2_SIZE variable can  hold,  which  is  effec- | 
 |        tively unlimited. | 
 |  | 
 |        int pcre2_set_max_pattern_compiled_length( | 
 |          pcre2_compile_context *ccontext, PCRE2_SIZE value); | 
 |  | 
 |        This  sets  a maximum size, in bytes, for the memory needed to hold the | 
 |        compiled version of a pattern that is compiled with  this  context.  If | 
 |        the  pattern needs more memory, an error is generated. This facility is | 
 |        provided so  that  applications  that  accept  patterns  from  external | 
 |        sources  can  limit  the  amount of memory they use. The default is the | 
 |        largest number that a PCRE2_SIZE variable can  hold,  which  is  effec- | 
 |        tively unlimited. | 
 |  | 
 |        int pcre2_set_max_varlookbehind(pcre2_compile_contest *ccontext, | 
 |          uint32_t value); | 
 |  | 
 |        This  sets  a  maximum length for the number of characters matched by a | 
 |        variable-length lookbehind assertion. The default is set when PCRE2  is | 
 |        built,  with  the ultimate default being 255, the same as Perl. Lookbe- | 
 |        hind assertions without a bounding length are not supported. | 
 |  | 
 |        int pcre2_set_newline(pcre2_compile_context *ccontext, | 
 |          uint32_t value); | 
 |  | 
 |        This specifies which characters or character sequences are to be recog- | 
 |        nized as newlines. The value must be one of PCRE2_NEWLINE_CR  (carriage | 
 |        return only), PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the | 
 |        two-character  sequence  CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any | 
 |        of the above), PCRE2_NEWLINE_ANY (any  Unicode  newline  sequence),  or | 
 |        PCRE2_NEWLINE_NUL (the NUL character, that is a binary zero). | 
 |  | 
 |        A pattern can override the value set in the compile context by starting | 
 |        with a sequence such as (*CRLF). See the pcre2pattern page for details. | 
 |  | 
 |        When  a  pattern  is  compiled  with  the  PCRE2_EXTENDED  or PCRE2_EX- | 
 |        TENDED_MORE option, the newline convention affects the  recognition  of | 
 |        the  end  of internal comments starting with #. The value is saved with | 
 |        the compiled pattern for subsequent use by the JIT compiler and by  the | 
 |        two     interpreted     matching     functions,    pcre2_match()    and | 
 |        pcre2_dfa_match(). | 
 |  | 
 |        int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext, | 
 |          uint32_t value); | 
 |  | 
 |        This parameter adjusts the limit, set  when  PCRE2  is  built  (default | 
 |        250),  on  the  depth  of  parenthesis nesting in a pattern. This limit | 
 |        stops rogue patterns using up too much system  stack  when  being  com- | 
 |        piled.  The limit applies to parentheses of all kinds, not just captur- | 
 |        ing parentheses. | 
 |  | 
 |        int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext, | 
 |          int (*guard_function)(uint32_t, void *), void *user_data); | 
 |  | 
 |        There is at least one application that runs PCRE2 in threads with  very | 
 |        limited  system  stack,  where running out of stack is to be avoided at | 
 |        all costs. The parenthesis limit above cannot take account of how  much | 
 |        stack  is  actually  available during compilation. For a finer control, | 
 |        you can supply a  function  that  is  called  whenever  pcre2_compile() | 
 |        starts  to compile a parenthesized part of a pattern. This function can | 
 |        check the actual stack size (or anything else  that  it  wants  to,  of | 
 |        course). | 
 |  | 
 |        The  first  argument to the callout function gives the current depth of | 
 |        nesting, and the second is user data that is set up by the  last  argu- | 
 |        ment   of  pcre2_set_compile_recursion_guard().  The  callout  function | 
 |        should return zero if all is well, or non-zero to force an error. | 
 |  | 
 |        int pcre2_set_optimize(pcre2_compile_context *ccontext, | 
 |          uint32_t directive); | 
 |  | 
 |        PCRE2 can apply various performance optimizations  during  compilation, | 
 |        in  order to make matching faster. For example, the compiler might con- | 
 |        vert  some  regex  constructs  into  an  equivalent   construct   which | 
 |        pcre2_match()  can  execute faster. By default, all available optimiza- | 
 |        tions are enabled. However, in rare cases, one might  wish  to  disable | 
 |        specific optimizations. For example, if it is known that some optimiza- | 
 |        tions  cannot benefit a certain regex, it might be desirable to disable | 
 |        them, in order to speed up compilation. | 
 |  | 
 |        The permitted values of directive are as follows: | 
 |  | 
 |          PCRE2_OPTIMIZATION_FULL | 
 |  | 
 |        Enable all optional performance  optimizations.  This  is  the  default | 
 |        value. | 
 |  | 
 |          PCRE2_OPTIMIZATION_NONE | 
 |  | 
 |        Disable all optional performance optimizations. | 
 |  | 
 |          PCRE2_AUTO_POSSESS | 
 |          PCRE2_AUTO_POSSESS_OFF | 
 |  | 
 |        Enable/disable  "auto-possessification" of variable quantifiers such as | 
 |        * and +.  This optimization, for example, turns a+b into a++b in  order | 
 |        to  avoid  backtracks into a+ that can never be successful. However, if | 
 |        callouts are in use, auto-possessification means that some callouts are | 
 |        never taken. You can disable this optimization if you want the matching | 
 |        functions to do a full, unoptimized search and run all the callouts. | 
 |  | 
 |          PCRE2_DOTSTAR_ANCHOR | 
 |          PCRE2_DOTSTAR_ANCHOR_OFF | 
 |  | 
 |        Enable/disable an optimization that is applied when  .*  is  the  first | 
 |        significant  item in a top-level branch of a pattern, and all the other | 
 |        branches also start with .* or with \A or \G or ^. Such  a  pattern  is | 
 |        automatically  anchored if PCRE2_DOTALL is set for all the .* items and | 
 |        PCRE2_MULTILINE is not set for any ^ items. Otherwise,  the  fact  that | 
 |        any  match must start either at the start of the subject or following a | 
 |        newline is remembered. Like other optimizations, this can  cause  call- | 
 |        outs to be skipped. | 
 |  | 
 |        Dotstar  anchor  optimization is automatically disabled for .* if it is | 
 |        inside an atomic group or a capture group that  is  the  subject  of  a | 
 |        backreference, or if the pattern contains (*PRUNE) or (*SKIP). | 
 |  | 
 |          PCRE2_START_OPTIMIZE | 
 |          PCRE2_START_OPTIMIZE_OFF | 
 |  | 
 |        Enable/disable optimizations which cause matching functions to scan the | 
 |        subject string for specific code unit values before attempting a match. | 
 |        For  example, if it is known that an unanchored match must start with a | 
 |        specific value, the matching code searches the subject for that  value, | 
 |        and  fails  immediately  if it cannot find it, without actually running | 
 |        the main matching function. This means that  a  special  item  such  as | 
 |        (*COMMIT)  at  the  start  of a pattern is not considered until after a | 
 |        suitable starting point for the match has been found.  Also, when call- | 
 |        outs or (*MARK) items are in use, these  "start-up"  optimizations  can | 
 |        cause  them  to  be  skipped if the pattern is never actually used. The | 
 |        start-up optimizations are in effect a pre-scan  of  the  subject  that | 
 |        takes place before the pattern is run. | 
 |  | 
 |        Disabling start-up optimizations ensures that in cases where the result | 
 |        is  "no match", the callouts do occur, and that items such as (*COMMIT) | 
 |        and (*MARK) are considered at every possible starting position  in  the | 
 |        subject string. | 
 |  | 
 |        Disabling  start-up  optimizations may change the outcome of a matching | 
 |        operation.  Consider the pattern | 
 |  | 
 |          (*COMMIT)ABC | 
 |  | 
 |        When this is compiled, PCRE2 records the fact that a match  must  start | 
 |        with  the  character  "A".  Suppose the subject string is "DEFABC". The | 
 |        start-up optimization scans along the subject, finds "A" and  runs  the | 
 |        first  match attempt from there. The (*COMMIT) item means that the pat- | 
 |        tern must match the current starting position, which in this  case,  it | 
 |        does. However, if the same match is run without start-up optimizations, | 
 |        the  initial  scan  along the subject string does not happen. The first | 
 |        match attempt is run starting from "D" and when this  fails,  (*COMMIT) | 
 |        prevents  any further matches being tried, so the overall result is "no | 
 |        match". | 
 |  | 
 |        Another start-up optimization makes use  of  a  minimum  length  for  a | 
 |        matching subject, which is recorded when possible. Consider the pattern | 
 |  | 
 |          (*MARK:1)B(*MARK:2)(X|Y) | 
 |  | 
 |        The  minimum  length  for  a match is two characters. If the subject is | 
 |        "XXBB", the "starting character" optimization skips "XX", then tries to | 
 |        match "BB", which is long enough. In the process, (*MARK:2) is  encoun- | 
 |        tered  and  remembered.  When  the match attempt fails, the next "B" is | 
 |        found, but there is only one character left, so there are no  more  at- | 
 |        tempts,  and  "no  match"  is returned with the "last mark seen" set to | 
 |        "2". Without start-up optimizations,  however,  matches  are  tried  at | 
 |        every  possible starting position, including at the end of the subject, | 
 |        where (*MARK:1) is encountered, but there is no "B", so the "last  mark | 
 |        seen"  that  is returned is "1". In this case, the optimizations do not | 
 |        affect the overall match result, which is still "no match", but they do | 
 |        affect the auxiliary information that is returned. | 
 |  | 
 |    The match context | 
 |  | 
 |        A match context is required if you want to: | 
 |  | 
 |          Set up a callout function | 
 |          Set an offset limit for matching an unanchored pattern | 
 |          Change the limit on the amount of heap used when matching | 
 |          Change the backtracking match limit | 
 |          Change the backtracking depth limit | 
 |          Set custom memory management specifically for the match | 
 |  | 
 |        If none of these apply, just pass  NULL  as  the  context  argument  of | 
 |        pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match(). | 
 |  | 
 |        A  match  context  is created, copied, and freed by the following func- | 
 |        tions: | 
 |  | 
 |        pcre2_match_context *pcre2_match_context_create( | 
 |          pcre2_general_context *gcontext); | 
 |  | 
 |        pcre2_match_context *pcre2_match_context_copy( | 
 |          pcre2_match_context *mcontext); | 
 |  | 
 |        void pcre2_match_context_free(pcre2_match_context *mcontext); | 
 |  | 
 |        A match context is created with  default  values  for  its  parameters. | 
 |        These can be changed by calling the following functions, which return 0 | 
 |        on success, or PCRE2_ERROR_BADDATA if invalid data is detected. | 
 |  | 
 |        int pcre2_set_callout(pcre2_match_context *mcontext, | 
 |          int (*callout_function)(pcre2_callout_block *, void *), | 
 |          void *callout_data); | 
 |  | 
 |        This  sets  up a callout function for PCRE2 to call at specified points | 
 |        during a matching operation. Details are given in the pcre2callout doc- | 
 |        umentation. | 
 |  | 
 |        int pcre2_set_substitute_callout(pcre2_match_context *mcontext, | 
 |          int (*callout_function)(pcre2_substitute_callout_block *, void *), | 
 |          void *callout_data); | 
 |  | 
 |        This sets up a callout function for PCRE2 to call after each  substitu- | 
 |        tion made by pcre2_substitute(). Details are given in the section enti- | 
 |        tled "Creating a new string with substitutions" below. | 
 |  | 
 |        int pcre2_set_substitute_case_callout(pcre2_match_context *mcontext, | 
 |          PCRE2_SIZE (*callout_function)(PCRE2_SPTR, PCRE2_SIZE, | 
 |                                         PCRE2_UCHAR *, PCRE2_SIZE, | 
 |                                         int, void *), | 
 |          void *callout_data); | 
 |  | 
 |        This  sets up a callout function for PCRE2 to call when performing case | 
 |        transformations inside pcre2_substitute(). Details  are  given  in  the | 
 |        section entitled "Creating a new string with substitutions" below. | 
 |  | 
 |        int pcre2_set_offset_limit(pcre2_match_context *mcontext, | 
 |          PCRE2_SIZE value); | 
 |  | 
 |        The  offset_limit parameter limits how far an unanchored search can ad- | 
 |        vance in the subject string. The  default  value  is  PCRE2_UNSET.  The | 
 |        pcre2_match()  and  pcre2_dfa_match()  functions return PCRE2_ERROR_NO- | 
 |        MATCH if a match with a starting point before or at the given offset is | 
 |        not found. The pcre2_substitute() function makes no more substitutions. | 
 |  | 
 |        For example, if the pattern /abc/ is matched against "123abc"  with  an | 
 |        offset  limit  less  than 3, the result is PCRE2_ERROR_NOMATCH. A match | 
 |        can never be  found  if  the  startoffset  argument  of  pcre2_match(), | 
 |        pcre2_dfa_match(),  or  pcre2_substitute()  is  greater than the offset | 
 |        limit set in the match context. | 
 |  | 
 |        When using this facility, you must set the  PCRE2_USE_OFFSET_LIMIT  op- | 
 |        tion when calling pcre2_compile() so that when JIT is in use, different | 
 |        code  can  be  compiled. If a match is started with a non-default match | 
 |        limit when PCRE2_USE_OFFSET_LIMIT is not set, an error is generated. | 
 |  | 
 |        The offset limit facility can be used to track progress when  searching | 
 |        large  subject  strings or to limit the extent of global substitutions. | 
 |        See also the PCRE2_FIRSTLINE option, which requires a  match  to  start | 
 |        before  or  at  the first newline that follows the start of matching in | 
 |        the subject. If this is set with an offset limit, a match must occur in | 
 |        the first line and also  within  the  offset  limit.  In  other  words, | 
 |        whichever limit comes first is used. | 
 |  | 
 |        int pcre2_set_heap_limit(pcre2_match_context *mcontext, | 
 |          uint32_t value); | 
 |  | 
 |        The heap_limit parameter specifies, in units of kibibytes (1024 bytes), | 
 |        the  maximum  amount  of heap memory that pcre2_match() may use to hold | 
 |        backtracking information when running an interpretive match. This limit | 
 |        also applies to pcre2_dfa_match(), which may use the heap when process- | 
 |        ing patterns with a lot of nested pattern recursion or  lookarounds  or | 
 |        atomic groups. This limit does not apply to matching with the JIT opti- | 
 |        mization,  which  has  its  own  memory  control  arrangements (see the | 
 |        pcre2jit documentation for more details). If the limit is reached,  the | 
 |        negative  error  code  PCRE2_ERROR_HEAPLIMIT  is  returned. The default | 
 |        limit can be set when PCRE2 is built; if it is not, the default is  set | 
 |        very large and is essentially unlimited. | 
 |  | 
 |        A value for the heap limit may also be supplied by an item at the start | 
 |        of a pattern of the form | 
 |  | 
 |          (*LIMIT_HEAP=ddd) | 
 |  | 
 |        where  ddd  is a decimal number. However, such a setting is ignored un- | 
 |        less ddd is less than the limit set by the caller of pcre2_match()  or, | 
 |        if no such limit is set, less than the default. | 
 |  | 
 |        The  pcre2_match() function always needs some heap memory, so setting a | 
 |        value of zero guarantees a "heap limit exceeded" error. Details of  how | 
 |        pcre2_match()  uses  the  heap are given in the pcre2perform documenta- | 
 |        tion. | 
 |  | 
 |        For pcre2_dfa_match(), a vector on the system stack is used  when  pro- | 
 |        cessing  pattern recursions, lookarounds, or atomic groups, and only if | 
 |        this is not big enough is heap memory used. In  this  case,  setting  a | 
 |        value of zero disables the use of the heap. | 
 |  | 
 |        int pcre2_set_match_limit(pcre2_match_context *mcontext, | 
 |          uint32_t value); | 
 |  | 
 |        The match_limit parameter provides a means of preventing PCRE2 from us- | 
 |        ing  up  too many computing resources when processing patterns that are | 
 |        not going to match, but which have a very large number of possibilities | 
 |        in their search trees. The classic  example  is  a  pattern  that  uses | 
 |        nested unlimited repeats. | 
 |  | 
 |        There  is an internal counter in pcre2_match() that is incremented each | 
 |        time round its main matching loop. If  this  value  reaches  the  match | 
 |        limit, pcre2_match() returns the negative value PCRE2_ERROR_MATCHLIMIT. | 
 |        This  has  the  effect  of limiting the amount of backtracking that can | 
 |        take place. For patterns that are not anchored, the count restarts from | 
 |        zero for each position in the subject string. This limit  also  applies | 
 |        to pcre2_dfa_match(), though the counting is done in a different way. | 
 |  | 
 |        When  pcre2_match()  is  called  with  a  pattern that was successfully | 
 |        processed by pcre2_jit_compile(), the way in which matching is executed | 
 |        is entirely different. However, there is still the possibility of  run- | 
 |        away matching that goes on for a very long time, and so the match_limit | 
 |        value  is  also used in this case (but in a different way) to limit how | 
 |        long the matching can continue. | 
 |  | 
 |        The default value for the limit can be set when PCRE2 is built; the de- | 
 |        fault is 10 million, which handles all but the most  extreme  cases.  A | 
 |        value  for the match limit may also be supplied by an item at the start | 
 |        of a pattern of the form | 
 |  | 
 |          (*LIMIT_MATCH=ddd) | 
 |  | 
 |        where ddd is a decimal number. However, such a setting is  ignored  un- | 
 |        less  ddd  is less than the limit set by the caller of pcre2_match() or | 
 |        pcre2_dfa_match() or, if no such limit is set, less than the default. | 
 |  | 
 |        int pcre2_set_depth_limit(pcre2_match_context *mcontext, | 
 |          uint32_t value); | 
 |  | 
 |        This  parameter  limits   the   depth   of   nested   backtracking   in | 
 |        pcre2_match().   Each time a nested backtracking point is passed, a new | 
 |        memory frame is used to remember the state of matching at  that  point. | 
 |        Thus,  this  parameter  indirectly  limits the amount of memory that is | 
 |        used in a match. However, because the size of each memory frame depends | 
 |        on the number of capturing parentheses, the actual memory limit  varies | 
 |        from  pattern to pattern. This limit was more useful in versions before | 
 |        10.30, where function recursion was used for backtracking. | 
 |  | 
 |        The depth limit is not relevant, and is ignored, when matching is  done | 
 |        using JIT compiled code. However, it is supported by pcre2_dfa_match(), | 
 |        which  uses it to limit the depth of nested internal recursive function | 
 |        calls that implement atomic groups, lookaround assertions, and  pattern | 
 |        recursions. This limits, indirectly, the amount of system stack that is | 
 |        used.  It  was  more useful in versions before 10.32, when stack memory | 
 |        was used for local workspace vectors for recursive function calls. From | 
 |        version 10.32, only local variables are allocated on the stack  and  as | 
 |        each call uses only a few hundred bytes, even a small stack can support | 
 |        quite a lot of recursion. | 
 |  | 
 |        If  the depth of internal recursive function calls is great enough, lo- | 
 |        cal workspace vectors are allocated on the heap from version 10.32  on- | 
 |        wards,  so  the  depth  limit also indirectly limits the amount of heap | 
 |        memory that is used. A recursive pattern such as /(.(?2))((?1)|)/, when | 
 |        matched to a very long string using pcre2_dfa_match(), can use a  great | 
 |        deal  of memory. However, it is probably better to limit heap usage di- | 
 |        rectly by calling pcre2_set_heap_limit(). | 
 |  | 
 |        The default value for the depth limit can be set when PCRE2  is  built; | 
 |        if  it  is not, the default is set to the same value as the default for | 
 |        the  match  limit.   If  the  limit  is  exceeded,   pcre2_match()   or | 
 |        pcre2_dfa_match() returns PCRE2_ERROR_DEPTHLIMIT. A value for the depth | 
 |        limit  may also be supplied by an item at the start of a pattern of the | 
 |        form | 
 |  | 
 |          (*LIMIT_DEPTH=ddd) | 
 |  | 
 |        where ddd is a decimal number. However, such a setting is  ignored  un- | 
 |        less  ddd  is less than the limit set by the caller of pcre2_match() or | 
 |        pcre2_dfa_match() or, if no such limit is set, less than the default. | 
 |  | 
 |  | 
 | CHECKING BUILD-TIME OPTIONS | 
 |  | 
 |        int pcre2_config(uint32_t what, void *where); | 
 |  | 
 |        The function pcre2_config() makes it possible for  a  PCRE2  client  to | 
 |        find  the  value  of  certain  configuration parameters and to discover | 
 |        which optional features have been compiled into the PCRE2 library.  The | 
 |        pcre2build documentation has more details about these features. | 
 |  | 
 |        The  first  argument  for pcre2_config() specifies which information is | 
 |        required. The second argument is a pointer to memory into which the in- | 
 |        formation is placed. If NULL is passed, the function returns the amount | 
 |        of memory that is needed for the requested information. For calls  that | 
 |        return  numerical  values, the value is in bytes; when requesting these | 
 |        values, where should point to appropriately aligned memory.  For  calls | 
 |        that  return  strings,  the required length is given in code units, not | 
 |        counting the terminating zero. | 
 |  | 
 |        When requesting information, the returned value from pcre2_config()  is | 
 |        non-negative  on success, or the negative error code PCRE2_ERROR_BADOP- | 
 |        TION if the value in the first argument is not recognized. The  follow- | 
 |        ing information is available: | 
 |  | 
 |          PCRE2_CONFIG_BSR | 
 |  | 
 |        The  output  is a uint32_t integer whose value indicates what character | 
 |        sequences the \R  escape  sequence  matches  by  default.  A  value  of | 
 |        PCRE2_BSR_UNICODE  means  that  \R  matches any Unicode line ending se- | 
 |        quence; a value of PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, | 
 |        or CRLF. The default can be overridden when a pattern is compiled. | 
 |  | 
 |          PCRE2_CONFIG_COMPILED_WIDTHS | 
 |  | 
 |        The output is a uint32_t integer whose lower bits indicate  which  code | 
 |        unit  widths  were  selected  when PCRE2 was built. The 1-bit indicates | 
 |        8-bit support, and the 2-bit and 4-bit indicate 16-bit and 32-bit  sup- | 
 |        port, respectively. | 
 |  | 
 |          PCRE2_CONFIG_DEPTHLIMIT | 
 |  | 
 |        The  output  is a uint32_t integer that gives the default limit for the | 
 |        depth of nested backtracking in pcre2_match() or the  depth  of  nested | 
 |        recursions,  lookarounds,  and atomic groups in pcre2_dfa_match(). Fur- | 
 |        ther details are given with pcre2_set_depth_limit() above. | 
 |  | 
 |          PCRE2_CONFIG_EFFECTIVE_LINKSIZE | 
 |  | 
 |        The output is a uint32_t integer that contains the number of bytes  the | 
 |        library  uses for internal linkage in compiled regular expressions. Its | 
 |        value is derived from the value that was provided  at  build  time  and | 
 |        that is described below by PCRE2_CONFIG_LINKSIZE. | 
 |  | 
 |          PCRE2_CONFIG_HEAPLIMIT | 
 |  | 
 |        The  output is a uint32_t integer that gives, in kibibytes, the default | 
 |        limit  for  the  amount  of  heap  memory  used  by  pcre2_match()   or | 
 |        pcre2_dfa_match().      Further      details     are     given     with | 
 |        pcre2_set_heap_limit() above. | 
 |  | 
 |          PCRE2_CONFIG_JIT | 
 |  | 
 |        The output is a uint32_t integer that is set  to  one  if  support  for | 
 |        just-in-time  compiling is included in the library; otherwise it is set | 
 |        to zero. Note that having the support in the library does not guarantee | 
 |        that JIT will be used for any given match, and neither does it  guaran- | 
 |        tee  that  JIT will actually be able to function, because it may not be | 
 |        able to allocate executable memory in some  environments.  There  is  a | 
 |        special call to pcre2_jit_compile() that can be used to check this. See | 
 |        the pcre2jit documentation for more details. | 
 |  | 
 |          PCRE2_CONFIG_JITTARGET | 
 |  | 
 |        The  where  argument  should point to a buffer that is at least 64 code | 
 |        units long.  (The  exact  length  required  can  be  found  by  calling | 
 |        pcre2_config()  with  where  set  to NULL.) The buffer is filled with a | 
 |        string that contains the name of the architecture  for  which  the  JIT | 
 |        compiler  is  configured,  for  example "x86 32bit (little endian + un- | 
 |        aligned)". If JIT support is not  available,  PCRE2_ERROR_BADOPTION  is | 
 |        returned,  otherwise the number of code units used is returned. This is | 
 |        the length of the string, plus one unit for the terminating zero. | 
 |  | 
 |          PCRE2_CONFIG_LINKSIZE | 
 |  | 
 |        The output is a uint32_t integer that contains the number of bytes  the | 
 |        library  was instructed to use for internal linkage in compiled regular | 
 |        expressions.  When PCRE2 is configured, the value can be set to  2,  3, | 
 |        or 4, with the default being 2 for most libraries. | 
 |  | 
 |        The  actual  number of bytes used depends on the size of the code units | 
 |        that the library supports and can be  higher.  See  PCRE2_CONFIG_EFFEC- | 
 |        TIVE_LINKSIZE above for details. | 
 |  | 
 |        The default value of 2 for the 8-bit and 16-bit libraries is sufficient | 
 |        for  all but the most massive patterns, since it allows the size of the | 
 |        compiled pattern to be up to 65535  code  units.  Larger  values  allow | 
 |        larger  regular  expressions to be compiled by those two libraries, but | 
 |        at the expense of slower matching. | 
 |  | 
 |          PCRE2_CONFIG_MATCHLIMIT | 
 |  | 
 |        The output is a uint32_t integer that gives the default match limit for | 
 |        pcre2_match(). Further details are given  with  pcre2_set_match_limit() | 
 |        above. | 
 |  | 
 |          PCRE2_CONFIG_NEWLINE | 
 |  | 
 |        The  output  is  a  uint32_t  integer whose value specifies the default | 
 |        character sequence that is recognized as meaning "newline". The  values | 
 |        are: | 
 |  | 
 |          PCRE2_NEWLINE_CR       Carriage return (CR) | 
 |          PCRE2_NEWLINE_LF       Linefeed (LF) | 
 |          PCRE2_NEWLINE_CRLF     Carriage return, linefeed (CRLF) | 
 |          PCRE2_NEWLINE_ANY      Any Unicode line ending | 
 |          PCRE2_NEWLINE_ANYCRLF  Any of CR, LF, or CRLF | 
 |          PCRE2_NEWLINE_NUL      The NUL character (binary zero) | 
 |  | 
 |        The  default  should  normally  correspond to the standard sequence for | 
 |        your operating system. | 
 |  | 
 |          PCRE2_CONFIG_NEVER_BACKSLASH_C | 
 |  | 
 |        The output is a uint32_t integer that is set to one if the  use  of  \C | 
 |        was  permanently  disabled when PCRE2 was built; otherwise it is set to | 
 |        zero. | 
 |  | 
 |          PCRE2_CONFIG_PARENSLIMIT | 
 |  | 
 |        The output is a uint32_t integer that gives the maximum depth of  nest- | 
 |        ing of parentheses (of any kind) in a pattern. This limit is imposed to | 
 |        cap  the  amount of system stack used when a pattern is compiled. It is | 
 |        specified when PCRE2 is built; the default is 250. This limit does  not | 
 |        take into account the stack that may already be used by the calling ap- | 
 |        plication.   For  finer  control  over  compilation  stack  usage,  see | 
 |        pcre2_set_compile_recursion_guard(). | 
 |  | 
 |          PCRE2_CONFIG_STACKRECURSE | 
 |  | 
 |        This parameter is obsolete and should not be used in new code. The out- | 
 |        put is a uint32_t integer that is always set to zero. | 
 |  | 
 |          PCRE2_CONFIG_TABLES_LENGTH | 
 |  | 
 |        The output is a uint32_t integer that gives the length of PCRE2's char- | 
 |        acter processing tables in bytes. For details of these tables  see  the | 
 |        section on locale support below. | 
 |  | 
 |          PCRE2_CONFIG_UNICODE_VERSION | 
 |  | 
 |        The  where  argument  should point to a buffer that is at least 24 code | 
 |        units long.  (The  exact  length  required  can  be  found  by  calling | 
 |        pcre2_config()  with  where  set  to  NULL.) If PCRE2 has been compiled | 
 |        without Unicode support, the buffer is filled with  the  text  "Unicode | 
 |        not  supported".  Otherwise,  the  Unicode version string (for example, | 
 |        "8.0.0") is inserted. The number of code units used is  returned.  This | 
 |        is the length of the string plus one unit for the terminating zero. | 
 |  | 
 |          PCRE2_CONFIG_UNICODE | 
 |  | 
 |        The  output is a uint32_t integer that is set to one if Unicode support | 
 |        is available; otherwise it is set to zero. Unicode support implies  UTF | 
 |        support. | 
 |  | 
 |          PCRE2_CONFIG_VERSION | 
 |  | 
 |        The  where  argument  should point to a buffer that is at least 24 code | 
 |        units long.  (The  exact  length  required  can  be  found  by  calling | 
 |        pcre2_config()  with  where set to NULL.) The buffer is filled with the | 
 |        PCRE2 version string, zero-terminated. The number of code units used is | 
 |        returned. This is the length of the string plus one unit for the termi- | 
 |        nating zero. | 
 |  | 
 |  | 
 | COMPILING A PATTERN | 
 |  | 
 |        pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length, | 
 |          uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset, | 
 |          pcre2_compile_context *ccontext); | 
 |  | 
 |        void pcre2_code_free(pcre2_code *code); | 
 |  | 
 |        pcre2_code *pcre2_code_copy(const pcre2_code *code); | 
 |  | 
 |        pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code); | 
 |  | 
 |        The pcre2_compile() function compiles a pattern into an internal  form. | 
 |        The  pattern  is  defined  by a pointer to a string of code units and a | 
 |        length in code units. If the pattern is zero-terminated, the length can | 
 |        be specified as PCRE2_ZERO_TERMINATED. A NULL pattern  pointer  with  a | 
 |        length  of  zero  is  treated  as an empty string (NULL with a non-zero | 
 |        length causes an error return). The function returns  a  pointer  to  a | 
 |        block of memory that contains the compiled pattern and related data, or | 
 |        NULL if an error occurred. | 
 |  | 
 |        If  the  compile context argument ccontext is NULL, memory for the com- | 
 |        piled pattern is obtained by calling malloc().  Otherwise,  it  is  ob- | 
 |        tained from the same memory function that was used for the compile con- | 
 |        text. The caller must free the memory by calling pcre2_code_free() when | 
 |        it is no longer needed.  If pcre2_code_free() is called with a NULL ar- | 
 |        gument, it returns immediately, without doing anything. | 
 |  | 
 |        The function pcre2_code_copy() makes a copy of the compiled code in new | 
 |        memory,  using  the same memory allocator as was used for the original. | 
 |        However, if the code has been processed by the JIT  compiler  (see  be- | 
 |        low),  the JIT information cannot be copied (because it is position-de- | 
 |        pendent).  The new copy can initially be used only for  non-JIT  match- | 
 |        ing,  though  it  can  be passed to pcre2_jit_compile() if required. If | 
 |        pcre2_code_copy() is called with a NULL argument, it returns NULL. | 
 |  | 
 |        The pcre2_code_copy() function provides a way for individual threads in | 
 |        a multithreaded application to acquire a private copy  of  shared  com- | 
 |        piled  code.   However, it does not make a copy of the character tables | 
 |        used by the compiled pattern; the new pattern code points to  the  same | 
 |        tables  as  the original code.  (See "Locale Support" below for details | 
 |        of these character tables.) In many applications the  same  tables  are | 
 |        used  throughout, so this behaviour is appropriate. Nevertheless, there | 
 |        are occasions when a copy of a compiled pattern and the relevant tables | 
 |        are needed. The pcre2_code_copy_with_tables() provides  this  facility. | 
 |        Copies  of  both  the  code  and the tables are made, with the new code | 
 |        pointing to the new tables. The memory for the new tables is  automati- | 
 |        cally  freed  when  pcre2_code_free() is called for the new copy of the | 
 |        compiled code. If pcre2_code_copy_with_tables() is called with  a  NULL | 
 |        argument, it returns NULL. | 
 |  | 
 |        NOTE:  When  one  of  the matching functions is called, pointers to the | 
 |        compiled pattern and the subject string are set in the match data block | 
 |        so that they can be referenced by the  substring  extraction  functions | 
 |        after  a  successful match.  After running a match, you must not free a | 
 |        compiled pattern or a subject string until after all operations on  the | 
 |        match  data  block have taken place, unless, in the case of the subject | 
 |        string, you have used the PCRE2_COPY_MATCHED_SUBJECT option,  which  is | 
 |        described  in  the section entitled "Option bits for pcre2_match()" be- | 
 |        low. | 
 |  | 
 |        The options argument for pcre2_compile() contains various bit  settings | 
 |        that  affect the compilation. It should be zero if none of them are re- | 
 |        quired. The available options are described below.  Some  of  them  (in | 
 |        particular,  those  that  are  compatible with Perl, but some others as | 
 |        well) can also be set and unset from within the pattern  (see  the  de- | 
 |        tailed description in the pcre2pattern documentation). | 
 |  | 
 |        For  those options that can be different in different parts of the pat- | 
 |        tern, the contents of the options argument specifies their settings  at | 
 |        the  start  of  compilation. The PCRE2_ANCHORED, PCRE2_ENDANCHORED, and | 
 |        PCRE2_NO_UTF_CHECK options can be set at the time of matching  as  well | 
 |        as at compile time. | 
 |  | 
 |        Some additional options and less frequently required compile-time para- | 
 |        meters  (for example, the newline setting) can be provided in a compile | 
 |        context (as described above). | 
 |  | 
 |        If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme- | 
 |        diately. Otherwise, the variables to which these point are  set  to  an | 
 |        error code and an offset (number of code units) within the pattern, re- | 
 |        spectively, when pcre2_compile() returns NULL because a compilation er- | 
 |        ror has occurred. | 
 |  | 
 |        There are over 100 positive error codes that pcre2_compile() may return | 
 |        if it finds an error in the pattern. There are also some negative error | 
 |        codes  that  are used for invalid UTF strings when validity checking is | 
 |        in  force.  These  are  the  same  as  given   by   pcre2_match()   and | 
 |        pcre2_dfa_match(), and are described in the pcre2unicode documentation. | 
 |        There  is  no  separate documentation for the positive error codes, be- | 
 |        cause the textual error messages  that  are  obtained  by  calling  the | 
 |        pcre2_get_error_message() function (see "Obtaining a textual error mes- | 
 |        sage"  below)  should  be  self-explanatory.  Macro names starting with | 
 |        PCRE2_ERROR_ are defined for both positive and negative error codes  in | 
 |        pcre2.h.  When  compilation  is  successful errorcode is set to a value | 
 |        that returns the message "no error" if passed  to  pcre2_get_error_mes- | 
 |        sage(). | 
 |  | 
 |        The value returned in erroroffset is an indication of where in the pat- | 
 |        tern  an  error  occurred.  When there is no error, zero is returned. A | 
 |        non-zero value is not necessarily the furthest  point  in  the  pattern | 
 |        that  was  read.  For example, after the error "lookbehind assertion is | 
 |        not fixed length", the error offset points to the start of the  failing | 
 |        assertion. For an invalid UTF-8 or UTF-16 string, the offset is that of | 
 |        the first code unit of the failing character. | 
 |  | 
 |        Some  errors are not detected until the whole pattern has been scanned; | 
 |        in these cases, the offset passed back is the length  of  the  pattern. | 
 |        Note  that  the  offset is in code units, not characters, even in a UTF | 
 |        mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char- | 
 |        acter. | 
 |  | 
 |        This code fragment shows a typical straightforward call  to  pcre2_com- | 
 |        pile(): | 
 |  | 
 |          pcre2_code *re; | 
 |          PCRE2_SIZE erroffset; | 
 |          int errorcode; | 
 |          re = pcre2_compile( | 
 |            "^A.*Z",                /* the pattern */ | 
 |            PCRE2_ZERO_TERMINATED,  /* the pattern is zero-terminated */ | 
 |            0,                      /* default options */ | 
 |            &errorcode,             /* for error code */ | 
 |            &erroffset,             /* for error offset */ | 
 |            NULL);                  /* no compile context */ | 
 |  | 
 |  | 
 |    Main compile options | 
 |  | 
 |        The  following  names for option bits are defined in the pcre2.h header | 
 |        file: | 
 |  | 
 |          PCRE2_ANCHORED | 
 |  | 
 |        If this bit is set, the pattern is forced to be "anchored", that is, it | 
 |        is constrained to match only at the first matching point in the  string | 
 |        that  is being searched (the "subject string"). This effect can also be | 
 |        achieved by appropriate constructs in the pattern itself, which is  the | 
 |        only way to do it in Perl. | 
 |  | 
 |          PCRE2_ALLOW_EMPTY_CLASS | 
 |  | 
 |        By  default, for compatibility with Perl, a closing square bracket that | 
 |        immediately follows an opening one is treated as a data  character  for | 
 |        the  class.  When  PCRE2_ALLOW_EMPTY_CLASS  is  set,  it terminates the | 
 |        class, which therefore contains no characters and so can never match. | 
 |  | 
 |          PCRE2_ALT_BSUX | 
 |  | 
 |        This option request alternative handling  of  three  escape  sequences, | 
 |        which  makes  PCRE2's  behaviour more like ECMAscript (aka JavaScript). | 
 |        When it is set: | 
 |  | 
 |        (1) \U matches an upper case "U" character; by default \U causes a com- | 
 |        pile time error (Perl uses \U to upper case subsequent characters). | 
 |  | 
 |        (2) \u matches a lower case "u" character unless it is followed by four | 
 |        hexadecimal digits, in which case the hexadecimal  number  defines  the | 
 |        code  point  to match. By default, \u causes a compile time error (Perl | 
 |        uses it to upper case the following character). | 
 |  | 
 |        (3) \x matches a lower case "x" character unless it is followed by  two | 
 |        hexadecimal  digits,  in  which case the hexadecimal number defines the | 
 |        code point to match. By default, as in Perl, a  hexadecimal  number  is | 
 |        always expected after \x, but it may have one or two digits. | 
 |  | 
 |        ECMAscript 6 added additional functionality to \u. This can be accessed | 
 |        using  the  PCRE2_EXTRA_ALT_BSUX  extra  option (see "Extra compile op- | 
 |        tions" below).  Note that this alternative escape handling applies only | 
 |        to patterns. Neither of these options affects  the  processing  of  re- | 
 |        placement strings passed to pcre2_substitute(). | 
 |  | 
 |          PCRE2_ALT_CIRCUMFLEX | 
 |  | 
 |        In  multiline  mode  (when  PCRE2_MULTILINE  is  set),  the  circumflex | 
 |        metacharacter matches at the start of the subject (unless  PCRE2_NOTBOL | 
 |        is  set),  and  also  after  any internal newline. However, it does not | 
 |        match after a newline at the end of the subject, for compatibility with | 
 |        Perl. If you want a multiline circumflex also to match after  a  termi- | 
 |        nating newline, you must set PCRE2_ALT_CIRCUMFLEX. | 
 |  | 
 |          PCRE2_ALT_EXTENDED_CLASS | 
 |  | 
 |        Alters  the  parsing of character classes to follow the extended syntax | 
 |        described by Unicode UTS#18. The PCRE2_ALT_EXTENDED_CLASS option has no | 
 |        impact on the behaviour of the Perl-specific "(?[...])" syntax for  ex- | 
 |        tended  classes, but instead enables the alternative syntax of extended | 
 |        class behaviour inside ordinary  "[...]"  character  classes.  See  the | 
 |        pcre2pattern  documentation  for  details of the character classes sup- | 
 |        ported. | 
 |  | 
 |          PCRE2_ALT_VERBNAMES | 
 |  | 
 |        By default, for compatibility with Perl, the name in any verb  sequence | 
 |        such  as  (*MARK:NAME)  is any sequence of characters that does not in- | 
 |        clude a closing parenthesis. The name is not processed in any way,  and | 
 |        it  is  not possible to include a closing parenthesis in the name. How- | 
 |        ever, if the PCRE2_ALT_VERBNAMES option is set, normal  backslash  pro- | 
 |        cessing  is  applied to verb names and only an unescaped closing paren- | 
 |        thesis terminates the name. A closing parenthesis can be included in  a | 
 |        name  either  as  \)  or  between  \Q  and \E. If the PCRE2_EXTENDED or | 
 |        PCRE2_EXTENDED_MORE option is set with  PCRE2_ALT_VERBNAMES,  unescaped | 
 |        white space in verb names is skipped and #-comments are recognized, ex- | 
 |        actly as in the rest of the pattern. | 
 |  | 
 |          PCRE2_AUTO_CALLOUT | 
 |  | 
 |        If  this  bit  is  set,  pcre2_compile()  automatically inserts callout | 
 |        items, all with number 255, before each pattern  item,  except  immedi- | 
 |        ately  before  or after an explicit callout in the pattern. For discus- | 
 |        sion of the callout facility, see the pcre2callout documentation. | 
 |  | 
 |          PCRE2_CASELESS | 
 |  | 
 |        If this bit is set, letters in the pattern match both upper  and  lower | 
 |        case  letters in the subject. It is equivalent to Perl's /i option, and | 
 |        it can be changed within a pattern by a (?i) option setting. If  either | 
 |        PCRE2_UTF  or  PCRE2_UCP  is  set,  Unicode properties are used for all | 
 |        characters with more than one other case, and for all characters  whose | 
 |        code points are greater than U+007F. | 
 |  | 
 |        Note that there are two ASCII characters, K and S, that, in addition to | 
 |        their  lower  case  ASCII  equivalents, are case-equivalent with U+212A | 
 |        (Kelvin sign) and U+017F (long S) respectively. If you do not want this | 
 |        case equivalence, you can  suppress  it  by  setting  PCRE2_EXTRA_CASE- | 
 |        LESS_RESTRICT. | 
 |  | 
 |        One  language family, Turkish and Azeri, has its own case-insensitivity | 
 |        rules, which can be  selected  by  setting  PCRE2_EXTRA_TURKISH_CASING. | 
 |        This  alters  the behaviour of the 'i', 'I', U+0130 (capital I with dot | 
 |        above), and U+0131 (small dotless i) characters. | 
 |  | 
 |        For lower valued characters with only one other case, a lookup table is | 
 |        used for speed. When neither PCRE2_UTF nor PCRE2_UCP is set,  a  lookup | 
 |        table is used for all code points less than 256, and higher code points | 
 |        (available only in 16-bit or 32-bit mode) are treated as not having an- | 
 |        other case. | 
 |  | 
 |        From release 10.45 PCRE2_CASELESS also affects what some of the letter- | 
 |        related  Unicode  property escapes (\p and \P) match. The properties Lu | 
 |        (upper case letter), Ll (lower case letter), and Lt (title case letter) | 
 |        are all treated as LC (cased letter) when PCRE2_CASELESS is set. | 
 |  | 
 |          PCRE2_DOLLAR_ENDONLY | 
 |  | 
 |        If this bit is set, a dollar metacharacter in the pattern matches  only | 
 |        at  the  end  of the subject string. Without this option, a dollar also | 
 |        matches immediately before a newline at the end of the string (but  not | 
 |        before  any other newlines). The PCRE2_DOLLAR_ENDONLY option is ignored | 
 |        if PCRE2_MULTILINE is set. There is no equivalent  to  this  option  in | 
 |        Perl, and no way to set it within a pattern. | 
 |  | 
 |          PCRE2_DOTALL | 
 |  | 
 |        If  this  bit  is  set,  a dot metacharacter in the pattern matches any | 
 |        character, including one that indicates a  newline.  However,  it  only | 
 |        ever matches one character, even if newlines are coded as CRLF. Without | 
 |        this option, a dot does not match when the current position in the sub- | 
 |        ject  is  at  a newline. This option is equivalent to Perl's /s option, | 
 |        and it can be changed within a pattern by a (?s) option setting. A neg- | 
 |        ative class such as [^a] always matches newline characters, and the  \N | 
 |        escape  sequence always matches a non-newline character, independent of | 
 |        the setting of PCRE2_DOTALL. | 
 |  | 
 |          PCRE2_DUPNAMES | 
 |  | 
 |        If this bit is set, names used to identify capture groups need  not  be | 
 |        unique.   This  can  be helpful for certain types of pattern when it is | 
 |        known that only one instance of the named group can  ever  be  matched. | 
 |        There  are  more  details  of  named capture groups below; see also the | 
 |        pcre2pattern documentation. | 
 |  | 
 |          PCRE2_ENDANCHORED | 
 |  | 
 |        If this bit is set, the end of any pattern match must be right  at  the | 
 |        end of the string being searched (the "subject string"). If the pattern | 
 |        match succeeds by reaching (*ACCEPT), but does not reach the end of the | 
 |        subject,  the match fails at the current starting point. For unanchored | 
 |        patterns, a new match is then tried at the next  starting  point.  How- | 
 |        ever, if the match succeeds by reaching the end of the pattern, but not | 
 |        the  end  of  the subject, backtracking occurs and an alternative match | 
 |        may be found. Consider these two patterns: | 
 |  | 
 |          .(*ACCEPT)|.. | 
 |          .|.. | 
 |  | 
 |        If matched against "abc" with PCRE2_ENDANCHORED set, the first  matches | 
 |        "c"  whereas  the  second matches "bc". The effect of PCRE2_ENDANCHORED | 
 |        can also be achieved by appropriate constructs in the  pattern  itself, | 
 |        which is the only way to do it in Perl. | 
 |  | 
 |        For DFA matching with pcre2_dfa_match(), PCRE2_ENDANCHORED applies only | 
 |        to  the  first  (that  is,  the longest) matched string. Other parallel | 
 |        matches, which are necessarily substrings of the first one, must  obvi- | 
 |        ously end before the end of the subject. | 
 |  | 
 |          PCRE2_EXTENDED | 
 |  | 
 |        If  this bit is set, most white space characters in the pattern are to- | 
 |        tally ignored except when escaped, inside a character class, or  inside | 
 |        a  \Q...\E  sequence.  However,  white  space is not allowed within se- | 
 |        quences such as (?> that introduce various  parenthesized  groups,  nor | 
 |        within  numerical  quantifiers  such as {1,3}. Ignorable white space is | 
 |        permitted between an item and a  following  quantifier  and  between  a | 
 |        quantifier  and  a following + that indicates possessiveness. PCRE2_EX- | 
 |        TENDED is equivalent to Perl's /x option, and it can be changed  within | 
 |        a pattern by a (?x) option setting. | 
 |  | 
 |        When  PCRE2  is compiled without Unicode support, PCRE2_EXTENDED recog- | 
 |        nizes as white space only those characters with code points  less  than | 
 |        256 that are flagged as white space in its low-character table. The ta- | 
 |        ble is normally created by pcre2_maketables(), which uses the isspace() | 
 |        function  to identify space characters. In most ASCII environments, the | 
 |        relevant characters are those with code  points  0x0009  (tab),  0x000A | 
 |        (linefeed),  0x000B (vertical tab), 0x000C (formfeed), 0x000D (carriage | 
 |        return), and 0x0020 (space). | 
 |  | 
 |        When PCRE2 is compiled with Unicode support, in addition to these char- | 
 |        acters, five more Unicode "Pattern White Space" characters  are  recog- | 
 |        nized by PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to- | 
 |        right  mark), U+200F (right-to-left mark), U+2028 (line separator), and | 
 |        U+2029 (paragraph separator). This set of characters  is  the  same  as | 
 |        recognized  by  Perl's /x option. Note that the horizontal and vertical | 
 |        space characters that are matched by the \h and \v escapes in  patterns | 
 |        are a much bigger set. | 
 |  | 
 |        As  well as ignoring most white space, PCRE2_EXTENDED also causes char- | 
 |        acters between an unescaped # outside a character class  and  the  next | 
 |        newline,  inclusive,  to be ignored, which makes it possible to include | 
 |        comments inside complicated patterns. Note that the end of this type of | 
 |        comment is a literal newline sequence in the pattern; escape  sequences | 
 |        that happen to represent a newline do not count. | 
 |  | 
 |        Which characters are interpreted as newlines can be specified by a set- | 
 |        ting  in  the compile context that is passed to pcre2_compile() or by a | 
 |        special sequence at the start of the pattern, as described in the  sec- | 
 |        tion  entitled "Newline conventions" in the pcre2pattern documentation. | 
 |        A default is defined when PCRE2 is built. | 
 |  | 
 |          PCRE2_EXTENDED_MORE | 
 |  | 
 |        This option has the effect of PCRE2_EXTENDED,  but,  in  addition,  un- | 
 |        escaped  space and horizontal tab characters are ignored inside a char- | 
 |        acter class. Note: only these two characters are ignored, not the  full | 
 |        set  of pattern white space characters that are ignored outside a char- | 
 |        acter class. PCRE2_EXTENDED_MORE is equivalent to  Perl's  /xx  option, | 
 |        and it can be changed within a pattern by a (?xx) option setting. | 
 |  | 
 |          PCRE2_FIRSTLINE | 
 |  | 
 |        If this option is set, the start of an unanchored pattern match must be | 
 |        before  or  at  the  first  newline in the subject string following the | 
 |        start of matching, though the matched text may continue over  the  new- | 
 |        line. If startoffset is non-zero, the limiting newline is not necessar- | 
 |        ily  the  first  newline  in  the  subject. For example, if the subject | 
 |        string is "abc\nxyz" (where \n represents a single-character newline) a | 
 |        pattern match for "yz" succeeds with PCRE2_FIRSTLINE if startoffset  is | 
 |        greater  than 3. See also PCRE2_USE_OFFSET_LIMIT, which provides a more | 
 |        general limiting facility. If PCRE2_FIRSTLINE is  set  with  an  offset | 
 |        limit,  a match must occur in the first line and also within the offset | 
 |        limit. In other words, whichever limit comes first is used. This option | 
 |        has no effect for anchored patterns. | 
 |  | 
 |          PCRE2_LITERAL | 
 |  | 
 |        If this option is set, all meta-characters in the pattern are disabled, | 
 |        and it is treated as a literal string. Matching literal strings with  a | 
 |        regular expression engine is not the most efficient way of doing it. If | 
 |        you  are  doing  a  lot of literal matching and are worried about effi- | 
 |        ciency, you should consider using other approaches. The only other main | 
 |        options  that  are  allowed  with  PCRE2_LITERAL  are:  PCRE2_ANCHORED, | 
 |        PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT, PCRE2_CASELESS, PCRE2_FIRSTLINE, | 
 |        PCRE2_MATCH_INVALID_UTF,  PCRE2_NO_START_OPTIMIZE,  PCRE2_NO_UTF_CHECK, | 
 |        PCRE2_UTF, and  PCRE2_USE_OFFSET_LIMIT.  The  extra  options  PCRE2_EX- | 
 |        TRA_MATCH_LINE and PCRE2_EXTRA_MATCH_WORD are also supported. Any other | 
 |        options cause an error. | 
 |  | 
 |          PCRE2_MATCH_INVALID_UTF | 
 |  | 
 |        This  option  forces PCRE2_UTF (see below) and also enables support for | 
 |        matching by pcre2_match() in subject strings that contain  invalid  UTF | 
 |        sequences.   Note,  however, that the 16-bit and 32-bit PCRE2 libraries | 
 |        process strings as sequences of uint16_t or uint32_t code points.  They | 
 |        cannot find valid UTF sequences within an arbitrary string of bytes un- | 
 |        less  such  sequences  are  suitably aligned. This facility is not sup- | 
 |        ported for DFA matching. For details, see the  pcre2unicode  documenta- | 
 |        tion. | 
 |  | 
 |          PCRE2_MATCH_UNSET_BACKREF | 
 |  | 
 |        If  this  option  is  set,  a  backreference  to an unset capture group | 
 |        matches an empty string (by default this causes  the  current  matching | 
 |        alternative to fail).  A pattern such as (\1)(a) succeeds when this op- | 
 |        tion  is  set  (assuming it can find an "a" in the subject), whereas it | 
 |        fails by default, for Perl compatibility.  Setting  this  option  makes | 
 |        PCRE2 behave more like ECMAscript (aka JavaScript). | 
 |  | 
 |          PCRE2_MULTILINE | 
 |  | 
 |        By  default,  for  the purposes of matching "start of line" and "end of | 
 |        line", PCRE2 treats the subject string as consisting of a  single  line | 
 |        of  characters,  even  if  it actually contains newlines. The "start of | 
 |        line" metacharacter (^) matches only at the start of  the  string,  and | 
 |        the  "end  of  line"  metacharacter  ($) matches only at the end of the | 
 |        string, or before a terminating newline (except  when  PCRE2_DOLLAR_EN- | 
 |        DONLY is set). Note, however, that unless PCRE2_DOTALL is set, the "any | 
 |        character"  metacharacter  (.) does not match at a newline. This behav- | 
 |        iour (for ^, $, and dot) is the same as Perl. | 
 |  | 
 |        When PCRE2_MULTILINE it is set, the "start of line" and "end  of  line" | 
 |        constructs  match  immediately following or immediately before internal | 
 |        newlines in the subject string, respectively, as well as  at  the  very | 
 |        start  and  end.  This is equivalent to Perl's /m option, and it can be | 
 |        changed within a pattern by a (?m) option setting. Note that the "start | 
 |        of line" metacharacter does not match after a newline at the end of the | 
 |        subject, for compatibility with Perl.  However, you can change this  by | 
 |        setting  the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in a | 
 |        subject string, or no occurrences of ^  or  $  in  a  pattern,  setting | 
 |        PCRE2_MULTILINE has no effect. | 
 |  | 
 |          PCRE2_NEVER_BACKSLASH_C | 
 |  | 
 |        This  option  locks out the use of \C in the pattern that is being com- | 
 |        piled.  This escape can  cause  unpredictable  behaviour  in  UTF-8  or | 
 |        UTF-16  modes,  because  it may leave the current matching point in the | 
 |        middle of a multi-code-unit character. This option may be useful in ap- | 
 |        plications that process patterns from external sources. Note that there | 
 |        is also a build-time option that permanently locks out the use of \C. | 
 |  | 
 |          PCRE2_NEVER_UCP | 
 |  | 
 |        This option locks out the use of Unicode properties  for  handling  \B, | 
 |        \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes, as | 
 |        described  for  the  PCRE2_UCP option below. In particular, it prevents | 
 |        the creator of the pattern from enabling this facility by starting  the | 
 |        pattern  with  (*UCP).  This  option may be useful in applications that | 
 |        process  patterns  from  external  sources.  The   option   combination | 
 |        PCRE2_UCP and PCRE2_NEVER_UCP causes an error. | 
 |  | 
 |          PCRE2_NEVER_UTF | 
 |  | 
 |        This  option  locks out interpretation of the pattern as UTF-8, UTF-16, | 
 |        or UTF-32, depending on which library is in use. In particular, it pre- | 
 |        vents the creator of the pattern from switching to  UTF  interpretation | 
 |        by  starting  the pattern with (*UTF). This option may be useful in ap- | 
 |        plications that process patterns from external sources. The combination | 
 |        of PCRE2_UTF and PCRE2_NEVER_UTF causes an error. | 
 |  | 
 |          PCRE2_NO_AUTO_CAPTURE | 
 |  | 
 |        If this option is set, it disables the use of numbered capturing paren- | 
 |        theses in the pattern. Any opening parenthesis that is not followed  by | 
 |        ?  behaves as if it were followed by ?: but named parentheses can still | 
 |        be used for capturing (and they acquire numbers in the usual way). This | 
 |        is the same as Perl's /n option.  Note that, when this option  is  set, | 
 |        references  to  capture  groups (backreferences or recursion/subroutine | 
 |        calls) may only refer to named groups, though the reference can  be  by | 
 |        name or by number. | 
 |  | 
 |          PCRE2_NO_AUTO_POSSESS | 
 |  | 
 |        If  this  (deprecated)  option  is set, it disables "auto-possessifica- | 
 |        tion", which is an optimization that, for example, turns a+b into  a++b | 
 |        in order to avoid backtracks into a+ that can never be successful. How- | 
 |        ever,  if  callouts  are  in use, auto-possessification means that some | 
 |        callouts are never taken. You can set  this  option  if  you  want  the | 
 |        matching  functions  to  do  a  full unoptimized search and run all the | 
 |        callouts, but it is mainly provided for testing purposes. | 
 |  | 
 |        If  a  compile  context  is  available,  it  is  recommended   to   use | 
 |        pcre2_set_optimize()  with  the directive PCRE2_AUTO_POSSESS_OFF rather | 
 |        than   the   compile   option    PCRE2_NO_AUTO_POSSESS.    Note    that | 
 |        PCRE2_NO_AUTO_POSSESS  takes  precedence  over the pcre2_set_optimize() | 
 |        optimization directives PCRE2_AUTO_POSSESS and PCRE2_AUTO_POSSESS_OFF. | 
 |  | 
 |          PCRE2_NO_DOTSTAR_ANCHOR | 
 |  | 
 |        If this (deprecated) option is set, it disables an optimization that is | 
 |        applied when .* is the first significant item in a top-level branch  of | 
 |        a  pattern, and all the other branches also start with .* or with \A or | 
 |        \G or ^. The optimization is automatically disabled for .* if it is in- | 
 |        side an atomic group or a capture group that is the subject of a  back- | 
 |        reference, or if the pattern contains (*PRUNE) or (*SKIP). When the op- | 
 |        timization is not disabled, such a pattern is automatically anchored if | 
 |        PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set | 
 |        for  any  ^ items. Otherwise, the fact that any match must start either | 
 |        at the start of the subject or following a newline is remembered.  Like | 
 |        other optimizations, this can cause callouts to be skipped.  (If a com- | 
 |        pile  context  is  available,  it is recommended to use pcre2_set_opti- | 
 |        mize() with the directive PCRE2_DOTSTAR_ANCHOR_OFF instead.) | 
 |  | 
 |          PCRE2_NO_START_OPTIMIZE | 
 |  | 
 |        This is an option whose main effect is at matching time.  It  does  not | 
 |        change what pcre2_compile() generates, but it does affect the output of | 
 |        the  JIT  compiler.  Setting  this  option  is  equivalent  to  calling | 
 |        pcre2_set_optimize()   with   the   directive    parameter    set    to | 
 |        PCRE2_START_OPTIMIZE_OFF. | 
 |  | 
 |        There  are  a  number of optimizations that may occur at the start of a | 
 |        match, in order to speed up the process. For example, if  it  is  known | 
 |        that  an  unanchored  match must start with a specific code unit value, | 
 |        the matching code searches the subject for that value, and fails  imme- | 
 |        diately  if it cannot find it, without actually running the main match- | 
 |        ing function. The start-up optimizations are in effect  a  pre-scan  of | 
 |        the subject that takes place before the pattern is run. | 
 |  | 
 |        Disabling  the  start-up optimizations may cause performance to suffer. | 
 |        However, this may be desirable for patterns which contain  callouts  or | 
 |        items  such  as  (*COMMIT)  and  (*MARK).  See the above description of | 
 |        PCRE2_START_OPTIMIZE_OFF for further details. | 
 |  | 
 |          PCRE2_NO_UTF_CHECK | 
 |  | 
 |        When PCRE2_UTF is set, the validity of the pattern as a UTF  string  is | 
 |        automatically  checked.  There  are  discussions  about the validity of | 
 |        UTF-8 strings, UTF-16 strings, and UTF-32 strings in  the  pcre2unicode | 
 |        document.  If an invalid UTF sequence is found, pcre2_compile() returns | 
 |        a negative error code. | 
 |  | 
 |        If you know that your pattern is a valid UTF string, and  you  want  to | 
 |        skip   this   check   for   performance   reasons,   you  can  set  the | 
 |        PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an in- | 
 |        valid UTF string as a pattern is undefined. It may cause  your  program | 
 |        to crash or loop. | 
 |  | 
 |        Note  that  this  option  can  also  be  passed  to  pcre2_match()  and | 
 |        pcre2_dfa_match(), to suppress UTF validity  checking  of  the  subject | 
 |        string. | 
 |  | 
 |        Note also that setting PCRE2_NO_UTF_CHECK at compile time does not dis- | 
 |        able  the error that is given if an escape sequence for an invalid Uni- | 
 |        code code point is encountered in the pattern. In particular,  the  so- | 
 |        called  "surrogate"  code points (0xd800 to 0xdfff) are invalid. If you | 
 |        want to allow escape  sequences  such  as  \x{d800}  you  can  set  the | 
 |        PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES  extra  option, as described in the | 
 |        section entitled "Extra compile options" below.  However, this is  pos- | 
 |        sible only in UTF-8 and UTF-32 modes, because these values are not rep- | 
 |        resentable in UTF-16. | 
 |  | 
 |          PCRE2_UCP | 
 |  | 
 |        This option has two effects. Firstly, it change the way PCRE2 processes | 
 |        \B,  \b,  \D,  \d,  \S,  \s,  \W,  \w,  and some of the POSIX character | 
 |        classes. By default, only  ASCII  characters  are  recognized,  but  if | 
 |        PCRE2_UCP  is  set, Unicode properties are used to classify characters. | 
 |        There are some PCRE2_EXTRA options (see below) that add  finer  control | 
 |        to  this  behaviour.  More  details are given in the section on generic | 
 |        character types in the pcre2pattern page. | 
 |  | 
 |        The second effect of PCRE2_UCP is to force the use of  Unicode  proper- | 
 |        ties for upper/lower casing operations, even when PCRE2_UTF is not set. | 
 |        This  makes  it  possible  to process strings in the 16-bit UCS-2 code. | 
 |        This option is available only if PCRE2 has been compiled  with  Unicode | 
 |        support (which is the default). | 
 |  | 
 |        The PCRE2_EXTRA_CASELESS_RESTRICT option (see above) restricts caseless | 
 |        matching  such  that  ASCII  characters match only ASCII characters and | 
 |        non-ASCII characters match only  non-ASCII  characters.  The  PCRE2_EX- | 
 |        TRA_TURKISH_CASING  option  (see  above) alters the matching of the 'i' | 
 |        characters to follow their behaviour in Turkish  and  Azeri  languages. | 
 |        For  further  details  on  PCRE2_EXTRA_CASELESS_RESTRICT  and PCRE2_EX- | 
 |        TRA_TURKISH_CASING, see the pcre2unicode page. | 
 |  | 
 |          PCRE2_UNGREEDY | 
 |  | 
 |        This option inverts the "greediness" of the quantifiers  so  that  they | 
 |        are  not greedy by default, but become greedy if followed by "?". It is | 
 |        not compatible with Perl. It can also be set by a (?U)  option  setting | 
 |        within the pattern. | 
 |  | 
 |          PCRE2_USE_OFFSET_LIMIT | 
 |  | 
 |        This option must be set for pcre2_compile() if pcre2_set_offset_limit() | 
 |        is  going  to be used to set a non-default offset limit in a match con- | 
 |        text for matches that use this pattern. An error  is  generated  if  an | 
 |        offset  limit is set without this option. For more details, see the de- | 
 |        scription of pcre2_set_offset_limit() in  the  section  that  describes | 
 |        match contexts. See also the PCRE2_FIRSTLINE option above. | 
 |  | 
 |          PCRE2_UTF | 
 |  | 
 |        This  option  causes  PCRE2  to regard both the pattern and the subject | 
 |        strings that are subsequently processed as strings  of  UTF  characters | 
 |        instead  of  single-code-unit  strings.  It  is available when PCRE2 is | 
 |        built to include Unicode support (which is  the  default).  If  Unicode | 
 |        support is not available, the use of this option provokes an error. De- | 
 |        tails  of how PCRE2_UTF changes the behaviour of PCRE2 are given in the | 
 |        pcre2unicode  page.  In  particular,  note  that  it  changes  the  way | 
 |        PCRE2_CASELESS works. | 
 |  | 
 |    Extra compile options | 
 |  | 
 |        The  option  bits  that  can be set in a compile context by calling the | 
 |        pcre2_set_compile_extra_options() function are as follows: | 
 |  | 
 |          PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK | 
 |  | 
 |        Since release 10.38 PCRE2 has forbidden the use of \K within lookaround | 
 |        assertions, following Perl's lead. This option is provided to re-enable | 
 |        the previous behaviour (act in positive lookarounds, ignore in negative | 
 |        ones) in case anybody is relying on it. | 
 |  | 
 |          PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES | 
 |  | 
 |        This option applies when compiling a pattern in UTF-8 or  UTF-32  mode. | 
 |        It  is  forbidden in UTF-16 mode, and ignored in non-UTF modes. Unicode | 
 |        "surrogate" code points in the range 0xd800 to 0xdfff are used in pairs | 
 |        in UTF-16 to encode code points with values in  the  range  0x10000  to | 
 |        0x10ffff.  The  surrogates  cannot  therefore be represented in UTF-16. | 
 |        They can be represented in UTF-8 and UTF-32, but are defined as invalid | 
 |        code points, and cause errors if  encountered  in  a  UTF-8  or  UTF-32 | 
 |        string that is being checked for validity by PCRE2. | 
 |  | 
 |        These  values also cause errors if encountered in escape sequences such | 
 |        as \x{d912} within a pattern. However, it seems that some applications, | 
 |        when using PCRE2 to check for unwanted characters in UTF-8 strings, ex- | 
 |        plicitly  test  for  the  surrogates  using   escape   sequences.   The | 
 |        PCRE2_NO_UTF_CHECK  option  does not disable the error that occurs, be- | 
 |        cause it applies only to the testing of input strings for UTF validity. | 
 |  | 
 |        If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set,  surro- | 
 |        gate  code  point values in UTF-8 and UTF-32 patterns no longer provoke | 
 |        errors and are incorporated in the compiled pattern. However, they  can | 
 |        only  match  subject characters if the matching function is called with | 
 |        PCRE2_NO_UTF_CHECK set. | 
 |  | 
 |          PCRE2_EXTRA_ALT_BSUX | 
 |  | 
 |        The original option PCRE2_ALT_BSUX causes PCRE2 to process \U, \u,  and | 
 |        \x  in  the way that ECMAscript (aka JavaScript) does. Additional func- | 
 |        tionality was defined by ECMAscript 6; setting PCRE2_EXTRA_ALT_BSUX has | 
 |        the effect of PCRE2_ALT_BSUX, but in addition it  recognizes  \u{hhh..} | 
 |        as a hexadecimal character code, where hhh.. is any number of hexadeci- | 
 |        mal digits. | 
 |  | 
 |          PCRE2_EXTRA_ASCII_BSD | 
 |  | 
 |        This  option  forces \d to match only ASCII digits, even when PCRE2_UCP | 
 |        is set.  It can be changed within a pattern by means of the  (?aD)  op- | 
 |        tion setting. | 
 |  | 
 |          PCRE2_EXTRA_ASCII_BSS | 
 |  | 
 |        This  option  forces \s to match only ASCII space characters, even when | 
 |        PCRE2_UCP is set. It can be changed within a pattern by  means  of  the | 
 |        (?aS) option setting. | 
 |  | 
 |          PCRE2_EXTRA_ASCII_BSW | 
 |  | 
 |        This  option  forces  \w to match only ASCII word characters, even when | 
 |        PCRE2_UCP is set. It can be changed within a pattern by  means  of  the | 
 |        (?aW) option setting. | 
 |  | 
 |          PCRE2_EXTRA_ASCII_DIGIT | 
 |  | 
 |        This option forces the POSIX character classes [:digit:] and [:xdigit:] | 
 |        to  match  only  ASCII  digits,  even  when PCRE2_UCP is set. It can be | 
 |        changed within a pattern by means of the (?aT) option setting. | 
 |  | 
 |          PCRE2_EXTRA_ASCII_POSIX | 
 |  | 
 |        This option forces all the POSIX character classes, including [:digit:] | 
 |        and [:xdigit:], to match only ASCII characters, even when PCRE2_UCP  is | 
 |        set.  It  can  be changed within a pattern by means of the (?aP) option | 
 |        setting, but note that this also sets PCRE2_EXTRA_ASCII_DIGIT in  order | 
 |        to ensure that (?-aP) unsets all ASCII restrictions for POSIX classes. | 
 |  | 
 |          PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL | 
 |  | 
 |        This  is a dangerous option. Use with care. By default, an unrecognized | 
 |        escape such as \j or a malformed one such as \x{2z} causes  a  compile- | 
 |        time error when detected by pcre2_compile(). Perl is somewhat inconsis- | 
 |        tent  in  handling  such items: for example, \j is treated as a literal | 
 |        "j", and non-hexadecimal digits in \x{} are just ignored, though  warn- | 
 |        ings  are given in both cases if Perl's warning switch is enabled. How- | 
 |        ever, a malformed octal number after \o{  always  causes  an  error  in | 
 |        Perl. | 
 |  | 
 |        If  the  PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL  extra  option  is passed to | 
 |        pcre2_compile(), all unrecognized or  malformed  escape  sequences  are | 
 |        treated  as  single-character escapes. For example, \j is a literal "j" | 
 |        and \x{2z} is treated as the literal string "x{2z}". Setting  this  op- | 
 |        tion means that typos in patterns may go undetected and have unexpected | 
 |        results.  Also  note  that a sequence such as [\N{] is interpreted as a | 
 |        malformed attempt at [\N{...}] and so is treated as [N{]  whereas  [\N] | 
 |        gives an error because an unqualified \N is a valid escape sequence but | 
 |        is  not supported in a character class. To reiterate: this is a danger- | 
 |        ous option. Use with great care. | 
 |  | 
 |          PCRE2_EXTRA_CASELESS_RESTRICT | 
 |  | 
 |        When either PCRE2_UCP or PCRE2_UTF is set,  caseless  matching  follows | 
 |        Unicode rules, which allow for more than two cases per character. There | 
 |        are two case-equivalent character sets that contain both ASCII and non- | 
 |        ASCII characters. The ASCII letter S is case-equivalent to U+017f (long | 
 |        S)  and  the ASCII letter K is case-equivalent to U+212a (Kelvin sign). | 
 |        This option disables recognition of case-equivalences  that  cross  the | 
 |        ASCII/non-ASCII boundary. In a caseless match, both characters must ei- | 
 |        ther  be ASCII or non-ASCII. The option can be changed within a pattern | 
 |        by the (*CASELESS_RESTRICT) or (?r) option settings. | 
 |  | 
 |          PCRE2_EXTRA_ESCAPED_CR_IS_LF | 
 |  | 
 |        There are some legacy applications where the escape sequence  \r  in  a | 
 |        pattern  is expected to match a newline. If this option is set, \r in a | 
 |        pattern is converted to \n so that it matches a LF  (linefeed)  instead | 
 |        of  a CR (carriage return) character. The option does not affect a lit- | 
 |        eral CR in the pattern, nor does it affect CR specified as an  explicit | 
 |        code point such as \x{0D}. | 
 |  | 
 |          PCRE2_EXTRA_MATCH_LINE | 
 |  | 
 |        This  option  is  provided  for  use  by the -x option of pcre2grep. It | 
 |        causes the pattern only to match complete lines. This  is  achieved  by | 
 |        automatically  inserting  the  code for "^(?:" at the start of the com- | 
 |        piled pattern and ")$" at the end. Thus, when PCRE2_MULTILINE  is  set, | 
 |        the  matched  line may be in the middle of the subject string. This op- | 
 |        tion can be used with PCRE2_LITERAL. | 
 |  | 
 |          PCRE2_EXTRA_MATCH_WORD | 
 |  | 
 |        This option is provided for use by  the  -w  option  of  pcre2grep.  It | 
 |        causes  the  pattern only to match strings that have a word boundary at | 
 |        the start and the end. This is achieved by automatically inserting  the | 
 |        code  for "\b(?:" at the start of the compiled pattern and ")\b" at the | 
 |        end. The option may be used with PCRE2_LITERAL. However, it is  ignored | 
 |        if PCRE2_EXTRA_MATCH_LINE is also set. | 
 |  | 
 |          PCRE2_EXTRA_NO_BS0 | 
 |  | 
 |        If this option is set (note that its final character is the digit 0) it | 
 |        locks  out  the  use  of the sequence \0 unless at least one more octal | 
 |        digit follows. | 
 |  | 
 |          PCRE2_EXTRA_PYTHON_OCTAL | 
 |  | 
 |        If this option is set, PCRE2 follows Python's  rules  for  interpreting | 
 |        octal  escape  sequences. The rules for handling sequences such as \14, | 
 |        which could be an octal number or a back reference are  different.  De- | 
 |        tails are given in the pcre2pattern documentation. | 
 |  | 
 |          PCRE2_EXTRA_NEVER_CALLOUT | 
 |  | 
 |        If this option is set, PCRE2 treats callouts in the pattern as a syntax | 
 |        error, returning PCRE2_ERROR_CALLOUT_CALLER_DISABLED. This is useful if | 
 |        the   application  knows  that  a  callout  will  not  be  provided  to | 
 |        pcre2_match(), so that callouts in the pattern  are  not  silently  ig- | 
 |        nored. | 
 |  | 
 |          PCRE2_EXTRA_TURKISH_CASING | 
 |  | 
 |        This  option  alters  case-equivalence of the 'i' letters to follow the | 
 |        alphabet used by Turkish and Azeri languages. The option can be changed | 
 |        within a pattern by the (*TURKISH_CASING) start-of-pattern setting. Ei- | 
 |        ther the UTF or UCP options must be set. In the 8-bit library, UTF must | 
 |        be set. This option cannot be  combined  with  PCRE2_EXTRA_CASELESS_RE- | 
 |        STRICT. | 
 |  | 
 |  | 
 | JUST-IN-TIME (JIT) COMPILATION | 
 |  | 
 |        int pcre2_jit_compile(pcre2_code *code, uint32_t options); | 
 |  | 
 |        int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject, | 
 |          PCRE2_SIZE length, PCRE2_SIZE startoffset, | 
 |          uint32_t options, pcre2_match_data *match_data, | 
 |          pcre2_match_context *mcontext); | 
 |  | 
 |        void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext); | 
 |  | 
 |        pcre2_jit_stack *pcre2_jit_stack_create(size_t startsize, | 
 |          size_t maxsize, pcre2_general_context *gcontext); | 
 |  | 
 |        void pcre2_jit_stack_assign(pcre2_match_context *mcontext, | 
 |          pcre2_jit_callback callback_function, void *callback_data); | 
 |  | 
 |        void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack); | 
 |  | 
 |        These  functions  provide  support  for  JIT compilation, which, if the | 
 |        just-in-time compiler is available, further processes a  compiled  pat- | 
 |        tern into machine code that executes much faster than the pcre2_match() | 
 |        interpretive  matching function. Full details are given in the pcre2jit | 
 |        documentation. | 
 |  | 
 |        JIT compilation is a heavyweight optimization. It can  take  some  time | 
 |        for  patterns  to  be analyzed, and for one-off matches and simple pat- | 
 |        terns the benefit of faster execution might be offset by a much  slower | 
 |        compilation  time.  Most (but not all) patterns can be optimized by the | 
 |        JIT compiler. | 
 |  | 
 |  | 
 | LOCALE SUPPORT | 
 |  | 
 |        const uint8_t *pcre2_maketables(pcre2_general_context *gcontext); | 
 |  | 
 |        void pcre2_maketables_free(pcre2_general_context *gcontext, | 
 |          const uint8_t *tables); | 
 |  | 
 |        PCRE2 handles caseless matching, and determines whether characters  are | 
 |        letters,  digits, or whatever, by reference to a set of tables, indexed | 
 |        by character code point. However, this applies only to characters whose | 
 |        code points are less than 256. By default,  higher-valued  code  points | 
 |        never match escapes such as \w or \d. | 
 |  | 
 |        When PCRE2 is built with Unicode support (the default), certain Unicode | 
 |        character  properties  can be tested with \p and \P, or, alternatively, | 
 |        the PCRE2_UCP option can be set when a pattern is compiled; this causes | 
 |        \w and friends to use Unicode property support instead of the  built-in | 
 |        tables.  PCRE2_UCP also causes upper/lower casing operations on charac- | 
 |        ters with code points greater than 127 to use Unicode properties. These | 
 |        effects  apply even when PCRE2_UTF is not set. There are, however, some | 
 |        PCRE2_EXTRA options (see above) that can be used to modify or  suppress | 
 |        them. | 
 |  | 
 |        The  use  of  locales  with Unicode is discouraged. If you are handling | 
 |        characters with code points greater than 127,  you  should  either  use | 
 |        Unicode support, or use locales, but not try to mix the two. | 
 |  | 
 |        PCRE2  contains a built-in set of character tables that are used by de- | 
 |        fault.  These are sufficient for many applications. Normally,  the  in- | 
 |        ternal  tables  recognize only ASCII characters. However, when PCRE2 is | 
 |        built, it is possible to cause the internal tables to be rebuilt in the | 
 |        default "C" locale of the local system, which may cause them to be dif- | 
 |        ferent. | 
 |  | 
 |        The built-in tables can be overridden by tables supplied by the  appli- | 
 |        cation  that  calls  PCRE2.  These may be created in a different locale | 
 |        from the default.  As more and more applications change to  using  Uni- | 
 |        code, the need for this locale support is expected to die away. | 
 |  | 
 |        External  tables  are built by calling the pcre2_maketables() function, | 
 |        in the relevant locale. The only argument to this function is a general | 
 |        context, which can be used to pass a custom memory  allocator.  If  the | 
 |        argument is NULL, the system malloc() is used. The result can be passed | 
 |        to pcre2_compile() as often as necessary, by creating a compile context | 
 |        and  calling  pcre2_set_character_tables()  to  set  the tables pointer | 
 |        therein. | 
 |  | 
 |        For example, to build and use  tables  that  are  appropriate  for  the | 
 |        French  locale  (where accented characters with values greater than 127 | 
 |        are treated as letters), the following code could be used: | 
 |  | 
 |          setlocale(LC_CTYPE, "fr_FR"); | 
 |          tables = pcre2_maketables(NULL); | 
 |          ccontext = pcre2_compile_context_create(NULL); | 
 |          pcre2_set_character_tables(ccontext, tables); | 
 |          re = pcre2_compile(..., ccontext); | 
 |  | 
 |        The locale name "fr_FR" is used on Linux and other  Unix-like  systems; | 
 |        if you are using Windows, the name for the French locale is "french". | 
 |  | 
 |        The pointer that is passed (via the compile context) to pcre2_compile() | 
 |        is saved with the compiled pattern, and the same tables are used by the | 
 |        matching  functions.  Thus,  for  any  single  pattern, compilation and | 
 |        matching both happen in the same locale, but different patterns can  be | 
 |        processed in different locales. | 
 |  | 
 |        It  is the caller's responsibility to ensure that the memory containing | 
 |        the tables remains available while they are still in use. When they are | 
 |        no longer needed, you can discard them  using  pcre2_maketables_free(), | 
 |        which  should  pass as its first parameter the same global context that | 
 |        was used to create the tables. | 
 |  | 
 |    Saving locale tables | 
 |  | 
 |        The tables described above are just a sequence of binary  bytes,  which | 
 |        makes  them  independent of hardware characteristics such as endianness | 
 |        or whether the processor is 32-bit or 64-bit. A copy of the  result  of | 
 |        pcre2_maketables()  can  therefore  be saved in a file or elsewhere and | 
 |        re-used later, even in a different program or on another computer.  The | 
 |        size  of  the  tables  (number  of  bytes)  must be obtained by calling | 
 |        pcre2_config()  with  the  PCRE2_CONFIG_TABLES_LENGTH  option   because | 
 |        pcre2_maketables()   does   not   return  this  value.  Note  that  the | 
 |        pcre2_dftables program, which is part of the PCRE2 build system, can be | 
 |        used stand-alone to create a file that contains a set of binary tables. | 
 |        See the pcre2build documentation for details. | 
 |  | 
 |  | 
 | INFORMATION ABOUT A COMPILED PATTERN | 
 |  | 
 |        int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where); | 
 |  | 
 |        The pcre2_pattern_info() function returns general information  about  a | 
 |        compiled pattern. For information about callouts, see the next section. | 
 |        The  first  argument  for pcre2_pattern_info() is a pointer to the com- | 
 |        piled pattern. The second argument specifies which piece of information | 
 |        is required, and the third argument is a pointer to a variable  to  re- | 
 |        ceive  the  data.  If the third argument is NULL, the first argument is | 
 |        ignored, and the function returns the size in  bytes  of  the  variable | 
 |        that is required for the information requested. Otherwise, the yield of | 
 |        the function is zero for success, or one of the following negative num- | 
 |        bers: | 
 |  | 
 |          PCRE2_ERROR_NULL           the argument code was NULL | 
 |          PCRE2_ERROR_BADMAGIC       the "magic number" was not found | 
 |          PCRE2_ERROR_BADOPTION      the value of what was invalid | 
 |          PCRE2_ERROR_UNSET          the requested field is not set | 
 |  | 
 |        The "magic number" is placed at the start of each compiled pattern as a | 
 |        simple  check  against  passing  an arbitrary memory pointer. Here is a | 
 |        typical call of pcre2_pattern_info(), to obtain the length of the  com- | 
 |        piled pattern: | 
 |  | 
 |          int rc; | 
 |          size_t length; | 
 |          rc = pcre2_pattern_info( | 
 |            re,               /* result of pcre2_compile() */ | 
 |            PCRE2_INFO_SIZE,  /* what is required */ | 
 |            &length);         /* where to put the data */ | 
 |  | 
 |        The possible values for the second argument are defined in pcre2.h, and | 
 |        are as follows: | 
 |  | 
 |          PCRE2_INFO_ALLOPTIONS | 
 |          PCRE2_INFO_ARGOPTIONS | 
 |          PCRE2_INFO_EXTRAOPTIONS | 
 |  | 
 |        Return copies of the pattern's options. The third argument should point | 
 |        to  a  uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly the op- | 
 |        tions that were passed to  pcre2_compile(),  whereas  PCRE2_INFO_ALLOP- | 
 |        TIONS  returns  the compile options as modified by any top-level (*XXX) | 
 |        option settings such as (*UTF) at the  start  of  the  pattern  itself. | 
 |        PCRE2_INFO_EXTRAOPTIONS  returns the extra options that were set in the | 
 |        compile context by calling the pcre2_set_compile_extra_options()  func- | 
 |        tion. | 
 |  | 
 |        For  example, if the pattern /(*UTF)abc/ is compiled with the PCRE2_EX- | 
 |        TENDED option, the result for PCRE2_INFO_ALLOPTIONS  is  PCRE2_EXTENDED | 
 |        and  PCRE2_UTF.   Option settings such as (?i) that can change within a | 
 |        pattern do not affect the result of PCRE2_INFO_ALLOPTIONS, even if they | 
 |        appear right at the start of the pattern. (This was different  in  some | 
 |        earlier releases.) | 
 |  | 
 |        A  pattern compiled without PCRE2_ANCHORED is automatically anchored by | 
 |        PCRE2 if the first significant item in every top-level branch is one of | 
 |        the following: | 
 |  | 
 |          ^     unless PCRE2_MULTILINE is set | 
 |          \A    always | 
 |          \G    always | 
 |          .*    sometimes - see below | 
 |  | 
 |        When .* is the first significant item, anchoring is possible only  when | 
 |        all the following are true: | 
 |  | 
 |          .* is not in an atomic group | 
 |          .* is not in a capture group that is the subject | 
 |               of a backreference | 
 |          PCRE2_DOTALL is in force for .* | 
 |          Neither (*PRUNE) nor (*SKIP) appears in the pattern | 
 |          PCRE2_NO_DOTSTAR_ANCHOR is not set | 
 |          Dotstar anchoring has not been disabled with PCRE2_DOTSTAR_ANCHOR_OFF | 
 |  | 
 |        For  patterns  that are auto-anchored, the PCRE2_ANCHORED bit is set in | 
 |        the options returned for PCRE2_INFO_ALLOPTIONS. | 
 |  | 
 |          PCRE2_INFO_BACKREFMAX | 
 |  | 
 |        Return the number of the highest  backreference  in  the  pattern.  The | 
 |        third  argument  should  point  to  a  uint32_t variable. Named capture | 
 |        groups acquire numbers as well as names, and these  count  towards  the | 
 |        highest  backreference.  Backreferences  such as \4 or \g{12} match the | 
 |        captured characters of the given group, but in addition, the check that | 
 |        a capture group is set in a conditional group such as (?(3)a|b) is also | 
 |        a backreference.  Zero is returned if there are no backreferences. | 
 |  | 
 |          PCRE2_INFO_BSR | 
 |  | 
 |        The output is a uint32_t integer whose value indicates  what  character | 
 |        sequences  the \R escape sequence matches. A value of PCRE2_BSR_UNICODE | 
 |        means that \R matches any Unicode line  ending  sequence;  a  value  of | 
 |        PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, or CRLF. | 
 |  | 
 |          PCRE2_INFO_CAPTURECOUNT | 
 |  | 
 |        Return  the  highest  capture  group number in the pattern. In patterns | 
 |        where (?| is not used, this is also the total number of capture groups. | 
 |        The third argument should point to a uint32_t variable. | 
 |  | 
 |          PCRE2_INFO_DEPTHLIMIT | 
 |  | 
 |        If the pattern set a backtracking depth limit by including an  item  of | 
 |        the  form  (*LIMIT_DEPTH=nnnn) at the start, the value is returned. The | 
 |        third argument should point to a uint32_t integer. If no such value has | 
 |        been set, the call to pcre2_pattern_info() returns the error  PCRE2_ER- | 
 |        ROR_UNSET. Note that this limit will only be used during matching if it | 
 |        is  less  than  the  limit  set or defaulted by the caller of the match | 
 |        function. | 
 |  | 
 |          PCRE2_INFO_FIRSTBITMAP | 
 |  | 
 |        In the absence of a single first code unit for a non-anchored  pattern, | 
 |        pcre2_compile()  may construct a 256-bit table that defines a fixed set | 
 |        of values for the first code unit in any match. For example, a  pattern | 
 |        that  starts  with  [abc]  results in a table with three bits set. When | 
 |        code unit values greater than 255 are supported, the flag bit  for  255 | 
 |        means  "any  code unit of value 255 or above". If such a table was con- | 
 |        structed, a pointer to it is returned. Otherwise NULL is returned.  The | 
 |        third argument should point to a const uint8_t * variable. | 
 |  | 
 |          PCRE2_INFO_FIRSTCODETYPE | 
 |  | 
 |        Return information about the first code unit of any matched string, for | 
 |        a  non-anchored  pattern. The third argument should point to a uint32_t | 
 |        variable. If there is a fixed first value, for example, the letter  "c" | 
 |        from  a  pattern such as (cat|cow|coyote), 1 is returned, and the value | 
 |        can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is  no  fixed | 
 |        first  value,  but it is known that a match can occur only at the start | 
 |        of the subject or following a newline in the subject,  2  is  returned. | 
 |        Otherwise, and for anchored patterns, 0 is returned. | 
 |  | 
 |          PCRE2_INFO_FIRSTCODEUNIT | 
 |  | 
 |        Return  the  value  of  the first code unit of any matched string for a | 
 |        pattern where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise  return  0. | 
 |        The  third  argument  should point to a uint32_t variable. In the 8-bit | 
 |        library, the value is always less than 256. In the 16-bit  library  the | 
 |        value  can  be  up  to 0xffff. In the 32-bit library in UTF-32 mode the | 
 |        value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32 | 
 |        mode. | 
 |  | 
 |          PCRE2_INFO_FRAMESIZE | 
 |  | 
 |        Return the size (in bytes) of the data frames that are used to remember | 
 |        backtracking positions when the pattern is processed  by  pcre2_match() | 
 |        without  the  use  of  JIT. The third argument should point to a size_t | 
 |        variable. The frame size depends on the number of capturing parentheses | 
 |        in the pattern. Each additional capture group adds two PCRE2_SIZE vari- | 
 |        ables. | 
 |  | 
 |          PCRE2_INFO_HASBACKSLASHC | 
 |  | 
 |        Return 1 if the pattern contains any instances of \C, otherwise 0.  The | 
 |        third argument should point to a uint32_t variable. | 
 |  | 
 |          PCRE2_INFO_HASCRORLF | 
 |  | 
 |        Return  1  if  the  pattern  contains any explicit matches for CR or LF | 
 |        characters, otherwise 0. The third argument should point to a  uint32_t | 
 |        variable.  An explicit match is either a literal CR or LF character, or | 
 |        \r or \n or one of the  equivalent  hexadecimal  or  octal  escape  se- | 
 |        quences. | 
 |  | 
 |          PCRE2_INFO_HEAPLIMIT | 
 |  | 
 |        If the pattern set a heap memory limit by including an item of the form | 
 |        (*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argu- | 
 |        ment should point to a uint32_t integer. If no such value has been set, | 
 |        the  call  to pcre2_pattern_info() returns the error PCRE2_ERROR_UNSET. | 
 |        Note that this limit will only be used during matching if  it  is  less | 
 |        than the limit set or defaulted by the caller of the match function. | 
 |  | 
 |          PCRE2_INFO_JCHANGED | 
 |  | 
 |        Return  1  if  the (?J) or (?-J) option setting is used in the pattern, | 
 |        otherwise 0. The third argument should point to  a  uint32_t  variable. | 
 |        (?J)  and  (?-J) set and unset the local PCRE2_DUPNAMES option, respec- | 
 |        tively. | 
 |  | 
 |          PCRE2_INFO_JITSIZE | 
 |  | 
 |        If the compiled pattern was successfully  processed  by  pcre2_jit_com- | 
 |        pile(),  return  the  size  of  the JIT compiled code, otherwise return | 
 |        zero. The third argument should point to a size_t variable. | 
 |  | 
 |          PCRE2_INFO_LASTCODETYPE | 
 |  | 
 |        Returns 1 if there is a rightmost literal code unit that must exist  in | 
 |        any  matched string, other than at its start. The third argument should | 
 |        point to a uint32_t variable. If there is no such value, 0 is returned. | 
 |        When 1 is returned, the code unit value itself can be  retrieved  using | 
 |        PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last literal value is | 
 |        recorded  only if it follows something of variable length. For example, | 
 |        for the pattern /^a\d+z\d+/ the returned value is 1 (with "z"  returned | 
 |        from  PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/ the returned value is | 
 |        0. | 
 |  | 
 |          PCRE2_INFO_LASTCODEUNIT | 
 |  | 
 |        Return the value of the rightmost literal code unit that must exist  in | 
 |        any  matched  string,  other  than  at  its  start, for a pattern where | 
 |        PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argu- | 
 |        ment should point to a uint32_t variable. | 
 |  | 
 |          PCRE2_INFO_MATCHEMPTY | 
 |  | 
 |        Return 1 if the pattern might match an empty string, otherwise  0.  The | 
 |        third argument should point to a uint32_t variable. When a pattern con- | 
 |        tains recursive subroutine calls it is not always possible to determine | 
 |        whether or not it can match an empty string. PCRE2 takes a cautious ap- | 
 |        proach and returns 1 in such cases. | 
 |  | 
 |          PCRE2_INFO_MATCHLIMIT | 
 |  | 
 |        If  the  pattern  set  a  match  limit by including an item of the form | 
 |        (*LIMIT_MATCH=nnnn) at the start, the value is returned. The third  ar- | 
 |        gument  should  point  to a uint32_t integer. If no such value has been | 
 |        set, the call to pcre2_pattern_info() returns the error PCRE2_ERROR_UN- | 
 |        SET. Note that this limit will only be used during matching  if  it  is | 
 |        less  than  the limit set or defaulted by the caller of the match func- | 
 |        tion. | 
 |  | 
 |          PCRE2_INFO_MAXLOOKBEHIND | 
 |  | 
 |        A lookbehind assertion moves back a certain number of  characters  (not | 
 |        code  units)  when  it starts to process each of its branches. This re- | 
 |        quest returns the largest of these backward moves. The  third  argument | 
 |        should point to a uint32_t integer. The simple assertions \b and \B re- | 
 |        quire  a one-character lookbehind and cause PCRE2_INFO_MAXLOOKBEHIND to | 
 |        return 1 in the absence of anything longer. \A also  registers  a  one- | 
 |        character  lookbehind, though it does not actually inspect the previous | 
 |        character. | 
 |  | 
 |        Note that this information is useful for multi-segment matching only if | 
 |        the pattern contains no nested lookbehinds. For  example,  the  pattern | 
 |        (?<=a(?<=ba)c)  returns  a  maximum  lookbehind  of  2,  but when it is | 
 |        processed, the first lookbehind moves back by two  characters,  matches | 
 |        one  character, then the nested lookbehind also moves back by two char- | 
 |        acters. This puts the matching point three characters earlier  than  it | 
 |        was  at the start.  PCRE2_INFO_MAXLOOKBEHIND is really only useful as a | 
 |        debugging tool. See the pcre2partial documentation for a discussion  of | 
 |        multi-segment matching. | 
 |  | 
 |          PCRE2_INFO_MINLENGTH | 
 |  | 
 |        If  a  minimum  length  for  matching subject strings was computed, its | 
 |        value is returned. Otherwise the returned value is 0. This value is not | 
 |        computed when PCRE2_NO_START_OPTIMIZE is set. The value is a number  of | 
 |        characters,  which in UTF mode may be different from the number of code | 
 |        units. The third argument should point  to  a  uint32_t  variable.  The | 
 |        value  is a lower bound to the length of any matching string. There may | 
 |        not be any strings of that length that do  actually  match,  but  every | 
 |        string that does match is at least that long. | 
 |  | 
 |          PCRE2_INFO_NAMECOUNT | 
 |          PCRE2_INFO_NAMEENTRYSIZE | 
 |          PCRE2_INFO_NAMETABLE | 
 |  | 
 |        PCRE2 supports the use of named as well as numbered capturing parenthe- | 
 |        ses.  The names are just an additional way of identifying the parenthe- | 
 |        ses, which still acquire numbers. Several convenience functions such as | 
 |        pcre2_substring_get_byname() are provided for extracting captured  sub- | 
 |        strings  by  name. It is also possible to extract the data directly, by | 
 |        first converting the name to a number in order to  access  the  correct | 
 |        pointers  in the output vector (described with pcre2_match() below). To | 
 |        do the conversion, you need to use the name-to-number map, which is de- | 
 |        scribed by these three values. | 
 |  | 
 |        The map consists of a number of  fixed-size  entries.  PCRE2_INFO_NAME- | 
 |        COUNT  gives  the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives | 
 |        the size of each entry in code units; both of these return  a  uint32_t | 
 |        value. The entry size depends on the length of the longest name. | 
 |  | 
 |        PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table. | 
 |        This is a PCRE2_SPTR pointer to a block of code units. In the 8-bit li- | 
 |        brary,  the first two bytes of each entry are the number of the captur- | 
 |        ing parenthesis, most significant byte first. In  the  16-bit  library, | 
 |        the  pointer  points  to 16-bit code units, the first of which contains | 
 |        the parenthesis number. In the 32-bit library, the  pointer  points  to | 
 |        32-bit  code units, the first of which contains the parenthesis number. | 
 |        The rest of the entry is the corresponding name, zero terminated. | 
 |  | 
 |        The names are in alphabetical order. If (?| is used to create  multiple | 
 |        capture groups with the same number, as described in the section on du- | 
 |        plicate group numbers in the pcre2pattern page, the groups may be given | 
 |        the  same  name,  but  there  is only one entry in the table. Different | 
 |        names for groups of the same number are not permitted. | 
 |  | 
 |        Duplicate names for capture groups with different numbers  are  permit- | 
 |        ted, but only if PCRE2_DUPNAMES is set. They appear in the table in the | 
 |        order  in  which  they were found in the pattern. In the absence of (?| | 
 |        this is the order of increasing number; when (?| is used  this  is  not | 
 |        necessarily  the  case because later capture groups may have lower num- | 
 |        bers. | 
 |  | 
 |        As a simple example of the name/number table,  consider  the  following | 
 |        pattern  after  compilation by the 8-bit library (assume PCRE2_EXTENDED | 
 |        is set, so white space - including newlines - is ignored): | 
 |  | 
 |          (?<date> (?<year>(\d\d)?\d\d) - | 
 |          (?<month>\d\d) - (?<day>\d\d) ) | 
 |  | 
 |        There are four named capture groups, so the table has four entries, and | 
 |        each entry in the table is eight bytes long. The table is  as  follows, | 
 |        with non-printing bytes shows in hexadecimal, and undefined bytes shown | 
 |        as ??: | 
 |  | 
 |          00 01 d  a  t  e  00 ?? | 
 |          00 05 d  a  y  00 ?? ?? | 
 |          00 04 m  o  n  t  h  00 | 
 |          00 02 y  e  a  r  00 ?? | 
 |  | 
 |        When  writing  code to extract data from named capture groups using the | 
 |        name-to-number map, remember that the length of the entries  is  likely | 
 |        to be different for each compiled pattern. | 
 |  | 
 |          PCRE2_INFO_NEWLINE | 
 |  | 
 |        The output is one of the following uint32_t values: | 
 |  | 
 |          PCRE2_NEWLINE_CR       Carriage return (CR) | 
 |          PCRE2_NEWLINE_LF       Linefeed (LF) | 
 |          PCRE2_NEWLINE_CRLF     Carriage return, linefeed (CRLF) | 
 |          PCRE2_NEWLINE_ANY      Any Unicode line ending | 
 |          PCRE2_NEWLINE_ANYCRLF  Any of CR, LF, or CRLF | 
 |          PCRE2_NEWLINE_NUL      The NUL character (binary zero) | 
 |  | 
 |        This identifies the character sequence that will be recognized as mean- | 
 |        ing "newline" while matching. | 
 |  | 
 |          PCRE2_INFO_SIZE | 
 |  | 
 |        Return  the  size  of  the compiled pattern in bytes (for all three li- | 
 |        braries). The third argument should point to a  size_t  variable.  This | 
 |        value  includes  the  size  of the general data block that precedes the | 
 |        code units of the compiled pattern itself. The value that is used  when | 
 |        pcre2_compile()  is  getting memory in which to place the compiled pat- | 
 |        tern may be slightly larger than the value returned by this option, be- | 
 |        cause there are cases where the code that calculates the  size  has  to | 
 |        over-estimate.  Processing a pattern with the JIT compiler does not al- | 
 |        ter the value returned by this option. | 
 |  | 
 |  | 
 | INFORMATION ABOUT A PATTERN'S CALLOUTS | 
 |  | 
 |        int pcre2_callout_enumerate(const pcre2_code *code, | 
 |          int (*callback)(pcre2_callout_enumerate_block *, void *), | 
 |          void *user_data); | 
 |  | 
 |        A script language that supports the use of string arguments in callouts | 
 |        might like to scan all the callouts in a  pattern  before  running  the | 
 |        match. This can be done by calling pcre2_callout_enumerate(). The first | 
 |        argument  is  a  pointer  to a compiled pattern, the second points to a | 
 |        callback function, and the third is arbitrary user data.  The  callback | 
 |        function  is  called  for  every callout in the pattern in the order in | 
 |        which they appear. Its first argument is a pointer to a callout enumer- | 
 |        ation block, and its second argument is the user_data  value  that  was | 
 |        passed  to  pcre2_callout_enumerate(). The contents of the callout enu- | 
 |        meration block are described in the pcre2callout  documentation,  which | 
 |        also gives further details about callouts. | 
 |  | 
 |  | 
 | SERIALIZATION AND PRECOMPILING | 
 |  | 
 |        It  is possible to save compiled patterns on disc or elsewhere, and re- | 
 |        load them later, subject to a number of restrictions. The host on which | 
 |        the patterns are reloaded must be running the same  version  of  PCRE2, | 
 |        with  the same code unit width, and must also have the same endianness, | 
 |        pointer width, and PCRE2_SIZE type. Before  compiled  patterns  can  be | 
 |        saved, they must be converted to a "serialized" form, which in the case | 
 |        of PCRE2 is really just a bytecode dump.  The functions whose names be- | 
 |        gin with pcre2_serialize_ are used for converting to and from the seri- | 
 |        alized  form.  They  are described in the pcre2serialize documentation. | 
 |        Note that PCRE2 serialization does not convert compiled patterns to  an | 
 |        abstract format like Java or .NET serialization. | 
 |  | 
 |  | 
 | THE MATCH DATA BLOCK | 
 |  | 
 |        pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize, | 
 |          pcre2_general_context *gcontext); | 
 |  | 
 |        pcre2_match_data *pcre2_match_data_create_from_pattern( | 
 |          const pcre2_code *code, pcre2_general_context *gcontext); | 
 |  | 
 |        void pcre2_match_data_free(pcre2_match_data *match_data); | 
 |  | 
 |        Information  about  a  successful  or unsuccessful match is placed in a | 
 |        match data block, which is an opaque  structure  that  is  accessed  by | 
 |        function  calls.  In particular, the match data block contains a vector | 
 |        of offsets into the subject string that define the matched parts of the | 
 |        subject. This is known as the ovector. | 
 |  | 
 |        Before calling pcre2_match(), pcre2_dfa_match(),  or  pcre2_jit_match() | 
 |        you must create a match data block by calling one of the creation func- | 
 |        tions  above.  For pcre2_match_data_create(), the first argument is the | 
 |        number of pairs of offsets in the ovector. | 
 |  | 
 |        When using pcre2_match(), one pair of offsets is required  to  identify | 
 |        the  string that matched the whole pattern, with an additional pair for | 
 |        each captured substring. For example, a value of 4 creates enough space | 
 |        to record the matched portion of the subject plus three  captured  sub- | 
 |        strings. | 
 |  | 
 |        When  using  pcre2_dfa_match() there may be multiple matched substrings | 
 |        of different lengths at the same point  in  the  subject.  The  ovector | 
 |        should be made large enough to hold as many as are expected. | 
 |  | 
 |        A  minimum  of at least 1 pair is imposed by pcre2_match_data_create(), | 
 |        so it is always possible to return the overall matched  string  in  the | 
 |        case   of   pcre2_match()   or   the  longest  match  in  the  case  of | 
 |        pcre2_dfa_match(). The maximum number of pairs is 65535; if  the  first | 
 |        argument  of  pcre2_match_data_create()  is greater than this, 65535 is | 
 |        used. | 
 |  | 
 |        The second argument of pcre2_match_data_create() is a pointer to a gen- | 
 |        eral context, which can specify custom memory management for  obtaining | 
 |        the memory for the match data block. If you are not using custom memory | 
 |        management, pass NULL, which causes malloc() to be used. | 
 |  | 
 |        For  pcre2_match_data_create_from_pattern(),  the  first  argument is a | 
 |        pointer to a compiled pattern. The ovector is created to be exactly the | 
 |        right size to hold all the substrings  a  pattern  might  capture  when | 
 |        matched using pcre2_match(). You should not use this call when matching | 
 |        with  pcre2_dfa_match().  The  second  argument is again a pointer to a | 
 |        general context, but in this case if NULL is passed, the memory is  ob- | 
 |        tained  using the same allocator that was used for the compiled pattern | 
 |        (custom or default). | 
 |  | 
 |        A match data block can be used many times, with the same  or  different | 
 |        compiled  patterns. You can extract information from a match data block | 
 |        after a match operation has finished,  using  functions  that  are  de- | 
 |        scribed in the sections on matched strings and other match data below. | 
 |  | 
 |        When  a  call  of  pcre2_match()  fails, valid data is available in the | 
 |        match block only  when  the  error  is  PCRE2_ERROR_NOMATCH,  PCRE2_ER- | 
 |        ROR_PARTIAL,  or  one of the error codes for an invalid UTF string. Ex- | 
 |        actly what is available depends on the error, and is detailed below. | 
 |  | 
 |        When one of the matching functions is called, pointers to the  compiled | 
 |        pattern  and the subject string are set in the match data block so that | 
 |        they can be referenced by the extraction functions after  a  successful | 
 |        match. After running a match, you must not free a compiled pattern or a | 
 |        subject  string until after all operations on the match data block (for | 
 |        that match) have taken place,  unless,  in  the  case  of  the  subject | 
 |        string,  you  have used the PCRE2_COPY_MATCHED_SUBJECT option, which is | 
 |        described in the section entitled "Option bits for  pcre2_match()"  be- | 
 |        low. | 
 |  | 
 |        When  a match data block itself is no longer needed, it should be freed | 
 |        by calling pcre2_match_data_free(). If this function is called  with  a | 
 |        NULL argument, it returns immediately, without doing anything. | 
 |  | 
 |  | 
 | MEMORY USE FOR MATCH DATA BLOCKS | 
 |  | 
 |        PCRE2_SIZE pcre2_get_match_data_size(pcre2_match_data *match_data); | 
 |  | 
 |        PCRE2_SIZE pcre2_get_match_data_heapframes_size( | 
 |          pcre2_match_data *match_data); | 
 |  | 
 |        The  size of a match data block depends on the size of the ovector that | 
 |        it contains. The function pcre2_get_match_data_size() returns the size, | 
 |        in bytes, of the block that is its argument. | 
 |  | 
 |        When pcre2_match() runs interpretively (that is, without using JIT), it | 
 |        makes use of a vector of data frames for remembering backtracking posi- | 
 |        tions.  The size of each individual frame depends on the number of cap- | 
 |        turing parentheses in the  pattern  and  can  be  obtained  by  calling | 
 |        pcre2_pattern_info() with the PCRE2_INFO_FRAMESIZE option (see the sec- | 
 |        tion entitled "Information about a compiled pattern" above). | 
 |  | 
 |        Heap  memory is used for the frames vector; if the initial memory block | 
 |        turns out to be too small during  matching,  it  is  automatically  ex- | 
 |        panded.  When  pcre2_match()  returns, the memory is not freed, but re- | 
 |        mains attached to the match data  block,  for  use  by  any  subsequent | 
 |        matches  that  use  the  same block. It is automatically freed when the | 
 |        match data block itself is freed. | 
 |  | 
 |        You can find the current size of the frames vector that  a  match  data | 
 |        block  owns  by  calling  pcre2_get_match_data_heapframes_size(). For a | 
 |        newly created match data block the size will be  zero.  Some  types  of | 
 |        match may require a lot of frames and thus a large vector; applications | 
 |        that run in environments where memory is constrained can check this and | 
 |        free the match data block if the heap frames vector has become too big. | 
 |  | 
 |  | 
 | MATCHING A PATTERN: THE TRADITIONAL FUNCTION | 
 |  | 
 |        int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject, | 
 |          PCRE2_SIZE length, PCRE2_SIZE startoffset, | 
 |          uint32_t options, pcre2_match_data *match_data, | 
 |          pcre2_match_context *mcontext); | 
 |  | 
 |        The  function pcre2_match() is called to match a subject string against | 
 |        a compiled pattern, which is passed in the code argument. You can  call | 
 |        pcre2_match() with the same code argument as many times as you like, in | 
 |        order  to  find multiple matches in the subject string or to match dif- | 
 |        ferent subject strings with the same pattern. | 
 |  | 
 |        This function is the main matching facility of the library, and it  op- | 
 |        erates  in  a Perl-like manner. For specialist use there is also an al- | 
 |        ternative matching function, which is described below  in  the  section | 
 |        about the pcre2_dfa_match() function. | 
 |  | 
 |        Here is an example of a simple call to pcre2_match(): | 
 |  | 
 |          pcre2_match_data *md = pcre2_match_data_create(4, NULL); | 
 |          int rc = pcre2_match( | 
 |            re,             /* result of pcre2_compile() */ | 
 |            "some string",  /* the subject string */ | 
 |            11,             /* the length of the subject string */ | 
 |            0,              /* start at offset 0 in the subject */ | 
 |            0,              /* default options */ | 
 |            md,             /* the match data block */ | 
 |            NULL);          /* a match context; NULL means use defaults */ | 
 |  | 
 |        If  the  subject  string is zero-terminated, the length can be given as | 
 |        PCRE2_ZERO_TERMINATED. A match context must be provided if certain less | 
 |        common matching parameters are to be changed. For details, see the sec- | 
 |        tion on the match context above. | 
 |  | 
 |    The string to be matched by pcre2_match() | 
 |  | 
 |        The subject string is passed to pcre2_match() as a pointer in  subject, | 
 |        a  length  in  length, and a starting offset in startoffset. The length | 
 |        and offset are in code units, not characters.  That  is,  they  are  in | 
 |        bytes  for the 8-bit library, 16-bit code units for the 16-bit library, | 
 |        and 32-bit code units for the 32-bit library, whether or not  UTF  pro- | 
 |        cessing is enabled. As a special case, if subject is NULL and length is | 
 |        zero,  the  subject is assumed to be an empty string. If length is non- | 
 |        zero, an error occurs if subject is NULL. | 
 |  | 
 |        If startoffset is greater than the length of the subject, pcre2_match() | 
 |        returns PCRE2_ERROR_BADOFFSET. When the starting offset  is  zero,  the | 
 |        search  for a match starts at the beginning of the subject, and this is | 
 |        by far the most common case. In UTF-8 or UTF-16 mode, the starting off- | 
 |        set must point to the start of a character, or to the end of  the  sub- | 
 |        ject  (in  UTF-32 mode, one code unit equals one character, so all off- | 
 |        sets are valid). Like the pattern string, the subject may  contain  bi- | 
 |        nary zeros. | 
 |  | 
 |        A  non-zero  starting offset is useful when searching for another match | 
 |        in the same subject by calling pcre2_match()  again  after  a  previous | 
 |        success.   Setting  startoffset  differs  from passing over a shortened | 
 |        string and setting PCRE2_NOTBOL in the case of a  pattern  that  begins | 
 |        with any kind of lookbehind. For example, consider the pattern | 
 |  | 
 |          \Biss\B | 
 |  | 
 |        which  finds  occurrences  of "iss" in the middle of words. (\B matches | 
 |        only if the current position in the subject is not  a  word  boundary.) | 
 |        When   applied   to   the   string  "Mississippi"  the  first  call  to | 
 |        pcre2_match() finds the first occurrence. If  pcre2_match()  is  called | 
 |        again with just the remainder of the subject, namely "issippi", it does | 
 |        not  match,  because  \B  is  always false at the start of the subject, | 
 |        which is deemed to be a word boundary.  However,  if  pcre2_match()  is | 
 |        passed the entire string again, but with startoffset set to 4, it finds | 
 |        the  second  occurrence  of "iss" because it is able to look behind the | 
 |        starting point to discover that it is preceded by a letter. | 
 |  | 
 |        Finding all the matches in a subject is tricky  when  the  pattern  can | 
 |        match an empty string. PCRE2 includes a helper API to assist with this; | 
 |        see  the  section  entitled  "Iterating over all matches" below for de- | 
 |        tails. | 
 |  | 
 |        If a non-zero starting offset is passed when the pattern is anchored, a | 
 |        single attempt to match at the given offset is made. This can only suc- | 
 |        ceed if the pattern does not require the match to be at  the  start  of | 
 |        the  subject.  In other words, the anchoring must be the result of set- | 
 |        ting the PCRE2_ANCHORED option or the use of .* with PCRE2_DOTALL,  not | 
 |        by starting the pattern with ^ or \A. | 
 |  | 
 |    Option bits for pcre2_match() | 
 |  | 
 |        The unused bits of the options argument for pcre2_match() must be zero. | 
 |        The    only    bits    that    may    be    set   are   PCRE2_ANCHORED, | 
 |        PCRE2_COPY_MATCHED_SUBJECT, PCRE2_DISABLE_RECURSELOOP_CHECK,  PCRE2_EN- | 
 |        DANCHORED,       PCRE2_NOTBOL,       PCRE2_NOTEOL,      PCRE2_NOTEMPTY, | 
 |        PCRE2_NOTEMPTY_ATSTART,  PCRE2_NO_JIT,  PCRE2_NO_UTF_CHECK,  PCRE2_PAR- | 
 |        TIAL_HARD, and PCRE2_PARTIAL_SOFT.  Their action is described below. | 
 |  | 
 |        Setting  PCRE2_ANCHORED  or PCRE2_ENDANCHORED at match time is not sup- | 
 |        ported by the just-in-time (JIT) compiler. If it is set,  JIT  matching | 
 |        is  disabled  and  the  interpretive  code  in  pcre2_match()  is  run. | 
 |        PCRE2_DISABLE_RECURSELOOP_CHECK is  ignored  by  JIT,  but  apart  from | 
 |        PCRE2_NO_JIT  (obviously),  the remaining options are supported for JIT | 
 |        matching. | 
 |  | 
 |          PCRE2_ANCHORED | 
 |  | 
 |        The PCRE2_ANCHORED option limits pcre2_match() to matching at the first | 
 |        matching position. If a pattern was compiled  with  PCRE2_ANCHORED,  or | 
 |        turned  out to be anchored by virtue of its contents, it cannot be made | 
 |        unanchored at matching time. Note that setting the option at match time | 
 |        disables JIT matching. | 
 |  | 
 |          PCRE2_COPY_MATCHED_SUBJECT | 
 |  | 
 |        By default, a pointer to the subject is remembered in  the  match  data | 
 |        block  so  that,  after a successful match, it can be referenced by the | 
 |        substring extraction functions. This means that  the  subject's  memory | 
 |        must  not be freed until all such operations are complete. For some ap- | 
 |        plications where the lifetime of the subject string is not  guaranteed, | 
 |        it  may  be  necessary  to make a copy of the subject string, but it is | 
 |        wasteful to do this unless the match is successful. After a  successful | 
 |        match,  if PCRE2_COPY_MATCHED_SUBJECT is set, the subject is copied and | 
 |        the new pointer is remembered in the match data block  instead  of  the | 
 |        original  subject  pointer.  The memory allocator that was used for the | 
 |        match block itself is  used.  The  copy  is  automatically  freed  when | 
 |        pcre2_match_data_free()  is  called to free the match data block. It is | 
 |        also automatically freed if the match data block is re-used for another | 
 |        match operation. | 
 |  | 
 |          PCRE2_DISABLE_RECURSELOOP_CHECK | 
 |  | 
 |        This option is relevant only to pcre2_match() for  interpretive  match- | 
 |        ing.    It   is  ignored  when  JIT  is  used,  and  is  forbidden  for | 
 |        pcre2_dfa_match(). | 
 |  | 
 |        The use of recursion in patterns can lead to infinite loops. In the in- | 
 |        terpretive matcher these would be eventually caught  by  the  match  or | 
 |        heap limits, but this could take a long time and/or use a lot of memory | 
 |        if  the  limits  are  large. There is therefore a check at the start of | 
 |        each recursion.  If the same group is  still  active  from  a  previous | 
 |        call,  and  the  current  subject  pointer is the same as it was at the | 
 |        start of that group, and the furthest inspected character of  the  sub- | 
 |        ject has not changed, an error is generated. | 
 |  | 
 |        There  are  rare cases of matches that would complete, but nevertheless | 
 |        trigger this error. This option disables  the  check.  It  is  provided | 
 |        mainly for testing when comparing JIT and interpretive behaviour. | 
 |  | 
 |          PCRE2_ENDANCHORED | 
 |  | 
 |        If  the  PCRE2_ENDANCHORED option is set, any string that pcre2_match() | 
 |        matches must be right at the end of the subject string. Note that  set- | 
 |        ting the option at match time disables JIT matching. | 
 |  | 
 |          PCRE2_NOTBOL | 
 |  | 
 |        This option specifies that first character of the subject string is not | 
 |        the  beginning  of  a  line, so the circumflex metacharacter should not | 
 |        match before it. Setting this without  having  set  PCRE2_MULTILINE  at | 
 |        compile time causes circumflex never to match. This option affects only | 
 |        the behaviour of the circumflex metacharacter. It does not affect \A. | 
 |  | 
 |          PCRE2_NOTEOL | 
 |  | 
 |        This option specifies that the end of the subject string is not the end | 
 |        of  a line, so the dollar metacharacter should not match it nor (except | 
 |        in multiline mode) a newline immediately before it. Setting this  with- | 
 |        out  having  set PCRE2_MULTILINE at compile time causes dollar never to | 
 |        match. This option affects only the behaviour of the dollar metacharac- | 
 |        ter. It does not affect \Z or \z. | 
 |  | 
 |          PCRE2_NOTEMPTY | 
 |  | 
 |        An empty string is not considered to be a valid match if this option is | 
 |        set. If there are alternatives in the pattern, they are tried.  If  all | 
 |        the  alternatives  match  the empty string, the entire match fails. For | 
 |        example, if the pattern | 
 |  | 
 |          a?b? | 
 |  | 
 |        is applied to a string not beginning with "a" or  "b",  it  matches  an | 
 |        empty string at the start of the subject. With PCRE2_NOTEMPTY set, this | 
 |        match  is  not valid, so pcre2_match() searches further into the string | 
 |        for occurrences of "a" or "b". | 
 |  | 
 |          PCRE2_NOTEMPTY_ATSTART | 
 |  | 
 |        This is like PCRE2_NOTEMPTY, except that it locks out an  empty  string | 
 |        match only at the first matching position, that is, at the start of the | 
 |        subject  plus  the  starting offset. An empty string match later in the | 
 |        subject is permitted.  If the pattern is anchored, such a match can oc- | 
 |        cur only if the pattern contains \K. | 
 |  | 
 |          PCRE2_NO_JIT | 
 |  | 
 |        By  default,  if  a  pattern  has  been   successfully   processed   by | 
 |        pcre2_jit_compile(),  JIT  is  automatically used when pcre2_match() is | 
 |        called with options that JIT supports.  Setting  PCRE2_NO_JIT  disables | 
 |        the use of JIT; it forces matching to be done by the interpreter. | 
 |  | 
 |          PCRE2_NO_UTF_CHECK | 
 |  | 
 |        When PCRE2_UTF is set at compile time, the validity of the subject as a | 
 |        UTF   string   is   checked  unless  PCRE2_NO_UTF_CHECK  is  passed  to | 
 |        pcre2_match() or PCRE2_MATCH_INVALID_UTF was passed to pcre2_compile(). | 
 |        The latter special case is discussed in detail in the pcre2unicode doc- | 
 |        umentation. | 
 |  | 
 |        In the default case, if a non-zero starting offset is given, the  check | 
 |        is  applied  only  to  that part of the subject that could be inspected | 
 |        during matching, and there is a check that the starting  offset  points | 
 |        to  the first code unit of a character or to the end of the subject. If | 
 |        there are no lookbehind assertions in the pattern, the check starts  at | 
 |        the starting offset.  Otherwise, it starts at the length of the longest | 
 |        lookbehind  before  the starting offset, or at the start of the subject | 
 |        if there are not that many characters before the starting offset.  Note | 
 |        that the sequences \b and \B are one-character lookbehinds. | 
 |  | 
 |        The check is carried out before any other processing takes place, and a | 
 |        negative  error  code is returned if the check fails. There are several | 
 |        UTF error codes for each code unit width,  corresponding  to  different | 
 |        problems  with  the code unit sequence. There are discussions about the | 
 |        validity of UTF-8 strings, UTF-16 strings, and UTF-32  strings  in  the | 
 |        pcre2unicode documentation. | 
 |  | 
 |        If you know that your subject is valid, and you want to skip this check | 
 |        for performance reasons, you can set the PCRE2_NO_UTF_CHECK option when | 
 |        calling  pcre2_match().  You  might  want to do this for the second and | 
 |        subsequent calls to pcre2_match() if you are making repeated  calls  to | 
 |        find multiple matches in the same subject string. | 
 |  | 
 |        Warning:  Unless  PCRE2_MATCH_INVALID_UTF was set at compile time, when | 
 |        PCRE2_NO_UTF_CHECK is set at match time the effect of  passing  an  in- | 
 |        valid string as a subject, or an invalid value of startoffset, is unde- | 
 |        fined.   Your  program may crash or loop indefinitely or give wrong re- | 
 |        sults. | 
 |  | 
 |          PCRE2_PARTIAL_HARD | 
 |          PCRE2_PARTIAL_SOFT | 
 |  | 
 |        These options turn on the partial matching feature. A partial match oc- | 
 |        curs if the end of the subject  string  is  reached  successfully,  but | 
 |        there are not enough subject characters to complete the match. In addi- | 
 |        tion,  either  at  least  one character must have been inspected or the | 
 |        pattern must contain a lookbehind, or the  pattern  must  be  one  that | 
 |        could match an empty string. | 
 |  | 
 |        If  this  situation  arises when PCRE2_PARTIAL_SOFT (but not PCRE2_PAR- | 
 |        TIAL_HARD) is set, matching continues by testing any remaining alterna- | 
 |        tives. Only if no complete match can be  found  is  PCRE2_ERROR_PARTIAL | 
 |        returned  instead  of  PCRE2_ERROR_NOMATCH.  In other words, PCRE2_PAR- | 
 |        TIAL_SOFT specifies that the caller is prepared  to  handle  a  partial | 
 |        match, but only if no complete match can be found. | 
 |  | 
 |        If  PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this | 
 |        case, if a partial match is found,  pcre2_match()  immediately  returns | 
 |        PCRE2_ERROR_PARTIAL,  without  considering  any  other alternatives. In | 
 |        other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid- | 
 |        ered to be more important than an alternative complete match. | 
 |  | 
 |        There is a more detailed discussion of partial and multi-segment match- | 
 |        ing, with examples, in the pcre2partial documentation. | 
 |  | 
 |  | 
 | NEWLINE HANDLING WHEN MATCHING | 
 |  | 
 |        When PCRE2 is built, a default newline convention is set; this is  usu- | 
 |        ally  the standard convention for the operating system. The default can | 
 |        be overridden in a compile context by calling  pcre2_set_newline().  It | 
 |        can  also be overridden by starting a pattern string with, for example, | 
 |        (*CRLF), as described in the section  on  newline  conventions  in  the | 
 |        pcre2pattern  page. During matching, the newline choice affects the be- | 
 |        haviour of the dot, circumflex, and dollar metacharacters. It may  also | 
 |        alter  the  way  the  match starting position is advanced after a match | 
 |        failure for an unanchored pattern. | 
 |  | 
 |        When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is | 
 |        set as the newline convention, and a match attempt  for  an  unanchored | 
 |        pattern fails when the current starting position is at a CRLF sequence, | 
 |        and  the  pattern contains no explicit matches for CR or LF characters, | 
 |        the match position is advanced by two characters  instead  of  one,  in | 
 |        other words, to after the CRLF. | 
 |  | 
 |        The above rule is a compromise that makes the most common cases work as | 
 |        expected.  For example, if the pattern is .+A (and the PCRE2_DOTALL op- | 
 |        tion is not set), it does not match the string "\r\nA"  because,  after | 
 |        failing  at the start, it skips both the CR and the LF before retrying. | 
 |        However, the pattern [\r\n]A does match that string,  because  it  con- | 
 |        tains an explicit CR or LF reference, and so advances only by one char- | 
 |        acter after the first failure. | 
 |  | 
 |        An explicit match for CR of LF is either a literal appearance of one of | 
 |        those  characters  in the pattern, or one of the \r or \n or equivalent | 
 |        octal or hexadecimal escape sequences. Implicit matches such as [^X] do | 
 |        not count, nor does \s, even though it includes CR and LF in the  char- | 
 |        acters that it matches. | 
 |  | 
 |        Notwithstanding  the above, anomalous effects may still occur when CRLF | 
 |        is a valid newline sequence and explicit \r or \n escapes appear in the | 
 |        pattern. | 
 |  | 
 |  | 
 | HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS | 
 |  | 
 |        uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data); | 
 |  | 
 |        PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data); | 
 |  | 
 |        In general, a pattern matches a certain portion of the subject, and  in | 
 |        addition,  further  substrings  from  the  subject may be picked out by | 
 |        parenthesized parts of the pattern.  Following  the  usage  in  Jeffrey | 
 |        Friedl's  book,  this  is  called  "capturing" in what follows, and the | 
 |        phrase "capture group" (Perl terminology) is used for a fragment  of  a | 
 |        pattern  that picks out a substring. PCRE2 supports several other kinds | 
 |        of parenthesized group that do not cause substrings to be captured. The | 
 |        pcre2_pattern_info() function can be used to find out how many  capture | 
 |        groups there are in a compiled pattern. | 
 |  | 
 |        You  can  use  auxiliary functions for accessing captured substrings by | 
 |        number or by name, as described in sections below. | 
 |  | 
 |        Alternatively, you can make direct use of the vector of PCRE2_SIZE val- | 
 |        ues, called  the  ovector,  which  contains  the  offsets  of  captured | 
 |        strings.   It   is   part  of  the  match  data  block.   The  function | 
 |        pcre2_get_ovector_pointer() returns the address  of  the  ovector,  and | 
 |        pcre2_get_ovector_count() returns the number of pairs of values it con- | 
 |        tains. | 
 |  | 
 |        Within the ovector, the first in each pair of values is set to the off- | 
 |        set of the first code unit of a substring, and the second is set to the | 
 |        offset  of the first code unit after the end of a substring. These val- | 
 |        ues are always code unit offsets, not character offsets. That is,  they | 
 |        are byte offsets in the 8-bit library, 16-bit offsets in the 16-bit li- | 
 |        brary, and 32-bit offsets in the 32-bit library. | 
 |  | 
 |        After  a  partial  match  (error  return PCRE2_ERROR_PARTIAL), only the | 
 |        first pair of offsets (that is, ovector[0]  and  ovector[1])  are  set. | 
 |        They  identify  the part of the subject that was partially matched. See | 
 |        the pcre2partial documentation for details of partial matching. | 
 |  | 
 |        After a fully successful match, the first pair  of  offsets  identifies | 
 |        the  portion  of the subject string that was matched by the entire pat- | 
 |        tern. The next pair is used for the first captured  substring,  and  so | 
 |        on.  The  value  returned by pcre2_match() is one more than the highest | 
 |        numbered pair that has been set. For example, if  two  substrings  have | 
 |        been  captured,  the returned value is 3. If there are no captured sub- | 
 |        strings, the return value from a successful match is 1, indicating that | 
 |        just the first pair of offsets has been set. | 
 |  | 
 |        If a pattern uses the \K escape sequence within  a  positive  lookahead | 
 |        assertion, the reported start of a successful match can be greater than | 
 |        the  end  of the match. For example, if the pattern (?=ab\K) is matched | 
 |        against "ab", the start and end offset values for the match are  2  and | 
 |        0. | 
 |  | 
 |        If  a  capture group is matched repeatedly within a single match opera- | 
 |        tion, it is the last portion of the subject that it matched that is re- | 
 |        turned. | 
 |  | 
 |        If the ovector is too small to hold all the captured substring offsets, | 
 |        as much as possible is filled in, and the function returns a  value  of | 
 |        zero.  If captured substrings are not of interest, pcre2_match() may be | 
 |        called with a match data block whose ovector is of minimum length (that | 
 |        is, one pair). | 
 |  | 
 |        It is possible for capture group number n+1 to match some part  of  the | 
 |        subject  when  group  n  has  not been used at all. For example, if the | 
 |        string "abc" is matched against the pattern (a|(z))(bc) the return from | 
 |        the function is 4, and groups 1 and 3 are matched, but 2 is  not.  When | 
 |        this  happens,  both values in the offset pairs corresponding to unused | 
 |        groups are set to PCRE2_UNSET. | 
 |  | 
 |        Offset values that correspond to unused groups at the end  of  the  ex- | 
 |        pression  are also set to PCRE2_UNSET. For example, if the string "abc" | 
 |        is matched against the pattern (abc)(x(yz)?)? groups 2 and  3  are  not | 
 |        matched.  The  return  from the function is 2, because the highest used | 
 |        capture group number is 1. The offsets for the second and third capture | 
 |        groups (assuming the vector is large enough,  of  course)  are  set  to | 
 |        PCRE2_UNSET. | 
 |  | 
 |        Elements in the ovector that do not correspond to capturing parentheses | 
 |        in the pattern are never changed. That is, if a pattern contains n cap- | 
 |        turing parentheses, no more than ovector[0] to ovector[2n+1] are set by | 
 |        pcre2_match().  The  other  elements retain whatever values they previ- | 
 |        ously had. After a failed match attempt, the contents  of  the  ovector | 
 |        are unchanged. | 
 |  | 
 |  | 
 | OTHER INFORMATION ABOUT A MATCH | 
 |  | 
 |        PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data); | 
 |  | 
 |        PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data); | 
 |  | 
 |        As  well as the offsets in the ovector, other information about a match | 
 |        is retained in the match data block and can be retrieved by  the  above | 
 |        functions  in  appropriate  circumstances.  If they are called at other | 
 |        times, the result is undefined. | 
 |  | 
 |        After a successful match, a partial match (PCRE2_ERROR_PARTIAL),  or  a | 
 |        failure  to  match (PCRE2_ERROR_NOMATCH), a mark name may be available. | 
 |        The function pcre2_get_mark() can be called to access this name,  which | 
 |        can  be  specified  in  the  pattern by any of the backtracking control | 
 |        verbs, not just (*MARK). The same function applies to all the verbs. It | 
 |        returns a pointer to the zero-terminated name, which is within the com- | 
 |        piled pattern. If no name is available, NULL is returned. The length of | 
 |        the name (excluding the terminating zero) is stored in  the  code  unit | 
 |        that  precedes  the name. You should use this length instead of relying | 
 |        on the terminating zero if the name might contain a binary zero. | 
 |  | 
 |        After a successful match, the name that is returned is  the  last  mark | 
 |        name encountered on the matching path through the pattern. Instances of | 
 |        backtracking  verbs  without  names do not count. Thus, for example, if | 
 |        the matching path contains (*MARK:A)(*PRUNE), the name "A" is returned. | 
 |        After a "no match" or a partial match, the last encountered name is re- | 
 |        turned. For example, consider this pattern: | 
 |  | 
 |          ^(*MARK:A)((*MARK:B)a|b)c | 
 |  | 
 |        When it matches "bc", the returned name is A. The B mark is  "seen"  in | 
 |        the  first  branch of the group, but it is not on the matching path. On | 
 |        the other hand, when this pattern fails to  match  "bx",  the  returned | 
 |        name is B. | 
 |  | 
 |        Warning:  By  default, certain start-of-match optimizations are used to | 
 |        give a fast "no match" result in some situations. For example,  if  the | 
 |        anchoring  is removed from the pattern above, there is an initial check | 
 |        for the presence of "c" in the subject before running the matching  en- | 
 |        gine. This check fails for "bx", causing a match failure without seeing | 
 |        any  marks. You can disable the start-of-match optimizations by setting | 
 |        the PCRE2_NO_START_OPTIMIZE option for pcre2_compile() or  by  starting | 
 |        the pattern with (*NO_START_OPT). | 
 |  | 
 |        After  a  successful  match, a partial match, or one of the invalid UTF | 
 |        errors (for example, PCRE2_ERROR_UTF8_ERR5), pcre2_get_startchar()  can | 
 |        be called. After a successful or partial match it returns the code unit | 
 |        offset  of  the character at which the match started. For a non-partial | 
 |        match, this can be different to the value of ovector[0] if the  pattern | 
 |        contains  the  \K escape sequence. After a partial match, however, this | 
 |        value is always the same as ovector[0] because \K does not  affect  the | 
 |        result of a partial match. | 
 |  | 
 |        After  a UTF check failure, pcre2_get_startchar() can be used to obtain | 
 |        the code unit offset of the invalid UTF character. Details are given in | 
 |        the pcre2unicode page. | 
 |  | 
 |  | 
 | ERROR RETURNS FROM pcre2_match() | 
 |  | 
 |        If pcre2_match() fails, it returns a negative number. This can be  con- | 
 |        verted  to a text string by calling the pcre2_get_error_message() func- | 
 |        tion (see "Obtaining a textual error message" below).   Negative  error | 
 |        codes  are  also  returned  by other functions, and are documented with | 
 |        them. The codes are given names in the header file. If UTF checking  is | 
 |        in force and an invalid UTF subject string is detected, one of a number | 
 |        of  UTF-specific negative error codes is returned. Details are given in | 
 |        the pcre2unicode page. The following are the other errors that  may  be | 
 |        returned by pcre2_match(): | 
 |  | 
 |          PCRE2_ERROR_NOMATCH | 
 |  | 
 |        The subject string did not match the pattern. | 
 |  | 
 |          PCRE2_ERROR_PARTIAL | 
 |  | 
 |        The  subject  string did not match, but it did match partially. See the | 
 |        pcre2partial documentation for details of partial matching. | 
 |  | 
 |          PCRE2_ERROR_BADMAGIC | 
 |  | 
 |        PCRE2 stores a 4-byte "magic number" at the start of the compiled code, | 
 |        to catch the case when it is passed a junk pointer. This is  the  error | 
 |        that is returned when the magic number is not present. | 
 |  | 
 |          PCRE2_ERROR_BADMODE | 
 |  | 
 |        This  error is given when a compiled pattern is passed to a function in | 
 |        a library of a different code unit width, for example, a  pattern  com- | 
 |        piled  by  the  8-bit  library  is passed to a 16-bit or 32-bit library | 
 |        function. | 
 |  | 
 |          PCRE2_ERROR_BADOFFSET | 
 |  | 
 |        The value of startoffset was greater than the length of the subject. | 
 |  | 
 |          PCRE2_ERROR_BADOPTION | 
 |  | 
 |        An unrecognized bit was set in the options argument. | 
 |  | 
 |          PCRE2_ERROR_BADUTFOFFSET | 
 |  | 
 |        The UTF code unit sequence that was passed as a subject was checked and | 
 |        found to be valid (the PCRE2_NO_UTF_CHECK option was not set), but  the | 
 |        value  of startoffset did not point to the beginning of a UTF character | 
 |        or the end of the subject. | 
 |  | 
 |          PCRE2_ERROR_CALLOUT | 
 |  | 
 |        This error is never generated by pcre2_match() itself. It  is  provided | 
 |        for  use  by  callout  functions  that  want  to cause pcre2_match() or | 
 |        pcre2_callout_enumerate() to return a distinctive error code.  See  the | 
 |        pcre2callout documentation for details. | 
 |  | 
 |          PCRE2_ERROR_DEPTHLIMIT | 
 |  | 
 |        The nested backtracking depth limit was reached. | 
 |  | 
 |          PCRE2_ERROR_HEAPLIMIT | 
 |  | 
 |        The heap limit was reached. | 
 |  | 
 |          PCRE2_ERROR_INTERNAL | 
 |  | 
 |        An  unexpected  internal error has occurred. This error could be caused | 
 |        by a bug in PCRE2 or by overwriting of the compiled pattern. | 
 |  | 
 |          PCRE2_ERROR_JIT_STACKLIMIT | 
 |  | 
 |        This error is returned when a pattern that was successfully studied us- | 
 |        ing JIT is being matched, but the memory available for the just-in-time | 
 |        processing stack is not large enough. See  the  pcre2jit  documentation | 
 |        for more details. | 
 |  | 
 |          PCRE2_ERROR_MATCHLIMIT | 
 |  | 
 |        The backtracking match limit was reached. | 
 |  | 
 |          PCRE2_ERROR_NOMEMORY | 
 |  | 
 |        Heap  memory  is  used  to  remember backtracking points. This error is | 
 |        given when the memory allocation function (default  or  custom)  fails. | 
 |        Note  that  a  different  error, PCRE2_ERROR_HEAPLIMIT, is given if the | 
 |        amount of memory needed exceeds the heap limit. PCRE2_ERROR_NOMEMORY is | 
 |        also returned if PCRE2_COPY_MATCHED_SUBJECT is set and  memory  alloca- | 
 |        tion fails. | 
 |  | 
 |          PCRE2_ERROR_NULL | 
 |  | 
 |        Either the code, subject, or match_data argument was passed as NULL. | 
 |  | 
 |          PCRE2_ERROR_RECURSELOOP | 
 |  | 
 |        This  error  is  returned  when  pcre2_match() detects a recursion loop | 
 |        within the pattern. Specifically, it means that either the  whole  pat- | 
 |        tern or a capture group has been called recursively for the second time | 
 |        at  the  same position in the subject string. Some simple patterns that | 
 |        might do this are detected and faulted at compile time, but  more  com- | 
 |        plicated  cases,  in particular mutual recursions between two different | 
 |        groups, cannot be detected until matching is attempted. | 
 |  | 
 |  | 
 | OBTAINING A TEXTUAL ERROR MESSAGE | 
 |  | 
 |        int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer, | 
 |          PCRE2_SIZE bufflen); | 
 |  | 
 |        A text message for an error code  from  any  PCRE2  function  (compile, | 
 |        match,  or  auxiliary)  can be obtained by calling pcre2_get_error_mes- | 
 |        sage(). The code is passed as the first argument,  with  the  remaining | 
 |        two  arguments  specifying  a  code  unit buffer and its length in code | 
 |        units, into which the text message is placed. The message  is  returned | 
 |        in  code  units  of the appropriate width for the library that is being | 
 |        used. | 
 |  | 
 |        The returned message is terminated with a trailing zero, and the  func- | 
 |        tion  returns  the  number  of  code units used, excluding the trailing | 
 |        zero. If the error number is unknown, the negative error code PCRE2_ER- | 
 |        ROR_BADDATA is returned. If the buffer is too  small,  the  message  is | 
 |        truncated (but still with a trailing zero), and the negative error code | 
 |        PCRE2_ERROR_NOMEMORY is returned.  None of the messages is very long; a | 
 |        buffer size of 120 code units is ample. | 
 |  | 
 |  | 
 | ITERATING OVER ALL MATCHES | 
 |  | 
 |        int pcre2_next_match(pcre2_match_data *match_data, | 
 |          PCRE2_SIZE *pstart_offset, uint32_t *poptions); | 
 |  | 
 |        A common task for applications is to implement "global" matching behav- | 
 |        iour,  for example, replacing all matches in the subject; splitting the | 
 |        subject on all matches; or simply counting the number of  matches.  The | 
 |        pcre2_next_match()  function  helps with this task by providing the ap- | 
 |        propriate parameters for the next match attempt (available since  PCRE2 | 
 |        10.46). | 
 |  | 
 |        First,  a  match attempt should be made using one of the matching func- | 
 |        tions (pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match()).   Then, | 
 |        pcre2_next_match() can be called, providing the same match_data parame- | 
 |        ter. | 
 |  | 
 |        It  returns 0 ("false") if there is no need to make a further match at- | 
 |        tempt, or 1 ("true") if another match should be attempted. Returning  1 | 
 |        does  not  imply  that  there is another match, only that another match | 
 |        should be attempted (which may return PCRE2_ERROR_NOMATCH). | 
 |  | 
 |        The *pstart_offset and *poptions are set if  the  function  returns  1. | 
 |        The *pstart_offset should be passed to the next match attempt directly, | 
 |        and the *poptions should be passed to the next match attempt by combin- | 
 |        ing with the application's match options using OR. | 
 |  | 
 |        There  is  some  code that demonstrates how to do this in the pcre2demo | 
 |        sample program. The general pattern is: | 
 |  | 
 |          uint32_t app_options = ...; | 
 |          uint32_t global_options = 0; | 
 |          PCRE2_SIZE start_offset = 0; | 
 |          while (1) | 
 |            { | 
 |            int rc = pcre2_match(re, subject, subject_len, start_offset, | 
 |                                 app_options | global_options, match_data, | 
 |                                 match_context); | 
 |  | 
 |            if (rc == PCRE2_ERROR_NOMATCH) break; /* no match, and no more attempts */ | 
 |            if (rc < 0) { ... exit } | 
 |  | 
 |            ...handle the match | 
 |  | 
 |            if (!pcre2_next_match(match_data, &start_offset, &global_options)) | 
 |              break; /* no more attempts */ | 
 |            } | 
 |  | 
 |        The guarantees provided by pcre2_next_match() are that the start_offset | 
 |        will advance, so the loop will  definitely  terminate.  The  conditions | 
 |        which  ensure  this  are  that either: (a) pcre2_next_match() returns 0 | 
 |        (false); or (b) the returned *pstart_offset is  strictly  greater  than | 
 |        the  previous start_offset; or (c) if the previous match was a success- | 
 |        ful match of the empty string then the returned *pstart_offset is equal | 
 |        to  the  previous  ovector[1],   and   *poptions   will   be   set   to | 
 |        PCRE2_NOTEMPTY_ATSTART  to  prevent  another empty match from being re- | 
 |        turned. | 
 |  | 
 |        A loop implemented as shown above will always terminate,  unless  there | 
 |        is  a  bug  in PCRE2. As a measure of "defensive programming", applica- | 
 |        tions are encouraged to add an assertion or check to break  their  loop | 
 |        if it does not make progress (and report the issue as a bug). | 
 |  | 
 |        If   an   application   does   not   use   the   flag   PCRE2_EXTRA_AL- | 
 |        LOW_LOOKAROUND_BSK, then each match is "well-behaved" and satisfies: | 
 |  | 
 |          start_offset <= ovector[0] <= ovector[1]. | 
 |  | 
 |        In   this   case,   the   matches   found   by    pcre2_match()    with | 
 |        pcre2_next_match() will be sorted, non-overlapping (possibly touching), | 
 |        and with no duplicates. | 
 |  | 
 |        Otherwise,  if PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK is used, then the guar- | 
 |        antees are considerably weaker. We do not guarantee  that  the  matches | 
 |        will always advance: only that the start_offset will. The matches found | 
 |        by  pcre2_match() with pcre2_next_match() will be a finite sequence (as | 
 |        pcre2_next_match() ensures that start_offset advances,  so  the  search | 
 |        will  terminate).  The  matches can however be overlapping, can contain | 
 |        duplicates, and (in truly pathological examples) may not even be sorted | 
 |        by ovector[0]. Additionally, each match itself can end before it starts | 
 |        (ovector[1] < ovector[0]). We recommend that applications  do  not  set | 
 |        PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK. | 
 |  | 
 |  | 
 | EXTRACTING CAPTURED SUBSTRINGS BY NUMBER | 
 |  | 
 |        int pcre2_substring_length_bynumber(pcre2_match_data *match_data, | 
 |          uint32_t number, PCRE2_SIZE *length); | 
 |  | 
 |        int pcre2_substring_copy_bynumber(pcre2_match_data *match_data, | 
 |          uint32_t number, PCRE2_UCHAR *buffer, | 
 |          PCRE2_SIZE *bufflen); | 
 |  | 
 |        int pcre2_substring_get_bynumber(pcre2_match_data *match_data, | 
 |          uint32_t number, PCRE2_UCHAR **bufferptr, | 
 |          PCRE2_SIZE *bufflen); | 
 |  | 
 |        void pcre2_substring_free(PCRE2_UCHAR *buffer); | 
 |  | 
 |        Captured  substrings  can  be accessed directly by using the ovector as | 
 |        described above.  For convenience, auxiliary functions are provided for | 
 |        extracting  captured  substrings  as  new,  separate,   zero-terminated | 
 |        strings. A substring that contains a binary zero is correctly extracted | 
 |        and  has  a  further  zero  added on the end, but the result is not, of | 
 |        course, a C string. | 
 |  | 
 |        The functions in this section identify substrings by number. The number | 
 |        zero refers to the entire matched substring, with higher numbers refer- | 
 |        ring to substrings captured by parenthesized groups.  After  a  partial | 
 |        match,  only  substring  zero  is  available. An attempt to extract any | 
 |        other substring gives the error PCRE2_ERROR_PARTIAL. The  next  section | 
 |        describes similar functions for extracting captured substrings by name. | 
 |  | 
 |        If  a  pattern  uses the \K escape sequence within a positive lookahead | 
 |        assertion, the reported start of a successful match can be greater than | 
 |        the end of the match. For example, if the pattern (?=ab\K)  is  matched | 
 |        against  "ab",  the start and end offset values for the match are 2 and | 
 |        0. In this situation, calling these functions  with  a  zero  substring | 
 |        number extracts a zero-length empty string. | 
 |  | 
 |        You  can  find the length in code units of a captured substring without | 
 |        extracting it by calling pcre2_substring_length_bynumber().  The  first | 
 |        argument  is a pointer to the match data block, the second is the group | 
 |        number, and the third is a pointer to a variable into which the  length | 
 |        is  placed.  If  you just want to know whether or not the substring has | 
 |        been captured, you can pass the third argument as NULL. | 
 |  | 
 |        The pcre2_substring_copy_bynumber() function  copies  a  captured  sub- | 
 |        string  into  a supplied buffer, whereas pcre2_substring_get_bynumber() | 
 |        copies it into new memory, obtained using the  same  memory  allocation | 
 |        function  that  was  used for the match data block. The first two argu- | 
 |        ments of these functions are a pointer to the match data  block  and  a | 
 |        capture group number. | 
 |  | 
 |        The final arguments of pcre2_substring_copy_bynumber() are a pointer to | 
 |        the buffer and a pointer to a variable that contains its length in code | 
 |        units.  This is updated to contain the actual number of code units used | 
 |        for the extracted substring, excluding the terminating zero. | 
 |  | 
 |        For pcre2_substring_get_bynumber() the third and fourth arguments point | 
 |        to  variables that are updated with a pointer to the new memory and the | 
 |        number of code units that comprise the substring, again  excluding  the | 
 |        terminating  zero.  When  the substring is no longer needed, the memory | 
 |        should be freed by calling pcre2_substring_free(). | 
 |  | 
 |        The return value from all these functions is zero  for  success,  or  a | 
 |        negative  error  code.  If  the pattern match failed, the match failure | 
 |        code is returned.  If a substring number greater than zero is used  af- | 
 |        ter  a  partial  match, PCRE2_ERROR_PARTIAL is returned. Other possible | 
 |        error codes are: | 
 |  | 
 |          PCRE2_ERROR_NOMEMORY | 
 |  | 
 |        The buffer was too small for  pcre2_substring_copy_bynumber(),  or  the | 
 |        attempt to get memory failed for pcre2_substring_get_bynumber(). | 
 |  | 
 |          PCRE2_ERROR_NOSUBSTRING | 
 |  | 
 |        There  is  no  substring  with that number in the pattern, that is, the | 
 |        number is greater than the number of capturing parentheses. | 
 |  | 
 |          PCRE2_ERROR_UNAVAILABLE | 
 |  | 
 |        The substring number, though not greater than the number of captures in | 
 |        the pattern, is greater than the number of slots in the ovector, so the | 
 |        substring could not be captured. | 
 |  | 
 |          PCRE2_ERROR_UNSET | 
 |  | 
 |        The substring did not participate in the match.  For  example,  if  the | 
 |        pattern  is  (abc)|(def) and the subject is "def", and the ovector con- | 
 |        tains at least two capturing slots, substring number 1 is unset. | 
 |  | 
 |  | 
 | EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS | 
 |  | 
 |        int pcre2_substring_list_get(pcre2_match_data *match_data, | 
 |          PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr); | 
 |  | 
 |        void pcre2_substring_list_free(PCRE2_UCHAR **list); | 
 |  | 
 |        The pcre2_substring_list_get() function  extracts  all  available  sub- | 
 |        strings  and  builds  a  list of pointers to them. It also (optionally) | 
 |        builds a second list that contains their lengths (in code  units),  ex- | 
 |        cluding  a  terminating zero that is added to each of them. All this is | 
 |        done in a single block of memory that is obtained using the same memory | 
 |        allocation function that was used to get the match data block. | 
 |  | 
 |        This function must be called only after a successful match.  If  called | 
 |        after a partial match, the error code PCRE2_ERROR_PARTIAL is returned. | 
 |  | 
 |        The  address of the memory block is returned via listptr, which is also | 
 |        the start of the list of string pointers. The end of the list is marked | 
 |        by a NULL pointer. The address of the list of lengths is  returned  via | 
 |        lengthsptr.  If your strings do not contain binary zeros and you do not | 
 |        therefore need the lengths, you may supply NULL as the lengthsptr argu- | 
 |        ment to disable the creation of a list of lengths.  The  yield  of  the | 
 |        function  is zero if all went well, or PCRE2_ERROR_NOMEMORY if the mem- | 
 |        ory block could not be obtained. When the list is no longer needed,  it | 
 |        should be freed by calling pcre2_substring_list_free(). | 
 |  | 
 |        If this function encounters a substring that is unset, which can happen | 
 |        when  capture  group  number  n+1 matches some part of the subject, but | 
 |        group n has not been used at all, it returns an empty string. This  can | 
 |        be distinguished from a genuine zero-length substring by inspecting the | 
 |        appropriate  offset in the ovector, which contain PCRE2_UNSET for unset | 
 |        substrings, or by calling pcre2_substring_length_bynumber(). | 
 |  | 
 |  | 
 | EXTRACTING CAPTURED SUBSTRINGS BY NAME | 
 |  | 
 |        int pcre2_substring_number_from_name(const pcre2_code *code, | 
 |          PCRE2_SPTR name); | 
 |  | 
 |        int pcre2_substring_length_byname(pcre2_match_data *match_data, | 
 |          PCRE2_SPTR name, PCRE2_SIZE *length); | 
 |  | 
 |        int pcre2_substring_copy_byname(pcre2_match_data *match_data, | 
 |          PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen); | 
 |  | 
 |        int pcre2_substring_get_byname(pcre2_match_data *match_data, | 
 |          PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen); | 
 |  | 
 |        void pcre2_substring_free(PCRE2_UCHAR *buffer); | 
 |  | 
 |        To extract a substring by name, you first have to find associated  num- | 
 |        ber.  For example, for this pattern: | 
 |  | 
 |          (a+)b(?<xxx>\d+)... | 
 |  | 
 |        the number of the capture group called "xxx" is 2. If the name is known | 
 |        to be unique (PCRE2_DUPNAMES was not set), you can find the number from | 
 |        the name by calling pcre2_substring_number_from_name(). The first argu- | 
 |        ment  is the compiled pattern, and the second is the name. The yield of | 
 |        the function is the group number, PCRE2_ERROR_NOSUBSTRING if  there  is | 
 |        no  group  with that name, or PCRE2_ERROR_NOUNIQUESUBSTRING if there is | 
 |        more than one group with that name.  Given the number, you can  extract | 
 |        the  substring  directly from the ovector, or use one of the "bynumber" | 
 |        functions described above. | 
 |  | 
 |        For convenience, there are also "byname" functions that  correspond  to | 
 |        the "bynumber" functions, the only difference being that the second ar- | 
 |        gument  is  a  name  instead  of a number. If PCRE2_DUPNAMES is set and | 
 |        there are duplicate names, these functions scan all the groups with the | 
 |        given name, and return the captured  substring  from  the  first  named | 
 |        group that is set. | 
 |  | 
 |        If  there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is | 
 |        returned. If all groups with the name have  numbers  that  are  greater | 
 |        than the number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is re- | 
 |        turned.  If there is at least one group with a slot in the ovector, but | 
 |        no group is found to be set, PCRE2_ERROR_UNSET is returned. | 
 |  | 
 |        Warning: If the pattern uses the (?| feature to set up multiple capture | 
 |        groups with the same number, as described in the section  on  duplicate | 
 |        group numbers in the pcre2pattern page, you cannot use names to distin- | 
 |        guish  the  different capture groups, because names are not included in | 
 |        the compiled code. The matching process uses  only  numbers.  For  this | 
 |        reason,  the  use  of  different  names for groups with the same number | 
 |        causes an error at compile time. | 
 |  | 
 |  | 
 | CREATING A NEW STRING WITH SUBSTITUTIONS | 
 |  | 
 |        int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject, | 
 |          PCRE2_SIZE length, PCRE2_SIZE startoffset, | 
 |          uint32_t options, pcre2_match_data *match_data, | 
 |          pcre2_match_context *mcontext, PCRE2_SPTR replacement, | 
 |          PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer, | 
 |          PCRE2_SIZE *outlengthptr); | 
 |  | 
 |        This function optionally calls pcre2_match() and then makes a  copy  of | 
 |        the  subject  string in outputbuffer, replacing parts that were matched | 
 |        with the replacement string, whose length is supplied in rlength, which | 
 |        can be given as PCRE2_ZERO_TERMINATED for a zero-terminated string.  As | 
 |        a  special  case,  if  replacement is NULL and rlength is zero, the re- | 
 |        placement is assumed to be an empty string. If rlength is non-zero,  an | 
 |        error occurs if replacement is NULL. | 
 |  | 
 |        There is an option (see PCRE2_SUBSTITUTE_REPLACEMENT_ONLY below) to re- | 
 |        turn  just  the replacement string(s). The default action is to perform | 
 |        just one replacement if the pattern matches, but  there  is  an  option | 
 |        that  requests  multiple  replacements (see PCRE2_SUBSTITUTE_GLOBAL be- | 
 |        low). | 
 |  | 
 |        If successful, pcre2_substitute() returns the number  of  substitutions | 
 |        that  were  carried out. This may be zero if no match was found, and is | 
 |        never greater than one unless PCRE2_SUBSTITUTE_GLOBAL is set.  A  nega- | 
 |        tive value is returned if an error is detected. | 
 |  | 
 |        Matches  in  which  a  \K item in a lookahead in the pattern causes the | 
 |        match to end before it starts are not supported, and give  rise  to  an | 
 |        error return. For global replacements, matches in which \K in a lookbe- | 
 |        hind  causes the match to start earlier than the point that was reached | 
 |        in the previous iteration are also not supported. (These cases are only | 
 |        possible if the pattern was compiled with  the  backwards-compatibility | 
 |        option PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK.) | 
 |  | 
 |        The  first  seven  arguments  of pcre2_substitute() are the same as for | 
 |        pcre2_match(), except that the partial matching options are not permit- | 
 |        ted, and match_data may be passed as NULL, in which case a  match  data | 
 |        block  is obtained and freed within this function, using memory manage- | 
 |        ment functions from the match context, if provided, or else those  that | 
 |        were used to allocate memory for the compiled code. | 
 |  | 
 |        If  match_data is not NULL and PCRE2_SUBSTITUTE_MATCHED is not set, the | 
 |        provided block is used for all calls to pcre2_match(), and its contents | 
 |        afterwards are  the  result  of  the  final  call  made  internally  by | 
 |        pcre2_substitute()  to  the matching function. For global changes, this | 
 |        will always be a no-match error. The contents of the ovector within the | 
 |        match data block may or may not have been changed. | 
 |  | 
 |        As well as the usual options for pcre2_match(), a number of  additional | 
 |        options  can be set in the options argument of pcre2_substitute().  One | 
 |        such option is PCRE2_SUBSTITUTE_MATCHED. When this is set, an  external | 
 |        match_data  block  must be provided, and it must have already been used | 
 |        for an external call to pcre2_match() (or pcre2_jit_match())  with  the | 
 |        same  pattern, subject pointer, effective subject length, start offset, | 
 |        and match option arguments (substitute-specific options can be added to | 
 |        the  options  argument).  If  any  of  these  parameters  is   changed, | 
 |        pcre2_substitute()  returns  an error. The data in the match_data block | 
 |        (return code, offset vector) is used for the first substitution instead | 
 |        of calling pcre2_match() from within pcre2_substitute(). This allows an | 
 |        application to check for a match before choosing to substitute, without | 
 |        having to repeat the match. | 
 |  | 
 |        If  the  contents  of  the  subject  buffer  are  mutated  in   between | 
 |        pcre2_match()  and  a  call  to  pcre2_substitute()  with PCRE2_SUBSTI- | 
 |        TUTE_MATCHED, the behaviour is unsafe; in  particular,  in  this  case, | 
 |        PCRE2  is unable to ensure that the offsets in the ovector point to the | 
 |        start of characters (with UTF-encoded input). | 
 |  | 
 |        The contents of the  externally  supplied  match  data  block  are  not | 
 |        changed when PCRE2_SUBSTITUTE_MATCHED is set, and so the match block is | 
 |        permitted  for  use  in another call using PCRE2_SUBSTITUTE_MATCHED. If | 
 |        PCRE2_SUBSTITUTE_GLOBAL is also set, pcre2_match() is called after  the | 
 |        first  substitution to check for furthe matches, but this is done using | 
 |        an internally obtained match data block, thus always leaving the exter- | 
 |        nal block unchanged. | 
 |  | 
 |        The code argument is not used for matching before the  first  substitu- | 
 |        tion  when  PCRE2_SUBSTITUTE_MATCHED  is  set, but it must be provided, | 
 |        even when PCRE2_SUBSTITUTE_GLOBAL is not set, because it  contains  in- | 
 |        formation such as the UTF setting and the number of capturing parenthe- | 
 |        ses in the pattern. | 
 |  | 
 |        When  using PCRE2_SUBSTITUTE_MATCHED, you should not modify the subject | 
 |        string in between the prior call  to  pcre2_match()  and  pcre2_substi- | 
 |        tute(),  as the substitution assumes that the passed-in ovector is com- | 
 |        patible with the subject string. Although PCRE2 does  verify  that  the | 
 |        subject  is  a  pointer to the same buffer, it cannot in general verify | 
 |        whether the contents of the buffer have changed. For  example,  if  the | 
 |        subject  buffer is mutated from one valid UTF-8 string to another valid | 
 |        string, of the same length in code units, the ovector  offsets  are  no | 
 |        longer  guaranteed  to  point  to the start of a character. Beware that | 
 |        with PCRE2_SUBSTITUTE_MATCHED in UTF mode, the subject  string  is  not | 
 |        re-scanned for UTF validity when pcre2_substitute() first uses it. | 
 |  | 
 |        The  default  action  of  pcre2_substitute() is to return a copy of the | 
 |        subject string with matched substrings replaced. However, if PCRE2_SUB- | 
 |        STITUTE_REPLACEMENT_ONLY is set, only the  replacement  substrings  are | 
 |        returned. In the global case, multiple replacements are concatenated in | 
 |        the  output  buffer.  Substitution  callouts (see below) can be used to | 
 |        separate them if necessary. | 
 |  | 
 |        The outlengthptr argument of pcre2_substitute() must point to  a  vari- | 
 |        able  that contains the length, in code units, of the output buffer. If | 
 |        the function is successful, the value is updated to contain the  length | 
 |        in  code  units  of the new string, excluding the trailing zero that is | 
 |        automatically added. | 
 |  | 
 |        If the function is not successful, the value set via  outlengthptr  de- | 
 |        pends  on  the  type  of  error.  For  syntax errors in the replacement | 
 |        string, the value is the offset in the replacement string where the er- | 
 |        ror was detected. For other errors, the value  is  PCRE2_UNSET  by  de- | 
 |        fault. This includes the case of the output buffer being too small, un- | 
 |        less PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set. | 
 |  | 
 |        PCRE2_SUBSTITUTE_OVERFLOW_LENGTH  changes  what happens when the output | 
 |        buffer is too small. The default action is to return PCRE2_ERROR_NOMEM- | 
 |        ORY immediately. If this option  is  set,  however,  pcre2_substitute() | 
 |        continues to go through the motions of matching and substituting (with- | 
 |        out,  of  course,  writing  anything)  in  order to compute the size of | 
 |        buffer that is needed, which will include the extra space for the  ter- | 
 |        minating  NUL. This value is passed back via the outlengthptr variable, | 
 |        with the result of the function still being PCRE2_ERROR_NOMEMORY. | 
 |  | 
 |        Passing a buffer size of zero is a permitted way  of  finding  out  how | 
 |        much  memory  is needed for given substitution. However, this does mean | 
 |        that the entire operation is carried out twice. Depending on the appli- | 
 |        cation, it may be more efficient to allocate a large  buffer  and  free | 
 |        the   excess   afterwards,   instead  of  using  PCRE2_SUBSTITUTE_OVER- | 
 |        FLOW_LENGTH. | 
 |  | 
 |        The replacement string, which is interpreted as a  UTF  string  in  UTF | 
 |        mode,  is checked for UTF validity unless PCRE2_NO_UTF_CHECK is set. An | 
 |        invalid UTF replacement string causes an immediate return with the rel- | 
 |        evant UTF error code. | 
 |  | 
 |        If PCRE2_SUBSTITUTE_LITERAL is set, the replacement string is  not  in- | 
 |        terpreted in any way. By default, however, a dollar character is an es- | 
 |        cape  character  that can specify the insertion of characters from cap- | 
 |        ture groups and names from (*MARK) or other control verbs in  the  pat- | 
 |        tern. Dollar is the only escape character (backslash is treated as lit- | 
 |        eral). The following forms are recognized: | 
 |  | 
 |          $$                  insert a dollar character | 
 |          $n or ${n}          insert the contents of group n | 
 |          $0 or $&            insert the entire matched substring | 
 |          $`                  insert the substring that precedes the match | 
 |          $'                  insert the substring that follows the match | 
 |          $_                  insert the entire input string | 
 |          $+                   insert  the highest-numbered capture group which | 
 |        matched | 
 |          $*MARK or ${*MARK}  insert a control verb name | 
 |  | 
 |        Either a group number or a group name can be given for n,  for  example | 
 |        $2  or $NAME. Curly brackets are required only if the following charac- | 
 |        ter would be interpreted as part of the number or name. The number  may | 
 |        be  zero to include the entire matched string. For example, if the pat- | 
 |        tern  a(b)c  is  matched  with  "=abc="  and  the  replacement   string | 
 |        "+$1$0$1+", the result is "=+babcb+=". | 
 |  | 
 |        The  JavaScript  form $<name>, where the angle brackets are part of the | 
 |        syntax, is also recognized for group names, but not for  group  numbers | 
 |        or *MARK. | 
 |  | 
 |        $*MARK  inserts the name from the last encountered backtracking control | 
 |        verb on the matching path that has a name. (*MARK) must always  include | 
 |        a  name,  but  the  other  verbs  need not. For example, in the case of | 
 |        (*MARK:A)(*PRUNE) the name inserted is "A", but for (*MARK:A)(*PRUNE:B) | 
 |        the relevant name is "B". This facility can be used to  perform  simple | 
 |        simultaneous substitutions, as this pcre2test example shows: | 
 |  | 
 |          /(*MARK:pear)apple|(*MARK:orange)lemon/g,replace=${*MARK} | 
 |              apple lemon | 
 |           2: pear orange | 
 |  | 
 |        PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject | 
 |        string,  replacing every matching substring. If this option is not set, | 
 |        only the first matching substring is replaced. The search  for  matches | 
 |        takes  place in the original subject string (that is, previous replace- | 
 |        ments do not affect it).  Iteration is  implemented  by  advancing  the | 
 |        startoffset  value  for  each search, which is always passed the entire | 
 |        subject string. If an offset limit is set in the match context, search- | 
 |        ing stops when that limit is reached. | 
 |  | 
 |        Because global substitutions apply the pattern repeatedly to  the  sub- | 
 |        ject  string, and always iterate over non-overlapping matches, the sub- | 
 |        stitutions done by pcre2_substitute() do not match and substitute  text | 
 |        inside  the replacement strings themselves (no recursive/iterative sub- | 
 |        stitution). However, applications can easily implement  other  alterna- | 
 |        tive replacement strategies, such as iteratively replacing, then match- | 
 |        ing and replacing on the result. The replacement loop inside pcre2_sub- | 
 |        stitute()  is simple and can be emulated in client code by allocating a | 
 |        buffer, searching for matches in a loop, and calling pcre2_substitute() | 
 |        with PCRE2_SUBSTITUTE_REPLACEMENT_ONLY an PCRE2_SUBSTITUTE_MATCHED, and | 
 |        without PCRE2_SUBSTITUTE_GLOBAL. | 
 |  | 
 |        You can restrict the effect of a global substitution to  a  portion  of | 
 |        the subject string by setting either or both of startoffset and an off- | 
 |        set limit. Here is a pcre2test example: | 
 |  | 
 |          /B/g,replace=!,use_offset_limit | 
 |          ABC ABC ABC ABC\=offset=3,offset_limit=12 | 
 |           2: ABC A!C A!C ABC | 
 |  | 
 |        When  continuing  with  global substitutions after matching a substring | 
 |        with zero length, an attempt to find a non-empty match at the same off- | 
 |        set is performed.  If this is not successful, the offset is advanced by | 
 |        one character except when CRLF is a valid newline sequence and the next | 
 |        two characters are CR, LF. In this case, the offset is advanced by  two | 
 |        characters. | 
 |  | 
 |        PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capture groups that | 
 |        do not appear in the pattern to be treated as unset groups. This option | 
 |        should  be used with care, because it means that a typo in a group name | 
 |        or number no longer causes the PCRE2_ERROR_NOSUBSTRING error. | 
 |  | 
 |        PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capture groups (including un- | 
 |        known groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be  treated | 
 |        as  empty  strings  when inserted as described above. If this option is | 
 |        not set, an attempt to insert an unset group causes the PCRE2_ERROR_UN- | 
 |        SET error. This option does not  influence  the  extended  substitution | 
 |        syntax described below. | 
 |  | 
 |        PCRE2_SUBSTITUTE_EXTENDED  causes extra processing to be applied to the | 
 |        replacement string. Without this option, only the dollar  character  is | 
 |        special,  and  only  the  group insertion forms listed above are valid. | 
 |        When PCRE2_SUBSTITUTE_EXTENDED is set, several things change: | 
 |  | 
 |        Firstly, backslash in a replacement string is interpreted as an  escape | 
 |        character.  The usual forms such as \x{ddd} can be used to specify par- | 
 |        ticular character codes, and backslash followed by any non-alphanumeric | 
 |        character quotes that character. Extended quoting can  be  coded  using | 
 |        \Q...\E,  exactly  as in pattern strings. The escapes \b and \v are in- | 
 |        terpreted as the characters backspace and vertical tab, respectively. | 
 |  | 
 |        The interpretation of backslash followed by one or more digits  is  the | 
 |        same  as  in a pattern, which in Perl has some ambiguities. Details are | 
 |        given in the pcre2pattern page. | 
 |  | 
 |        The Python form \g<n>, where the angle brackets are part of the  syntax | 
 |        and n is either a group name or number, is recognized as an alternative | 
 |        way of inserting the contents of a group, for example \g<3>. | 
 |  | 
 |        There  are  also four escape sequences for forcing the case of inserted | 
 |        letters.  Case forcing applies to all  inserted  characters,  including | 
 |        those  from capture groups and letters within \Q...\E quoted sequences. | 
 |        The insertion mechanism has three states: no case forcing, force  upper | 
 |        case,  and  force  lower  case. The escape sequences change the current | 
 |        state: \U and \L change to upper or lower case  forcing,  respectively, | 
 |        and  \E  (when not terminating a \Q quoted sequence) reverts to no case | 
 |        forcing. The sequences \u and \l force the next character (if it  is  a | 
 |        letter)  to upper or lower case, respectively, and then the state auto- | 
 |        matically reverts to no case forcing. | 
 |  | 
 |        However, if \u is immediately followed by \L or \l is immediately  fol- | 
 |        lowed  by  \U,  the next character's case is forced by the first escape | 
 |        sequence, and subsequent characters by the second. This provides a "ti- | 
 |        tle casing" facility that can be applied to group captures.  For  exam- | 
 |        ple,  if  group 1 has captured "heLLo", the replacement string "\u\L$1" | 
 |        becomes "Hello". | 
 |  | 
 |        If either PCRE2_UTF or PCRE2_UCP was set when the pattern was compiled, | 
 |        Unicode properties are used for  case  forcing  characters  whose  code | 
 |        points  are greater than 127. However, only simple case folding, as de- | 
 |        termined by the Unicode file CaseFolding.txt is supported.  PCRE2  does | 
 |        not  support  language-specific special casing rules such as using dif- | 
 |        ferent lower case Greek sigmas in the middle and ends of words (as  de- | 
 |        fined in the Unicode file SpecialCasing.txt). | 
 |  | 
 |        Note that case forcing sequences such as \U...\E do not nest. For exam- | 
 |        ple,  the  result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final | 
 |        \E has no effect. Note  also  that  the  PCRE2_ALT_BSUX  and  PCRE2_EX- | 
 |        TRA_ALT_BSUX options do not apply to replacement strings. | 
 |  | 
 |        The  final  effect  of setting PCRE2_SUBSTITUTE_EXTENDED is to add more | 
 |        flexibility to capture group substitution. The  syntax  is  similar  to | 
 |        that used by Bash: | 
 |  | 
 |          ${n:-string} | 
 |          ${n:+string1:string2} | 
 |  | 
 |        As  in  the  simple  case, n may be a group number or a name. The first | 
 |        form specifies a default value. If group n is set,  its  value  is  in- | 
 |        serted;  if  not,  the  string is expanded and the result inserted. The | 
 |        second form specifies strings that are expanded and inserted when group | 
 |        n is set or unset, respectively. The first form is  just  a  convenient | 
 |        shorthand for | 
 |  | 
 |          ${n:+${n}:string} | 
 |  | 
 |        Backslash  can  be  used to escape colons and closing curly brackets in | 
 |        the replacement strings. A change of the case forcing  state  within  a | 
 |        replacement  string  remains  in  force  afterwards,  as  shown in this | 
 |        pcre2test example: | 
 |  | 
 |          /(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo | 
 |              body | 
 |           1: hello | 
 |              somebody | 
 |           1: HELLO | 
 |  | 
 |        The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these  extended | 
 |        substitutions.  However,  PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause un- | 
 |        known groups in the extended syntax forms to be treated as unset. | 
 |  | 
 |        If  PCRE2_SUBSTITUTE_LITERAL  is  set,  PCRE2_SUBSTITUTE_UNKNOWN_UNSET, | 
 |        PCRE2_SUBSTITUTE_UNSET_EMPTY, and PCRE2_SUBSTITUTE_EXTENDED are irrele- | 
 |        vant and are ignored. | 
 |  | 
 |    Substitution errors | 
 |  | 
 |        In  the  event of an error, pcre2_substitute() returns a negative error | 
 |        code. Except for PCRE2_ERROR_NOMATCH (which is never returned),  errors | 
 |        from pcre2_match() are passed straight back. | 
 |  | 
 |        PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser- | 
 |        tion, unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set. | 
 |  | 
 |        PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ- | 
 |        ing  an  unknown  substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) | 
 |        when the simple (non-extended) syntax is used and  PCRE2_SUBSTITUTE_UN- | 
 |        SET_EMPTY is not set. | 
 |  | 
 |        PCRE2_ERROR_NOMEMORY  is  returned  if  the  output  buffer  is not big | 
 |        enough. If the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size | 
 |        of buffer that is needed is returned via outlengthptr. Note  that  this | 
 |        does not happen by default. | 
 |  | 
 |        PCRE2_ERROR_NULL is returned if PCRE2_SUBSTITUTE_MATCHED is set but the | 
 |        match_data  argument is NULL or if the subject or replacement arguments | 
 |        are NULL. For backward compatibility reasons an exception is  made  for | 
 |        the replacement argument if the rlength argument is also 0. | 
 |  | 
 |        PCRE2_ERROR_BADREPLACEMENT  is  used for miscellaneous syntax errors in | 
 |        the replacement string, with more  particular  errors  being  PCRE2_ER- | 
 |        ROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REPMISSINGBRACE | 
 |        (closing  curly bracket not found), PCRE2_ERROR_BADSUBSTITUTION (syntax | 
 |        error in extended group substitution),  and  PCRE2_ERROR_BADSUBSPATTERN | 
 |        (the pattern match ended before it started or the match started earlier | 
 |        than  the  current  position  in the subject, which can happen if \K is | 
 |        used in a lookaround assertion). | 
 |  | 
 |        As for all PCRE2 errors, a text message that describes the error can be | 
 |        obtained by calling the pcre2_get_error_message()  function  (see  "Ob- | 
 |        taining a textual error message" above). | 
 |  | 
 |    Substitution callouts | 
 |  | 
 |        int pcre2_set_substitute_callout(pcre2_match_context *mcontext, | 
 |          int (*callout_function)(pcre2_substitute_callout_block *, void *), | 
 |          void *callout_data); | 
 |  | 
 |        The  pcre2_set_substitute_callout()  function  can be used to specify a | 
 |        callout function for pcre2_substitute(). This information is passed  in | 
 |        a match context. The callout function is called after each substitution | 
 |        has been processed, but it can cause the replacement not to happen. | 
 |  | 
 |        The  callout  function  is  not called for simulated substitutions that | 
 |        happen as a result of the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH  option.  In | 
 |        this  mode,  when substitution processing exceeds the buffer space pro- | 
 |        vided by the caller, processing continues by counting code  units.  The | 
 |        simulation  is unable to populate the callout block, and so the simula- | 
 |        tion is pessimistic about the required buffer size. Whichever is larger | 
 |        of accepted or rejected substitution is reported as the required  size. | 
 |        Therefore, the returned buffer length may be an overestimate (without a | 
 |        substitution callout, it is normally an exact measurement). | 
 |  | 
 |        The first argument of the callout function is a pointer to a substitute | 
 |        callout  block structure, which contains the following fields, not nec- | 
 |        essarily in this order: | 
 |  | 
 |          uint32_t    version; | 
 |          uint32_t    subscount; | 
 |          PCRE2_SPTR  input; | 
 |          PCRE2_SPTR  output; | 
 |          PCRE2_SIZE *ovector; | 
 |          uint32_t    oveccount; | 
 |          PCRE2_SIZE  output_offsets[2]; | 
 |  | 
 |        The version field contains the version number of the block format.  The | 
 |        current  version  is  0.  The version number will increase in future if | 
 |        more fields are added, but the intention is never to remove any of  the | 
 |        existing fields. | 
 |  | 
 |        The subscount field is the number of the current match. It is 1 for the | 
 |        first callout, 2 for the second, and so on. The input and output point- | 
 |        ers are copies of the values passed to pcre2_substitute(). | 
 |  | 
 |        The  ovector  field points to the ovector, which contains the result of | 
 |        the most recent match. The oveccount field contains the number of pairs | 
 |        that are set in the ovector, and is always greater than zero. | 
 |  | 
 |        The output_offsets vector contains the offsets of  the  replacement  in | 
 |        the  output  string. This has already been processed for dollar and (if | 
 |        requested) backslash substitutions as described above. | 
 |  | 
 |        The second argument of the callout function  is  the  value  passed  as | 
 |        callout_data  when  the  function was registered. The value returned by | 
 |        the callout function is interpreted as follows: | 
 |  | 
 |        If the value is zero, the replacement is accepted, and,  if  PCRE2_SUB- | 
 |        STITUTE_GLOBAL  is set, processing continues with a search for the next | 
 |        match. If the value is not zero, the current  replacement  is  not  ac- | 
 |        cepted.  If  the  value is greater than zero, processing continues when | 
 |        PCRE2_SUBSTITUTE_GLOBAL is set. Otherwise (the value is less than  zero | 
 |        or PCRE2_SUBSTITUTE_GLOBAL is not set), the rest of the input is copied | 
 |        to  the  output and the call to pcre2_substitute() exits, returning the | 
 |        number of matches so far. | 
 |  | 
 |    Substitution case callouts | 
 |  | 
 |        int pcre2_set_substitute_case_callout(pcre2_match_context *mcontext, | 
 |          PCRE2_SIZE (*callout_function)(PCRE2_SPTR, PCRE2_SIZE, | 
 |                                         PCRE2_UCHAR *, PCRE2_SIZE, | 
 |                                         int, void *), | 
 |          void *callout_data); | 
 |  | 
 |        The pcre2_set_substitute_case_callout() function can be used to specify | 
 |        a callout function for pcre2_substitute() to use when  performing  case | 
 |        transformations.  This does not affect any case insensitivity behaviour | 
 |        when performing a match, but only the user-visible transformations per- | 
 |        formed when processing a substitution such as: | 
 |  | 
 |            pcre2_substitute(..., "\\U$1", ...) | 
 |  | 
 |        The default case transformations applied by PCRE2 are  reasonably  com- | 
 |        plete,  and,  in  UTF  or UCP mode, perform the simple locale-invariant | 
 |        case transformations as specified by Unicode. This is suitable for  the | 
 |        internal  (invisible)  case-equivalence  procedures used during pattern | 
 |        matching, but an application may wish to use more sophisticated locale- | 
 |        aware processing for the user-visible substitution transformations. | 
 |  | 
 |        One example implementation of the callout_function using  the  ICU  li- | 
 |        brary would be: | 
 |  | 
 |            PCRE2_SIZE | 
 |            icu_case_callout( | 
 |              PCRE2_SPTR input, PCRE2_SIZE input_len, | 
 |              PCRE2_UCHAR *output, PCRE2_SIZE output_cap, | 
 |              int to_case, void *data_ptr) | 
 |            { | 
 |              UErrorCode err = U_ZERO_ERROR; | 
 |              int32_t r = to_case == PCRE2_SUBSTITUTE_CASE_LOWER | 
 |                ? u_strToLower(output, output_cap, input, input_len, NULL, &err) | 
 |                : to_case == PCRE2_SUBSTITUTE_CASE_UPPER | 
 |                ? u_strToUpper(output, output_cap, input, input_len, NULL, &err) | 
 |                : u_strToTitle(output, output_cap, input, input_len, &first_char_only, | 
 |                               NULL, &err); | 
 |              if (U_FAILURE(err)) return (~(PCRE2_SIZE)0); | 
 |              return r; | 
 |            } | 
 |  | 
 |        The  first  and  second  arguments of the case callout function are the | 
 |        Unicode string to transform. | 
 |  | 
 |        The third and fourth arguments are the output buffer and its capacity. | 
 |  | 
 |        The  fifth  is  one  of  the   constants   PCRE2_SUBSTITUTE_CASE_LOWER, | 
 |        PCRE2_SUBSTITUTE_CASE_UPPER,    or   PCRE2_SUBSTITUTE_CASE_TITLE_FIRST. | 
 |        PCRE2_SUBSTITUTE_CASE_LOWER and PCRE2_SUBSTITUTE_CASE_UPPER are  passed | 
 |        to  the  callout  to indicate that the case of the entire callout input | 
 |        should be case-transformed. PCRE2_SUBSTITUTE_CASE_TITLE_FIRST is passed | 
 |        to indicate that only the first character or  glyph  should  be  trans- | 
 |        formed  to  Unicode  titlecase  and the rest to Unicode lowercase (note | 
 |        that titlecasing sometimes uses Unicode properties  to  titlecase  each | 
 |        word  in a string; but PCRE2 is requesting that only the single leading | 
 |        character is to be titlecased). | 
 |  | 
 |        The sixth argument is the callout_data  supplied  to  pcre2_set_substi- | 
 |        tute_case_callout(). | 
 |  | 
 |        The resulting string in the destination buffer may be larger or smaller | 
 |        than  the input, if the casing rules merge or split characters. The re- | 
 |        turn value is the length required for the output string. If a buffer of | 
 |        sufficient size was provided to the callout, then the  result  must  be | 
 |        written to the buffer and the number of code units returned. If the re- | 
 |        sult  does  not  fit in the provided buffer, then the required capacity | 
 |        must be returned and PCRE2 will not make  use  of  the  output  buffer. | 
 |        PCRE2  provides  input and output buffers which overlap, so the callout | 
 |        must support this by suitable internal buffering. | 
 |  | 
 |        Alternatively, if the callout wishes to indicate an error, then it  may | 
 |        return  (~(PCRE2_SIZE)0).  In this case pcre2_substitute() will immedi- | 
 |        ately fail with error PCRE2_ERROR_REPLACECASE. | 
 |  | 
 |        When  a  case  callout  is  combined  with  the  PCRE2_SUBSTITUTE_OVER- | 
 |        FLOW_LENGTH  option,  there are situations when pcre2_substitute() will | 
 |        return an underestimate of  the  required  buffer  size.  If  you  call | 
 |        pcre2_substitute()  once with PCRE2_SUBSTITUTE_OVERFLOW_LENGTH, and the | 
 |        input buffer is too small for the replacement string to be constructed, | 
 |        then instead of calling the case callout, pcre2_substitute() will  make | 
 |        an  estimate  of the required buffer size.  The second call should also | 
 |        pass PCRE2_SUBSTITUTE_OVERFLOW_LENGTH, because that second call is  not | 
 |        guaranteed  to succeed either, if the case callout requires more buffer | 
 |        space than expected. The caller must make repeated attempts in a loop. | 
 |  | 
 |  | 
 | DUPLICATE CAPTURE GROUP NAMES | 
 |  | 
 |        int pcre2_substring_nametable_scan(const pcre2_code *code, | 
 |          PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last); | 
 |  | 
 |        When a pattern is compiled with the PCRE2_DUPNAMES  option,  names  for | 
 |        capture  groups  are not required to be unique. Duplicate names are al- | 
 |        ways allowed for groups with the same number, created by using the  (?| | 
 |        feature. Indeed, if such groups are named, they are required to use the | 
 |        same names. | 
 |  | 
 |        Normally,  patterns  that  use duplicate names are such that in any one | 
 |        match, only one of each set of identically-named  groups  participates. | 
 |        An example is shown in the pcre2pattern documentation. | 
 |  | 
 |        When   duplicates   are   present,   pcre2_substring_copy_byname()  and | 
 |        pcre2_substring_get_byname() return the first  substring  corresponding | 
 |        to  the given name that is set. Only if none are set is PCRE2_ERROR_UN- | 
 |        SET is returned. The  pcre2_substring_number_from_name()  function  re- | 
 |        turns  the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are duplicate | 
 |        names. | 
 |  | 
 |        If you want to get full details of all captured substrings for a  given | 
 |        name,  you  must use the pcre2_substring_nametable_scan() function. The | 
 |        first argument is the compiled pattern, and the second is the name.  If | 
 |        the  third  and fourth arguments are NULL, the function returns a group | 
 |        number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise. | 
 |  | 
 |        When the third and fourth arguments are not NULL, they must be pointers | 
 |        to variables that are updated by the function. After it has  run,  they | 
 |        point to the first and last entries in the name-to-number table for the | 
 |        given  name,  and the function returns the length of each entry in code | 
 |        units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there  are | 
 |        no entries for the given name. | 
 |  | 
 |        The format of the name table is described above in the section entitled | 
 |        Information  about  a  pattern.  Given all the relevant entries for the | 
 |        name, you can extract each of their numbers,  and  hence  the  captured | 
 |        data. | 
 |  | 
 |  | 
 | FINDING ALL POSSIBLE MATCHES AT ONE POSITION | 
 |  | 
 |        The  traditional  matching  function  uses a similar algorithm to Perl, | 
 |        which stops when it finds the first match at a given point in the  sub- | 
 |        ject. If you want to find all possible matches, or the longest possible | 
 |        match  at  a  given  position,  consider using the alternative matching | 
 |        function (see below) instead. If you cannot use the  alternative  func- | 
 |        tion, you can kludge it up by making use of the callout facility, which | 
 |        is described in the pcre2callout documentation. | 
 |  | 
 |        What you have to do is to insert a callout right at the end of the pat- | 
 |        tern.   When your callout function is called, extract and save the cur- | 
 |        rent matched substring. Then return 1, which  forces  pcre2_match()  to | 
 |        backtrack  and  try other alternatives. Ultimately, when it runs out of | 
 |        matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH. | 
 |  | 
 |  | 
 | MATCHING A PATTERN: THE ALTERNATIVE FUNCTION | 
 |  | 
 |        int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject, | 
 |          PCRE2_SIZE length, PCRE2_SIZE startoffset, | 
 |          uint32_t options, pcre2_match_data *match_data, | 
 |          pcre2_match_context *mcontext, | 
 |          int *workspace, PCRE2_SIZE wscount); | 
 |  | 
 |        The function pcre2_dfa_match() is called  to  match  a  subject  string | 
 |        against  a  compiled pattern, using a matching algorithm that scans the | 
 |        subject string just once (not counting lookaround assertions), and does | 
 |        not backtrack (except when processing lookaround assertions). This  has | 
 |        different  characteristics to the normal algorithm, and is not compati- | 
 |        ble with Perl. Some of the features of  PCRE2  patterns  are  not  sup- | 
 |        ported. Nevertheless, there are times when this kind of matching can be | 
 |        useful.  For a discussion of the two matching algorithms, and a list of | 
 |        features that pcre2_dfa_match() does not support, see the pcre2matching | 
 |        documentation. | 
 |  | 
 |        The arguments for the pcre2_dfa_match() function are the  same  as  for | 
 |        pcre2_match(), plus two extras. The ovector within the match data block | 
 |        is used in a different way, and this is described below. The other com- | 
 |        mon  arguments  are used in the same way as for pcre2_match(), so their | 
 |        description is not repeated here. | 
 |  | 
 |        The two additional arguments provide workspace for  the  function.  The | 
 |        workspace  vector  should  contain at least 20 elements. It is used for | 
 |        keeping track of multiple paths through the pattern  tree.  More  work- | 
 |        space  is needed for patterns and subjects where there are a lot of po- | 
 |        tential matches. | 
 |  | 
 |        Here is an example of a simple call to pcre2_dfa_match(): | 
 |  | 
 |          int wspace[20]; | 
 |          pcre2_match_data *md = pcre2_match_data_create(4, NULL); | 
 |          int rc = pcre2_dfa_match( | 
 |            re,             /* result of pcre2_compile() */ | 
 |            "some string",  /* the subject string */ | 
 |            11,             /* the length of the subject string */ | 
 |            0,              /* start at offset 0 in the subject */ | 
 |            0,              /* default options */ | 
 |            md,             /* the match data block */ | 
 |            NULL,           /* a match context; NULL means use defaults */ | 
 |            wspace,         /* working space vector */ | 
 |            20);            /* number of elements (NOT size in bytes) */ | 
 |  | 
 |    Option bits for pcre2_dfa_match() | 
 |  | 
 |        The unused bits of the options argument for pcre2_dfa_match()  must  be | 
 |        zero.   The   only   bits   that   may   be   set  are  PCRE2_ANCHORED, | 
 |        PCRE2_COPY_MATCHED_SUBJECT, PCRE2_ENDANCHORED, PCRE2_NOTBOL,  PCRE2_NO- | 
 |        TEOL,   PCRE2_NOTEMPTY,   PCRE2_NOTEMPTY_ATSTART,   PCRE2_NO_UTF_CHECK, | 
 |        PCRE2_PARTIAL_HARD,   PCRE2_PARTIAL_SOFT,    PCRE2_DFA_SHORTEST,    and | 
 |        PCRE2_DFA_RESTART.  All but the last four of these are exactly the same | 
 |        as for pcre2_match(), so their description is not repeated here. | 
 |  | 
 |          PCRE2_PARTIAL_HARD | 
 |          PCRE2_PARTIAL_SOFT | 
 |  | 
 |        These have the same general effect as they do  for  pcre2_match(),  but | 
 |        the  details are slightly different. When PCRE2_PARTIAL_HARD is set for | 
 |        pcre2_dfa_match(), it returns PCRE2_ERROR_PARTIAL if  the  end  of  the | 
 |        subject is reached and there is still at least one matching possibility | 
 |        that requires additional characters. This happens even if some complete | 
 |        matches  have  already  been found. When PCRE2_PARTIAL_SOFT is set, the | 
 |        return code PCRE2_ERROR_NOMATCH is converted  into  PCRE2_ERROR_PARTIAL | 
 |        if  the  end  of  the  subject  is reached, there have been no complete | 
 |        matches, but there is still at least one matching possibility. The por- | 
 |        tion of the string that was inspected when the  longest  partial  match | 
 |        was found is set as the first matching string in both cases. There is a | 
 |        more  detailed  discussion  of partial and multi-segment matching, with | 
 |        examples, in the pcre2partial documentation. | 
 |  | 
 |          PCRE2_DFA_SHORTEST | 
 |  | 
 |        Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm  to | 
 |        stop as soon as it has found one match. Because of the way the alterna- | 
 |        tive  algorithm  works, this is necessarily the shortest possible match | 
 |        at the first possible matching point in the subject string. | 
 |  | 
 |          PCRE2_DFA_RESTART | 
 |  | 
 |        When pcre2_dfa_match() returns a partial match, it is possible to  call | 
 |        it again, with additional subject characters, and have it continue with | 
 |        the same match. The PCRE2_DFA_RESTART option requests this action; when | 
 |        it  is  set,  the workspace and wscount options must reference the same | 
 |        vector as before because data about the match so far is  left  in  them | 
 |        after a partial match. There is more discussion of this facility in the | 
 |        pcre2partial documentation. | 
 |  | 
 |    Successful returns from pcre2_dfa_match() | 
 |  | 
 |        When pcre2_dfa_match() succeeds, it may have matched more than one sub- | 
 |        string in the subject. Note, however, that all the matches from one run | 
 |        of  the  function  start  at the same point in the subject. The shorter | 
 |        matches are all initial substrings of the longer matches. For  example, | 
 |        if the pattern | 
 |  | 
 |          <.*> | 
 |  | 
 |        is matched against the string | 
 |  | 
 |          This is <something> <something else> <something further> no more | 
 |  | 
 |        the three matched strings are | 
 |  | 
 |          <something> <something else> <something further> | 
 |          <something> <something else> | 
 |          <something> | 
 |  | 
 |        On  success,  the  yield of the function is a number greater than zero, | 
 |        which is the number of matched substrings.  The  offsets  of  the  sub- | 
 |        strings  are returned in the ovector, and can be extracted by number in | 
 |        the same way as for pcre2_match(), but the numbers bear no relation  to | 
 |        any  capture groups that may exist in the pattern, because DFA matching | 
 |        does not support capturing. | 
 |  | 
 |        Calls to the convenience functions that extract substrings by name  re- | 
 |        turn the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used af- | 
 |        ter  a  DFA match. The convenience functions that extract substrings by | 
 |        number never return PCRE2_ERROR_NOSUBSTRING. | 
 |  | 
 |        The matched strings are stored in  the  ovector  in  reverse  order  of | 
 |        length;  that  is,  the longest matching string is first. If there were | 
 |        too many matches to fit into the ovector, the yield of the function  is | 
 |        zero, and the vector is filled with the longest matches. | 
 |  | 
 |        NOTE:  PCRE2's  "auto-possessification" optimization usually applies to | 
 |        character repeats at the end of a pattern (as well as internally).  For | 
 |        example,  the pattern "a\d+" is compiled as if it were "a\d++". For DFA | 
 |        matching, this means that only one possible match is found. If you  re- | 
 |        ally do want multiple matches in such cases, either use an ungreedy re- | 
 |        peat  such as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when com- | 
 |        piling. | 
 |  | 
 |    Error returns from pcre2_dfa_match() | 
 |  | 
 |        The pcre2_dfa_match() function returns a negative number when it fails. | 
 |        Many of the errors are the same  as  for  pcre2_match(),  as  described | 
 |        above.  There are in addition the following errors that are specific to | 
 |        pcre2_dfa_match(): | 
 |  | 
 |          PCRE2_ERROR_DFA_UITEM | 
 |  | 
 |        This  return  is  given  if pcre2_dfa_match() encounters an item in the | 
 |        pattern that it does not support, for instance, the use of \C in a  UTF | 
 |        mode or a backreference. | 
 |  | 
 |          PCRE2_ERROR_DFA_UCOND | 
 |  | 
 |        This  return  is given if pcre2_dfa_match() encounters a condition item | 
 |        that uses a backreference for the condition, or a test for recursion in | 
 |        a specific capture group. These are not supported. | 
 |  | 
 |          PCRE2_ERROR_DFA_UINVALID_UTF | 
 |  | 
 |        This return is given if pcre2_dfa_match() is called for a pattern  that | 
 |        was  compiled  with  PCRE2_MATCH_INVALID_UTF. This is not supported for | 
 |        DFA matching. | 
 |  | 
 |          PCRE2_ERROR_DFA_WSSIZE | 
 |  | 
 |        This return is given if pcre2_dfa_match() runs  out  of  space  in  the | 
 |        workspace vector. | 
 |  | 
 |          PCRE2_ERROR_DFA_RECURSE | 
 |  | 
 |        When a recursion or subroutine call is processed, the matching function | 
 |        calls  itself  recursively,  using  private  memory for the ovector and | 
 |        workspace.  This error is given if the internal ovector  is  not  large | 
 |        enough.  This  should  be  extremely  rare, as a vector of size 1000 is | 
 |        used. | 
 |  | 
 |          PCRE2_ERROR_DFA_BADRESTART | 
 |  | 
 |        When pcre2_dfa_match() is called  with  the  PCRE2_DFA_RESTART  option, | 
 |        some  plausibility  checks  are  made on the contents of the workspace, | 
 |        which should contain data about the previous partial match. If  any  of | 
 |        these checks fail, this error is given. | 
 |  | 
 |  | 
 | SEE ALSO | 
 |  | 
 |        pcre2build(3),    pcre2callout(3),    pcre2demo(3),   pcre2matching(3), | 
 |        pcre2partial(3), pcre2posix(3), pcre2sample(3), pcre2unicode(3). | 
 |  | 
 |  | 
 | AUTHOR | 
 |  | 
 |        Philip Hazel | 
 |        Retired from University Computing Service | 
 |        Cambridge, England. | 
 |  | 
 |  | 
 | REVISION | 
 |  | 
 |        Last updated: 19 October 2025 | 
 |        Copyright (c) 1997-2024 University of Cambridge. | 
 |  | 
 |  | 
 | PCRE2 10.48-DEV                 19 October 2025                    PCRE2API(3) | 
 | ------------------------------------------------------------------------------ | 
 |  | 
 |  | 
 | PCRE2BUILD(3)              Library Functions Manual              PCRE2BUILD(3) | 
 |  | 
 |  | 
 | NAME | 
 |        PCRE2 - Perl-compatible regular expressions (revised API) | 
 |  | 
 |  | 
 | BUILDING PCRE2 | 
 |  | 
 |        PCRE2  is distributed with a configure script that can be used to build | 
 |        the library in Unix-like environments using the Autotools applications. | 
 |        Also in the distribution are files to support building using CMake  in- | 
 |        stead  of  configure. The text file README contains general information | 
 |        about building with Autotools (some of which is  repeated  below),  and | 
 |        also has some comments about building on various operating systems. The | 
 |        files  in the vms directory support building under OpenVMS.  There is a | 
 |        lot more information about building PCRE2 without using Autotools  (in- | 
 |        cluding  information  about  using CMake and building "by hand") in the | 
 |        text file called NON-AUTOTOOLS-BUILD.  You should consult this file  as | 
 |        well as the README file if you are building in a non-Unix-like environ- | 
 |        ment. | 
 |  | 
 |  | 
 | PCRE2 BUILD-TIME OPTIONS | 
 |  | 
 |        The rest of this document describes the optional features of PCRE2 that | 
 |        can  be  selected  when  the library is compiled. It assumes use of the | 
 |        configure script, where the optional features  are  selected  or  dese- | 
 |        lected  by  providing options to configure before running the make com- | 
 |        mand. However, the same options can be selected in both  Unix-like  and | 
 |        non-Unix-like  environments if you are using CMake instead of configure | 
 |        to build PCRE2. | 
 |  | 
 |        If you are not using Autotools or CMake, option selection can  be  done | 
 |        by  editing  the config.h file, or by passing parameter settings to the | 
 |        compiler, as described in NON-AUTOTOOLS-BUILD. | 
 |  | 
 |        The complete list of options for configure (which includes the standard | 
 |        ones such as the selection of the installation directory)  can  be  ob- | 
 |        tained by running | 
 |  | 
 |          ./configure --help | 
 |  | 
 |        The  following  sections include descriptions of "on/off" options whose | 
 |        names begin with --enable or --disable. Because of the way that config- | 
 |        ure works, --enable and --disable always come in pairs, so the  comple- | 
 |        mentary  option always exists as well, but as it specifies the default, | 
 |        it is not described.  Options that specify values have names that start | 
 |        with --with. At the end of a configure run, a summary of the configura- | 
 |        tion is output. | 
 |  | 
 |  | 
 | BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES | 
 |  | 
 |        By default, a library called libpcre2-8 is built, containing  functions | 
 |        that  take  string  arguments contained in arrays of bytes, interpreted | 
 |        either as single-byte characters, or UTF-8 strings. You can also  build | 
 |        two  other libraries, called libpcre2-16 and libpcre2-32, which process | 
 |        strings that are contained in arrays of 16-bit and 32-bit  code  units, | 
 |        respectively. These can be interpreted either as single-unit characters | 
 |        or  UTF-16/UTF-32 strings. To build these additional libraries, add one | 
 |        or both of the following to the configure command: | 
 |  | 
 |          --enable-pcre2-16 | 
 |          --enable-pcre2-32 | 
 |  | 
 |        If you do not want the 8-bit library, add | 
 |  | 
 |          --disable-pcre2-8 | 
 |  | 
 |        as well. At least one of the three libraries must be built.  Note  that | 
 |        the  POSIX wrapper is for the 8-bit library only, and that pcre2grep is | 
 |        an 8-bit program. Neither of these are built if  you  select  only  the | 
 |        16-bit or 32-bit libraries. | 
 |  | 
 |  | 
 | BUILDING SHARED AND STATIC LIBRARIES | 
 |  | 
 |        The  Autotools PCRE2 building process uses libtool to build both shared | 
 |        and static libraries by default. You can suppress an  unwanted  library | 
 |        by adding one of | 
 |  | 
 |          --disable-shared | 
 |          --disable-static | 
 |  | 
 |        to  the  configure command. Setting --disable-shared ensures that PCRE2 | 
 |        libraries are built as static libraries. The  binaries  that  are  then | 
 |        created  as  part  of  the  build  process  (for example, pcre2test and | 
 |        pcre2grep) are linked statically with one or more PCRE2 libraries,  but | 
 |        may  also  be  dynamically linked with other libraries such as libc. If | 
 |        you want these binaries to be fully statically linked, you can set  LD- | 
 |        FLAGS like this: | 
 |  | 
 |        LDFLAGS=--static ./configure --disable-shared | 
 |  | 
 |        Note  the two hyphens in --static. Of course, this works only if static | 
 |        versions of all the relevant libraries are available for linking. | 
 |  | 
 |  | 
 | UNICODE AND UTF SUPPORT | 
 |  | 
 |        By default, PCRE2 is built with support for Unicode and  UTF  character | 
 |        strings.  To build it without Unicode support, add | 
 |  | 
 |          --disable-unicode | 
 |  | 
 |        to  the configure command. This setting applies to all three libraries. | 
 |        It is not possible to build one library with Unicode  support  and  an- | 
 |        other without in the same configuration. | 
 |  | 
 |        Of  itself, Unicode support does not make PCRE2 treat strings as UTF-8, | 
 |        UTF-16 or UTF-32. To do that, applications that use the library can set | 
 |        the PCRE2_UTF option when they call pcre2_compile() to compile  a  pat- | 
 |        tern.   Alternatively,  patterns  may be started with (*UTF) unless the | 
 |        application has locked this out by setting PCRE2_NEVER_UTF. | 
 |  | 
 |        UTF support allows the libraries to process character code points up to | 
 |        0x10ffff in the strings that they handle. Unicode  support  also  gives | 
 |        access  to  the Unicode properties of characters, using pattern escapes | 
 |        such as \P, \p, and \X. Only the general category properties such as Lu | 
 |        and Nd, script names, and some bi-directional and binary properties are | 
 |        supported.  Details are given in the pcre2pattern documentation. | 
 |  | 
 |        Pattern escapes such as \d and \w do not by default make use of Unicode | 
 |        properties. The application can request that they  do  by  setting  the | 
 |        PCRE2_UCP  option.  Unless  the  application has set PCRE2_NEVER_UCP, a | 
 |        pattern may also request this by starting with (*UCP). | 
 |  | 
 |  | 
 | DISABLING THE USE OF \C | 
 |  | 
 |        The \C escape sequence, which matches a single code unit, even in a UTF | 
 |        mode, can cause unpredictable behaviour because it may leave  the  cur- | 
 |        rent  matching  point in the middle of a multi-code-unit character. The | 
 |        application can lock it out by setting the PCRE2_NEVER_BACKSLASH_C  op- | 
 |        tion when calling pcre2_compile(). There is also a build-time option | 
 |  | 
 |          --enable-never-backslash-C | 
 |  | 
 |        (note the upper case C) which locks out the use of \C entirely. | 
 |  | 
 |  | 
 | JUST-IN-TIME COMPILER SUPPORT | 
 |  | 
 |        Just-in-time  (JIT) compiler support is included in the build by speci- | 
 |        fying | 
 |  | 
 |          --enable-jit | 
 |  | 
 |        This support is available only for certain hardware  architectures.  If | 
 |        this  option  is  set for an unsupported architecture, a building error | 
 |        occurs.  If in doubt, use | 
 |  | 
 |          --enable-jit=auto | 
 |  | 
 |        which enables JIT only if the current hardware is  supported.  You  can | 
 |        check  if JIT is enabled in the configuration summary that is output at | 
 |        the end of a configure run. If you are enabling JIT under  SELinux  you | 
 |        may also want to add | 
 |  | 
 |          --enable-jit-sealloc | 
 |  | 
 |        which enables the use of an execmem allocator in JIT that is compatible | 
 |        with  SELinux.  This  has  no  effect  if  JIT  is not enabled. See the | 
 |        pcre2jit documentation for a discussion of JIT usage. When JIT  support | 
 |        is enabled, pcre2grep automatically makes use of it, unless you add | 
 |  | 
 |          --disable-pcre2grep-jit | 
 |  | 
 |        to the configure command. | 
 |  | 
 |  | 
 | NEWLINE RECOGNITION | 
 |  | 
 |        By  default, PCRE2 interprets the linefeed (LF) character as indicating | 
 |        the end of a line. This is the normal newline  character  on  Unix-like | 
 |        systems.  You can compile PCRE2 to use carriage return (CR) instead, by | 
 |        adding | 
 |  | 
 |          --enable-newline-is-cr | 
 |  | 
 |        to the configure command. There is also an  --enable-newline-is-lf  op- | 
 |        tion, which explicitly specifies linefeed as the newline character. | 
 |  | 
 |        Alternatively, you can specify that line endings are to be indicated by | 
 |        the two-character sequence CRLF (CR immediately followed by LF). If you | 
 |        want this, add | 
 |  | 
 |          --enable-newline-is-crlf | 
 |  | 
 |        to the configure command. There is a fourth option, specified by | 
 |  | 
 |          --enable-newline-is-anycrlf | 
 |  | 
 |        which  causes  PCRE2 to recognize any of the three sequences CR, LF, or | 
 |        CRLF as indicating a line ending. A fifth option, specified by | 
 |  | 
 |          --enable-newline-is-any | 
 |  | 
 |        causes PCRE2 to recognize any Unicode  newline  sequence.  The  Unicode | 
 |        newline sequences are the three just mentioned, plus the single charac- | 
 |        ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line, | 
 |        U+0085),  LS  (line  separator,  U+2028),  and PS (paragraph separator, | 
 |        U+2029). The final option is | 
 |  | 
 |          --enable-newline-is-nul | 
 |  | 
 |        which causes NUL (binary zero) to be set  as  the  default  line-ending | 
 |        character. | 
 |  | 
 |        Whatever default line ending convention is selected when PCRE2 is built | 
 |        can  be  overridden by applications that use the library. At build time | 
 |        it is recommended to use the standard for your operating system. | 
 |  | 
 |  | 
 | WHAT \R MATCHES | 
 |  | 
 |        By default, the sequence \R in a pattern matches  any  Unicode  newline | 
 |        sequence,  independently  of  what has been selected as the line ending | 
 |        sequence. If you specify | 
 |  | 
 |          --enable-bsr-anycrlf | 
 |  | 
 |        the default is changed so that \R matches only CR, LF, or  CRLF.  What- | 
 |        ever  is selected when PCRE2 is built can be overridden by applications | 
 |        that use the library. | 
 |  | 
 |  | 
 | HANDLING VERY LARGE PATTERNS | 
 |  | 
 |        Within a compiled pattern, offset values are used  to  point  from  one | 
 |        part  to another (for example, from an opening parenthesis to an alter- | 
 |        nation metacharacter). By default, in the 8-bit and  16-bit  libraries, | 
 |        two-byte  values  are used for these offsets, leading to a maximum size | 
 |        for a compiled pattern of around 64 thousand code units. This is suffi- | 
 |        cient to handle all but the most gigantic patterns. Nevertheless,  some | 
 |        people do want to process truly enormous patterns, so it is possible to | 
 |        compile  PCRE2  to use three-byte or four-byte offsets by adding a set- | 
 |        ting such as | 
 |  | 
 |          --with-link-size=3 | 
 |  | 
 |        to the configure command. The value given must be 2, 3, or 4.  For  the | 
 |        16-bit  library,  a  value of 3 is rounded up to 4. In these libraries, | 
 |        using longer offsets slows down the operation of PCRE2 because  it  has | 
 |        to  load additional data when handling them. For the 32-bit library the | 
 |        value is always 4 and cannot be overridden; the value  of  --with-link- | 
 |        size is ignored. | 
 |  | 
 |  | 
 | LIMITING PCRE2 RESOURCE USAGE | 
 |  | 
 |        The pcre2_match() function increments a counter each time it goes round | 
 |        its  main  loop. Putting a limit on this counter controls the amount of | 
 |        computing resource used by a single call to  pcre2_match().  The  limit | 
 |        can be changed at run time, as described in the pcre2api documentation. | 
 |        The  default is 10 million, but this can be changed by adding a setting | 
 |        such as | 
 |  | 
 |          --with-match-limit=500000 | 
 |  | 
 |        to  the  configure  command.  This  setting   also   applies   to   the | 
 |        pcre2_dfa_match()  matching  function,  and to JIT matching (though the | 
 |        counting is done differently). | 
 |  | 
 |        The pcre2_match() function uses  heap  memory  to  record  backtracking | 
 |        points.  The  more  nested  backtracking points there are (that is, the | 
 |        deeper the search tree), the more memory is needed. There is  an  upper | 
 |        limit,  specified in kibibytes (units of 1024 bytes). This limit can be | 
 |        changed at run time, as described in the  pcre2api  documentation.  The | 
 |        default  limit (in effect unlimited) is 20 million. You can change this | 
 |        by a setting such as | 
 |  | 
 |          --with-heap-limit=500 | 
 |  | 
 |        which limits the amount of heap to 500 KiB. This limit applies only  to | 
 |        interpretive matching in pcre2_match() and pcre2_dfa_match(), which may | 
 |        also  use  the  heap for internal workspace when processing complicated | 
 |        patterns. This limit does not apply when JIT (which has its own  memory | 
 |        arrangements) is used. | 
 |  | 
 |        You  can  also explicitly limit the depth of nested backtracking in the | 
 |        pcre2_match() interpreter. This limit defaults to the value that is set | 
 |        for --with-match-limit. You can set a lower default  limit  by  adding, | 
 |        for example, | 
 |  | 
 |          --with-match-limit-depth=10000 | 
 |  | 
 |        to  the  configure  command.  This value can be overridden at run time. | 
 |        This depth limit indirectly limits the amount of heap  memory  that  is | 
 |        used,  but because the size of each backtracking "frame" depends on the | 
 |        number of capturing parentheses in a pattern, the amount of  heap  that | 
 |        is  used  before  the  limit is reached varies from pattern to pattern. | 
 |        This limit was more useful in versions before 10.30, where function re- | 
 |        cursion was used for backtracking. | 
 |  | 
 |        As well as applying to pcre2_match(), the depth limit also controls the | 
 |        depth of recursive function calls in pcre2_dfa_match(). These are  used | 
 |        for  lookaround  assertions,  atomic  groups, and recursion within pat- | 
 |        terns.  The limit does not apply to JIT matching. | 
 |  | 
 |  | 
 | LIMITING VARIABLE-LENGTH LOOKBEHIND ASSERTIONS | 
 |  | 
 |        Lookbehind assertions in which one or more branches can match  a  vari- | 
 |        able  number  of  characters  are  supported only if there is a maximum | 
 |        matching length for each top-level branch. There is  a  limit  to  this | 
 |        maximum  that defaults to 255 characters. You can alter this default by | 
 |        a setting such as | 
 |  | 
 |          --with-max-varlookbehind=100 | 
 |  | 
 |        The limit can be changed at runtime by calling pcre2_set_max_varlookbe- | 
 |        hind(). Lookbehind assertions in which every  branch  matches  a  fixed | 
 |        number of characters (not necessarily all the same) are not constrained | 
 |        by this limit. | 
 |  | 
 |  | 
 | CREATING CHARACTER TABLES AT BUILD TIME | 
 |  | 
 |        PCRE2 uses fixed tables for processing characters whose code points are | 
 |        less than 256. By default, PCRE2 is built with a set of tables that are | 
 |        distributed  in  the file src/pcre2_chartables.c.dist. These tables are | 
 |        for ASCII codes only. If you add | 
 |  | 
 |          --enable-rebuild-chartables | 
 |  | 
 |        to the configure command, the distributed tables are  no  longer  used. | 
 |        Instead, a program called pcre2_dftables is compiled and run. This out- | 
 |        puts the source for new set of tables, created in the default locale of | 
 |        your  C  run-time  system. This method of replacing the tables does not | 
 |        work if you are cross compiling, because pcre2_dftables needs to be run | 
 |        on the local host and therefore not compiled with the cross compiler. | 
 |  | 
 |        If you need to create alternative tables when cross compiling, you will | 
 |        have to do so "by hand". There may also be other reasons  for  creating | 
 |        tables  manually.   To  cause  pcre2_dftables  to be built on the local | 
 |        host, run a normal compiling command, and then run the program with the | 
 |        output file as its argument, for example: | 
 |  | 
 |          cc src/pcre2_dftables.c -o pcre2_dftables | 
 |          ./pcre2_dftables src/pcre2_chartables.c | 
 |  | 
 |        This builds the tables in the default locale of the local host. If  you | 
 |        want to specify a locale, you must use the -L option: | 
 |  | 
 |          LC_ALL=fr_FR ./pcre2_dftables -L src/pcre2_chartables.c | 
 |  | 
 |        You can also specify -b (with or without -L). This causes the tables to | 
 |        be  written in binary instead of as source code. A set of binary tables | 
 |        can be loaded into memory by an application and  passed  to  pcre2_com- | 
 |        pile() in the same way as tables created by calling pcre2_maketables(). | 
 |        The  tables are just a string of bytes, independent of hardware charac- | 
 |        teristics such as endianness. This means they can be  bundled  with  an | 
 |        application  that  runs in different environments, to ensure consistent | 
 |        behaviour. | 
 |  | 
 |  | 
 | USING EBCDIC CODE | 
 |  | 
 |        PCRE2 assumes by default that it will run in an environment  where  the | 
 |        character  code is ASCII or Unicode, which is a superset of ASCII. This | 
 |        is the case for most computer operating systems. PCRE2 can, however, be | 
 |        compiled to run in an 8-bit EBCDIC environment by adding | 
 |  | 
 |          --enable-ebcdic --disable-unicode | 
 |  | 
 |        to the configure command. You should only use it if you know  that  you | 
 |        are  in  an EBCDIC environment (for example, an IBM mainframe operating | 
 |        system). | 
 |  | 
 |        This setting implies --enable-rebuild-chartables, in  order  to  ensure | 
 |        that  you  have  the correct default character tables for your system's | 
 |        codepage. There is an exception when you set  --enable-ebcdic-ignoring- | 
 |        compiler  (see  below), which allows using a default set of EBCDIC 1047 | 
 |        character tables rather than forcing  use  of  --enable-rebuild-charta- | 
 |        bles. | 
 |  | 
 |        It  is  not  supported  to enable both EBCDIC input and either ASCII or | 
 |        UTF-8/16/32 in the same build of the library. When PCRE2 is built  with | 
 |        EBCDIC  support,  it  always operates in EBCDIC, and consequently --en- | 
 |        able-unicode and --enable-ebcdic are mutually exclusive. | 
 |  | 
 |        The EBCDIC character that corresponds to an ASCII LF is assumed to have | 
 |        the value 0x15 by default. However, in some EBCDIC  environments,  0x25 | 
 |        is used. In such an environment you should use | 
 |  | 
 |          --enable-ebcdic-nl25 | 
 |  | 
 |        (which  implies  --enable-ebcdic).  The EBCDIC character for CR has the | 
 |        same value as in ASCII, namely, 0x0d. Whichever of 0x15 and 0x25 is not | 
 |        chosen as LF is made to correspond to the Unicode NEL character (which, | 
 |        in Unicode, is 0x85). | 
 |  | 
 |        The options that select newline behaviour, such as --enable-newline-is- | 
 |        cr, and equivalent run-time options, refer to these character values in | 
 |        an EBCDIC environment. | 
 |  | 
 |        On systems requiring an EBCDIC build of PCRE2, the compiler  should  be | 
 |        set  to  use the correct codepage, so that C character literals such as | 
 |        'z' use the correct numeric value for whichever EBCDIC  codpage  is  in | 
 |        use. (PCRE2 cannot support multiple EBCDIC codepages dynamically.) How- | 
 |        ever, if this not possible, then you can use | 
 |  | 
 |          --enable-ebcdic-ignoring-compiler | 
 |  | 
 |        in  order to disregard the compiler's codepage, and instead force PCRE2 | 
 |        to use numeric constants corresponding to the EBCDIC 1047 codepage  in- | 
 |        stead.  This  can  be  used  to  build  (or  test) EBCDIC support on an | 
 |        ASCII/UTF-8 system such as Linux. | 
 |  | 
 |  | 
 | PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS | 
 |  | 
 |        By default pcre2grep supports the use of callouts with string arguments | 
 |        within the patterns it is matching. There are two kinds: one that  gen- | 
 |        erates output using local code, and another that calls an external pro- | 
 |        gram  or  script.   If --disable-pcre2grep-callout-fork is added to the | 
 |        configure command, only the first kind  of  callout  is  supported;  if | 
 |        --disable-pcre2grep-callout  is  used,  all callouts are completely ig- | 
 |        nored. For more details of pcre2grep callouts, see the pcre2grep  docu- | 
 |        mentation. | 
 |  | 
 |  | 
 | PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT | 
 |  | 
 |        By  default,  pcre2grep reads all files as plain text. You can build it | 
 |        so that it recognizes files whose names end in .gz or .bz2,  and  reads | 
 |        them with libz or libbz2, respectively, by adding one or both of | 
 |  | 
 |          --enable-pcre2grep-libz | 
 |          --enable-pcre2grep-libbz2 | 
 |  | 
 |        to the configure command. These options naturally require that the rel- | 
 |        evant  libraries  are installed on your system. Configuration will fail | 
 |        if they are not. | 
 |  | 
 |  | 
 | PCRE2GREP BUFFER SIZE | 
 |  | 
 |        pcre2grep uses an internal buffer to hold a "window" on the file it  is | 
 |        scanning, in order to be able to output "before" and "after" lines when | 
 |        it finds a match. The default starting size of the buffer is 20KiB. The | 
 |        buffer  itself  is  three times this size, but because of the way it is | 
 |        used for holding "before" lines, the longest line that is guaranteed to | 
 |        be processable is the notional buffer size. If a longer line is encoun- | 
 |        tered, pcre2grep automatically expands the buffer, up  to  a  specified | 
 |        maximum  size, whose default is 1MiB or the starting size, whichever is | 
 |        the larger. You can change the default parameter values by adding,  for | 
 |        example, | 
 |  | 
 |          --with-pcre2grep-bufsize=51200 | 
 |          --with-pcre2grep-max-bufsize=2097152 | 
 |  | 
 |        to  the  configure  command. The caller of pcre2grep can override these | 
 |        values by using --buffer-size  and  --max-buffer-size  on  the  command | 
 |        line. | 
 |  | 
 |  | 
 | PCRE2TEST OPTION FOR LIBREADLINE SUPPORT | 
 |  | 
 |        If you add one of | 
 |  | 
 |          --enable-pcre2test-libreadline | 
 |          --enable-pcre2test-libedit | 
 |  | 
 |        to  the configure command, pcre2test is linked with the libreadline or- | 
 |        libedit library, respectively, and when its input is from  a  terminal, | 
 |        it  reads  it using the readline() function. This provides line-editing | 
 |        and history facilities. Note that libreadline is  GPL-licensed,  so  if | 
 |        you  distribute  a binary of pcre2test linked in this way, there may be | 
 |        licensing issues. These can be avoided by linking instead with libedit, | 
 |        which has a BSD licence. | 
 |  | 
 |        Setting --enable-pcre2test-libreadline causes the -lreadline option  to | 
 |        be  added to the pcre2test build. In many operating environments with a | 
 |        system-installed readline library this is sufficient. However, in  some | 
 |        environments (e.g. if an unmodified distribution version of readline is | 
 |        in  use),  some  extra configuration may be necessary. The INSTALL file | 
 |        for libreadline says this: | 
 |  | 
 |          "Readline uses the termcap functions, but does not link with | 
 |          the termcap or curses library itself, allowing applications | 
 |          which link with readline the to choose an appropriate library." | 
 |  | 
 |        If your environment has not been set up so that an appropriate  library | 
 |        is automatically included, you may need to add something like | 
 |  | 
 |          LIBS="-lncurses" | 
 |  | 
 |        immediately before the configure command. | 
 |  | 
 |  | 
 | INCLUDING DEBUGGING CODE | 
 |  | 
 |        If you add | 
 |  | 
 |          --enable-debug | 
 |  | 
 |        to  the configure command, additional debugging code is included in the | 
 |        build. This feature is intended for use by the PCRE2 maintainers. | 
 |  | 
 |  | 
 | DEBUGGING WITH VALGRIND SUPPORT | 
 |  | 
 |        If you add | 
 |  | 
 |          --enable-valgrind | 
 |  | 
 |        to the configure command, PCRE2 will use valgrind annotations  to  mark | 
 |        certain  memory  regions as unaddressable. This allows it to detect in- | 
 |        valid memory accesses, and is mostly useful for debugging PCRE2 itself. | 
 |  | 
 |  | 
 | CODE COVERAGE REPORTING | 
 |  | 
 |        If your C compiler is gcc, you can build a version of  PCRE2  that  can | 
 |        generate a code coverage report for its test suite. To enable this, you | 
 |        must install lcov version 1.6 or above. Then specify | 
 |  | 
 |          --enable-coverage | 
 |  | 
 |        to the configure command and build PCRE2 in the usual way. | 
 |  | 
 |        Note that using ccache (a caching C compiler) is incompatible with code | 
 |        coverage  reporting. If you have configured ccache to run automatically | 
 |        on your system, you must set the environment variable | 
 |  | 
 |          CCACHE_DISABLE=1 | 
 |  | 
 |        before running make to build PCRE2, so that ccache is not used. | 
 |  | 
 |        When --enable-coverage is used,  the  following  addition  targets  are | 
 |        added to the Makefile: | 
 |  | 
 |          make coverage | 
 |  | 
 |        This  creates  a  fresh coverage report for the PCRE2 test suite. It is | 
 |        equivalent to running "make coverage-reset", "make  coverage-baseline", | 
 |        "make check", and then "make coverage-report". | 
 |  | 
 |          make coverage-reset | 
 |  | 
 |        This zeroes the coverage counters, but does nothing else. | 
 |  | 
 |          make coverage-baseline | 
 |  | 
 |        This captures baseline coverage information. | 
 |  | 
 |          make coverage-report | 
 |  | 
 |        This creates the coverage report. | 
 |  | 
 |          make coverage-clean-report | 
 |  | 
 |        This  removes the generated coverage report without cleaning the cover- | 
 |        age data itself. | 
 |  | 
 |          make coverage-clean-data | 
 |  | 
 |        This removes the captured coverage data without removing  the  coverage | 
 |        files created at compile time (*.gcno). | 
 |  | 
 |          make coverage-clean | 
 |  | 
 |        This  cleans all coverage data including the generated coverage report. | 
 |        For more information about code coverage, see the gcov and  lcov  docu- | 
 |        mentation. | 
 |  | 
 |  | 
 | DISABLING THE Z AND T FORMATTING MODIFIERS | 
 |  | 
 |        The  C99  standard  defines formatting modifiers z and t for size_t and | 
 |        ptrdiff_t values, respectively. By default, PCRE2 uses these  modifiers | 
 |        in environments other than old versions of Microsoft Visual Studio when | 
 |        __STDC_VERSION__  is  defined  and has a value greater than or equal to | 
 |        199901L (indicating support for C99).  However, there is at  least  one | 
 |        environment that claims to be C99 but does not support these modifiers. | 
 |        If | 
 |  | 
 |          --disable-percent-zt | 
 |  | 
 |        is specified, no use is made of the z or t modifiers. Instead of %td or | 
 |        %zu,  a  suitable  format is used depending in the size of long for the | 
 |        platform. | 
 |  | 
 |  | 
 | SUPPORT FOR FUZZERS | 
 |  | 
 |        There is a special option for use by people who  want  to  run  fuzzing | 
 |        tests on PCRE2: | 
 |  | 
 |          --enable-fuzz-support | 
 |  | 
 |        At present this applies only to the 8-bit library. If set, it causes an | 
 |        extra  library  called  libpcre2-fuzzsupport.a to be built, but not in- | 
 |        stalled. This contains a single  function  called  LLVMFuzzerTestOneIn- | 
 |        put()  whose  arguments are a pointer to a string and the length of the | 
 |        string. When called, this function tries to compile  the  string  as  a | 
 |        pattern,  and if that succeeds, to match it.  This is done both with no | 
 |        options and with some random options bits that are generated  from  the | 
 |        string. | 
 |  | 
 |        Setting  --enable-fuzz-support  also  causes  a binary called pcre2fuz- | 
 |        zcheck to be created. This is normally run under valgrind or used  when | 
 |        PCRE2 is compiled with address sanitizing enabled. It calls the fuzzing | 
 |        function  and  outputs  information  about  what it is doing. The input | 
 |        strings are specified by arguments: if an argument starts with "="  the | 
 |        rest  of it is a literal input string. Otherwise, it is assumed to be a | 
 |        file name, and the contents of the file are the test string. | 
 |  | 
 |  | 
 | OBSOLETE OPTION | 
 |  | 
 |        In versions of PCRE2 prior to 10.30, there were two  ways  of  handling | 
 |        backtracking  in the pcre2_match() function. The default was to use the | 
 |        system stack, but if | 
 |  | 
 |          --disable-stack-for-recursion | 
 |  | 
 |        was set, memory on the heap was used. From release 10.30  onwards  this | 
 |        has  changed  (the  stack  is  no longer used) and this option now does | 
 |        nothing except give a warning. | 
 |  | 
 |  | 
 | SEE ALSO | 
 |  | 
 |        pcre2api(3), pcre2-config(3). | 
 |  | 
 |  | 
 | AUTHOR | 
 |  | 
 |        Philip Hazel | 
 |        Retired from University Computing Service | 
 |        Cambridge, England. | 
 |  | 
 |  | 
 | REVISION | 
 |  | 
 |        Last updated: 17 October 2025 | 
 |        Copyright (c) 1997-2024 University of Cambridge. | 
 |  | 
 |  | 
 | PCRE2 10.48-DEV                 17 October 2025                  PCRE2BUILD(3) | 
 | ------------------------------------------------------------------------------ | 
 |  | 
 |  | 
 | PCRE2CALLOUT(3)            Library Functions Manual            PCRE2CALLOUT(3) | 
 |  | 
 |  | 
 | NAME | 
 |        PCRE2 - Perl-compatible regular expressions (revised API) | 
 |  | 
 |  | 
 | SYNOPSIS | 
 |  | 
 |        #include <pcre2.h> | 
 |  | 
 |        int (*pcre2_callout)(pcre2_callout_block *, void *); | 
 |  | 
 |        int pcre2_callout_enumerate(const pcre2_code *code, | 
 |          int (*callback)(pcre2_callout_enumerate_block *, void *), | 
 |          void *user_data); | 
 |  | 
 |  | 
 | DESCRIPTION | 
 |  | 
 |        PCRE2  provides  a  feature  called "callout", which is a means of tem- | 
 |        porarily passing control to the caller of PCRE2 in the middle  of  pat- | 
 |        tern  matching.  The  caller  of PCRE2 provides an external function by | 
 |        putting its entry point in a match context (see pcre2_set_callout()  in | 
 |        the pcre2api documentation). | 
 |  | 
 |        When  using the pcre2_substitute() function, an additional callout fea- | 
 |        ture is available. This does a callout after each change to the subject | 
 |        string and is described in the pcre2api documentation; the rest of this | 
 |        document is concerned with callouts during pattern matching. | 
 |  | 
 |        Within a regular expression, (?C<arg>) indicates a point at  which  the | 
 |        external  function  is  to  be  called. Different callout points can be | 
 |        identified by putting a number less than 256 after the  letter  C.  The | 
 |        default  value is zero.  Alternatively, the argument may be a delimited | 
 |        string. The starting delimiter must be one of ` ' " ^ % # $ {  and  the | 
 |        ending delimiter is the same as the start, except for {, where the end- | 
 |        ing  delimiter  is  }.  If  the  ending  delimiter is needed within the | 
 |        string, it must be doubled. For example, this pattern has  two  callout | 
 |        points: | 
 |  | 
 |          (?C1)abc(?C"some ""arbitrary"" text")def | 
 |  | 
 |        If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled, | 
 |        PCRE2  automatically inserts callouts, all with number 255, before each | 
 |        item in the pattern except for immediately before or after an  explicit | 
 |        callout. For example, if PCRE2_AUTO_CALLOUT is used with the pattern | 
 |  | 
 |          A(?C3)B | 
 |  | 
 |        it is processed as if it were | 
 |  | 
 |          (?C255)A(?C3)B(?C255) | 
 |  | 
 |        Here is a more complicated example: | 
 |  | 
 |          A(\d{2}|--) | 
 |  | 
 |        With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were | 
 |  | 
 |          (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255) | 
 |  | 
 |        Notice  that  there  is a callout before and after each parenthesis and | 
 |        alternation bar. If the pattern contains a conditional group whose con- | 
 |        dition is an assertion, an automatic callout  is  inserted  immediately | 
 |        before  the  condition. Such a callout may also be inserted explicitly, | 
 |        for example: | 
 |  | 
 |          (?(?C9)(?=a)ab|de)  (?(?C%text%)(?!=d)ab|de) | 
 |  | 
 |        This applies only to assertion conditions (because they are  themselves | 
 |        independent groups). | 
 |  | 
 |        Callouts  can  be useful for tracking the progress of pattern matching. | 
 |        The pcre2test program has a pattern qualifier (/auto_callout) that sets | 
 |        automatic callouts.  When any callouts are  present,  the  output  from | 
 |        pcre2test  indicates  how  the pattern is being matched. This is useful | 
 |        information when you are trying to optimize the performance of  a  par- | 
 |        ticular pattern. | 
 |  | 
 |  | 
 | MISSING CALLOUTS | 
 |  | 
 |        You  should  be  aware  that, because of optimizations in the way PCRE2 | 
 |        compiles and matches patterns, callouts sometimes do not happen exactly | 
 |        as you might expect. | 
 |  | 
 |    Auto-possessification | 
 |  | 
 |        At compile time, PCRE2 "auto-possessifies" repeated items when it knows | 
 |        that what follows cannot be part of the repeat. For example, a+[bc]  is | 
 |        compiled  as if it were a++[bc]. The pcre2test output when this pattern | 
 |        is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied | 
 |        to the string "aaaa" is: | 
 |  | 
 |          --->aaaa | 
 |           +0 ^        a+ | 
 |           +2 ^   ^    [bc] | 
 |          No match | 
 |  | 
 |        This indicates that when matching [bc] fails, there is no  backtracking | 
 |        into a+ (because it is being treated as a++) and therefore the callouts | 
 |        that  would  be  taken for the backtracks do not occur. You can disable | 
 |        the  auto-possessify  feature  by  passing   PCRE2_NO_AUTO_POSSESS   to | 
 |        pcre2_compile(),  or  starting  the pattern with (*NO_AUTO_POSSESS). In | 
 |        this case, the output changes to this: | 
 |  | 
 |          --->aaaa | 
 |           +0 ^        a+ | 
 |           +2 ^   ^    [bc] | 
 |           +2 ^  ^     [bc] | 
 |           +2 ^ ^      [bc] | 
 |           +2 ^^       [bc] | 
 |          No match | 
 |  | 
 |        This time, when matching [bc] fails, the matcher backtracks into a+ and | 
 |        tries again, repeatedly, until a+ itself fails. | 
 |  | 
 |    Automatic .* anchoring | 
 |  | 
 |        By default, an optimization is applied when .* is the first significant | 
 |        item in a pattern. If PCRE2_DOTALL is set, so that the  dot  can  match | 
 |        any  character,  the pattern is automatically anchored. If PCRE2_DOTALL | 
 |        is not set, a match can start only after an internal newline or at  the | 
 |        beginning of the subject, and pcre2_compile() remembers this. If a pat- | 
 |        tern  has more than one top-level branch, automatic anchoring occurs if | 
 |        all branches are anchorable. | 
 |  | 
 |        This optimization is disabled, however, if .* is in an atomic group  or | 
 |        if  there  is a backreference to the capture group in which it appears. | 
 |        It is also disabled if the pattern contains (*PRUNE) or  (*SKIP).  How- | 
 |        ever, the presence of callouts does not affect it. | 
 |  | 
 |        For  example,  if  the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT | 
 |        and applied to the string "aa", the pcre2test output is: | 
 |  | 
 |          --->aa | 
 |           +0 ^      .* | 
 |           +2 ^ ^    \d | 
 |           +2 ^^     \d | 
 |           +2 ^      \d | 
 |          No match | 
 |  | 
 |        This shows that all match attempts start at the beginning of  the  sub- | 
 |        ject. In other words, the pattern is anchored. You can disable this op- | 
 |        timization  by  passing  PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(), or | 
 |        starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the  out- | 
 |        put changes to: | 
 |  | 
 |          --->aa | 
 |           +0 ^      .* | 
 |           +2 ^ ^    \d | 
 |           +2 ^^     \d | 
 |           +2 ^      \d | 
 |           +0  ^     .* | 
 |           +2  ^^    \d | 
 |           +2  ^     \d | 
 |          No match | 
 |  | 
 |        This  shows more match attempts, starting at the second subject charac- | 
 |        ter.  Another optimization, described in the next section,  means  that | 
 |        there is no subsequent attempt to match with an empty subject. | 
 |  | 
 |    Other optimizations | 
 |  | 
 |        Other  optimizations  that  provide fast "no match" results also affect | 
 |        callouts.  For example, if the pattern is | 
 |  | 
 |          ab(?C4)cd | 
 |  | 
 |        PCRE2 knows that any matching string must contain the  letter  "d".  If | 
 |        the  subject  string  is  "abyz",  the  lack of "d" means that matching | 
 |        doesn't ever start, and the callout is  never  reached.  However,  with | 
 |        "abyd", though the result is still no match, the callout is obeyed. | 
 |  | 
 |        For  most  patterns  PCRE2  also knows the minimum length of a matching | 
 |        string, and will immediately give a "no match" return without  actually | 
 |        running  a  match if the subject is not long enough, or, for unanchored | 
 |        patterns, if it has been scanned far enough. | 
 |  | 
 |        You can disable these optimizations by passing the PCRE2_NO_START_OPTI- | 
 |        MIZE option  to  pcre2_compile(),  or  by  starting  the  pattern  with | 
 |        (*NO_START_OPT).  This slows down the matching process, but does ensure | 
 |        that callouts such as the example above are obeyed. | 
 |  | 
 |  | 
 | THE CALLOUT INTERFACE | 
 |  | 
 |        During matching, when PCRE2 reaches a callout  point,  if  an  external | 
 |        function  is  provided in the match context, it is called. This applies | 
 |        to both normal, DFA, and JIT matching. The first argument to the  call- | 
 |        out function is a pointer to a pcre2_callout block. The second argument | 
 |        is  the  void * callout data that was supplied when the callout was set | 
 |        up by calling pcre2_set_callout() (see the pcre2api documentation). The | 
 |        callout block structure contains the following fields, not  necessarily | 
 |        in this order: | 
 |  | 
 |          uint32_t      version; | 
 |          uint32_t      callout_number; | 
 |          uint32_t      capture_top; | 
 |          uint32_t      capture_last; | 
 |          uint32_t      callout_flags; | 
 |          PCRE2_SIZE   *offset_vector; | 
 |          PCRE2_SPTR    mark; | 
 |          PCRE2_SPTR    subject; | 
 |          PCRE2_SIZE    subject_length; | 
 |          PCRE2_SIZE    start_match; | 
 |          PCRE2_SIZE    current_position; | 
 |          PCRE2_SIZE    pattern_position; | 
 |          PCRE2_SIZE    next_item_length; | 
 |          PCRE2_SIZE    callout_string_offset; | 
 |          PCRE2_SIZE    callout_string_length; | 
 |          PCRE2_SPTR    callout_string; | 
 |  | 
 |        The  version field contains the version number of the block format. The | 
 |        current version is 2; the three callout string fields  were  added  for | 
 |        version  1, and the callout_flags field for version 2. If you are writ- | 
 |        ing an application that might use an  earlier  release  of  PCRE2,  you | 
 |        should  check  the version number before accessing any of these fields. | 
 |        The version number will increase in future if more  fields  are  added, | 
 |        but the intention is never to remove any of the existing fields. | 
 |  | 
 |    Fields for numerical callouts | 
 |  | 
 |        For  a  numerical  callout,  callout_string is NULL, and callout_number | 
 |        contains the number of the callout, in the range  0-255.  This  is  the | 
 |        number  that  follows  (?C for callouts that part of the pattern; it is | 
 |        255 for automatically generated callouts. | 
 |  | 
 |    Fields for string callouts | 
 |  | 
 |        For callouts with string arguments, callout_number is always zero,  and | 
 |        callout_string  points  to the string that is contained within the com- | 
 |        piled pattern. Its length is given by callout_string_length. Duplicated | 
 |        ending delimiters that were present in the original pattern string have | 
 |        been turned into single characters, but there is no other processing of | 
 |        the callout string argument. An additional code unit containing  binary | 
 |        zero  is  present  after the string, but is not included in the length. | 
 |        The delimiter that was used to start the string is also  stored  within | 
 |        the  pattern, immediately before the string itself. You can access this | 
 |        delimiter as callout_string[-1] if you need it. | 
 |  | 
 |        The callout_string_offset field is the code unit offset to the start of | 
 |        the callout argument string within the original pattern string. This is | 
 |        provided for the benefit of applications such as script languages  that | 
 |        might need to report errors in the callout string within the pattern. | 
 |  | 
 |    Fields for all callouts | 
 |  | 
 |        The  remaining  fields in the callout block are the same for both kinds | 
 |        of callout. | 
 |  | 
 |        The offset_vector field is a pointer to a vector of  capturing  offsets | 
 |        (the "ovector"). You may read the elements in this vector, but you must | 
 |        not change any of them. | 
 |  | 
 |        For  calls  to pcre2_match(), the offset_vector field is not (since re- | 
 |        lease 10.30) a pointer to the actual ovector that  was  passed  to  the | 
 |        matching  function in the match data block. Instead it points to an in- | 
 |        ternal ovector of a size large enough to  hold  all  possible  captured | 
 |        substrings in the pattern. Note that whenever a recursion or subroutine | 
 |        call  within  a pattern completes, the capturing state is reset to what | 
 |        it was before. | 
 |  | 
 |        The capture_last field contains the number of the  most  recently  cap- | 
 |        tured  substring,  and the capture_top field contains one more than the | 
 |        number of the highest numbered captured substring so far.  If  no  sub- | 
 |        strings  have yet been captured, the value of capture_last is 0 and the | 
 |        value of capture_top is 1. The values of these  fields  do  not  always | 
 |        differ   by   one;  for  example,  when  the  callout  in  the  pattern | 
 |        ((a)(b))(?C2) is taken, capture_last is 1 but capture_top is 4. | 
 |  | 
 |        The contents of ovector[2] to  ovector[<capture_top>*2-1]  can  be  in- | 
 |        spected  in  order to extract substrings that have been matched so far, | 
 |        in the same way as extracting substrings after a match  has  completed. | 
 |        The  values in ovector[0] and ovector[1] are always PCRE2_UNSET because | 
 |        the match is by definition not complete. Substrings that have not  been | 
 |        captured  but whose numbers are less than capture_top also have both of | 
 |        their ovector slots set to PCRE2_UNSET. | 
 |  | 
 |        For DFA matching, the offset_vector field points to  the  ovector  that | 
 |        was  passed  to the matching function in the match data block for call- | 
 |        outs at the top level, but to an internal ovector during the processing | 
 |        of pattern recursions, lookarounds, and atomic groups.  However,  these | 
 |        ovectors  hold no useful information because pcre2_dfa_match() does not | 
 |        support substring capturing. The value of capture_top is always  1  and | 
 |        the value of capture_last is always 0 for DFA matching. | 
 |  | 
 |        The subject and subject_length fields contain copies of the values that | 
 |        were passed to the matching function. | 
 |  | 
 |        The  start_match  field normally contains the offset within the subject | 
 |        at which the current match attempt started. However, if the escape  se- | 
 |        quence  \K  has  been encountered, this value is changed to reflect the | 
 |        modified starting point. If the pattern is not  anchored,  the  callout | 
 |        function may be called several times from the same point in the pattern | 
 |        for different starting points in the subject. | 
 |  | 
 |        The  current_position  field  contains the offset within the subject of | 
 |        the current match pointer. | 
 |  | 
 |        The pattern_position field contains the offset in the pattern string to | 
 |        the next item to be matched. | 
 |  | 
 |        The next_item_length field contains the length of the next item  to  be | 
 |        processed  in the pattern string. When the callout is at the end of the | 
 |        pattern, the length is zero.  When  the  callout  precedes  an  opening | 
 |        parenthesis, the length includes meta characters that follow the paren- | 
 |        thesis.  For  example,  in a callout before an assertion such as (?=ab) | 
 |        the length is 3. For an alternation bar or a closing  parenthesis,  the | 
 |        length  is  one,  unless a closing parenthesis is followed by a quanti- | 
 |        fier, in which case its length is included. (This  changed  in  release | 
 |        10.23.  In  earlier  releases, before an opening parenthesis the length | 
 |        was that of the entire group, and before an alternation bar or a  clos- | 
 |        ing parenthesis the length was zero.) | 
 |  | 
 |        The  pattern_position  and next_item_length fields are intended to help | 
 |        in distinguishing between different automatic callouts, which all  have | 
 |        the  same  callout  number. However, they are set for all callouts, and | 
 |        are used by pcre2test to show the next item to be matched when display- | 
 |        ing callout information. | 
 |  | 
 |        In callouts from pcre2_match() the mark field contains a pointer to the | 
 |        zero-terminated name of the most recently passed (*MARK), (*PRUNE),  or | 
 |        (*THEN)  item  in the match, or NULL if no such items have been passed. | 
 |        Instances of (*PRUNE) or (*THEN) without a name  do  not  obliterate  a | 
 |        previous (*MARK). In callouts from the DFA matching function this field | 
 |        always contains NULL. | 
 |  | 
 |        The   callout_flags   field   is   always   zero   in   callouts   from | 
 |        pcre2_dfa_match() or when JIT is being used. When pcre2_match() without | 
 |        JIT is used, the following bits may be set: | 
 |  | 
 |          PCRE2_CALLOUT_STARTMATCH | 
 |  | 
 |        This is set for the first callout after the start of matching for  each | 
 |        new starting position in the subject. | 
 |  | 
 |          PCRE2_CALLOUT_BACKTRACK | 
 |  | 
 |        This  is  set if there has been a matching backtrack since the previous | 
 |        callout, or since the start of matching if this is  the  first  callout | 
 |        from a pcre2_match() run. | 
 |  | 
 |        Both  bits  are  set when a backtrack has caused a "bumpalong" to a new | 
 |        starting position in the subject. Output from pcre2test does not  indi- | 
 |        cate  the  presence  of these bits unless the callout_extra modifier is | 
 |        set. | 
 |  | 
 |        The information in the callout_flags field is provided so that applica- | 
 |        tions can track and tell their users how matching with backtracking  is | 
 |        done.  This  can be useful when trying to optimize patterns, or just to | 
 |        understand how PCRE2 works. There is no  support  in  pcre2_dfa_match() | 
 |        because  there is no backtracking in DFA matching, and there is no sup- | 
 |        port in JIT because JIT is all about maximimizing matching performance. | 
 |        In both these cases the callout_flags field is always zero. | 
 |  | 
 |  | 
 | RETURN VALUES FROM CALLOUTS | 
 |  | 
 |        The external callout function returns an integer to PCRE2. If the value | 
 |        is zero, matching proceeds as normal. If  the  value  is  greater  than | 
 |        zero,  matching  fails  at  the current point, but the testing of other | 
 |        matching possibilities goes ahead, just as if a lookahead assertion had | 
 |        failed. If the value is less than zero, the match is abandoned, and the | 
 |        matching function returns the negative value. | 
 |  | 
 |        Negative values should normally be chosen from  the  set  of  PCRE2_ER- | 
 |        ROR_xxx  values.  In  particular, PCRE2_ERROR_NOMATCH forces a standard | 
 |        "no match" failure. The error number  PCRE2_ERROR_CALLOUT  is  reserved | 
 |        for use by callout functions; it will never be used by PCRE2 itself. | 
 |  | 
 |  | 
 | CALLOUT ENUMERATION | 
 |  | 
 |        int pcre2_callout_enumerate(const pcre2_code *code, | 
 |          int (*callback)(pcre2_callout_enumerate_block *, void *), | 
 |          void *user_data); | 
 |  | 
 |        A script language that supports the use of string arguments in callouts | 
 |        might  like  to  scan  all the callouts in a pattern before running the | 
 |        match. This can be done by calling pcre2_callout_enumerate(). The first | 
 |        argument is a pointer to a compiled pattern, the  second  points  to  a | 
 |        callback  function,  and the third is arbitrary user data. The callback | 
 |        function is called for every callout in the pattern  in  the  order  in | 
 |        which they appear. Its first argument is a pointer to a callout enumer- | 
 |        ation  block,  and  its second argument is the user_data value that was | 
 |        passed to pcre2_callout_enumerate(). The data block contains  the  fol- | 
 |        lowing fields: | 
 |  | 
 |          version                Block version number | 
 |          pattern_position       Offset to next item in pattern | 
 |          next_item_length       Length of next item in pattern | 
 |          callout_number         Number for numbered callouts | 
 |          callout_string_offset  Offset to string within pattern | 
 |          callout_string_length  Length of callout string | 
 |          callout_string         Points to callout string or is NULL | 
 |  | 
 |        The  version  number is currently 0. It will increase if new fields are | 
 |        ever added to the block. The remaining fields are  the  same  as  their | 
 |        namesakes  in  the pcre2_callout block that is used for callouts during | 
 |        matching, as described above. | 
 |  | 
 |        Note that the value of pattern_position is  unique  for  each  callout. | 
 |        However,  if  a callout occurs inside a group that is quantified with a | 
 |        non-zero minimum or a fixed maximum, the group is replicated inside the | 
 |        compiled pattern. For example, a pattern such as /(a){2}/  is  compiled | 
 |        as  if it were /(a)(a)/. This means that the callout will be enumerated | 
 |        more than once, but with the same value for  pattern_position  in  each | 
 |        case. | 
 |  | 
 |        The callback function should normally return zero. If it returns a non- | 
 |        zero value, scanning the pattern stops, and that value is returned from | 
 |        pcre2_callout_enumerate(). | 
 |  | 
 |  | 
 | AUTHOR | 
 |  | 
 |        Philip Hazel | 
 |        Retired from University Computing Service | 
 |        Cambridge, England. | 
 |  | 
 |  | 
 | REVISION | 
 |  | 
 |        Last updated: 26 February 2025 | 
 |        Copyright (c) 1997-2024 University of Cambridge. | 
 |  | 
 |  | 
 | PCRE2 10.48-DEV                26 February 2025                PCRE2CALLOUT(3) | 
 | ------------------------------------------------------------------------------ | 
 |  | 
 |  | 
 | PCRE2COMPAT(3)             Library Functions Manual             PCRE2COMPAT(3) | 
 |  | 
 |  | 
 | NAME | 
 |        PCRE2 - Perl-compatible regular expressions (revised API) | 
 |  | 
 |  | 
 | DIFFERENCES BETWEEN PCRE2 AND PERL | 
 |  | 
 |        This  document describes some of the known differences in the ways that | 
 |        PCRE2 and Perl handle regular expressions.  The  differences  described | 
 |        here  are  with  respect  to  Perl version 5.38.0, but as both Perl and | 
 |        PCRE2 are continually changing, the information may at times be out  of | 
 |        date. | 
 |  | 
 |        1.  When  PCRE2_DOTALL  (equivalent to Perl's /s qualifier) is not set, | 
 |        the behaviour of the '.' metacharacter differs from Perl. In PCRE2, '.' | 
 |        matches the next character unless it is the  start  of  a  newline  se- | 
 |        quence.  This  means  that, if the newline setting is CR, CRLF, or NUL, | 
 |        '.' will match the code point LF (0x0A) in ASCII/Unicode  environments, | 
 |        and  NL  (either  0x15 or 0x25) when using EBCDIC. In Perl, '.' appears | 
 |        never to match LF, even when 0x0A is not a newline indicator. | 
 |  | 
 |        2. PCRE2 has only a subset of Perl's Unicode support. Details  of  what | 
 |        it does have are given in the pcre2unicode page. | 
 |  | 
 |        3.  Like  Perl, PCRE2 allows repeat quantifiers on parenthesized asser- | 
 |        tions, but they do not mean what you might think. For example, (?!a){3} | 
 |        does not assert that the next three characters are not "a". It just as- | 
 |        serts that the next character is not "a"  three  times  (in  principle; | 
 |        PCRE2  optimizes this to run the assertion just once). Perl allows some | 
 |        repeat quantifiers on other assertions, for example, \b* , but these do | 
 |        not seem to have any use. PCRE2 does not allow any kind  of  quantifier | 
 |        on non-lookaround assertions. | 
 |  | 
 |        4.  If a braced quantifier such as {1,2} appears where there is nothing | 
 |        to repeat (for example, at the start of a branch), PCRE2 raises an  er- | 
 |        ror  whereas  Perl  treats the quantifier characters as literal. When a | 
 |        braced quantifier (...){min,max} has min > max, Perl treats  it  as  an | 
 |        item  which  fails to match any portion of the subject (as no number of | 
 |        repetitions can meet the condition), and additionally issues a  warning | 
 |        when in warning mode. PCRE2 has no warning features, so it gives an er- | 
 |        ror in this case. | 
 |  | 
 |        5.  Capture groups that occur inside negative lookaround assertions are | 
 |        counted, but their entries in the offsets vector are set  only  when  a | 
 |        negative  assertion is a condition that has a matching branch (that is, | 
 |        the condition is false).  Perl may set such  capture  groups  in  other | 
 |        circumstances. | 
 |  | 
 |        6.  The  following Perl escape sequences are not supported: \F, \l, \L, | 
 |        \u, \U, and \N when followed by a character name. \N on its own, match- | 
 |        ing a non-newline character, and \N{U+dd..}, matching  a  Unicode  code | 
 |        point,  are  supported.  The  escapes that modify the case of following | 
 |        letters are implemented by Perl's general string-handling and  are  not | 
 |        part of its pattern matching engine. If any of these are encountered by | 
 |        PCRE2,  an  error  is  generated  by default. However, if either of the | 
 |        PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX options is set, \U  and  \u  are | 
 |        interpreted as ECMAScript interprets them. | 
 |  | 
 |        7. The Perl escape sequences \p, \P, and \X are supported only if PCRE2 | 
 |        is built with Unicode support (the default). The properties that can be | 
 |        tested  with  \p  and \P are limited to the general category properties | 
 |        such as Lu and Nd, the derived properties  Any  and  Lc  (synonym  L&), | 
 |        script  names such as Greek or Han, Bidi_Class, Bidi_Control, and a few | 
 |        binary properties. Both PCRE2 and Perl support the Cs (surrogate) prop- | 
 |        erty, but in PCRE2 its use is limited. See the pcre2pattern  documenta- | 
 |        tion  for  details. The long synonyms for property names that Perl sup- | 
 |        ports (such as \p{Letter}) are not supported by PCRE2, nor is  it  per- | 
 |        mitted to prefix any of these properties with "Is". | 
 |  | 
 |        8. PCRE2 supports the \Q...\E escape for quoting substrings. Characters | 
 |        in between are treated as literals. However, this is slightly different | 
 |        from  Perl  in  that  $  and  @ are also handled as literals inside the | 
 |        quotes. In Perl, they cause variable interpolation (PCRE2 does not have | 
 |        variables). Also, Perl does "double-quotish backslash interpolation" on | 
 |        any backslashes between \Q and \E which, its documentation  says,  "may | 
 |        lead  to confusing results". PCRE2 treats a backslash between \Q and \E | 
 |        just like any other character. Note the following examples: | 
 |  | 
 |            Pattern            PCRE2 matches     Perl matches | 
 |  | 
 |            \Qabc$xyz\E        abc$xyz           abc followed by the | 
 |                                                   contents of $xyz | 
 |            \Qabc\$xyz\E       abc\$xyz          abc\$xyz | 
 |            \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz | 
 |            \QA\B\E            A\B               A\B | 
 |            \Q\\E              \                 \\E | 
 |  | 
 |        The \Q...\E sequence is recognized both inside  and  outside  character | 
 |        classes  by  both  PCRE2 and Perl. Another difference from Perl is that | 
 |        any appearance of \Q or \E inside what might otherwise be a  quantifier | 
 |        causes PCRE2 not to recognize the sequence as a quantifier. Perl recog- | 
 |        nizes  a  quantifier  if  (redundantly) either of the numbers is inside | 
 |        \Q...\E, but not if the separating comma is. When not recognized  as  a | 
 |        quantifier  a  sequence  such  as  {\Q1\E,2}  is treated as the literal | 
 |        string "{1,2}". | 
 |  | 
 |        9.  Fairly  obviously,  PCRE2  does  not  support  the  (?{code})   and | 
 |        (??{code}) constructions. However, PCRE2 does have a "callout" feature, | 
 |        which allows an external function to be called during pattern matching. | 
 |        See the pcre2callout documentation for details. | 
 |  | 
 |        10.  Subroutine calls (whether recursive or not) were treated as atomic | 
 |        groups up to PCRE2 release 10.23, but from release 10.30 this  changed, | 
 |        and backtracking into subroutine calls is now supported, as in Perl. | 
 |  | 
 |        11.  In  PCRE2,  if any of the backtracking control verbs are used in a | 
 |        group that is called as a  subroutine  (whether  or  not  recursively), | 
 |        their  effect is confined to that group; it does not extend to the sur- | 
 |        rounding pattern. This is not always the case in Perl.  In  particular, | 
 |        if  (*THEN)  is  present in a group that is called as a subroutine, its | 
 |        action is limited to that group, even if the group does not contain any | 
 |        | characters. Note that such groups are processed as  anchored  at  the | 
 |        point  where  they  are  tested.  PCRE2 also confines all control verbs | 
 |        within atomic assertions, again including (*THEN)  in  assertions  with | 
 |        only one branch. | 
 |  | 
 |        12.  If a pattern contains more than one backtracking control verb, the | 
 |        first one that is backtracked onto acts. For example,  in  the  pattern | 
 |        A(*COMMIT)B(*PRUNE)C  a  failure in B triggers (*COMMIT), but a failure | 
 |        in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases | 
 |        it is the same as PCRE2, but there are cases where it differs. | 
 |  | 
 |        13. There are some differences that are concerned with the settings  of | 
 |        captured  strings  when  part  of  a  pattern is repeated. For example, | 
 |        matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves  $2  un- | 
 |        set, but in PCRE2 it is set to "b". | 
 |  | 
 |        14.  PCRE2's  handling  of duplicate capture group numbers and names is | 
 |        not as general as Perl's. This is a consequence of the fact  the  PCRE2 | 
 |        works  internally  just with numbers, using an external table to trans- | 
 |        late between numbers and  names.  In  particular,  a  pattern  such  as | 
 |        (?|(?<a>A)|(?<b>B)),  where the two capture groups have the same number | 
 |        but different names, is not supported, and causes an error  at  compile | 
 |        time. If it were allowed, it would not be possible to distinguish which | 
 |        group  matched,  because  both  names map to capture group number 1. To | 
 |        avoid this confusing situation, an error is given at compile time. | 
 |  | 
 |        15. Perl used to recognize comments in some places that PCRE2 does not, | 
 |        for example, between the ( and ? at the start of a  group.  If  the  /x | 
 |        modifier  is  set,  Perl allowed white space between ( and ? though the | 
 |        latest Perls give an error (for a while it was just deprecated).  There | 
 |        may still be some cases where Perl behaves differently. | 
 |  | 
 |        16.  Perl,  when  in warning mode, gives warnings for character classes | 
 |        such as [A-\d] or [a-[:digit:]]. It then treats the hyphens  as  liter- | 
 |        als. PCRE2 has no warning features, so it gives an error in these cases | 
 |        because they are almost certainly user mistakes. | 
 |  | 
 |        17. In PCRE2, until release 10.45, the upper/lower case character prop- | 
 |        erties  Lu  and Ll were not affected when case-independent matching was | 
 |        specified. Perl has changed in this respect, and PCRE2 has now  changed | 
 |        to  match.  When  caseless  matching is in force, Lu, Ll, and Lt (title | 
 |        case) are all treated as Lc (cased letter). | 
 |  | 
 |        18. From release 5.32.0, Perl locks out the use of \K in lookaround as- | 
 |        sertions. From release 10.38 PCRE2 does the same by  default.  However, | 
 |        there  is  an  option for re-enabling the previous behaviour. When this | 
 |        option is set, \K is acted on when it occurs  in  positive  assertions, | 
 |        but is ignored in negative assertions. | 
 |  | 
 |        19.  PCRE2  provides some extensions to the Perl regular expression fa- | 
 |        cilities.  Perl 5.10 included new features that  were  not  in  earlier | 
 |        versions  of  Perl,  some  of which (such as named parentheses) were in | 
 |        PCRE2 for some time before. This list is with respect to Perl 5.38: | 
 |  | 
 |        (a) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set,  the | 
 |        $ meta-character matches only at the very end of the string. | 
 |  | 
 |        (b)  A  backslash  followed  by  a  letter  with  no special meaning is | 
 |        faulted. (Perl can be made to issue a warning.) | 
 |  | 
 |        (c) If PCRE2_UNGREEDY is set, the greediness of the repetition  quanti- | 
 |        fiers is inverted, that is, by default they are not greedy, but if fol- | 
 |        lowed by a question mark they are. | 
 |  | 
 |        (d)  PCRE2_ANCHORED  can be used at matching time to force a pattern to | 
 |        be tried only at the first matching position in the subject string. | 
 |  | 
 |        (e)    The    PCRE2_NOTBOL,    PCRE2_NOTEOL,     PCRE2_NOTEMPTY     and | 
 |        PCRE2_NOTEMPTY_ATSTART options have no Perl equivalents. | 
 |  | 
 |        (f)  The  \R escape sequence can be restricted to match only CR, LF, or | 
 |        CRLF by the PCRE2_BSR_ANYCRLF option. | 
 |  | 
 |        (g) The callout facility is PCRE2-specific.  Perl  supports  codeblocks | 
 |        and variable interpolation, but not general hooks on every match. | 
 |  | 
 |        (h) The partial matching facility is PCRE2-specific. | 
 |  | 
 |        (i)  The alternative matching function (pcre2_dfa_match()) matches in a | 
 |        different way and is not Perl-compatible. | 
 |  | 
 |        (j) PCRE2 recognizes some special sequences such as (*CR) or  (*NO_JIT) | 
 |        at  the  start  of  a pattern. These set overall options that cannot be | 
 |        changed within the pattern. | 
 |  | 
 |        (k) PCRE2 supports non-atomic positive lookaround assertions.  This  is | 
 |        an extension to the lookaround facilities. The default, Perl-compatible | 
 |        lookarounds are atomic. | 
 |  | 
 |        (l)  There  are three syntactical items in patterns that can refer to a | 
 |        capturing group by number: back references such  as  \g{2},  subroutine | 
 |        calls  such  as (?3), and condition references such as (?(4)...). PCRE2 | 
 |        supports relative group numbers such as +2 and -4 in all  three  cases. | 
 |        Perl  supports both plus and minus for subroutine calls, but only minus | 
 |        for back references, and no relative numbering at all for conditions. | 
 |  | 
 |        (m) The scan substring assertion (syntax (*scs:(n)...)) is a PCRE2  ex- | 
 |        tension that is not available in Perl. | 
 |  | 
 |        20.  Perl has different limits than PCRE2. See the pcre2limits documen- | 
 |        tation for details. Perl went with 5.10  from  recursion  to  iteration | 
 |        keeping  the intermediate matches on the heap, which is ~10% slower but | 
 |        does not fall into any  stack-overflow  limit.  PCRE2  made  a  similar | 
 |        change at release 10.30, and also has many build-time and run-time cus- | 
 |        tomizable limits. | 
 |  | 
 |        21.  Unlike  Perl,  PCRE2 doesn't have character set modifiers and spe- | 
 |        cially no way to set characters by context just  like  Perl's  "/d".  A | 
 |        regular expression using PCRE2_UTF and PCRE2_UCP will use similar rules | 
 |        to  Perl's  "/u";  something closer to "/a" could be selected by adding | 
 |        other PCRE2_EXTRA_ASCII* options on top. | 
 |  | 
 |        22. Some recursive patterns that Perl diagnoses as infinite  recursions | 
 |        can be handled by PCRE2, either by the interpreter or the JIT. An exam- | 
 |        ple is /(?:|(?0)abcd)(?(R)|\z)/, which matches a sequence of any number | 
 |        of repeated "abcd" substrings at the end of the subject. | 
 |  | 
 |        23.  Both  PCRE2  and Perl error when \x{ escapes are invalid, but Perl | 
 |        tries to recover and prints a warning if the problem was  that  an  in- | 
 |        valid hexadecimal digit was found. Since PCRE2 doesn't have warnings it | 
 |        returns  an  error instead.  Additionally, Perl accepts \x{} and gener- | 
 |        ates NUL unlike PCRE2. | 
 |  | 
 |        24. From release 10.45, PCRE2 gives an error if \x is not followed by a | 
 |        hexadecimal digit or a curly bracket. It used to interpret this as  the | 
 |        NUL character. Perl still generates NUL, but warns when in warning mode | 
 |        in most cases. | 
 |  | 
 |  | 
 | AUTHOR | 
 |  | 
 |        Philip Hazel | 
 |        Retired from University Computing Service | 
 |        Cambridge, England. | 
 |  | 
 |  | 
 | REVISION | 
 |  | 
 |        Last updated: 02 June 2025 | 
 |        Copyright (c) 1997-2024 University of Cambridge. | 
 |  | 
 |  | 
 | PCRE2 10.48-DEV                  02 June 2025                   PCRE2COMPAT(3) | 
 | ------------------------------------------------------------------------------ | 
 |  | 
 |  | 
 | PCRE2JIT(3)                Library Functions Manual                PCRE2JIT(3) | 
 |  | 
 |  | 
 | NAME | 
 |        PCRE2 - Perl-compatible regular expressions (revised API) | 
 |  | 
 |  | 
 | PCRE2 JUST-IN-TIME COMPILER SUPPORT | 
 |  | 
 |        Just-in-time  compiling  is a heavyweight optimization that can greatly | 
 |        speed up pattern matching. However, it comes at the cost of extra  pro- | 
 |        cessing  before  the  match is performed, so it is of most benefit when | 
 |        the same pattern is going to be matched many times. This does not  nec- | 
 |        essarily  mean many calls of a matching function; if the pattern is not | 
 |        anchored, matching attempts may take place many times at various  posi- | 
 |        tions in the subject, even for a single call. Therefore, if the subject | 
 |        string  is  very  long,  it  may  still pay to use JIT even for one-off | 
 |        matches. JIT support is available for all  of  the  8-bit,  16-bit  and | 
 |        32-bit PCRE2 libraries. | 
 |  | 
 |        JIT  support  applies  only to the traditional Perl-compatible matching | 
 |        function.  It does not apply when the DFA matching  function  is  being | 
 |        used. The code for JIT support was written by Zoltan Herczeg. | 
 |  | 
 |  | 
 | AVAILABILITY OF JIT SUPPORT | 
 |  | 
 |        JIT  support  is  an  optional feature of PCRE2. The "configure" option | 
 |        --enable-jit (or equivalent CMake option) must be  set  when  PCRE2  is | 
 |        built  if  you want to use JIT. The support is limited to the following | 
 |        hardware platforms: | 
 |  | 
 |          ARM 32-bit (v7, and Thumb2) | 
 |          ARM 64-bit | 
 |          IBM s390x 64 bit | 
 |          Intel x86 32-bit and 64-bit | 
 |          LoongArch 64 bit | 
 |          MIPS 32-bit and 64-bit | 
 |          Power PC 32-bit and 64-bit | 
 |          RISC-V 32-bit and 64-bit | 
 |  | 
 |        If --enable-jit is set on an unsupported platform, compilation fails. | 
 |  | 
 |        A client program can tell if JIT support has been compiled  by  calling | 
 |        pcre2_config()  with  the PCRE2_CONFIG_JIT option. The result is one if | 
 |        PCRE2 was built with JIT support, and zero otherwise.  However,  having | 
 |        the  JIT code available does not guarantee that it will be used for any | 
 |        particular match. One reason for this is that there are a number of op- | 
 |        tions and pattern items that are not supported by JIT (see below).  An- | 
 |        other  reason  is  that  in some environments JIT is unable to get exe- | 
 |        cutable memory in which to build its compiled code. The only  guarantee | 
 |        from pcre2_config() is that if it returns zero, JIT will definitely not | 
 |        be used. | 
 |  | 
 |        As  of  release  10.45  there is a more informative way to test for JIT | 
 |        support. If  pcre2_compile_jit()  is  called  with  the  single  option | 
 |        PCRE2_JIT_TEST_ALLOC  it  returns  zero  if  JIT is available and has a | 
 |        working allocator. Otherwise it returns PCRE2_ERROR_NOMEMORY if JIT  is | 
 |        available but cannot allocate executable memory, or PCRE2_ERROR_JIT_UN- | 
 |        SUPPORTED if JIT support is not compiled. The code argument is ignored, | 
 |        so it can be a NULL value. | 
 |  | 
 |        A  simple  program  does not need to check availability in order to use | 
 |        JIT when possible. The API is implemented in a way that falls  back  to | 
 |        the  interpretive  code if JIT is not available or cannot be used for a | 
 |        given match. For programs that  need  the  best  possible  performance, | 
 |        there is a "fast path" API that is JIT-specific. | 
 |  | 
 |  | 
 | SIMPLE USE OF JIT | 
 |  | 
 |        To  make use of the JIT support in the simplest way, all you have to do | 
 |        is to call pcre2_jit_compile() after successfully compiling  a  pattern | 
 |        with pcre2_compile(). This function has two arguments: the first is the | 
 |        compiled  pattern pointer that was returned by pcre2_compile(), and the | 
 |        second is zero or more of the  following  option  bits:  PCRE2_JIT_COM- | 
 |        PLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT. | 
 |  | 
 |        If  JIT  support  is  not available, a call to pcre2_jit_compile() does | 
 |        nothing and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the  compiled | 
 |        pattern is passed to the JIT compiler, which turns it into machine code | 
 |        that executes much faster than the normal interpretive code, but yields | 
 |        exactly  the  same results. The returned value from pcre2_jit_compile() | 
 |        is zero on success, or a negative error code. | 
 |  | 
 |        There is a limit to the size of pattern that JIT supports,  imposed  by | 
 |        the  size  of machine stack that it uses. The exact rules are not docu- | 
 |        mented because they may change at any time, in particular, when new op- | 
 |        timizations are introduced.  If  a  pattern  is  too  big,  a  call  to | 
 |        pcre2_jit_compile() returns PCRE2_ERROR_NOMEMORY. | 
 |  | 
 |        PCRE2_JIT_COMPLETE  requests the JIT compiler to generate code for com- | 
 |        plete matches. If you want to run partial matches using the  PCRE2_PAR- | 
 |        TIAL_HARD  or  PCRE2_PARTIAL_SOFT  options of pcre2_match(), you should | 
 |        set one or both of  the  other  options  as  well  as,  or  instead  of | 
 |        PCRE2_JIT_COMPLETE. The JIT compiler generates different optimized code | 
 |        for  each of the three modes (normal, soft partial, hard partial). When | 
 |        pcre2_match() is called, the appropriate code is run if  it  is  avail- | 
 |        able. Otherwise, the pattern is matched using interpretive code. | 
 |  | 
 |        You  can  call pcre2_jit_compile() multiple times for the same compiled | 
 |        pattern. It does nothing if it has previously compiled code for any  of | 
 |        the  option bits. For example, you can call it once with PCRE2_JIT_COM- | 
 |        PLETE and (perhaps later, when you  find  you  need  partial  matching) | 
 |        again  with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it | 
 |        will ignore PCRE2_JIT_COMPLETE and just compile code for partial match- | 
 |        ing. If pcre2_jit_compile() is called with no option bits set, it imme- | 
 |        diately returns zero. This is an alternative way of testing whether JIT | 
 |        support has been compiled. | 
 |  | 
 |        At present, it is not possible to free JIT compiled  code  except  when | 
 |        the entire compiled pattern is freed by calling pcre2_code_free(). | 
 |  | 
 |        In  some circumstances you may need to call additional functions. These | 
 |        are described in the section entitled "Controlling the JIT  stack"  be- | 
 |        low. | 
 |  | 
 |        There are some pcre2_match() options that are not supported by JIT, and | 
 |        there  are  also some pattern items that JIT cannot handle. Details are | 
 |        given below.  In both cases, matching automatically falls back  to  the | 
 |        interpretive  code.  If  you want to know whether JIT was actually used | 
 |        for a particular match, you should arrange for a JIT callback  function | 
 |        to  be set up as described in the section entitled "Controlling the JIT | 
 |        stack" below, even if you do not  need  to  supply  a  non-default  JIT | 
 |        stack. Such a callback function is called whenever JIT code is about to | 
 |        be  obeyed.  If the match-time options are not right for JIT execution, | 
 |        the callback function is not obeyed. | 
 |  | 
 |        If the JIT compiler finds an unsupported item, no JIT  data  is  gener- | 
 |        ated. You can find out if JIT compilation was successful for a compiled | 
 |        pattern by calling pcre2_pattern_info() with the PCRE2_INFO_JITSIZE op- | 
 |        tion.  A  non-zero  result means that JIT compilation was successful. A | 
 |        result of 0 means that JIT support is not available, or the pattern was | 
 |        not processed by pcre2_jit_compile(), or the JIT compiler was not  able | 
 |        to  handle  the  pattern. Successful JIT compilation does not, however, | 
 |        guarantee the use of JIT at match time because  there  are  some  match | 
 |        time options that are not supported by JIT. | 
 |  | 
 |  | 
 | MATCHING SUBJECTS CONTAINING INVALID UTF | 
 |  | 
 |        When  a  pattern is compiled with the PCRE2_UTF option, subject strings | 
 |        are normally expected to be a valid sequence of UTF code units. By  de- | 
 |        fault,  this is checked at the start of matching and an error is gener- | 
 |        ated if invalid UTF is detected. The PCRE2_NO_UTF_CHECK option  can  be | 
 |        passed to pcre2_match() to skip the check (for improved performance) if | 
 |        you  are  sure  that  a subject string is valid. If this option is used | 
 |        with an invalid string, the result is undefined.  The  calling  program | 
 |        may crash or loop or otherwise misbehave. | 
 |  | 
 |        However,  a  way of running matches on strings that may contain invalid | 
 |        UTF  sequences  is  available.   Calling   pcre2_compile()   with   the | 
 |        PCRE2_MATCH_INVALID_UTF  option  has  two  effects: it tells the inter- | 
 |        preter in pcre2_match() to support invalid UTF, and, if  pcre2_jit_com- | 
 |        pile()  is subsequently called, the compiled JIT code also supports in- | 
 |        valid UTF.  Details of how this support works, in both the JIT and  the | 
 |        interpretive cases, is given in the pcre2unicode documentation. | 
 |  | 
 |        There  is  also  an  obsolete  option  for  pcre2_jit_compile()  called | 
 |        PCRE2_JIT_INVALID_UTF, which currently exists only for backward compat- | 
 |        ibility.    It   is   superseded   by   the   pcre2_compile()    option | 
 |        PCRE2_MATCH_INVALID_UTF and should no longer be used. It may be removed | 
 |        in future. | 
 |  | 
 |  | 
 | UNSUPPORTED OPTIONS AND PATTERN ITEMS | 
 |  | 
 |        The  pcre2_match()  options  that  are  supported  for JIT matching are | 
 |        PCRE2_COPY_MATCHED_SUBJECT, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, | 
 |        PCRE2_NOTEMPTY_ATSTART,  PCRE2_NO_UTF_CHECK,  PCRE2_PARTIAL_HARD,   and | 
 |        PCRE2_PARTIAL_SOFT.  The  PCRE2_ANCHORED  and PCRE2_ENDANCHORED options | 
 |        are not supported at match time. | 
 |  | 
 |        If the PCRE2_NO_JIT option is passed to pcre2_match() it  disables  the | 
 |        use of JIT, forcing matching by the interpreter code. | 
 |  | 
 |        The  only  unsupported  pattern items are \C (match a single data unit) | 
 |        when running in a UTF mode, and a callout immediately before an  asser- | 
 |        tion condition in a conditional group. | 
 |  | 
 |  | 
 | RETURN VALUES FROM JIT MATCHING | 
 |  | 
 |        When  a pattern is matched using JIT, the return values are the same as | 
 |        those given by the interpretive pcre2_match() code, with  the  addition | 
 |        of  one new error code: PCRE2_ERROR_JIT_STACKLIMIT. This means that the | 
 |        memory used for the JIT stack was insufficient.  See  "Controlling  the | 
 |        JIT stack" below for a discussion of JIT stack usage. | 
 |  | 
 |        The  error  code  PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if | 
 |        searching a very large pattern tree goes on for too long, as it  is  in | 
 |        the  same circumstance when JIT is not used, but the details of exactly | 
 |        what is counted are not the same. The PCRE2_ERROR_DEPTHLIMIT error code | 
 |        is never returned when JIT matching is used. | 
 |  | 
 |  | 
 | CONTROLLING THE JIT STACK | 
 |  | 
 |        When the compiled JIT code runs, it needs a block of memory to use as a | 
 |        stack.  By default, it uses 32KiB on the machine stack.  However,  some | 
 |        large  or complicated patterns need more than this. The error PCRE2_ER- | 
 |        ROR_JIT_STACKLIMIT is given when there is not enough stack. Three func- | 
 |        tions are provided for managing blocks of memory for use as JIT stacks. | 
 |        There is further discussion about the use of JIT stacks in the  section | 
 |        entitled "JIT stack FAQ" below. | 
 |  | 
 |        The  pcre2_jit_stack_create()  function  creates a JIT stack. Its argu- | 
 |        ments are a starting size, a maximum size, and a general  context  (for | 
 |        memory  allocation  functions, or NULL for standard memory allocation). | 
 |        It returns a pointer to an opaque structure of type pcre2_jit_stack, or | 
 |        NULL if there is an error. The pcre2_jit_stack_free() function is  used | 
 |        to free a stack that is no longer needed. If its argument is NULL, this | 
 |        function  returns immediately, without doing anything. (For the techni- | 
 |        cally minded: the address space is allocated by mmap or  VirtualAlloc.) | 
 |        A  maximum  stack size of 512KiB to 1MiB should be more than enough for | 
 |        any pattern. | 
 |  | 
 |        The pcre2_jit_stack_assign() function specifies which  stack  JIT  code | 
 |        should use. Its arguments are as follows: | 
 |  | 
 |          pcre2_match_context  *mcontext | 
 |          pcre2_jit_callback    callback | 
 |          void                 *data | 
 |  | 
 |        The first argument is a pointer to a match context. When this is subse- | 
 |        quently passed to a matching function, its information determines which | 
 |        JIT stack is used. If this argument is NULL, the function returns imme- | 
 |        diately,  without  doing anything. There are three cases for the values | 
 |        of the other two options: | 
 |  | 
 |          (1) If callback is NULL and data is NULL, an internal 32KiB block | 
 |              on the machine stack is used. This is the default when a match | 
 |              context is created. | 
 |  | 
 |          (2) If callback is NULL and data is not NULL, data must be | 
 |              a pointer to a valid JIT stack, the result of calling | 
 |              pcre2_jit_stack_create(). | 
 |  | 
 |          (3) If callback is not NULL, it must point to a function that is | 
 |              called with data as an argument at the start of matching, in | 
 |              order to set up a JIT stack. If the return from the callback | 
 |              function is NULL, the internal 32KiB stack is used; otherwise the | 
 |              return value must be a valid JIT stack, the result of calling | 
 |              pcre2_jit_stack_create(). | 
 |  | 
 |        A callback function is obeyed whenever JIT code is about to be run;  it | 
 |        is not obeyed when pcre2_match() is called with options that are incom- | 
 |        patible  for JIT matching. A callback function can therefore be used to | 
 |        determine whether a match operation was executed by JIT or by  the  in- | 
 |        terpreter. | 
 |  | 
 |        You may safely use the same JIT stack for more than one pattern (either | 
 |        by  assigning  directly  or  by  callback), as long as the patterns are | 
 |        matched sequentially in the same thread. Currently, the only way to set | 
 |        up non-sequential matches in one thread is to use callouts: if a  call- | 
 |        out  function starts another match, that match must use a different JIT | 
 |        stack to the one used for currently suspended match(es). | 
 |  | 
 |        In a multithread application, if you do not specify a JIT stack, or  if | 
 |        you  assign or pass back NULL from a callback, that is thread-safe, be- | 
 |        cause each thread has its own machine stack. However, if you assign  or | 
 |        pass back a non-NULL JIT stack, this must be a different stack for each | 
 |        thread so that the application is thread-safe. | 
 |  | 
 |        Strictly  speaking,  even more is allowed. You can assign the same non- | 
 |        NULL stack to a match context that is used by any number  of  patterns, | 
 |        as  long  as  they are not used for matching by multiple threads at the | 
 |        same time. For example, you could use the same stack  in  all  compiled | 
 |        patterns,  with  a global mutex in the callback to wait until the stack | 
 |        is available for use. However, this is an inefficient solution, and not | 
 |        recommended. | 
 |  | 
 |        This is a suggestion for how a multithreaded program that needs to  set | 
 |        up non-default JIT stacks might operate: | 
 |  | 
 |          During thread initialization | 
 |            thread_local_var = pcre2_jit_stack_create(...) | 
 |  | 
 |          During thread exit | 
 |            pcre2_jit_stack_free(thread_local_var) | 
 |  | 
 |          Use a one-line callback function | 
 |            return thread_local_var | 
 |  | 
 |        All  the  functions  described in this section do nothing if JIT is not | 
 |        available. | 
 |  | 
 |  | 
 | JIT STACK FAQ | 
 |  | 
 |        (1) Why do we need JIT stacks? | 
 |  | 
 |        PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack | 
 |        where the local data of the current node is pushed before checking  its | 
 |        child nodes.  Allocating real machine stack on some platforms is diffi- | 
 |        cult. For example, the stack chain needs to be updated every time if we | 
 |        extend  the  stack  on  PowerPC.  Although it is possible, its updating | 
 |        time overhead decreases performance. So we do the recursion in memory. | 
 |  | 
 |        (2) Why don't we simply allocate blocks of memory with malloc()? | 
 |  | 
 |        Modern operating systems have a nice feature: they can reserve  an  ad- | 
 |        dress space instead of allocating memory. We can safely allocate memory | 
 |        pages inside this address space, so the stack could grow without moving | 
 |        memory  data (this is important because of pointers). Thus we can allo- | 
 |        cate 1MiB address space, and use only a  single  memory  page  (usually | 
 |        4KiB)  if that is enough. However, we can still grow up to 1MiB anytime | 
 |        if needed. | 
 |  | 
 |        (3) Who "owns" a JIT stack? | 
 |  | 
 |        The owner of the stack is the user program, not the JIT studied pattern | 
 |        or anything else. The user program must ensure that if a stack is being | 
 |        used by pcre2_match(), (that is, it is assigned to a match context that | 
 |        is passed to the pattern currently running), that  stack  must  not  be | 
 |        used  by any other threads (to avoid overwriting the same memory area). | 
 |        The best practice for multithreaded programs is to allocate a stack for | 
 |        each thread, and return this stack through the JIT callback function. | 
 |  | 
 |        (4) When should a JIT stack be freed? | 
 |  | 
 |        You can free a JIT stack at any time, as long as it will not be used by | 
 |        pcre2_match() again. When you assign the stack to a match context, only | 
 |        a pointer is set. There is no reference counting or  any  other  magic. | 
 |        You can free compiled patterns, contexts, and stacks in any order, any- | 
 |        time.   Just do not call pcre2_match() with a match context pointing to | 
 |        an already freed stack, as that will cause SEGFAULT. (Also, do not free | 
 |        a stack currently used by pcre2_match() in  another  thread).  You  can | 
 |        also  replace the stack in a context at any time when it is not in use. | 
 |        You should free the previous stack before assigning a replacement. | 
 |  | 
 |        (5) Should I allocate/free a  stack  every  time  before/after  calling | 
 |        pcre2_match()? | 
 |  | 
 |        No,  because  this  is  too  costly in terms of resources. However, you | 
 |        could implement some clever idea which release the stack if it  is  not | 
 |        used  in  let's  say  two minutes. The JIT callback can help to achieve | 
 |        this without keeping a list of patterns. | 
 |  | 
 |        (6) OK, the stack is for long term memory allocation. But what  happens | 
 |        if  a  pattern causes stack overflow with a stack of 1MiB? Is that 1MiB | 
 |        kept until the stack is freed? | 
 |  | 
 |        Especially on embedded systems, it might be a good idea to release mem- | 
 |        ory sometimes without freeing the stack. There is no API  for  this  at | 
 |        the  moment.  Probably a function call which returns with the currently | 
 |        allocated memory for any stack and another which allows releasing  mem- | 
 |        ory (shrinking the stack) would be a good idea if someone needs this. | 
 |  | 
 |        (7) This is too much of a headache. Isn't there any better solution for | 
 |        JIT stack handling? | 
 |  | 
 |        No,  thanks to Windows. If POSIX threads were used everywhere, we could | 
 |        throw out this complicated API. | 
 |  | 
 |  | 
 | FREEING JIT SPECULATIVE MEMORY | 
 |  | 
 |        void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext); | 
 |  | 
 |        The JIT executable allocator does not free all memory when it is possi- | 
 |        ble. It expects new allocations, and keeps some free memory  around  to | 
 |        improve  allocation  speed. However, in low memory conditions, it might | 
 |        be better to free all possible memory. You can cause this to happen  by | 
 |        calling  pcre2_jit_free_unused_memory(). Its argument is a general con- | 
 |        text, for custom memory management, or NULL for standard memory manage- | 
 |        ment. | 
 |  | 
 |  | 
 | EXAMPLE CODE | 
 |  | 
 |        This is a single-threaded example that specifies a  JIT  stack  without | 
 |        using  a  callback.  A real program should include error checking after | 
 |        all the function calls. | 
 |  | 
 |          int rc; | 
 |          pcre2_code *re; | 
 |          pcre2_match_data *match_data; | 
 |          pcre2_match_context *mcontext; | 
 |          pcre2_jit_stack *jit_stack; | 
 |  | 
 |          re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, 0, | 
 |            &errornumber, &erroffset, NULL); | 
 |          rc = pcre2_jit_compile(re, PCRE2_JIT_COMPLETE); | 
 |          mcontext = pcre2_match_context_create(NULL); | 
 |          jit_stack = pcre2_jit_stack_create(32*1024, 512*1024, NULL); | 
 |          pcre2_jit_stack_assign(mcontext, NULL, jit_stack); | 
 |          match_data = pcre2_match_data_create(re, 10); | 
 |          rc = pcre2_match(re, subject, length, 0, 0, match_data, mcontext); | 
 |          /* Process result */ | 
 |  | 
 |          pcre2_code_free(re); | 
 |          pcre2_match_data_free(match_data); | 
 |          pcre2_match_context_free(mcontext); | 
 |          pcre2_jit_stack_free(jit_stack); | 
 |  | 
 |  | 
 | JIT FAST PATH API | 
 |  | 
 |        Because the API described above falls back to interpreted matching when | 
 |        JIT is not available, it is convenient for programs  that  are  written | 
 |        for  general  use  in  many  environments.  However,  calling  JIT  via | 
 |        pcre2_match() does have a performance impact. Programs that are written | 
 |        for use where JIT is known to be available, and  which  need  the  best | 
 |        possible  performance,  can  instead  use a "fast path" API to call JIT | 
 |        matching directly instead of calling pcre2_match() (obviously only  for | 
 |        patterns that have been successfully processed by pcre2_jit_compile()). | 
 |  | 
 |        The  fast  path  function is called pcre2_jit_match(), and it takes ex- | 
 |        actly the same arguments as pcre2_match(). However, the subject  string | 
 |        must  be  specified  with  a  length; PCRE2_ZERO_TERMINATED is not sup- | 
 |        ported.  Unsupported  option  bits  (for  example,  PCRE2_ANCHORED  and | 
 |        PCRE2_ENDANCHORED)  are ignored, as is the PCRE2_NO_JIT option. The re- | 
 |        turn values are also the same  as  for  pcre2_match(),  plus  PCRE2_ER- | 
 |        ROR_JIT_BADOPTION if a matching mode (partial or complete) is requested | 
 |        that was not compiled. | 
 |  | 
 |        When  you call pcre2_match(), as well as testing for invalid options, a | 
 |        number of other sanity checks are performed on the arguments. For exam- | 
 |        ple, if the subject pointer is NULL but the length is non-zero, an  im- | 
 |        mediate  error  is given. Also, unless PCRE2_NO_UTF_CHECK is set, a UTF | 
 |        subject string is tested for validity. In the interests of speed, these | 
 |        checks do not happen on the JIT fast  path.  If  invalid  UTF  data  is | 
 |        passed  when  PCRE2_MATCH_INVALID_UTF  was not set for pcre2_compile(), | 
 |        the result is undefined. The program may crash or loop  or  give  wrong | 
 |        results.  In  the  absence  of  PCRE2_MATCH_INVALID_UTF you should call | 
 |        pcre2_jit_match() in UTF mode only if  you  are  sure  the  subject  is | 
 |        valid. | 
 |  | 
 |        Bypassing  the  sanity  checks  and the pcre2_match() wrapping can give | 
 |        speedups of more than 10%. | 
 |  | 
 |  | 
 | SEE ALSO | 
 |  | 
 |        pcre2api(3), pcre2unicode(3) | 
 |  | 
 |  | 
 | AUTHOR | 
 |  | 
 |        Philip Hazel (FAQ by Zoltan Herczeg) | 
 |        Retired from University Computing Service | 
 |        Cambridge, England. | 
 |  | 
 |  | 
 | REVISION | 
 |  | 
 |        Last updated: 22 August 2024 | 
 |        Copyright (c) 1997-2024 University of Cambridge. | 
 |  | 
 |  | 
 | PCRE2 10.48-DEV                 22 August 2024                     PCRE2JIT(3) | 
 | ------------------------------------------------------------------------------ | 
 |  | 
 |  | 
 | PCRE2LIMITS(3)             Library Functions Manual             PCRE2LIMITS(3) | 
 |  | 
 |  | 
 | NAME | 
 |        PCRE2 - Perl-compatible regular expressions (revised API) | 
 |  | 
 |  | 
 | SIZE AND OTHER LIMITATIONS | 
 |  | 
 |        There are some size limitations in PCRE2 but it is hoped that they will | 
 |        never in practice be relevant. | 
 |  | 
 |        The  maximum  size  of  a compiled pattern is approximately 64 thousand | 
 |        code units for the 8-bit and 16-bit libraries if PCRE2 is compiled with | 
 |        the default internal linkage size, which  is  2  bytes  for  these  li- | 
 |        braries.  If  you  want  to  process regular expressions that are truly | 
 |        enormous, you can compile PCRE2 with an internal linkage size of 3 or 4 | 
 |        (when building the 16-bit library, 3 is  rounded  up  to  4).  See  the | 
 |        README file in the source distribution and the pcre2build documentation | 
 |        for  details.  In  these cases the limit is substantially larger.  How- | 
 |        ever, the speed of execution is slower. In the 32-bit library, the  in- | 
 |        ternal linkage size is always 4. | 
 |  | 
 |        The maximum length of a source pattern string is essentially unlimited; | 
 |        it  is  the largest number a PCRE2_SIZE variable can hold. However, the | 
 |        program that calls pcre2_compile() can specify a smaller limit. | 
 |  | 
 |        The maximum length (in code units) of a subject string is one less than | 
 |        the largest number a PCRE2_SIZE variable can hold. PCRE2_SIZE is an un- | 
 |        signed integer type, usually defined as size_t. Its maximum value (that | 
 |        is ~(PCRE2_SIZE)0) is reserved as a special indicator  for  zero-termi- | 
 |        nated strings and unset offsets. | 
 |  | 
 |        All values in repeating quantifiers must be less than 65536. | 
 |  | 
 |        There are two different limits that apply to branches of lookbehind as- | 
 |        sertions.   If every branch in such an assertion matches a fixed number | 
 |        of characters, the maximum length of any branch is 65535 characters. If | 
 |        any branch matches a variable number of characters,  then  the  maximum | 
 |        matching  length  for every branch is limited. The default limit is set | 
 |        at compile time, defaulting to 255, but can be changed by  the  calling | 
 |        program. | 
 |  | 
 |        There  is no limit to the number of parenthesized groups, but there can | 
 |        be no more than 65535 capture groups, and there is a limit to the depth | 
 |        of nesting of parenthesized subpatterns of all kinds. This  is  imposed | 
 |        in  order to limit the amount of system stack used at compile time. The | 
 |        default limit can be specified when PCRE2 is built; if not, the default | 
 |        is set to  250.  An  application  can  change  this  limit  by  calling | 
 |        pcre2_set_parens_nest_limit() to set the limit in a compile context. | 
 |  | 
 |        The maximum length of the name for a named capture group as well as the | 
 |        number of such groups is configurable at build time. The maximum length | 
 |        for the name defaults to 128 code units, and the maximum number of such | 
 |        groups to 10000. | 
 |  | 
 |        The  maximum  length  of  a  name  in  a (*MARK), (*PRUNE), (*SKIP), or | 
 |        (*THEN) verb is 255 code units for the 8-bit  library  and  65535  code | 
 |        units for the 16-bit and 32-bit libraries. | 
 |  | 
 |        The  maximum  length  of  a string argument to a callout is the largest | 
 |        number a 32-bit unsigned integer can hold. | 
 |  | 
 |        The maximum amount of heap memory used for matching  is  controlled  by | 
 |        the  heap  limit,  which can be set in a pattern or in a match context. | 
 |        The default is a very large number, effectively unlimited. | 
 |  | 
 |  | 
 | AUTHOR | 
 |  | 
 |        Philip Hazel | 
 |        Retired from University Computing Service | 
 |        Cambridge, England. | 
 |  | 
 |  | 
 | REVISION | 
 |  | 
 |        Last updated: 03 September 2025 | 
 |        Copyright (c) 1997-2023 University of Cambridge. | 
 |  | 
 |  | 
 | PCRE2 10.48-DEV                03 September 2025                PCRE2LIMITS(3) | 
 | ------------------------------------------------------------------------------ | 
 |  | 
 |  | 
 | PCRE2MATCHING(3)           Library Functions Manual           PCRE2MATCHING(3) | 
 |  | 
 |  | 
 | NAME | 
 |        PCRE2 - Perl-compatible regular expressions (revised API) | 
 |  | 
 |  | 
 | PCRE2 MATCHING ALGORITHMS | 
 |  | 
 |        This document describes the two different algorithms that are available | 
 |        in  PCRE2  for  matching  a compiled regular expression against a given | 
 |        subject string. The "standard" algorithm is the  one  provided  by  the | 
 |        pcre2_match()  function.  This works in the same way as Perl's matching | 
 |        function, and provides a Perl-compatible matching operation. The  just- | 
 |        in-time (JIT) optimization that is described in the pcre2jit documenta- | 
 |        tion is compatible with this function. | 
 |  | 
 |        An alternative algorithm is provided by the pcre2_dfa_match() function; | 
 |        it operates in a different way, and is not Perl-compatible. This alter- | 
 |        native  has advantages and disadvantages compared with the standard al- | 
 |        gorithm, and these are described below. | 
 |  | 
 |        When there is only one possible way in which a given subject string can | 
 |        match a pattern, the two algorithms give the same answer. A  difference | 
 |        arises, however, when there are multiple possibilities. For example, if | 
 |        the anchored pattern | 
 |  | 
 |          ^<.*> | 
 |  | 
 |        is matched against the string | 
 |  | 
 |          <something> <something else> <something further> | 
 |  | 
 |        there are three possible answers. The standard algorithm finds only one | 
 |        of them, whereas the alternative algorithm finds all three. | 
 |  | 
 |  | 
 | REGULAR EXPRESSIONS AS TREES | 
 |  | 
 |        The set of strings that are matched by a regular expression can be rep- | 
 |        resented  as  a  tree structure. An unlimited repetition in the pattern | 
 |        makes the tree of infinite size, but it is still a tree.  Matching  the | 
 |        pattern  to a given subject string (from a given starting point) can be | 
 |        thought of as a search of the tree.  There are two  ways  to  search  a | 
 |        tree:  depth-first  and  breadth-first, and these correspond to the two | 
 |        matching algorithms provided by PCRE2. | 
 |  | 
 |  | 
 | THE STANDARD MATCHING ALGORITHM | 
 |  | 
 |        In the terminology of Jeffrey Friedl's book "Mastering Regular  Expres- | 
 |        sions",  the  standard  algorithm  is an "NFA algorithm". It conducts a | 
 |        depth-first search of the pattern tree. That is, it  proceeds  along  a | 
 |        single path through the tree, checking that the subject matches what is | 
 |        required.  When  there  is a mismatch, the algorithm tries any alterna- | 
 |        tives at the current point, and if they all fail, it backs  up  to  the | 
 |        previous  branch  point  in  the  tree,  and tries the next alternative | 
 |        branch at that level. This often involves backing  up  (moving  to  the | 
 |        left)  in  the  subject  string  as well. The order in which repetition | 
 |        branches are tried is controlled by the greedy or  ungreedy  nature  of | 
 |        the quantifier. | 
 |  | 
 |        If  a  leaf  node  is reached, a matching string has been found, and at | 
 |        that point the algorithm stops. Thus, if there is more than one  possi- | 
 |        ble  match, this algorithm returns the first one that it finds. Whether | 
 |        this is the shortest, the longest, or some intermediate length  depends | 
 |        on the way the alternations and the greedy or ungreedy repetition quan- | 
 |        tifiers are specified in the pattern. | 
 |  | 
 |        Because  it  ends  up  with a single path through the tree, it is rela- | 
 |        tively straightforward for this algorithm to keep  track  of  the  sub- | 
 |        strings  that  are  matched  by portions of the pattern in parentheses. | 
 |        This provides support for capturing parentheses and backreferences. | 
 |  | 
 |  | 
 | THE ALTERNATIVE MATCHING ALGORITHM | 
 |  | 
 |        This algorithm conducts a breadth-first search of  the  tree.  Starting | 
 |        from  the  first  matching  point  in the subject, it scans the subject | 
 |        string from left to right, once, character by character, and as it does | 
 |        this, it remembers all the paths through the tree that represent  valid | 
 |        matches.  In  Friedl's  terminology, this is a kind of "DFA algorithm", | 
 |        though it is not implemented as a traditional finite state machine  (it | 
 |        keeps multiple states active simultaneously). | 
 |  | 
 |        Although  the  general  principle of this matching algorithm is that it | 
 |        scans the subject string only once, without backtracking, there is  one | 
 |        exception:  when  a lookaround assertion is encountered, the characters | 
 |        following or preceding the current point have to be  independently  in- | 
 |        spected. | 
 |  | 
 |        The  scan  continues until either the end of the subject is reached, or | 
 |        there are no more unterminated paths. At this point,  terminated  paths | 
 |        represent  the different matching possibilities (if there are none, the | 
 |        match has failed).  Thus, if there is more  than  one  possible  match, | 
 |        this  algorithm  finds  all  of  them,  and in particular, it finds the | 
 |        longest. The matches are returned in the output  vector  in  decreasing | 
 |        order  of  length.  There  is an option to stop the algorithm after the | 
 |        first match (which is necessarily the shortest) is found. | 
 |  | 
 |        Note that the size of vector needed to contain all the results  depends | 
 |        on  the  number of simultaneous matches, not on the number of capturing | 
 |        parentheses in  the  pattern.  Using  pcre2_match_data_create_from_pat- | 
 |        tern()  to  create the match data block is therefore not advisable when | 
 |        doing DFA matching. | 
 |  | 
 |        Note also that all the matches that are found start at the  same  point | 
 |        in the subject. If the pattern | 
 |  | 
 |          cat(er(pillar)?)? | 
 |  | 
 |        is  matched  against the string "the caterpillar catchment", the result | 
 |        is the three strings "caterpillar", "cater", and "cat"  that  start  at | 
 |        the  fifth  character  of the subject. The algorithm does not automati- | 
 |        cally move on to find matches that start at later positions. | 
 |  | 
 |        PCRE2's "auto-possessification" optimization usually applies to charac- | 
 |        ter repeats at the end of a pattern (as well as internally). For  exam- | 
 |        ple, the pattern "a\d+" is compiled as if it were "a\d++" because there | 
 |        is  no  point even considering the possibility of backtracking into the | 
 |        repeated digits. For DFA matching, this means that  only  one  possible | 
 |        match  is  found. If you really do want multiple matches in such cases, | 
 |        either use an ungreedy repeat ("a\d+?") or set  the  PCRE2_NO_AUTO_POS- | 
 |        SESS option when compiling. | 
 |  | 
 |        There  are  a  number of features of PCRE2 regular expressions that are | 
 |        not supported or behave differently in the alternative  matching  func- | 
 |        tion. Those that are not supported cause an error if encountered. | 
 |  | 
 |        1.  Because the algorithm finds all possible matches, the greedy or un- | 
 |        greedy nature of repetition quantifiers is not relevant (though it  may | 
 |        affect  auto-possessification,  as  just  described).  During matching, | 
 |        greedy and ungreedy quantifiers are treated in exactly  the  same  way. | 
 |        However, possessive quantifiers can make a difference when what follows | 
 |        could  also  match  what  is  quantified, for example in a pattern like | 
 |        this: | 
 |  | 
 |          ^a++\w! | 
 |  | 
 |        This pattern matches "aaab!" but not "aaa!", which would be matched  by | 
 |        a  non-possessive quantifier. Similarly, if an atomic group is present, | 
 |        it is matched as if it were a standalone pattern at the current  point, | 
 |        and  the  longest match is then "locked in" for the rest of the overall | 
 |        pattern. | 
 |  | 
 |        2. When dealing with multiple paths through the tree simultaneously, it | 
 |        is not straightforward to keep track of  captured  substrings  for  the | 
 |        different  matching  possibilities,  and PCRE2's implementation of this | 
 |        algorithm does not attempt to do this. This means that no captured sub- | 
 |        strings are available. | 
 |  | 
 |        3. Because no substrings are captured, a number of related features are | 
 |        not available: | 
 |  | 
 |        (a) Backreferences; | 
 |  | 
 |        (b) Conditional expressions that use a backreference as  the  condition | 
 |        or test for a specific group recursion; | 
 |  | 
 |        (c) Script runs; | 
 |  | 
 |        (d) Scan substring assertions. | 
 |  | 
 |        4. Because many paths through the tree may be active, the \K escape se- | 
 |        quence,  which  resets the start of the match when encountered (but may | 
 |        be on some paths and not on others), is not supported. | 
 |  | 
 |        5. Callouts are supported, but the value of the  capture_top  field  is | 
 |        always 1, and the value of the capture_last field is always 0. | 
 |  | 
 |        6.  The  \C  escape  sequence, which (in the standard algorithm) always | 
 |        matches a single code unit, even in a UTF mode, is not supported in UTF | 
 |        modes because the  alternative  algorithm  moves  through  the  subject | 
 |        string  one  character  (not code unit) at a time, for all active paths | 
 |        through the tree. | 
 |  | 
 |        7. Except for (*FAIL), the backtracking control verbs such as  (*PRUNE) | 
 |        are  not  supported.  (*FAIL)  is supported, and behaves like a failing | 
 |        negative assertion. | 
 |  | 
 |        8. The PCRE2_MATCH_INVALID_UTF option for pcre2_compile() is  not  sup- | 
 |        ported by pcre2_dfa_match(). | 
 |  | 
 |  | 
 | ADVANTAGES OF THE ALTERNATIVE ALGORITHM | 
 |  | 
 |        The  main  advantage  of the alternative algorithm is that all possible | 
 |        matches (at a single point in the subject) are automatically found, and | 
 |        in particular, the longest match is found. To find more than one  match | 
 |        at  the  same point using the standard algorithm, you have to do kludgy | 
 |        things with callouts. | 
 |  | 
 |        Partial matching is possible with this algorithm, though  it  has  some | 
 |        limitations.  The  pcre2partial  documentation gives details of partial | 
 |        matching and discusses multi-segment matching. | 
 |  | 
 |  | 
 | DISADVANTAGES OF THE ALTERNATIVE ALGORITHM | 
 |  | 
 |        The alternative algorithm suffers from a number of disadvantages: | 
 |  | 
 |        1. It is substantially slower than  the  standard  algorithm.  This  is | 
 |        partly  because  it has to search for all possible matches, but is also | 
 |        because it is less susceptible to optimization. | 
 |  | 
 |        2. Capturing parentheses and other features such as backreferences that | 
 |        rely on them are not supported. | 
 |  | 
 |        3. Matching within invalid UTF strings is not supported. | 
 |  | 
 |        4. Although atomic groups are supported, their use does not provide the | 
 |        performance advantage that it does for the standard algorithm. | 
 |  | 
 |        5. JIT optimization is not supported. | 
 |  | 
 |  | 
 | AUTHOR | 
 |  | 
 |        Philip Hazel | 
 |        Retired from University Computing Service | 
 |        Cambridge, England. | 
 |  | 
 |  | 
 | REVISION | 
 |  | 
 |        Last updated: 22 February 2025 | 
 |        Copyright (c) 1997-2024 University of Cambridge. | 
 |  | 
 |  | 
 | PCRE2 10.48-DEV                22 February 2025               PCRE2MATCHING(3) | 
 | ------------------------------------------------------------------------------ | 
 |  | 
 |  | 
 | PCRE2PARTIAL(3)            Library Functions Manual            PCRE2PARTIAL(3) | 
 |  | 
 |  | 
 | NAME | 
 |        PCRE2 - Perl-compatible regular expressions (revised API) | 
 |  | 
 |  | 
 | PARTIAL MATCHING IN PCRE2 | 
 |  | 
 |        In  normal use of PCRE2, if there is a match up to the end of a subject | 
 |        string, but more characters are needed to  match  the  entire  pattern, | 
 |        PCRE2_ERROR_NOMATCH  is  returned,  just  like any other failing match. | 
 |        There are circumstances where it might be helpful to  distinguish  this | 
 |        "partial match" case. | 
 |  | 
 |        One  example  is  an application where the subject string is very long, | 
 |        and not all available at once. The requirement here is to be able to do | 
 |        the matching segment by segment, but special action is  needed  when  a | 
 |        matched substring spans the boundary between two segments. | 
 |  | 
 |        Another  example is checking a user input string as it is typed, to en- | 
 |        sure that it conforms to a required format. Invalid characters  can  be | 
 |        immediately diagnosed and rejected, giving instant feedback. | 
 |  | 
 |        Partial  matching  is a PCRE2-specific feature; it is not Perl-compati- | 
 |        ble. It is requested  by  setting  one  of  the  PCRE2_PARTIAL_HARD  or | 
 |        PCRE2_PARTIAL_SOFT  options  when calling a matching function. The dif- | 
 |        ference between the two options is whether or not a  partial  match  is | 
 |        preferred  to  an alternative complete match, though the details differ | 
 |        between the two types of matching function. If both  options  are  set, | 
 |        PCRE2_PARTIAL_HARD takes precedence. | 
 |  | 
 |        If  you  want to use partial matching with just-in-time optimized code, | 
 |        as well as setting a partial match option for  the  matching  function, | 
 |        you  must  also  call pcre2_jit_compile() with one or both of these op- | 
 |        tions: | 
 |  | 
 |          PCRE2_JIT_PARTIAL_HARD | 
 |          PCRE2_JIT_PARTIAL_SOFT | 
 |  | 
 |        PCRE2_JIT_COMPLETE should also be set if you are going to run  non-par- | 
 |        tial  matches  on  the same pattern. Separate code is compiled for each | 
 |        mode. If the appropriate JIT mode has not been  compiled,  interpretive | 
 |        matching code is used. | 
 |  | 
 |        Setting  a partial matching option disables two of PCRE2's standard op- | 
 |        timization hints. PCRE2 remembers the last literal code unit in a  pat- | 
 |        tern,  and  abandons  matching  immediately if it is not present in the | 
 |        subject string.  This optimization cannot be used for a subject  string | 
 |        that  might match only partially. PCRE2 also remembers a minimum length | 
 |        of a matching string, and does not bother to run the matching  function | 
 |        on  shorter  strings.  This  optimization  is also disabled for partial | 
 |        matching. | 
 |  | 
 |  | 
 | REQUIREMENTS FOR A PARTIAL MATCH | 
 |  | 
 |        A possible partial match occurs during matching when  the  end  of  the | 
 |        subject  string is reached successfully, but either more characters are | 
 |        needed to complete the match, or the addition of more characters  might | 
 |        change what is matched. | 
 |  | 
 |        Example  1: if the pattern is /abc/ and the subject is "ab", more char- | 
 |        acters are definitely needed to complete a match.  In  this  case  both | 
 |        hard and soft matching options yield a partial match. | 
 |  | 
 |        Example  2: if the pattern is /ab+/ and the subject is "ab", a complete | 
 |        match can be found, but the addition of more  characters  might  change | 
 |        what  is  matched. In this case, only PCRE2_PARTIAL_HARD returns a par- | 
 |        tial match; PCRE2_PARTIAL_SOFT returns the complete match. | 
 |  | 
 |        On reaching the end of the subject, when PCRE2_PARTIAL_HARD is set,  if | 
 |        the next pattern item is \z, \Z, \b, \B, or $ there is always a partial | 
 |        match.   Otherwise, for both options, the next pattern item must be one | 
 |        that inspects a character, and at least one of the  following  must  be | 
 |        true: | 
 |  | 
 |        (1)  At  least  one  character has already been inspected. An inspected | 
 |        character need not form part of the final  matched  string;  lookbehind | 
 |        assertions  and the \K escape sequence provide ways of inspecting char- | 
 |        acters before the start of a matched string. | 
 |  | 
 |        (2) The pattern contains one or more lookbehind assertions. This condi- | 
 |        tion exists in case there is a lookbehind that inspects characters  be- | 
 |        fore the start of the match. | 
 |  | 
 |        (3)  There  is a special case when the whole pattern can match an empty | 
 |        string.  When the starting point is at the  end  of  the  subject,  the | 
 |        empty  string  match is a possibility, and if PCRE2_PARTIAL_SOFT is set | 
 |        and neither of the above conditions is true, it is  returned.  However, | 
 |        because  adding  more  characters  might  result  in a non-empty match, | 
 |        PCRE2_PARTIAL_HARD returns a partial match, which in  this  case  means | 
 |        "there  is going to be a match at this point, but until some more char- | 
 |        acters are added, we do not know if it will be an empty string or some- | 
 |        thing longer". | 
 |  | 
 |  | 
 | PARTIAL MATCHING USING pcre2_match() | 
 |  | 
 |        When  a  partial  matching  option  is  set,  the  result  of   calling | 
 |        pcre2_match() can be one of the following: | 
 |  | 
 |        A successful match | 
 |          A complete match has been found, starting and ending within this sub- | 
 |          ject. | 
 |  | 
 |        PCRE2_ERROR_NOMATCH | 
 |          No match can start anywhere in this subject. | 
 |  | 
 |        PCRE2_ERROR_PARTIAL | 
 |          Adding  more  characters may result in a complete match that uses one | 
 |          or more characters from the end of this subject. | 
 |  | 
 |        When a partial match is returned, the first two elements in the ovector | 
 |        point to the portion of the subject that was matched, but the values in | 
 |        the rest of the ovector are undefined. The appearance of \K in the pat- | 
 |        tern has no effect for a partial match. Consider this pattern: | 
 |  | 
 |          /abc\K123/ | 
 |  | 
 |        If it is matched against "456abc123xyz" the result is a complete match, | 
 |        and the ovector defines the matched string as "123", because \K  resets | 
 |        the  "start  of  match" point. However, if a partial match is requested | 
 |        and the subject string is "456abc12", a partial match is found for  the | 
 |        string  "abc12",  because  all these characters are needed for a subse- | 
 |        quent re-match with additional characters. | 
 |  | 
 |        If there is more than one partial match, the first one that  was  found | 
 |        provides the data that is returned. Consider this pattern: | 
 |  | 
 |          /123\w+X|dogY/ | 
 |  | 
 |        If  this is matched against the subject string "abc123dog", both alter- | 
 |        natives fail to match, but the end of the  subject  is  reached  during | 
 |        matching,  so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3 | 
 |        and 9, identifying "123dog" as the first partial match. (In this  exam- | 
 |        ple,  there are two partial matches, because "dog" on its own partially | 
 |        matches the second alternative.) | 
 |  | 
 |    How a partial match is processed by pcre2_match() | 
 |  | 
 |        What happens when a partial match is identified depends on which of the | 
 |        two partial matching options is set. | 
 |  | 
 |        If PCRE2_PARTIAL_HARD is set, PCRE2_ERROR_PARTIAL is returned  as  soon | 
 |        as  a partial match is found, without continuing to search for possible | 
 |        complete matches. This option is "hard" because it prefers  an  earlier | 
 |        partial match over a later complete match. For this reason, the assump- | 
 |        tion  is  made  that  the end of the supplied subject string is not the | 
 |        true end of the available data, which is why \z, \Z, \b, \B, and $  al- | 
 |        ways give a partial match. | 
 |  | 
 |        If  PCRE2_PARTIAL_SOFT  is  set,  the  partial match is remembered, but | 
 |        matching continues as normal, and other alternatives in the pattern are | 
 |        tried. If no complete match can be found,  PCRE2_ERROR_PARTIAL  is  re- | 
 |        turned instead of PCRE2_ERROR_NOMATCH. This option is "soft" because it | 
 |        prefers a complete match over a partial match. All the various matching | 
 |        items  in a pattern behave as if the subject string is potentially com- | 
 |        plete; \z, \Z, and $ match at the end of the subject,  as  normal,  and | 
 |        for \b and \B the end of the subject is treated as a non-alphanumeric. | 
 |  | 
 |        The  difference  between the two partial matching options can be illus- | 
 |        trated by a pattern such as: | 
 |  | 
 |          /dog(sbody)?/ | 
 |  | 
 |        This matches either "dog" or "dogsbody", greedily (that is, it  prefers | 
 |        the  longer  string  if  possible). If it is matched against the string | 
 |        "dog" with PCRE2_PARTIAL_SOFT, it yields a complete  match  for  "dog". | 
 |        However,  if  PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR- | 
 |        TIAL. On the other hand, if the pattern is made ungreedy the result  is | 
 |        different: | 
 |  | 
 |          /dog(sbody)??/ | 
 |  | 
 |        In  this  case  the  result  is always a complete match because that is | 
 |        found first, and matching never  continues  after  finding  a  complete | 
 |        match. It might be easier to follow this explanation by thinking of the | 
 |        two patterns like this: | 
 |  | 
 |          /dog(sbody)?/    is the same as  /dogsbody|dog/ | 
 |          /dog(sbody)??/   is the same as  /dog|dogsbody/ | 
 |  | 
 |        The  second pattern will never match "dogsbody", because it will always | 
 |        find the shorter match first. | 
 |  | 
 |    Example of partial matching using pcre2test | 
 |  | 
 |        The pcre2test data modifiers partial_hard (or ph) and partial_soft  (or | 
 |        ps)  set  PCRE2_PARTIAL_HARD and PCRE2_PARTIAL_SOFT, respectively, when | 
 |        calling pcre2_match(). Here is a run of pcre2test using a pattern  that | 
 |        matches the whole subject in the form of a date: | 
 |  | 
 |            re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ | 
 |          data> 25dec3\=ph | 
 |          Partial match: 23dec3 | 
 |          data> 3ju\=ph | 
 |          Partial match: 3ju | 
 |          data> 3juj\=ph | 
 |          No match | 
 |  | 
 |        This  example  gives  the  same  results for both hard and soft partial | 
 |        matching options. Here is an example where there is a difference: | 
 |  | 
 |            re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ | 
 |          data> 25jun04\=ps | 
 |           0: 25jun04 | 
 |           1: jun | 
 |          data> 25jun04\=ph | 
 |          Partial match: 25jun04 | 
 |  | 
 |        With  PCRE2_PARTIAL_SOFT,  the  subject  is  matched  completely.   For | 
 |        PCRE2_PARTIAL_HARD, however, the subject is assumed not to be complete, | 
 |        so there is only a partial match. | 
 |  | 
 |  | 
 | MULTI-SEGMENT MATCHING WITH pcre2_match() | 
 |  | 
 |        PCRE  was  not originally designed with multi-segment matching in mind. | 
 |        However, over time, features (including  partial  matching)  that  make | 
 |        multi-segment matching possible have been added. A very long string can | 
 |        be  searched  segment  by  segment by calling pcre2_match() repeatedly, | 
 |        with the aim of achieving the same results that would happen if the en- | 
 |        tire string was available for searching all  the  time.  Normally,  the | 
 |        strings  that  are  being  sought are much shorter than each individual | 
 |        segment, and are in the middle of very long strings, so the pattern  is | 
 |        normally not anchored. | 
 |  | 
 |        Special  logic  must  be implemented to handle a matched substring that | 
 |        spans a segment boundary. PCRE2_PARTIAL_HARD should be used, because it | 
 |        returns a partial match at the end of a segment whenever there  is  the | 
 |        possibility  of  changing  the  match  by  adding  more characters. The | 
 |        PCRE2_NOTBOL option should also be set for all but the first segment. | 
 |  | 
 |        When a partial match occurs, the next segment must be added to the cur- | 
 |        rent subject and the match re-run, using the  startoffset  argument  of | 
 |        pcre2_match()  to  begin  at the point where the partial match started. | 
 |        For example: | 
 |  | 
 |            re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/ | 
 |          data> ...the date is 23ja\=ph | 
 |          Partial match: 23ja | 
 |          data> ...the date is 23jan19 and on that day...\=offset=15 | 
 |           0: 23jan19 | 
 |           1: jan | 
 |  | 
 |        Note the use of the offset modifier to start the new  match  where  the | 
 |        partial match was found. In this example, the next segment was added to | 
 |        the  one  in  which  the  partial  match  was  found.  This is the most | 
 |        straightforward approach, typically using a memory buffer that is twice | 
 |        the size of each segment. After a partial match, the first half of  the | 
 |        buffer  is  discarded,  the  second  half  is moved to the start of the | 
 |        buffer, and a new segment is added before repeating the match as in the | 
 |        example above. After a no match, the entire buffer can be discarded. | 
 |  | 
 |        If there are memory constraints, you may want to discard text that pre- | 
 |        cedes a partial match before adding the  next  segment.  Unfortunately, | 
 |        this  is  not  at  present straightforward. In cases such as the above, | 
 |        where the pattern does not contain any lookbehinds, it is sufficient to | 
 |        retain only the partially matched substring. However,  if  the  pattern | 
 |        contains  a  lookbehind assertion, characters that precede the start of | 
 |        the partial match may have been inspected during the matching  process. | 
 |        When  pcre2test displays a partial match, it indicates these characters | 
 |        with '<' if the allusedtext modifier is set: | 
 |  | 
 |            re> "(?<=123)abc" | 
 |          data> xx123ab\=ph,allusedtext | 
 |          Partial match: 123ab | 
 |                         <<< | 
 |  | 
 |        However, the allusedtext modifier is not available  for  JIT  matching, | 
 |        because  JIT  matching  does  not  record the first (or last) consulted | 
 |        characters.  For this reason, this information is not available via the | 
 |        API. It is therefore not possible in general to obtain the exact number | 
 |        of characters that must be retained in order to get the right match re- | 
 |        sult. If you cannot retain the  entire  segment,  you  must  find  some | 
 |        heuristic way of choosing. | 
 |  | 
 |        If  you know the approximate length of the matching substrings, you can | 
 |        use that to decide how much text to retain. The only lookbehind  infor- | 
 |        mation  that  is  currently  available via the API is the length of the | 
 |        longest individual lookbehind in a pattern, but this can be  misleading | 
 |        if  there  are  nested  lookbehinds.  The  value  returned  by  calling | 
 |        pcre2_pattern_info() with the PCRE2_INFO_MAXLOOKBEHIND  option  is  the | 
 |        maximum number of characters (not code units) that any individual look- | 
 |        behind   moves   back   when   it  is  processed.  A  pattern  such  as | 
 |        "(?<=(?<!b)a)" has a maximum lookbehind value of one, but inspects  two | 
 |        characters before its starting point. | 
 |  | 
 |        In  a  non-UTF or a 32-bit case, moving back is just a subtraction, but | 
 |        in UTF-8 or UTF-16 you have  to  count  characters  while  moving  back | 
 |        through the code units. | 
 |  | 
 |  | 
 | PARTIAL MATCHING USING pcre2_dfa_match() | 
 |  | 
 |        The DFA function moves along the subject string character by character, | 
 |        without  backtracking,  searching  for  all possible matches simultane- | 
 |        ously. If the end of the subject is reached before the end of the  pat- | 
 |        tern, there is the possibility of a partial match. | 
 |  | 
 |        When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if | 
 |        there  have  been  no complete matches. Otherwise, the complete matches | 
 |        are returned.  If PCRE2_PARTIAL_HARD is  set,  a  partial  match  takes | 
 |        precedence  over  any  complete matches. The portion of the string that | 
 |        was matched when the longest partial match was  found  is  set  as  the | 
 |        first matching string. | 
 |  | 
 |        Because  the DFA function always searches for all possible matches, and | 
 |        there is no difference between greedy and ungreedy repetition, its  be- | 
 |        haviour  is different from the pcre2_match(). Consider the string "dog" | 
 |        matched against this ungreedy pattern: | 
 |  | 
 |          /dog(sbody)??/ | 
 |  | 
 |        Whereas the standard function stops as soon as it  finds  the  complete | 
 |        match  for  "dog",  the  DFA  function also finds the partial match for | 
 |        "dogsbody", and so returns that when PCRE2_PARTIAL_HARD is set. | 
 |  | 
 |  | 
 | MULTI-SEGMENT MATCHING WITH pcre2_dfa_match() | 
 |  | 
 |        When a partial match has been found using the DFA matching function, it | 
 |        is possible to continue the match by providing additional subject  data | 
 |        and  calling  the function again with the same compiled regular expres- | 
 |        sion, this time setting the PCRE2_DFA_RESTART option. You must pass the | 
 |        same working space as before, because this is where details of the pre- | 
 |        vious partial match are stored. You can set the  PCRE2_PARTIAL_SOFT  or | 
 |        PCRE2_PARTIAL_HARD  options  with PCRE2_DFA_RESTART to continue partial | 
 |        matching over multiple segments. Here is an example using pcre2test: | 
 |  | 
 |            re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ | 
 |          data> 23ja\=dfa,ps | 
 |          Partial match: 23ja | 
 |          data> n05\=dfa,dfa_restart | 
 |           0: n05 | 
 |  | 
 |        The first call has "23ja" as the subject, and requests  partial  match- | 
 |        ing;  the  second  call  has  "n05"  as  the  subject for the continued | 
 |        (restarted) match.  Notice that when the match is  complete,  only  the | 
 |        last  part  is  shown;  PCRE2 does not retain the previously partially- | 
 |        matched string. It is up to the calling program to do that if it  needs | 
 |        to.  This  means  that, for an unanchored pattern, if a continued match | 
 |        fails, it is not possible to try again at a  new  starting  point.  All | 
 |        this facility is capable of doing is continuing with the previous match | 
 |        attempt. For example, consider this pattern: | 
 |  | 
 |          1234|3789 | 
 |  | 
 |        If  the  first  part of the subject is "ABC123", a partial match of the | 
 |        first alternative is found at offset 3. There is no partial  match  for | 
 |        the second alternative, because such a match does not start at the same | 
 |        point  in  the  subject  string. Attempting to continue with the string | 
 |        "7890" does not yield a match  because  only  those  alternatives  that | 
 |        match  at one point in the subject are remembered. Depending on the ap- | 
 |        plication, this may or may not be what you want. | 
 |  | 
 |        If you do want to allow for starting again at the next  character,  one | 
 |        way  of  doing it is to retain some or all of the segment and try a new | 
 |        complete match, as described for pcre2_match() above. Another possibil- | 
 |        ity is to work with two buffers. If a partial match at offset n in  the | 
 |        first  buffer  is followed by "no match" when PCRE2_DFA_RESTART is used | 
 |        on the second buffer, you can then try a new match starting  at  offset | 
 |        n+1 in the first buffer. | 
 |  | 
 |  | 
 | AUTHOR | 
 |  | 
 |        Philip Hazel | 
 |        Retired from University Computing Service | 
 |        Cambridge, England. | 
 |  | 
 |  | 
 | REVISION | 
 |  | 
 |        Last updated: 27 November 2024 | 
 |        Copyright (c) 1997-2019 University of Cambridge. | 
 |  | 
 |  | 
 | PCRE2 10.48-DEV                27 November 2024                PCRE2PARTIAL(3) | 
 | ------------------------------------------------------------------------------ | 
 |  | 
 |  | 
 | PCRE2PATTERN(3)            Library Functions Manual            PCRE2PATTERN(3) | 
 |  | 
 |  | 
 | NAME | 
 |        PCRE2 - Perl-compatible regular expressions (revised API) | 
 |  | 
 |  | 
 | PCRE2 REGULAR EXPRESSION DETAILS | 
 |  | 
 |        The  syntax and semantics of the regular expressions that are supported | 
 |        by PCRE2 are described in detail below. There is a quick-reference syn- | 
 |        tax summary in the pcre2syntax page. PCRE2 tries to match  Perl  syntax | 
 |        and  semantics as closely as it can.  PCRE2 also supports some alterna- | 
 |        tive regular expression syntax that does not  conflict  with  the  Perl | 
 |        syntax  in order to provide some compatibility with regular expressions | 
 |        in Python, .NET, and Oniguruma. There are in addition some options that | 
 |        enable alternative syntax and semantics that are not  the  same  as  in | 
 |        Perl. | 
 |  | 
 |        Perl's  regular expressions are described in its own documentation, and | 
 |        regular expressions in general are covered in a number of  books,  some | 
 |        of which have copious examples. Jeffrey Friedl's "Mastering Regular Ex- | 
 |        pressions",  published by O'Reilly, covers regular expressions in great | 
 |        detail. This description of PCRE2's regular expressions is intended  as | 
 |        reference material. | 
 |  | 
 |        This  document  discusses the regular expression patterns that are sup- | 
 |        ported by PCRE2 when its  main  matching  function,  pcre2_match(),  is | 
 |        used.    PCRE2    also    has   an   alternative   matching   function, | 
 |        pcre2_dfa_match(), which matches using a different  algorithm  that  is | 
 |        not  Perl-compatible.  Some  of  the  features  discussed below are not | 
 |        available when DFA matching is used. The advantages  and  disadvantages | 
 |        of  the  alternative function, and how it differs from the normal func- | 
 |        tion, are discussed in the pcre2matching page. | 
 |  | 
 |  | 
 | EBCDIC CHARACTER CODES | 
 |  | 
 |        Most computers use ASCII or Unicode for encoding characters, and  PCRE2 | 
 |        assumes this by default. However, it can be compiled to run in an envi- | 
 |        ronment that uses the EBCDIC code, which is the case for some IBM main- | 
 |        frame  operating  systems. In the sections below, character code values | 
 |        are ASCII or Unicode; in an EBCDIC  environment  these  characters  may | 
 |        have  different  code values, and there are no code points greater than | 
 |        255. Differences in behaviour when PCRE2 is running in an EBCDIC  envi- | 
 |        ronment are described in the section "EBCDIC environments" below, which | 
 |        you can ignore unless you really are in an EBCDIC environment. | 
 |  | 
 |  | 
 | SPECIAL START-OF-PATTERN ITEMS | 
 |  | 
 |        A  number  of options that can be passed to pcre2_compile() can also be | 
 |        set by special items at the start of a pattern. These are not Perl-com- | 
 |        patible, but are provided to make these options accessible  to  pattern | 
 |        writers  who are not able to change the program that processes the pat- | 
 |        tern. Any number of these items may appear, but they must  all  be  to- | 
 |        gether  right  at the start of the pattern string, and the letters must | 
 |        be in upper case. | 
 |  | 
 |    UTF support | 
 |  | 
 |        In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either | 
 |        as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 | 
 |        can be specified for the 32-bit library, in which  case  it  constrains | 
 |        the  character  values  to  valid  Unicode  code points. To process UTF | 
 |        strings, PCRE2 must be built to include Unicode support (which  is  the | 
 |        default).  When  using  UTF  strings you must either call the compiling | 
 |        function with one or both of the PCRE2_UTF  or  PCRE2_MATCH_INVALID_UTF | 
 |        options,  or  the  pattern must start with the special sequence (*UTF), | 
 |        which is equivalent to setting the relevant PCRE2_UTF.  How  setting  a | 
 |        UTF mode affects pattern matching is mentioned in several places below. | 
 |        There is also a summary of features in the pcre2unicode page. | 
 |  | 
 |        Some applications that allow their users to supply patterns may wish to | 
 |        restrict   them   to   non-UTF   data  for  security  reasons.  If  the | 
 |        PCRE2_NEVER_UTF option is passed to pcre2_compile(), (*UTF) is not  al- | 
 |        lowed, and its appearance in a pattern causes an error. | 
 |  | 
 |    Unicode property support | 
 |  | 
 |        Another  special  sequence that may appear at the start of a pattern is | 
 |        (*UCP).  This has the same effect as setting the PCRE2_UCP  option:  it | 
 |        causes  sequences such as \d and \w to use Unicode properties to deter- | 
 |        mine character types, instead of recognizing only characters with codes | 
 |        less than 256 via a lookup table. If also causes upper/lower casing op- | 
 |        erations to use Unicode properties  for  characters  with  code  points | 
 |        greater  than  127,  even when UTF is not set.  These behaviours can be | 
 |        changed within the pattern; see the section entitled  "Internal  Option | 
 |        Setting" below. | 
 |  | 
 |        Some applications that allow their users to supply patterns may wish to | 
 |        restrict  them  for  security reasons. If the PCRE2_NEVER_UCP option is | 
 |        passed to pcre2_compile(), (*UCP) is not allowed, and its appearance in | 
 |        a pattern causes an error. | 
 |  | 
 |    Locking out empty string matching | 
 |  | 
 |        Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same | 
 |        effect as passing the PCRE2_NOTEMPTY or  PCRE2_NOTEMPTY_ATSTART  option | 
 |        to whichever matching function is subsequently called to match the pat- | 
 |        tern.  These options lock out the matching of empty strings, either en- | 
 |        tirely, or only at the start of the subject. | 
 |  | 
 |    Disabling auto-possessification | 
 |  | 
 |        If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect  as | 
 |        setting  the  PCRE2_NO_AUTO_POSSESS  option, or calling pcre2_set_opti- | 
 |        mize() with a PCRE2_AUTO_POSSESS_OFF directive. This stops  PCRE2  from | 
 |        making  quantifiers  possessive  when what follows cannot match the re- | 
 |        peated item. For example, by default a+b is treated as a++b.  For  more | 
 |        details, see the pcre2api documentation. | 
 |  | 
 |    Disabling start-up optimizations | 
 |  | 
 |        If  a  pattern  starts  with (*NO_START_OPT), it has the same effect as | 
 |        setting the PCRE2_NO_START_OPTIMIZE option, or calling  pcre2_set_opti- | 
 |        mize() with a PCRE2_START_OPTIMIZE_OFF directive. This disables several | 
 |        optimizations  for  quickly  reaching  "no match" results. For more de- | 
 |        tails, see the pcre2api documentation. | 
 |  | 
 |    Disabling automatic anchoring | 
 |  | 
 |        If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the  same  effect | 
 |        as setting the PCRE2_NO_DOTSTAR_ANCHOR option, or calling pcre2_set_op- | 
 |        timize()  with a PCRE2_DOTSTAR_ANCHOR_OFF directive.  This disables op- | 
 |        timizations that apply to patterns whose top-level branches  all  start | 
 |        with  .*  (match any number of arbitrary characters). For more details, | 
 |        see the pcre2api documentation. | 
 |  | 
 |    Disabling JIT compilation | 
 |  | 
 |        If a pattern that starts with (*NO_JIT) is  successfully  compiled,  an | 
 |        attempt  by  the  application  to apply the JIT optimization by calling | 
 |        pcre2_jit_compile() is ignored. | 
 |  | 
 |    Setting match resource limits | 
 |  | 
 |        The pcre2_match() function contains a counter that is incremented every | 
 |        time it goes round its main loop. The caller of pcre2_match() can set a | 
 |        limit on this counter, which therefore limits the amount  of  computing | 
 |        resource used for a match. The maximum depth of nested backtracking can | 
 |        also  be  limited;  this indirectly restricts the amount of heap memory | 
 |        that is used, but there is also an explicit memory limit  that  can  be | 
 |        set. | 
 |  | 
 |        These  facilities  are  provided to catch runaway matches that are pro- | 
 |        voked by patterns with huge matching trees. A common example is a  pat- | 
 |        tern  with  nested unlimited repeats applied to a long string that does | 
 |        not match. When one of these limits is reached, pcre2_match() gives  an | 
 |        error  return.  The limits can also be set by items at the start of the | 
 |        pattern of the form | 
 |  | 
 |          (*LIMIT_HEAP=d) | 
 |          (*LIMIT_MATCH=d) | 
 |          (*LIMIT_DEPTH=d) | 
 |  | 
 |        where d is any number of decimal digits. However, the value of the set- | 
 |        ting must be less than the value set (or defaulted) by  the  caller  of | 
 |        pcre2_match()  for  it  to have any effect. In other words, the pattern | 
 |        writer can lower the limits set by the programmer, but not raise  them. | 
 |        If  there  is  more  than one setting of one of these limits, the lower | 
 |        value is used. The heap limit is specified in kibibytes (units of  1024 | 
 |        bytes). | 
 |  | 
 |        Prior  to  release  10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This | 
 |        name is still recognized for backwards compatibility. | 
 |  | 
 |        The heap limit applies only when the pcre2_match() or pcre2_dfa_match() | 
 |        interpreters are used for matching. It does not apply to JIT. The match | 
 |        limit is used (but in a different way) when JIT is being used, or  when | 
 |        pcre2_dfa_match() is called, to limit computing resource usage by those | 
 |        matching  functions.  The depth limit is ignored by JIT but is relevant | 
 |        for DFA matching, which uses function recursion for  recursions  within | 
 |        the  pattern  and  for lookaround assertions and atomic groups. In this | 
 |        case, the depth limit controls the depth of such recursion. | 
 |  | 
 |    Newline conventions | 
 |  | 
 |        PCRE2 supports six different conventions for indicating line breaks  in | 
 |        strings:  a  single  CR (carriage return) character, a single LF (line- | 
 |        feed) character, the two-character sequence CRLF, any of the three pre- | 
 |        ceding, any Unicode newline sequence,  or  the  NUL  character  (binary | 
 |        zero).  The  pcre2api  page  has further discussion about newlines, and | 
 |        shows how to set the newline convention when calling pcre2_compile(). | 
 |  | 
 |        It is also possible to specify a newline convention by starting a  pat- | 
 |        tern string with one of the following sequences: | 
 |  | 
 |          (*CR)        carriage return | 
 |          (*LF)        linefeed | 
 |          (*CRLF)      carriage return, followed by linefeed | 
 |          (*ANYCRLF)   any of the three above | 
 |          (*ANY)       all Unicode newline sequences | 
 |          (*NUL)       the NUL character (binary zero) | 
 |  | 
 |        These override the default and the options given to the compiling func- | 
 |        tion. For example, on a Unix system where LF is the default newline se- | 
 |        quence, the pattern | 
 |  | 
 |          (*CR)a.b | 
 |  | 
 |        changes the convention to CR. That pattern matches "a\nb" because LF is | 
 |        no longer a newline. If more than one of these settings is present, the | 
 |        last one is used. | 
 |  | 
 |        The  newline  convention affects where the circumflex and dollar asser- | 
 |        tions are true. It also affects the interpretation of the dot metachar- | 
 |        acter when PCRE2_DOTALL is not set, and the behaviour of  \N  when  not | 
 |        followed  by  an opening brace. However, it does not affect what the \R | 
 |        escape sequence matches. By default, this is any  Unicode  newline  se- | 
 |        quence,  for  Perl compatibility. However, this can be changed; see the | 
 |        next section and the description of \R in the section entitled "Newline | 
 |        sequences" below. A change of \R setting can be combined with a  change | 
 |        of newline convention. | 
 |  | 
 |    Specifying what \R matches | 
 |  | 
 |        It is possible to restrict \R to match only CR, LF, or CRLF (instead of | 
 |        the  complete  set  of  Unicode  line  endings)  by  setting the option | 
 |        PCRE2_BSR_ANYCRLF at compile time. This effect can also be achieved  by | 
 |        starting  a  pattern  with (*BSR_ANYCRLF). For completeness, (*BSR_UNI- | 
 |        CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE. | 
 |  | 
 |  | 
 | CHARACTERS AND METACHARACTERS | 
 |  | 
 |        A regular expression is a pattern that is  matched  against  a  subject | 
 |        string  from  left  to right. Most characters stand for themselves in a | 
 |        pattern, and match the corresponding characters in the  subject.  As  a | 
 |        trivial example, the pattern | 
 |  | 
 |          The quick brown fox | 
 |  | 
 |        matches a portion of a subject string that is identical to itself. When | 
 |        caseless  matching  is  specified  (the  PCRE2_CASELESS  option or (?i) | 
 |        within the pattern), letters are matched independently  of  case.  Note | 
 |        that  there  are  two  ASCII  characters, K and S, that, in addition to | 
 |        their lower case ASCII equivalents, are  case-equivalent  with  Unicode | 
 |        U+212A  (Kelvin  sign)  and  U+017F  (long  S) respectively when either | 
 |        PCRE2_UTF or PCRE2_UCP is set, unless the PCRE2_EXTRA_CASELESS_RESTRICT | 
 |        option is in force (either passed to pcre2_compile() or set by  (*CASE- | 
 |        LESS_RESTRICT)  or  (?r)  within the pattern). If the PCRE2_EXTRA_TURK- | 
 |        ISH_CASING option is in force (either passed to pcre2_compile() or  set | 
 |        by  (*TURKISH_CASING)  within  the  pattern),  then the 'i' letters are | 
 |        matched according to Turkish and Azeri languages. | 
 |  | 
 |        The power of regular expressions comes from the ability to include wild | 
 |        cards, character classes, alternatives, and repetitions in the pattern. | 
 |        These are encoded in the pattern by the use of metacharacters, which do | 
 |        not stand for themselves but instead are interpreted  in  some  special | 
 |        way. | 
 |  | 
 |        There  are  two different sets of metacharacters: those that are recog- | 
 |        nized anywhere in the pattern except within square brackets, and  those | 
 |        that  are  recognized  within square brackets. Outside square brackets, | 
 |        the metacharacters are as follows: | 
 |  | 
 |          \      general escape character with several uses | 
 |          ^      assert start of string (or line, in multiline mode) | 
 |          $      assert end of string (or line, in multiline mode) | 
 |          .      match any character except newline (by default) | 
 |          [      start character class definition | 
 |          |      start of alternative branch | 
 |          (      start group or control verb | 
 |          )      end group or control verb | 
 |          *      0 or more quantifier | 
 |          +      1 or more quantifier; also "possessive quantifier" | 
 |          ?      0 or 1 quantifier; also quantifier minimizer | 
 |          {      potential start of min/max quantifier | 
 |  | 
 |        Brace characters { and } are also used to enclose  data  for  construc- | 
 |        tions  such  as  \g{2} or \k{name}. In almost all uses of braces, space | 
 |        and/or horizontal tab characters that follow { or precede } are allowed | 
 |        and are ignored. In the case of quantifiers, they may also  appear  be- | 
 |        fore  or  after the comma. The exception to this is \u{...} which is an | 
 |        ECMAScript compatibility feature  that  is  recognized  only  when  the | 
 |        PCRE2_EXTRA_ALT_BSUX  option  is  set.  ECMAScript does not ignore such | 
 |        white space; it causes the item to be interpreted as literal. | 
 |  | 
 |        Part of a pattern that is in square brackets  is  called  a  "character | 
 |        class". In a character class the only metacharacters are: | 
 |  | 
 |          \      general escape character | 
 |          ^      negate the class, but only if the first character | 
 |          -      indicates character range | 
 |          [      POSIX character class (if followed by POSIX syntax) | 
 |          ]      terminates the character class | 
 |  | 
 |        If  a  pattern  is  compiled with the PCRE2_EXTENDED option, most white | 
 |        space in the pattern, other than in a character class, within a \Q...\E | 
 |        sequence, or between a # outside a character class and  the  next  new- | 
 |        line,  inclusive,  is ignored. An escaping backslash can be used to in- | 
 |        clude a white space or a # character as part of  the  pattern.  If  the | 
 |        PCRE2_EXTENDED_MORE  option  is  set, the same applies, but in addition | 
 |        unescaped space and horizontal tab  characters  are  ignored  inside  a | 
 |        character  class.  Note: only these two characters are ignored, not the | 
 |        full set of pattern white space characters that are ignored  outside  a | 
 |        character  class.  Option settings can be changed within a pattern; see | 
 |        the section entitled "Internal Option Setting" below. | 
 |  | 
 |        The following sections describe the use of each of the metacharacters. | 
 |  | 
 |  | 
 | BACKSLASH | 
 |  | 
 |        The backslash character has several uses. Firstly, if it is followed by | 
 |        a character that is not a digit or a letter, it takes away any  special | 
 |        meaning  that  character  may  have. This use of backslash as an escape | 
 |        character applies both inside and outside character classes. | 
 |  | 
 |        For example, if you want to match a * character, you must write  \*  in | 
 |        the  pattern. This escaping action applies whether or not the following | 
 |        character would otherwise be interpreted as a metacharacter, so  it  is | 
 |        always  safe  to  precede  a non-alphanumeric with backslash to specify | 
 |        that it stands for itself.  In particular, if you want to match a back- | 
 |        slash, you write \\. | 
 |  | 
 |        Only ASCII digits and letters have any special meaning  after  a  back- | 
 |        slash. All other characters (in particular, those whose code points are | 
 |        greater than 127) are treated as literals. | 
 |  | 
 |        If  you want to treat all characters in a sequence as literals, you can | 
 |        do so by putting them between \Q and \E. Note that this includes  white | 
 |        space  even  when  the  PCRE2_EXTENDED option is set so that most other | 
 |        white space is ignored. The behaviour is different from Perl in that  $ | 
 |        and @ are handled as literals in \Q...\E sequences in PCRE2, whereas in | 
 |        Perl,  $  and  @ cause variable interpolation. Also, Perl does "double- | 
 |        quotish backslash interpolation" on any backslashes between \Q  and  \E | 
 |        which,  its  documentation says, "may lead to confusing results". PCRE2 | 
 |        treats a backslash between \Q and \E just  like  any  other  character. | 
 |        Note the following examples: | 
 |  | 
 |          Pattern            PCRE2 matches   Perl matches | 
 |  | 
 |          \Qabc$xyz\E        abc$xyz        abc followed by the | 
 |                                              contents of $xyz | 
 |          \Qabc\$xyz\E       abc\$xyz       abc\$xyz | 
 |          \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz | 
 |          \QA\B\E            A\B            A\B | 
 |          \Q\\E              \              \\E | 
 |  | 
 |        The  \Q...\E  sequence  is recognized both inside and outside character | 
 |        classes.  An isolated \E that is not preceded by \Q is ignored.  If  \Q | 
 |        is  not followed by \E later in the pattern, the literal interpretation | 
 |        continues to the end of the pattern (that is,  \E  is  assumed  at  the | 
 |        end).  If  the  isolated \Q is inside a character class, this causes an | 
 |        error, because the character class is then not terminated by a  closing | 
 |        square bracket. | 
 |  | 
 |        Another  difference from Perl is that any appearance of \Q or \E inside | 
 |        what might otherwise be a quantifier causes PCRE2 not to recognize  the | 
 |        sequence as a quantifier. Perl recognizes a quantifier if (redundantly) | 
 |        either  of  the  numbers  is  inside \Q...\E, but not if the separating | 
 |        comma is. When not recognized  as  a  quantifier  a  sequence  such  as | 
 |        {\Q1\E,2} is treated as the literal string "{1,2}". | 
 |  | 
 |    Non-printing characters | 
 |  | 
 |        A second use of backslash provides a way of encoding non-printing char- | 
 |        acters  in patterns in a visible manner. There is no restriction on the | 
 |        appearance of non-printing characters in a pattern, but when a  pattern | 
 |        is being prepared by text editing, it is often easier to use one of the | 
 |        following  escape  sequences  instead of the binary character it repre- | 
 |        sents. In an ASCII or Unicode environment, these escapes  are  as  fol- | 
 |        lows: | 
 |  | 
 |          \a          alarm, that is, the BEL character (hex 07) | 
 |          \cx         "control-x", where x is a non-control ASCII character | 
 |          \e          escape (hex 1B) | 
 |          \f          form feed (hex 0C) | 
 |          \n          linefeed (hex 0A) | 
 |          \r          carriage return (hex 0D) (but see below) | 
 |          \t          tab (hex 09) | 
 |          \0dd        character with octal code 0dd | 
 |          \ddd        character with octal code ddd, or back reference | 
 |          \o{ddd..}   character with octal code ddd.. | 
 |          \xhh        character with hex code hh | 
 |          \x{hhh..}   character with hex code hhh.. | 
 |          \N{U+hhh..} character with Unicode hex code point hhh.. | 
 |  | 
 |        A description of how back references work is given later, following the | 
 |        discussion of parenthesized groups. | 
 |  | 
 |        By  default, after \x that is not followed by {, one or two hexadecimal | 
 |        digits are read (letters can be in upper or lower case). If the charac- | 
 |        ter that follows \x is neither { nor a hexadecimal digit, an error  oc- | 
 |        curs.  This is different from Perl's default behaviour, which generates | 
 |        a NUL character, but is in line with the behaviour of  Perl's  'strict' | 
 |        mode in re. | 
 |  | 
 |        Any  number  of  hexadecimal  digits may appear between \x{ and }. If a | 
 |        character other than a hexadecimal digit appears between \x{ and },  or | 
 |        if there is no terminating }, an error occurs. | 
 |  | 
 |        Characters whose code points are less than 256 can be defined by either | 
 |        of the two syntaxes for \x or by an octal sequence. There is no differ- | 
 |        ence in the way they are handled. For example, \xdc is exactly the same | 
 |        as  \x{dc}  or \334.  However, using the braced versions does make such | 
 |        sequences easier to read. | 
 |  | 
 |        Support is available for some ECMAScript (aka  JavaScript)  escape  se- | 
 |        quences via two compile-time options. If PCRE2_ALT_BSUX is set, the se- | 
 |        quence  \x  followed  by { is not recognized. Only if \x is followed by | 
 |        two hexadecimal digits is it recognized as a character  escape.  Other- | 
 |        wise  it  is interpreted as a literal "x" character. In this mode, sup- | 
 |        port for code points greater than 256 is provided by \u, which must  be | 
 |        followed  by  four hexadecimal digits; otherwise it is interpreted as a | 
 |        literal "u" character. | 
 |  | 
 |        PCRE2_EXTRA_ALT_BSUX has the same effect as PCRE2_ALT_BSUX and, in  ad- | 
 |        dition, \u{hhh..} is recognized as the character specified by hexadeci- | 
 |        mal code point.  There may be any number of hexadecimal digits, but un- | 
 |        like  other places that also use curly brackets, spaces are not allowed | 
 |        and would result in the string being interpreted  as  a  literal.  This | 
 |        syntax is from ECMAScript 6. | 
 |  | 
 |        The  \N{U+hhh..} escape sequence is recognized only when PCRE2 is oper- | 
 |        ating in UTF mode. Perl also uses \N{name}  to  specify  characters  by | 
 |        Unicode  name;  PCRE2  does  not support this. Note that when \N is not | 
 |        followed by an opening brace (curly bracket) it has an entirely differ- | 
 |        ent meaning, matching any character that is not a newline. | 
 |  | 
 |        There are some legacy applications where the escape sequence \r is  ex- | 
 |        pected  to  match a newline. If the PCRE2_EXTRA_ESCAPED_CR_IS_LF option | 
 |        is set, \r in a pattern is converted to \n so  that  it  matches  a  LF | 
 |        (linefeed) instead of a CR (carriage return) character. | 
 |  | 
 |        An  error  occurs if \c is not followed by a character whose ASCII code | 
 |        point is in the range 32 to 126. The precise effect of \cx is  as  fol- | 
 |        lows:  if x is a lower case letter, it is converted to upper case. Then | 
 |        bit 6 of the character (hex 40) is inverted. Thus \cA to \cZ become hex | 
 |        01 to hex 1A (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B),  and | 
 |        \c;  becomes hex 7B (; is 3B). If the code unit following \c has a code | 
 |        point less than 32 or greater than 126, a compile-time error occurs. | 
 |  | 
 |        For differences in the way some escapes behave in EBCDIC  environments, | 
 |        see section "EBCDIC environments" below. | 
 |  | 
 |    Octal escapes and back references | 
 |  | 
 |        The  escape \o must be followed by a sequence of octal digits, enclosed | 
 |        in braces. An error occurs if this is not the case.  This  escape  pro- | 
 |        vides  a  way  of  specifying  character  code  points as octal numbers | 
 |        greater than 0777, and it also allows octal numbers and  backreferences | 
 |        to be unambiguously distinguished. | 
 |  | 
 |        If  braces  are  not  used, after \0 up to two further octal digits are | 
 |        read.  However, if the PCRE2_EXTRA_NO_BS0 option is set, at  least  one | 
 |        more  octal digit must follow \0 (use \00 to generate a NUL character). | 
 |        Make sure you supply two digits after the initial zero if  the  pattern | 
 |        character that follows is itself an octal digit. | 
 |  | 
 |        Inside  a  character  class,  when a backslash is followed by any octal | 
 |        digit, up to three octal digits are read to generate a code point.  Any | 
 |        subsequent  digits  stand  for  themselves. The sequences \8 and \9 are | 
 |        treated as the literal characters "8" and "9". | 
 |  | 
 |        Outside a character class, Perl's handling of a backslash followed by a | 
 |        digit other than 0 is complicated by ambiguity, and  Perl  has  changed | 
 |        over time, causing PCRE2 also to change. From PCRE2 release 10.45 there | 
 |        is  an  option called PCRE2_EXTRA_PYTHON_OCTAL that causes PCRE2 to use | 
 |        Python's unambiguous rules. The next two subsections describe  the  two | 
 |        sets of rules. | 
 |  | 
 |        For greater clarity and unambiguity, it is best to avoid following \ by | 
 |        a  digit  greater than zero. Instead, use \o{...} or \x{...} to specify | 
 |        numerical character code points, and \g{...} to specify backreferences. | 
 |  | 
 |    Perl rules for non-class backslash 1-9 | 
 |  | 
 |        All the digits that follow the backslash are read as a decimal  number. | 
 |        If  the  number  is  less  than 10, begins with the digit 8 or 9, or if | 
 |        there are at least that many previous capture groups in the expression, | 
 |        the entire sequence is taken as a  back  reference.  Otherwise,  up  to | 
 |        three octal digits are read to form a character code. For example: | 
 |  | 
 |          \040   is another way of writing an ASCII space | 
 |          \40    is the same, provided there are fewer than 40 | 
 |                    previous capture groups | 
 |          \7     is always a backreference | 
 |          \11    might be a backreference, or another way of | 
 |                    writing a tab | 
 |          \011   is always a tab | 
 |          \0113  is a tab followed by the character "3" | 
 |          \113   might be a backreference, otherwise the | 
 |                    character with octal code 113 | 
 |          \377   might be a backreference, otherwise | 
 |                    the value 255 (decimal) | 
 |          \81    is always a backreference | 
 |  | 
 |        Note  that octal values of 100 or greater that are specified using this | 
 |        syntax must not be introduced by a leading zero, because no  more  than | 
 |        three octal digits are ever read. | 
 |  | 
 |    Python rules for non_class backslash 1-9 | 
 |  | 
 |        If  there  are at least three octal digits after the backslash, exactly | 
 |        three are read as an octal code point number, but the value must be  no | 
 |        greater  than  \377,  even  in modes where higher code point values are | 
 |        supported. Any subsequent digits stand for  themselves.  If  there  are | 
 |        fewer  than three octal digits, the sequence is taken as a decimal back | 
 |        reference. Thus, for example, \12 is always a back reference,  indepen- | 
 |        dent  of how many captures there are in the pattern. An error is gener- | 
 |        ated for a reference to a non-existent capturing group. | 
 |  | 
 |    Constraints on character values | 
 |  | 
 |        Characters that are specified using octal or  hexadecimal  numbers  are | 
 |        limited to certain values, as follows: | 
 |  | 
 |          8-bit non-UTF mode    no greater than 0xff | 
 |          16-bit non-UTF mode   no greater than 0xffff | 
 |          32-bit non-UTF mode   no greater than 0xffffffff | 
 |          All UTF modes         no greater than 0x10ffff and a valid code point | 
 |  | 
 |        Invalid Unicode code points are all those in the range 0xd800 to 0xdfff | 
 |        (the  so-called  "surrogate"  code  points). The check for these can be | 
 |        disabled by  the  caller  of  pcre2_compile()  by  setting  the  option | 
 |        PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES.  However, this is possible only in | 
 |        UTF-8 and UTF-32 modes, because these values are not  representable  in | 
 |        UTF-16. | 
 |  | 
 |    Escape sequences in character classes | 
 |  | 
 |        All the sequences that define a single character value can be used both | 
 |        inside  and  outside character classes. In addition, inside a character | 
 |        class, \b is interpreted as the backspace character (hex 08). | 
 |  | 
 |        When not followed by an opening brace, \N is not allowed in a character | 
 |        class.  \B, \R, and \X are not special inside a character  class.  Like | 
 |        other  unrecognized  alphabetic  escape sequences, they cause an error. | 
 |        Outside a character class, these sequences have different meanings. | 
 |  | 
 |    Unsupported escape sequences | 
 |  | 
 |        In Perl, the sequences \F, \l, \L, \u, and \U  are  recognized  by  its | 
 |        string  handler and used to modify the case of following characters. By | 
 |        default, PCRE2 does not support these  escape  sequences  in  patterns. | 
 |        However,  if  either  of the PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX op- | 
 |        tions is set, \U matches a "U" character, and \u can be used to  define | 
 |        a character by code point, as described above. | 
 |  | 
 |    Absolute and relative backreferences | 
 |  | 
 |        The sequence \g followed by a signed or unsigned number, optionally en- | 
 |        closed  in  braces,  is  an absolute or relative backreference. A named | 
 |        backreference can be coded as \g{name}.  Backreferences  are  discussed | 
 |        later, following the discussion of parenthesized groups. | 
 |  | 
 |    Absolute and relative subroutine calls | 
 |  | 
 |        For  compatibility with Oniguruma, the non-Perl syntax \g followed by a | 
 |        name or a number enclosed either in angle brackets or single quotes, is | 
 |        an alternative syntax for referencing a capture group as a  subroutine. | 
 |        Details  are  discussed  later.   Note  that  \g{...} (Perl syntax) and | 
 |        \g<...> (Oniguruma syntax) are not synonymous. The former is a backref- | 
 |        erence; the latter is a subroutine call. | 
 |  | 
 |    Generic character types | 
 |  | 
 |        Another use of backslash is for specifying generic character types: | 
 |  | 
 |          \d     any decimal digit | 
 |          \D     any character that is not a decimal digit | 
 |          \h     any horizontal white space character | 
 |          \H     any character that is not a horizontal white space character | 
 |          \N     any character that is not a newline | 
 |          \s     any white space character | 
 |          \S     any character that is not a white space character | 
 |          \v     any vertical white space character | 
 |          \V     any character that is not a vertical white space character | 
 |          \w     any "word" character | 
 |          \W     any "non-word" character | 
 |  | 
 |        The \N escape sequence has the same meaning as  the  "."  metacharacter | 
 |        when  PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change | 
 |        the meaning of \N. Note that when \N is followed by an opening brace it | 
 |        has a different meaning. See the section entitled "Non-printing charac- | 
 |        ters" above for details. Perl also uses \N{name} to specify  characters | 
 |        by Unicode name; PCRE2 does not support this. | 
 |  | 
 |        Each  pair of lower and upper case escape sequences partitions the com- | 
 |        plete set of characters into two disjoint  sets.  Any  given  character | 
 |        matches  one, and only one, of each pair. The sequences can appear both | 
 |        inside and outside character classes. They each match one character  of | 
 |        the  appropriate  type.  If the current matching point is at the end of | 
 |        the subject string, all of them fail, because there is no character  to | 
 |        match. | 
 |  | 
 |        The  default  \s  characters  are HT (9), LF (10), VT (11), FF (12), CR | 
 |        (13), and space (32), which are defined as white space in the  "C"  lo- | 
 |        cale.  This  list may vary if locale-specific matching is taking place. | 
 |        For example, in some locales the "non-breaking space" character  (\xA0) | 
 |        is recognized as white space, and in others the VT character is not. | 
 |  | 
 |        A  "word"  character is an underscore or any character that is a letter | 
 |        or digit.  By default, the definition of letters  and  digits  is  con- | 
 |        trolled by PCRE2's low-valued character tables, and may vary if locale- | 
 |        specific matching is taking place (see "Locale support" in the pcre2api | 
 |        page).  For  example,  in  a French locale such as "fr_FR" in Unix-like | 
 |        systems, or "french" in Windows, some character codes greater than  127 | 
 |        are  used  for  accented letters, and these are then matched by \w. The | 
 |        use of locales with Unicode is discouraged. | 
 |  | 
 |        By default, characters whose code points are  greater  than  127  never | 
 |        match \d, \s, or \w, and always match \D, \S, and \W, although this may | 
 |        be  different  for characters in the range 128-255 when locale-specific | 
 |        matching is happening.  These escape sequences  retain  their  original | 
 |        meanings  from  before  Unicode support was available, mainly for effi- | 
 |        ciency reasons. If the  PCRE2_UCP  option  is  set,  the  behaviour  is | 
 |        changed  so  that  Unicode  properties  are used to determine character | 
 |        types, as follows: | 
 |  | 
 |          \d  any character that matches \p{Nd} (decimal digit) | 
 |          \s  any character that matches \p{Z} or \h or \v | 
 |          \w  any character that matches \p{L}, \p{N}, \p{Mn}, or \p{Pc} | 
 |  | 
 |        The addition of \p{Mn} (non-spacing mark) and the replacement of an ex- | 
 |        plicit test for underscore with a test for \p{Pc}  (connector  punctua- | 
 |        tion) happened in PCRE2 release 10.43. This brings PCRE2 into line with | 
 |        Perl. | 
 |  | 
 |        The  upper case escapes match the inverse sets of characters. Note that | 
 |        \d matches only decimal digits, whereas \w matches any  Unicode  digit, | 
 |        as well as other character categories. Note also that PCRE2_UCP affects | 
 |        \b,  and  \B  because  they are defined in terms of \w and \W. Matching | 
 |        these sequences is noticeably slower when PCRE2_UCP is set. | 
 |  | 
 |        The effect of PCRE2_UCP on any one of these  escape  sequences  can  be | 
 |        negated  by  the  options PCRE2_EXTRA_ASCII_BSD, PCRE2_EXTRA_ASCII_BSS, | 
 |        and PCRE2_EXTRA_ASCII_BSW, respectively. These options can be  set  and | 
 |        reset  within a pattern by means of an internal option setting (see be- | 
 |        low). | 
 |  | 
 |        The sequences \h, \H, \v, and \V, in contrast to the  other  sequences, | 
 |        which  match  only ASCII characters by default, always match a specific | 
 |        list of code points, whether or not PCRE2_UCP is  set.  The  horizontal | 
 |        space characters are: | 
 |  | 
 |          U+0009     Horizontal tab (HT) | 
 |          U+0020     Space | 
 |          U+00A0     Non-break space | 
 |          U+1680     Ogham space mark | 
 |          U+180E     Mongolian vowel separator | 
 |          U+2000     En quad | 
 |          U+2001     Em quad | 
 |          U+2002     En space | 
 |          U+2003     Em space | 
 |          U+2004     Three-per-em space | 
 |          U+2005     Four-per-em space | 
 |          U+2006     Six-per-em space | 
 |          U+2007     Figure space | 
 |          U+2008     Punctuation space | 
 |          U+2009     Thin space | 
 |          U+200A     Hair space | 
 |          U+202F     Narrow no-break space | 
 |          U+205F     Medium mathematical space | 
 |          U+3000     Ideographic space | 
 |  | 
 |        The vertical space characters are: | 
 |  | 
 |          U+000A     Linefeed (LF) | 
 |          U+000B     Vertical tab (VT) | 
 |          U+000C     Form feed (FF) | 
 |          U+000D     Carriage return (CR) | 
 |          U+0085     Next line (NEL) | 
 |          U+2028     Line separator | 
 |          U+2029     Paragraph separator | 
 |  | 
 |        In  8-bit,  non-UTF-8  mode,  only the characters with code points less | 
 |        than 256 are relevant. | 
 |  | 
 |    Newline sequences | 
 |  | 
 |        Outside a character class, by default, the escape sequence  \R  matches | 
 |        any  Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent | 
 |        to the following: | 
 |  | 
 |          (?>\r\n|\n|\x0b|\f|\r|\x85) | 
 |  | 
 |        This is an example of an "atomic group", details of which are given be- | 
 |        low.  This particular group matches either the  two-character  sequence | 
 |        CR  followed  by  LF,  or  one  of  the single characters LF (linefeed, | 
 |        U+000A), VT (vertical tab, U+000B), FF (form feed,  U+000C),  CR  (car- | 
 |        riage  return,  U+000D), or NEL (next line, U+0085). Because this is an | 
 |        atomic group, the two-character sequence is treated as  a  single  unit | 
 |        that cannot be split. | 
 |  | 
 |        In other modes, two additional characters whose code points are greater | 
 |        than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- | 
 |        rator,  U+2029).  Unicode support is not needed for these characters to | 
 |        be recognized. | 
 |  | 
 |        It is possible to restrict \R to match only CR, LF, or CRLF (instead of | 
 |        the complete set  of  Unicode  line  endings)  by  setting  the  option | 
 |        PCRE2_BSR_ANYCRLF  at  compile time. (BSR is an abbreviation for "back- | 
 |        slash R".) This can be made the default when PCRE2 is built; if this is | 
 |        the case, the other behaviour can be requested via  the  PCRE2_BSR_UNI- | 
 |        CODE  option. It is also possible to specify these settings by starting | 
 |        a pattern string with one of the following sequences: | 
 |  | 
 |          (*BSR_ANYCRLF)   CR, LF, or CRLF only | 
 |          (*BSR_UNICODE)   any Unicode newline sequence | 
 |  | 
 |        These override the default and the options given to the compiling func- | 
 |        tion.  Note that these special settings, which are not Perl-compatible, | 
 |        are recognized only at the very start of a pattern, and that they  must | 
 |        be  in upper case. If more than one of them is present, the last one is | 
 |        used. They can be combined with a change of newline convention; for ex- | 
 |        ample, a pattern can start with: | 
 |  | 
 |          (*ANY)(*BSR_ANYCRLF) | 
 |  | 
 |        They can also be combined with the (*UTF) or (*UCP) special  sequences. | 
 |        Inside  a  character class, \R is treated as an unrecognized escape se- | 
 |        quence, and causes an error. | 
 |  | 
 |    Unicode character properties | 
 |  | 
 |        When PCRE2 is built with Unicode support  (the  default),  three  addi- | 
 |        tional  escape sequences that match characters with specific properties | 
 |        are available. They can be used in any mode, though in 8-bit and 16-bit | 
 |        non-UTF modes these sequences are of course limited to testing  charac- | 
 |        ters  whose  code points are less than U+0100 or U+10000, respectively. | 
 |        In 32-bit non-UTF mode, code points greater than 0x10ffff (the  Unicode | 
 |        limit)  may  be  encountered. These are all treated as being in the Un- | 
 |        known script and with an unassigned type. | 
 |  | 
 |        Matching characters by Unicode property is not fast, because PCRE2  has | 
 |        to  do  a  multistage table lookup in order to find a character's prop- | 
 |        erty. That is why the traditional escape sequences such as \d and \w do | 
 |        not use Unicode properties in PCRE2 by default,  though  you  can  make | 
 |        them  do  so by setting the PCRE2_UCP option or by starting the pattern | 
 |        with (*UCP). | 
 |  | 
 |        The extra escape sequences that provide property support are: | 
 |  | 
 |          \p{xx}   a character with the xx property | 
 |          \P{xx}   a character without the xx property | 
 |          \X       a Unicode extended grapheme cluster | 
 |  | 
 |        For compatibility with Perl, negation can be specified by  including  a | 
 |        circumflex  between  the  opening  brace and the property. For example, | 
 |        \p{^Lu} is the same as \P{Lu}. | 
 |  | 
 |        In accordance with Unicode's "loose matching" rules, ASCII white  space | 
 |        characters, hyphens, and underscores are ignored in the properties rep- | 
 |        resented by xx above. As well as the space character, ASCII white space | 
 |        can be tab, linefeed, vertical tab, formfeed, or carriage return. | 
 |  | 
 |        Some  properties  are  specified as a name only; others as a name and a | 
 |        value, separated by a colon or an equals sign.  The  names  and  values | 
 |        consist  of ASCII letters and digits (with one Perl-specific exception, | 
 |        see below). They are not case sensitive. Note, however,  that  the  es- | 
 |        capes  themselves,  \p  and \P, are case sensitive. There are abbrevia- | 
 |        tions for many names. The following examples are all equivalent: | 
 |  | 
 |          \p{bidiclass=al} | 
 |          \p{BC=al} | 
 |          \p{ Bidi_Class : AL } | 
 |          \p{ Bi-di class = Al } | 
 |          \P{ ^ Bi-di class = Al } | 
 |  | 
 |        There is support for Unicode script  names,  Unicode  general  category | 
 |        properties,  "Any",  which  matches  any character (including newline), | 
 |        Bidi_Class, a number of binary (yes/no) properties,  and  some  special | 
 |        PCRE2 properties (described below).  Certain other Perl properties such | 
 |        as  "InMusicalSymbols"  are  not  supported by PCRE2. Note that \P{Any} | 
 |        does not match any characters, so always causes a match failure. | 
 |  | 
 |    Script properties for \p and \P | 
 |  | 
 |        There are three different syntax forms for matching a script. Each Uni- | 
 |        code character has a basic script and,  optionally,  a  list  of  other | 
 |        scripts ("Script Extensions") with which it is commonly used. Using the | 
 |        Adlam script as an example, \p{sc:Adlam} matches characters whose basic | 
 |        script is Adlam, whereas \p{scx:Adlam} matches, in addition, characters | 
 |        that  have  Adlam in their extensions list. The full names "script" and | 
 |        "script extensions" for the property types are recognized and,  as  for | 
 |        all  property  specifications,  an equals sign is an alternative to the | 
 |        colon. If a script name is given without a property type, for  example, | 
 |        \p{Adlam},  it is treated as \p{scx:Adlam}. Perl changed to this inter- | 
 |        pretation at release 5.26 and PCRE2 changed at release 10.40. | 
 |  | 
 |        Unassigned characters (and in non-UTF 32-bit mode, characters with code | 
 |        points greater than 0x10FFFF) are assigned the "Unknown" script. Others | 
 |        that are not part of an identified script are lumped together as  "Com- | 
 |        mon". The current list of recognized script names and their 4-character | 
 |        abbreviations can be obtained by running this command: | 
 |  | 
 |          pcre2test -LS | 
 |  | 
 |  | 
 |    The general category property for \p and \P | 
 |  | 
 |        Each character has exactly one Unicode general category property, spec- | 
 |        ified  by  a  two-letter  abbreviation. If only one letter is specified | 
 |        with \p or \P, it includes all the  general  category  properties  that | 
 |        start  with  that letter. In this case, in the absence of negation, the | 
 |        curly brackets in the escape sequence are optional; these two  examples | 
 |        have the same effect: | 
 |  | 
 |          \p{L} | 
 |          \pL | 
 |  | 
 |        The following general category property codes are supported: | 
 |  | 
 |          C     Other | 
 |          Cc    Control | 
 |          Cf    Format | 
 |          Cn    Unassigned | 
 |          Co    Private use | 
 |          Cs    Surrogate | 
 |  | 
 |          L     Letter | 
 |          Lc    Cased letter | 
 |          Ll    Lower case letter | 
 |          Lm    Modifier letter | 
 |          Lo    Other letter | 
 |          Lt    Title case letter | 
 |          Lu    Upper case letter | 
 |  | 
 |          M     Mark | 
 |          Mc    Spacing mark | 
 |          Me    Enclosing mark | 
 |          Mn    Non-spacing mark | 
 |  | 
 |          N     Number | 
 |          Nd    Decimal number | 
 |          Nl    Letter number | 
 |          No    Other number | 
 |  | 
 |          P     Punctuation | 
 |          Pc    Connector punctuation | 
 |          Pd    Dash punctuation | 
 |          Pe    Close punctuation | 
 |          Pf    Final punctuation | 
 |          Pi    Initial punctuation | 
 |          Po    Other punctuation | 
 |          Ps    Open punctuation | 
 |  | 
 |          S     Symbol | 
 |          Sc    Currency symbol | 
 |          Sk    Modifier symbol | 
 |          Sm    Mathematical symbol | 
 |          So    Other symbol | 
 |  | 
 |          Z     Separator | 
 |          Zl    Line separator | 
 |          Zp    Paragraph separator | 
 |          Zs    Space separator | 
 |  | 
 |        Perl  originally  used  the  name L& for the Lc property. This is still | 
 |        supported by Perl, but discouraged. PCRE2 also still supports it.  This | 
 |        property  matches any character that has the Lu, Ll, or Lt property, in | 
 |        other words, any letter  that  is  not  classified  as  a  modifier  or | 
 |        "other".  From release 10.45 of PCRE2 the properties Lu, Ll, and Lt are | 
 |        all treated  as  Lc  when  case-independent  matching  is  set  by  the | 
 |        PCRE2_CASELESS  option or (?i) within the pattern. The other properties | 
 |        are not affected by caseless matching. | 
 |  | 
 |        The Cs (Surrogate) property  applies  only  to  characters  whose  code | 
 |        points  are in the range U+D800 to U+DFFF. These characters are no dif- | 
 |        ferent to any other character when PCRE2 is not in UTF mode (using  the | 
 |        16-bit  or  32-bit  library).   However,  they are not valid in Unicode | 
 |        strings and so cannot be tested by PCRE2 in UTF mode, unless UTF valid- | 
 |        ity  checking  has   been   turned   off   (see   the   discussion   of | 
 |        PCRE2_NO_UTF_CHECK in the pcre2api page). | 
 |  | 
 |        The  long  synonyms  for  property  names  that  Perl supports (such as | 
 |        \p{Letter}) are not supported by PCRE2, nor is it permitted  to  prefix | 
 |        any of these properties with "Is". | 
 |  | 
 |        No character that is in the Unicode table has the Cn (unassigned) prop- | 
 |        erty.  Instead, this property is assumed for any code point that is not | 
 |        in the Unicode table. | 
 |  | 
 |    Binary (yes/no) properties for \p and \P | 
 |  | 
 |        Unicode  defines  a  number  of  binary properties, that is, properties | 
 |        whose only values are true or false. You can obtain  a  list  of  those | 
 |        that  are  recognized  by \p and \P, along with their abbreviations, by | 
 |        running this command: | 
 |  | 
 |          pcre2test -LP | 
 |  | 
 |  | 
 |    The Bidi_Class property for \p and \P | 
 |  | 
 |          \p{Bidi_Class:<class>}   matches a character with the given class | 
 |          \p{BC:<class>}           matches a character with the given class | 
 |  | 
 |        The recognized classes are: | 
 |  | 
 |          AL          Arabic letter | 
 |          AN          Arabic number | 
 |          B           paragraph separator | 
 |          BN          boundary neutral | 
 |          CS          common separator | 
 |          EN          European number | 
 |          ES          European separator | 
 |          ET          European terminator | 
 |          FSI         first strong isolate | 
 |          L           left-to-right | 
 |          LRE         left-to-right embedding | 
 |          LRI         left-to-right isolate | 
 |          LRO         left-to-right override | 
 |          NSM         non-spacing mark | 
 |          ON          other neutral | 
 |          PDF         pop directional format | 
 |          PDI         pop directional isolate | 
 |          R           right-to-left | 
 |          RLE         right-to-left embedding | 
 |          RLI         right-to-left isolate | 
 |          RLO         right-to-left override | 
 |          S           segment separator | 
 |          WS          white space | 
 |  | 
 |        As in all property specifications, an equals sign may be  used  instead | 
 |        of  a  colon  and  the class names are case-insensitive. Only the short | 
 |        names listed above are recognized; PCRE2 does not  at  present  support | 
 |        any long alternatives. | 
 |  | 
 |    Extended grapheme clusters | 
 |  | 
 |        The  \X  escape  matches  any number of Unicode characters that form an | 
 |        "extended grapheme cluster", and treats the sequence as an atomic group | 
 |        (see below).  Unicode supports various kinds of composite character  by | 
 |        giving  each  character  a grapheme breaking property, and having rules | 
 |        that use these properties to define the boundaries of extended grapheme | 
 |        clusters. The rules are defined in Unicode Standard Annex 29,  "Unicode | 
 |        Text  Segmentation".  Unicode 11.0.0 abandoned the use of some previous | 
 |        properties that had been used for emojis.  Instead it introduced  vari- | 
 |        ous  emoji-specific  properties.  PCRE2  uses  only the Extended Picto- | 
 |        graphic property. | 
 |  | 
 |        \X always matches at least one character. Then it  decides  whether  to | 
 |        add additional characters according to the following rules for ending a | 
 |        cluster: | 
 |  | 
 |        1. End at the end of the subject string. | 
 |  | 
 |        2.  Do not end between CR and LF; otherwise end after any control char- | 
 |        acter. | 
 |  | 
 |        3. Do not break Hangul (a Korean  script)  syllable  sequences.  Hangul | 
 |        characters  are of five types: L, V, T, LV, and LVT. An L character may | 
 |        be followed by an L, V, LV, or LVT character; an LV or V character  may | 
 |        be  followed  by  a V or T character; an LVT or T character may be fol- | 
 |        lowed only by a T character. | 
 |  | 
 |        4. Do not end before extending characters or spacing marks or the zero- | 
 |        width joiner (ZWJ) character. Characters with the "mark"  property  al- | 
 |        ways have the "extend" grapheme breaking property. | 
 |  | 
 |        5. Do not end after prepend characters. | 
 |  | 
 |        6.  Do not end within emoji modifier sequences or emoji ZWJ (zero-width | 
 |        joiner) sequences. An emoji ZWJ sequence consists of a  character  with | 
 |        the  Extended_Pictographic property, optionally followed by one or more | 
 |        characters with the Extend property, followed  by  the  ZWJ  character, | 
 |        followed by another Extended_Pictographic character. | 
 |  | 
 |        7.  Do not break within emoji flag sequences. That is, do not break be- | 
 |        tween regional indicator (RI) characters if there are an odd number  of | 
 |        RI characters before the break point. | 
 |  | 
 |        8. Otherwise, end the cluster. | 
 |  | 
 |    PCRE2's additional properties | 
 |  | 
 |        As  well as the standard Unicode properties described above, PCRE2 sup- | 
 |        ports four more that make it possible to convert traditional escape se- | 
 |        quences such as \w and \s to use Unicode properties. PCRE2  uses  these | 
 |        non-standard,  non-Perl  properties  internally  when PCRE2_UCP is set. | 
 |        However, they may also be used explicitly. These properties are: | 
 |  | 
 |          Xan   Any alphanumeric character | 
 |          Xps   Any POSIX space character | 
 |          Xsp   Any Perl space character | 
 |          Xwd   Any Perl "word" character | 
 |  | 
 |        Xan matches characters that have either the L (letter) or the  N  (num- | 
 |        ber)  property. Xps matches the characters tab, linefeed, vertical tab, | 
 |        form feed, or carriage return, and any other character that has  the  Z | 
 |        (separator)  property  (this  includes the space character). Xsp is the | 
 |        same as Xps; in PCRE1 it used to exclude vertical tab, for Perl compat- | 
 |        ibility, but Perl changed. Xwd matches the same characters as Xan, plus | 
 |        those that match Mn (non-spacing mark) or  Pc  (connector  punctuation, | 
 |        which includes underscore). | 
 |  | 
 |        There  is another non-standard property, Xuc, which matches any charac- | 
 |        ter that can be represented by a Universal Character Name  in  C++  and | 
 |        other  programming  languages.  These are the characters $, @, ` (grave | 
 |        accent), and all characters with Unicode code points  greater  than  or | 
 |        equal  to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that | 
 |        most base (ASCII) characters are excluded. (Universal  Character  Names | 
 |        are  of  the  form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit. | 
 |        Note that the Xuc property does not match these sequences but the char- | 
 |        acters that they represent.) | 
 |  | 
 |    Resetting the match start | 
 |  | 
 |        In normal use, the escape sequence \K  causes  any  previously  matched | 
 |        characters not to be included in the final matched sequence that is re- | 
 |        turned. For example, the pattern: | 
 |  | 
 |          foo\Kbar | 
 |  | 
 |        matches  "foobar",  but  reports that it has matched "bar". \K does not | 
 |        interact with anchoring in any way. The pattern: | 
 |  | 
 |          ^foo\Kbar | 
 |  | 
 |        matches only when the subject begins  with  "foobar"  (in  single  line | 
 |        mode),  though  it again reports the matched string as "bar". This fea- | 
 |        ture is similar to a lookbehind assertion (described  below),  but  the | 
 |        part of the pattern that precedes \K is not constrained to match a lim- | 
 |        ited  number  of characters, as is required for a lookbehind assertion. | 
 |        The use of \K does not interfere with  the  setting  of  captured  sub- | 
 |        strings.  For example, when the pattern | 
 |  | 
 |          (foo)\Kbar | 
 |  | 
 |        matches "foobar", the first substring is still set to "foo". | 
 |  | 
 |        From  version  5.32.0  Perl  forbids the use of \K in lookaround asser- | 
 |        tions. From release 10.38 PCRE2 also forbids this by default.  However, | 
 |        the  PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK  option  can be used when calling | 
 |        pcre2_compile() to re-enable the previous behaviour. When  this  option | 
 |        is set, \K is acted upon when it occurs inside positive assertions, but | 
 |        is  ignored  in  negative  assertions. Note that when a pattern such as | 
 |        (?=ab\K) matches, the reported start of the match can be  greater  than | 
 |        the  end  of the match. Using \K in a lookbehind assertion at the start | 
 |        of a pattern can also lead to odd effects. For example,  consider  this | 
 |        pattern: | 
 |  | 
 |          (?<=\Kfoo)bar | 
 |  | 
 |        If  the  subject  is  "foobar", a call to pcre2_match() with a starting | 
 |        offset of 3 succeeds and reports the matching string as "foobar",  that | 
 |        is,  the  start  of  the reported match is earlier than where the match | 
 |        started. | 
 |  | 
 |    Simple assertions | 
 |  | 
 |        The final use of backslash is for certain simple assertions. An  asser- | 
 |        tion  specifies a condition that has to be met at a particular point in | 
 |        a match, without consuming any characters from the subject string.  The | 
 |        use  of groups for more complicated assertions is described below.  The | 
 |        backslashed assertions are: | 
 |  | 
 |          \b     matches at a word boundary | 
 |          \B     matches when not at a word boundary | 
 |          \A     matches at the start of the subject | 
 |          \Z     matches at the end of the subject | 
 |                  also matches before a newline at the end of the subject | 
 |          \z     matches only at the end of the subject | 
 |          \G     matches at the first matching position in the subject | 
 |  | 
 |        Inside a character class, \b has a different meaning;  it  matches  the | 
 |        backspace  character.  If  any  other  of these assertions appears in a | 
 |        character class, an "invalid escape sequence" error is generated. | 
 |  | 
 |        A word boundary is a position in the subject string where  the  current | 
 |        character  and  the previous character do not both match \w or \W (i.e. | 
 |        one matches \w and the other matches \W), or the start or  end  of  the | 
 |        string  if  the  first or last character matches \w, respectively. When | 
 |        PCRE2 is built with Unicode support, the meanings of \w and \W  can  be | 
 |        changed by setting the PCRE2_UCP option. When this is done, it also af- | 
 |        fects  \b and \B. Neither PCRE2 nor Perl has a separate "start of word" | 
 |        or "end of word" metasequence. However, whatever  follows  \b  normally | 
 |        determines  which  it  is. For example, the fragment \ba matches "a" at | 
 |        the start of a word. | 
 |  | 
 |        The \A, \Z, and \z assertions differ from  the  traditional  circumflex | 
 |        and dollar (described in the next section) in that they only ever match | 
 |        at  the  very start and end of the subject string, whatever options are | 
 |        set. Thus, they are independent of multiline mode. These  three  asser- | 
 |        tions  are  not  affected  by the PCRE2_NOTBOL or PCRE2_NOTEOL options, | 
 |        which affect only the behaviour of the circumflex and dollar  metachar- | 
 |        acters.  However,  if the startoffset argument of pcre2_match() is non- | 
 |        zero, indicating that matching is to start at a point  other  than  the | 
 |        beginning  of  the subject, \A can never match.  The difference between | 
 |        \Z and \z is that \Z matches before a newline at the end of the  string | 
 |        as well as at the very end, whereas \z matches only at the end. | 
 |  | 
 |        The  \G assertion is true only when the current matching position is at | 
 |        the start point of the matching process, as specified by the  startoff- | 
 |        set  argument  of  pcre2_match().  It differs from \A when the value of | 
 |        startoffset is non-zero. By calling pcre2_match() multiple  times  with | 
 |        appropriate  arguments,  you  can  mimic Perl's /g option, and it is in | 
 |        this kind of implementation where \G can be useful. | 
 |  | 
 |        Note, however, that PCRE2's implementation of \G,  being  true  at  the | 
 |        starting  character  of  the matching process, is subtly different from | 
 |        Perl's, which defines it as true at the end of the previous  match.  In | 
 |        Perl,  these  can  be  different when the previously matched string was | 
 |        empty. Because PCRE2 does just one match at a time, it cannot reproduce | 
 |        this behaviour. | 
 |  | 
 |        If all the alternatives of a pattern begin with \G, the  expression  is | 
 |        anchored to the starting match position, and the "anchored" flag is set | 
 |        in the compiled regular expression. | 
 |  | 
 |  | 
 | CIRCUMFLEX AND DOLLAR | 
 |  | 
 |        The  circumflex  and  dollar  metacharacters are zero-width assertions. | 
 |        That is, they test for a particular condition being true  without  con- | 
 |        suming any characters from the subject string. These two metacharacters | 
 |        are  concerned  with matching the starts and ends of lines. If the new- | 
 |        line convention is set so that only the two-character sequence CRLF  is | 
 |        recognized  as  a newline, isolated CR and LF characters are treated as | 
 |        ordinary data characters, and are not recognized as newlines. | 
 |  | 
 |        Outside a character class, in the default matching mode, the circumflex | 
 |        character is an assertion that is true only  if  the  current  matching | 
 |        point  is  at the start of the subject string. If the startoffset argu- | 
 |        ment of pcre2_match() is non-zero, or if PCRE2_NOTBOL is  set,  circum- | 
 |        flex  can  never match if the PCRE2_MULTILINE option is unset. Inside a | 
 |        character class, circumflex has an entirely different meaning (see  be- | 
 |        low). | 
 |  | 
 |        Circumflex  need  not be the first character of the pattern if a number | 
 |        of alternatives are involved, but it should be the first thing in  each | 
 |        alternative  in  which  it appears if the pattern is ever to match that | 
 |        branch. If all possible alternatives start with a circumflex, that  is, | 
 |        if  the  pattern  is constrained to match only at the start of the sub- | 
 |        ject, it is said to be an "anchored" pattern.  (There  are  also  other | 
 |        constructs that can cause a pattern to be anchored.) | 
 |  | 
 |        The  dollar  character is an assertion that is true only if the current | 
 |        matching point is at the end of the subject string, or immediately  be- | 
 |        fore  a newline at the end of the string (by default), unless PCRE2_NO- | 
 |        TEOL is set. Note, however, that it does not actually  match  the  new- | 
 |        line.  Dollar need not be the last character of the pattern if a number | 
 |        of alternatives are involved, but it should be the  last  item  in  any | 
 |        branch  in which it appears. Dollar has no special meaning in a charac- | 
 |        ter class. | 
 |  | 
 |        The meaning of dollar can be changed so that it  matches  only  at  the | 
 |        very  end  of the string, by setting the PCRE2_DOLLAR_ENDONLY option at | 
 |        compile time. This does not affect the \Z assertion. | 
 |  | 
 |        The meanings of the circumflex and dollar metacharacters are changed if | 
 |        the PCRE2_MULTILINE option is set. When this  is  the  case,  a  dollar | 
 |        character  matches before any newlines in the string, as well as at the | 
 |        very end, and a circumflex matches immediately after internal  newlines | 
 |        as  well as at the start of the subject string. It does not match after | 
 |        a newline that ends the string, for compatibility with  Perl.  However, | 
 |        this can be changed by setting the PCRE2_ALT_CIRCUMFLEX option. | 
 |  | 
 |        For  example, the pattern /^abc$/ matches the subject string "def\nabc" | 
 |        (where \n represents a newline) in multiline mode, but  not  otherwise. | 
 |        Consequently,  patterns  that  are anchored in single line mode because | 
 |        all branches start with ^ are not anchored in  multiline  mode,  and  a | 
 |        match  for  circumflex  is  possible  when  the startoffset argument of | 
 |        pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option  is  ignored | 
 |        if PCRE2_MULTILINE is set. | 
 |  | 
 |        When  the  newline  convention (see "Newline conventions" below) recog- | 
 |        nizes the two-character sequence CRLF as a newline, this is  preferred, | 
 |        even  if  the  single  characters CR and LF are also recognized as new- | 
 |        lines. For example, if the newline convention  is  "any",  a  multiline | 
 |        mode  circumflex matches before "xyz" in the string "abc\r\nxyz" rather | 
 |        than after CR, even though CR on its own is a valid newline.  (It  also | 
 |        matches at the very start of the string, of course.) | 
 |  | 
 |        Note  that  the sequences \A, \Z, and \z can be used to match the start | 
 |        and end of the subject in both modes, and if all branches of a  pattern | 
 |        start  with \A it is always anchored, whether or not PCRE2_MULTILINE is | 
 |        set. | 
 |  | 
 |  | 
 | FULL STOP (PERIOD, DOT) AND \N | 
 |  | 
 |        Outside a character class, a dot in the pattern matches any one charac- | 
 |        ter in the subject string except (by default) a character  that  signi- | 
 |        fies the end of a line. One or more characters may be specified as line | 
 |        terminators (see "Newline conventions" above). | 
 |  | 
 |        Dot  never matches a single line-ending character. When the two-charac- | 
 |        ter sequence CRLF is the only line ending, dot does not match CR if  it | 
 |        is  immediately followed by LF, but otherwise it matches all characters | 
 |        (including isolated CRs and LFs). When ANYCRLF  is  selected  for  line | 
 |        endings,  no  occurrences  of CR of LF match dot. When all Unicode line | 
 |        endings are being recognized, dot does not match CR or LF or any of the | 
 |        other line ending characters. | 
 |  | 
 |        The behaviour of dot with regard to newlines can  be  changed.  If  the | 
 |        PCRE2_DOTALL  option  is  set, a dot matches any one character, without | 
 |        exception.  If the two-character sequence CRLF is present in  the  sub- | 
 |        ject string, it takes two dots to match it. | 
 |  | 
 |        The  handling of dot is entirely independent of the handling of circum- | 
 |        flex and dollar, the only relationship being  that  they  both  involve | 
 |        newlines. Dot has no special meaning in a character class. | 
 |  | 
 |        The  escape  sequence  \N when not followed by an opening brace behaves | 
 |        like a dot, except that it is not affected by the PCRE2_DOTALL  option. | 
 |        In  other words, it matches any character except one that signifies the | 
 |        end of a line. | 
 |  | 
 |        When \N is followed by an opening brace it has a different meaning. See | 
 |        the section entitled "Non-printing characters" above for details.  Perl | 
 |        also  uses  \N{name}  to specify characters by Unicode name; PCRE2 does | 
 |        not support this. | 
 |  | 
 |  | 
 | MATCHING A SINGLE CODE UNIT | 
 |  | 
 |        Outside a character class, the escape sequence \C matches any one  code | 
 |        unit,  whether or not a UTF mode is set. In the 8-bit library, one code | 
 |        unit is one byte; in the 16-bit library it is a  16-bit  unit;  in  the | 
 |        32-bit  library  it  is  a 32-bit unit. Unlike a dot, \C always matches | 
 |        line-ending characters. The feature is provided in  Perl  in  order  to | 
 |        match individual bytes in UTF-8 mode, but it is unclear how it can use- | 
 |        fully be used. | 
 |  | 
 |        Because  \C  breaks  up characters into individual code units, matching | 
 |        one unit with \C in UTF-8 or UTF-16 mode means that  the  rest  of  the | 
 |        string may start with a malformed UTF character. This has undefined re- | 
 |        sults, because PCRE2 assumes that it is matching character by character | 
 |        in a valid UTF string (by default it checks the subject string's valid- | 
 |        ity  at  the  start  of  processing  unless  the  PCRE2_NO_UTF_CHECK or | 
 |        PCRE2_MATCH_INVALID_UTF option is used). | 
 |  | 
 |        An  application  can  lock  out  the  use  of   \C   by   setting   the | 
 |        PCRE2_NEVER_BACKSLASH_C  option  when  compiling  a pattern. It is also | 
 |        possible to build PCRE2 with the use of \C permanently disabled. | 
 |  | 
 |        PCRE2 does not allow \C to appear in lookbehind  assertions  (described | 
 |        below)  in UTF-8 or UTF-16 modes, because this would make it impossible | 
 |        to calculate the length of  the  lookbehind.  Neither  the  alternative | 
 |        matching function pcre2_dfa_match() nor the JIT optimizer support \C in | 
 |        these UTF modes.  The former gives a match-time error; the latter fails | 
 |        to optimize and so the match is always run using the interpreter. | 
 |  | 
 |        In  the  32-bit  library, however, \C is always supported (when not ex- | 
 |        plicitly locked out) because it always  matches  a  single  code  unit, | 
 |        whether or not UTF-32 is specified. | 
 |  | 
 |        In general, the \C escape sequence is best avoided. However, one way of | 
 |        using  it  that avoids the problem of malformed UTF-8 or UTF-16 charac- | 
 |        ters is to use a lookahead to check the length of the  next  character, | 
 |        as  in  this  pattern,  which could be used with a UTF-8 string (ignore | 
 |        white space and line breaks): | 
 |  | 
 |          (?| (?=[\x00-\x7f])(\C) | | 
 |              (?=[\x80-\x{7ff}])(\C)(\C) | | 
 |              (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) | | 
 |              (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C)) | 
 |  | 
 |        In this example, a group that starts  with  (?|  resets  the  capturing | 
 |        parentheses  numbers in each alternative (see "Duplicate Group Numbers" | 
 |        below). The assertions at the start of each branch check the next UTF-8 | 
 |        character for values whose encoding uses 1, 2, 3, or 4  bytes,  respec- | 
 |        tively.  The  character's individual bytes are then captured by the ap- | 
 |        propriate number of \C groups. | 
 |  | 
 |  | 
 | SQUARE BRACKETS AND CHARACTER CLASSES | 
 |  | 
 |        An opening square bracket introduces a character class, terminated by a | 
 |        closing square bracket. A closing square bracket on its own is not spe- | 
 |        cial by default.  If a closing square bracket is required as  a  member | 
 |        of the class, it should be the first data character in the class (after | 
 |        an  initial  circumflex,  if present) or escaped with a backslash. This | 
 |        means that, by default, an empty class cannot be defined.  However,  if | 
 |        the  PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at | 
 |        the start does end the (empty) class. | 
 |  | 
 |        A character class matches a single character in the subject. A  matched | 
 |        character must be in the set of characters defined by the class, unless | 
 |        the  first  character in the class definition is a circumflex, in which | 
 |        case the subject character must not be in the set defined by the class. | 
 |        If a circumflex is actually required as a member of the  class,  ensure | 
 |        it is not the first character, or escape it with a backslash. | 
 |  | 
 |        For example, the character class [aeiou] matches any lower case English | 
 |        vowel,  whereas [^aeiou] matches all other characters. Note that a cir- | 
 |        cumflex is just a convenient notation  for  specifying  the  characters | 
 |        that  are  in the class by enumerating those that are not. A class that | 
 |        starts with a circumflex is not an assertion; it still consumes a char- | 
 |        acter from the subject string, and therefore it fails to match  if  the | 
 |        current pointer is at the end of the string. | 
 |  | 
 |        Characters  in  a class may be specified by their code points using \o, | 
 |        \x, or \N{U+hh..} in the usual way. When caseless matching is set,  any | 
 |        letters  in a class represent both their upper case and lower case ver- | 
 |        sions, so for example, a caseless [aeiou] matches "A" as well  as  "a", | 
 |        and  a  caseless [^aeiou] does not match "A", whereas a caseful version | 
 |        would. Note that there are two ASCII characters, K and S, that, in  ad- | 
 |        dition  to their lower case ASCII equivalents, are case-equivalent with | 
 |        Unicode U+212A (Kelvin sign) and U+017F (long S) respectively when  ei- | 
 |        ther PCRE2_UTF or PCRE2_UCP is set. If you do not want these ASCII/non- | 
 |        ASCII  case  equivalences,  you  can suppress them by setting PCRE2_EX- | 
 |        TRA_CASELESS_RESTRICT, either as an option in a compile context, or  by | 
 |        including (*CASELESS_RESTRICT) or (?r) within a pattern. | 
 |  | 
 |        Characters  that  might  indicate  line breaks are never treated in any | 
 |        special way when matching character classes, whatever  line-ending  se- | 
 |        quence  is  in  use,  and  whatever  setting  of  the  PCRE2_DOTALL and | 
 |        PCRE2_MULTILINE options is used. A class such as  [^a]  always  matches | 
 |        one of these characters. | 
 |  | 
 |        The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s, | 
 |        \S,  \v,  \V,  \w,  and \W may appear in a character class, and add the | 
 |        characters that they  match  to  the  class.  For  example,  [\dABCDEF] | 
 |        matches  any  hexadecimal digit. In UTF modes, the PCRE2_UCP option af- | 
 |        fects the meanings of \d, \s, \w and their upper case partners, just as | 
 |        it does when they appear outside a character class, as described in the | 
 |        section entitled "Generic character types" above. The  escape  sequence | 
 |        \b  has  a  different  meaning inside a character class; it matches the | 
 |        backspace character. The sequences \B, \R, and \X are not  special  in- | 
 |        side  a  character class. Like any other unrecognized escape sequences, | 
 |        they cause an error. The same is true for \N when not  followed  by  an | 
 |        opening brace. | 
 |  | 
 |        The  minus (hyphen) character can be used to specify a range of charac- | 
 |        ters in a character class. For example, [d-m] matches  any  letter  be- | 
 |        tween  d and m, inclusive. If a minus character is required in a class, | 
 |        it must be escaped with a backslash or appear in a  position  where  it | 
 |        cannot  be interpreted as indicating a range, typically as the first or | 
 |        last character in the class, or immediately after a range. For example, | 
 |        [b-d-z] matches letters in the range b to d, a hyphen character, or z. | 
 |  | 
 |        There is some special treatment for alphabetic ranges in  EBCDIC  envi- | 
 |        ronments; see the section "EBCDIC environments" below. | 
 |  | 
 |        Perl treats a hyphen as a literal if it appears before or after a POSIX | 
 |        class (see below) or before or after a character type escape such as \d | 
 |        or  \H.  However, unless the hyphen is the last character in the class, | 
 |        Perl outputs a warning in its warning mode, as this is  most  likely  a | 
 |        user  error. As PCRE2 has no facility for warning, an error is given in | 
 |        these cases. | 
 |  | 
 |        It is not possible to have the literal character "]" as the end charac- | 
 |        ter of a range. A pattern such as [W-]46] is interpreted as a class  of | 
 |        two  characters ("W" and "-") followed by a literal string "46]", so it | 
 |        would match "W46]" or "-46]". However, if the "]"  is  escaped  with  a | 
 |        backslash  it  is interpreted as the end of a range, so [W-\]46] is in- | 
 |        terpreted as a class containing a range and two other  characters.  The | 
 |        octal  or  hexadecimal  representation of "]" can also be used to end a | 
 |        range. | 
 |  | 
 |        Ranges normally include all code points between the start and end char- | 
 |        acters, inclusive. They can also be used for code points specified  nu- | 
 |        merically,  for  example [\000-\037]. Ranges can include any characters | 
 |        that are valid for the current mode. In any  UTF  mode,  the  so-called | 
 |        "surrogate"  characters (those whose code points lie between 0xd800 and | 
 |        0xdfff inclusive) may not  be  specified  explicitly  by  default  (the | 
 |        PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES  option  disables this check). How- | 
 |        ever, ranges such as [\x{d7ff}-\x{e000}], which include the surrogates, | 
 |        are always permitted. | 
 |  | 
 |        If a range that includes letters is used when caseless matching is set, | 
 |        it matches the letters in either case. For example, [W-c] is equivalent | 
 |        to [][\\^_`wxyzabc], matched caselessly, and  in  a  non-UTF  mode,  if | 
 |        character  tables  for  a French locale are in use, [\xc8-\xcb] matches | 
 |        accented E characters in both cases. | 
 |  | 
 |        A circumflex can conveniently be used with  the  upper  case  character | 
 |        types  to specify a more restricted set of characters than the matching | 
 |        lower case type.  For example, the class [^\W_] matches any  letter  or | 
 |        digit, but not underscore, whereas [\w] includes underscore. A positive | 
 |        character class should be read as "something OR something OR ..." and a | 
 |        negative class as "NOT something AND NOT something AND NOT ...". | 
 |  | 
 |        The  metacharacters  that are recognized in character classes are back- | 
 |        slash, hyphen (when it can be interpreted as specifying a range),  cir- | 
 |        cumflex  (only  at  the  start),  and  the  terminating  closing square | 
 |        bracket. An opening square bracket is also special when it can  be  in- | 
 |        terpreted  as  introducing a POSIX class (see "Posix character classes" | 
 |        below), or a special compatibility feature (see "Compatibility  feature | 
 |        for  word boundaries" below. Escaping any non-alphanumeric character in | 
 |        a class turns it into a literal, whether or not it would otherwise be a | 
 |        metacharacter. | 
 |  | 
 |  | 
 | PERL EXTENDED CHARACTER CLASSES | 
 |  | 
 |        From release 10.45 PCRE2 supports Perl's  (?[...])  extended  character | 
 |        class syntax. This can be used to perform set operations such as inter- | 
 |        section on character classes. | 
 |  | 
 |        The  syntax  permitted  within  (?[...]) is quite different to ordinary | 
 |        character classes. Inside the extended class, there  is  an  expression | 
 |        syntax  consisting of "atoms", operators, and ordinary parentheses "()" | 
 |        used for grouping. Such classes  always  have  the  Perl  /xx  modifier | 
 |        (PCRE2  option  PCRE2_EXTENDED_MORE)  turned on within them. This means | 
 |        that literal space and tab characters are  ignored  everywhere  in  the | 
 |        class. | 
 |  | 
 |        The  allowed  atoms  are  individual characters specified by escape se- | 
 |        quences such as \n or  \x{123},  character  types  such  as  \d,  POSIX | 
 |        classes such as [:alpha:], and nested ordinary (non-extended) character | 
 |        classes. For example, in (?[\d & [...]]) the nested class [...] follows | 
 |        the  usual  rules  for ordinary character classes, in which parentheses | 
 |        are not metacharacters, and character literals and ranges  are  permit- | 
 |        ted. | 
 |  | 
 |        Character  literals and ranges may not appear outside a nested ordinary | 
 |        character class because they are not atoms in the extended syntax.  The | 
 |        extended  syntax does not introduce any additional escape sequences, so | 
 |        (?[\y]) is an unknown escape, as it would be in [\y]. | 
 |  | 
 |        In the extended syntax, ^ does not negate a class (except within an or- | 
 |        dinary class nested inside an extended class); it is instead  a  binary | 
 |        operator. | 
 |  | 
 |        The  binary  operators  are "&" (intersection), "|" or "+" (union), "-" | 
 |        (subtraction) and "^" (symmetric difference). These  are  left-associa- | 
 |        tive  and  "&"  has  higher (tighter) precedence, while the others have | 
 |        equal lower precedence. The one prefix unary operator is  "!"  (comple- | 
 |        ment), with highest precedence. | 
 |  | 
 |  | 
 | UTS#18 EXTENDED CHARACTER CLASSES | 
 |  | 
 |        The  PCRE2_ALT_EXTENDED_CLASS  option  enables an alternative to Perl's | 
 |        (?[...])  syntax, allowing instead extended class behaviour inside  or- | 
 |        dinary  [...]  character classes. This altered syntax for [...] classes | 
 |        is loosely described by the Unicode standard UTS#18. The  PCRE2_ALT_EX- | 
 |        TENDED_CLASS  option  does not prevent use of (?[...]) classes; it just | 
 |        changes the meaning of all [...] classes that are not nested  inside  a | 
 |        Perl (?[...]) class. | 
 |  | 
 |        Firstly, in ordinary Perl [...] syntax, an expression such as "[a[]" is | 
 |        a  character  class  with  two  literal  characters "a" and "[", but in | 
 |        UTS#18  extended  classes  the  "["  character  becomes  an  additional | 
 |        metacharacter  within classes, denoting the start of a nested class, so | 
 |        a literal "[" must be escaped as "\[". | 
 |  | 
 |        Secondly, within the UTS#18 extended syntax, there are operators  "||", | 
 |        "&&",  "--"  and "~~" which denote character class union, intersection, | 
 |        subtraction, and symmetric difference respectively.  In  standard  Perl | 
 |        syntax,  these would simply be needlessly-repeated literals (except for | 
 |        "--" which could be the start or end of a range).  In  UTS#18  extended | 
 |        classes these operators can be used in constructs such as [\p{L}--[QW]] | 
 |        for  "Unicode letters, other than Q and W".  A literal "-" at the start | 
 |        or end of a range must be escaped, so while "[--1]" in Perl  syntax  is | 
 |        the  range from hyphen to "1", it must be escaped as "[\--1]" in UTS#18 | 
 |        extended classes. | 
 |  | 
 |        Unlike Perl's (?[...]) extended classes, the PCRE2_EXTENDED_MORE option | 
 |        to ignore space and tab characters is  not  automatically  enabled  for | 
 |        UTS#18 extended classes, but it is honoured if set. | 
 |  | 
 |        Extended  UTS#18  classes  can  be nested, and nested classes are them- | 
 |        selves extended classes (unlike Perl, where nested classes must be sim- | 
 |        ple classes).  For example, [\p{L}&&[\p{Thai}||\p{Greek}]] matches  any | 
 |        letter  that is in the Thai or Greek scripts. Note that this means that | 
 |        no special grouping characters (such as the parentheses used in  Perl's | 
 |        (?[...]) class syntax) are needed. | 
 |  | 
 |        Individual  class items (literal characters, literal ranges, properties | 
 |        such as \d or \p{...}, and nested classes) can be combined by  juxtapo- | 
 |        sition or by an operator. Juxtaposition is the implicit union operator, | 
 |        and  binds  more tightly than any explicit operator. Thus a sequence of | 
 |        literals and/or ranges behaves as if it is enclosed in square brackets. | 
 |        For example, [A-Z0-9&&[^E8]] is the same  as  [[A-Z0-9]&&[^E8]],  which | 
 |        matches any upper case alphanumeric character except "E" or "8". | 
 |  | 
 |        Precedence between the explicit operators is not defined, so mixing op- | 
 |        erators  is  a  syntax  error.  For example, [A&&B--C] is an error, but | 
 |        [A&&[B--C]] is valid. | 
 |  | 
 |        This is an emerging syntax which is being adopted gradually across  the | 
 |        regex  ecosystem:  for  example JavaScript adopted the "/v" flag in EC- | 
 |        MAScript 2024; Python's "re" module reserves the syntax for future  use | 
 |        with a FutureWarning for unescaped use of "[" as a literal within char- | 
 |        acter  classes.  Due to UTS#18 providing insufficient guidance, engines | 
 |        interpret the syntax differently.  Rust's "regex"  crate  and  Python's | 
 |        "regex"  PyPi  module  both implement UTS#18 extended classes, but with | 
 |        slight  incompatibilities  ([A||B&&C]  is  parsed  as  [A||[B&&C]]   in | 
 |        Python's "regex" but as [[A||B]&&C] in Rust's "regex"). | 
 |  | 
 |        PCRE2's  syntax  adds  syntax  restrictions  similar to ECMASCript's /v | 
 |        flag, so that all the UTS#18 extended  classes  accepted  as  valid  by | 
 |        PCRE2  have the property that they are interpreted either with the same | 
 |        behaviour, or as invalid, by all other major engines.  Please  file  an | 
 |        issue if you are aware of cross-engine differences in behaviour between | 
 |        PCRE2 and another major engine. | 
 |  | 
 |  | 
 | POSIX CHARACTER CLASSES | 
 |  | 
 |        Perl supports the POSIX notation for character classes. This uses names | 
 |        enclosed  by [: and :] within the enclosing square brackets. PCRE2 also | 
 |        supports this notation, in both ordinary and extended classes. For  ex- | 
 |        ample, | 
 |  | 
 |          [01[:alpha:]%] | 
 |  | 
 |        matches "0", "1", any alphabetic character, or "%". The supported class | 
 |        names are: | 
 |  | 
 |          alnum    letters and digits | 
 |          alpha    letters | 
 |          ascii    character codes 0 - 127 | 
 |          blank    space or tab only | 
 |          cntrl    control characters | 
 |          digit    decimal digits (same as \d) | 
 |          graph    printing characters, excluding space | 
 |          lower    lower case letters | 
 |          print    printing characters, including space | 
 |          punct    printing characters, excluding letters and digits and space | 
 |          space    white space (the same as \s from PCRE2 8.34) | 
 |          upper    upper case letters | 
 |          word     "word" characters (same as \w) | 
 |          xdigit   hexadecimal digits | 
 |  | 
 |        The  default  "space" characters are HT (9), LF (10), VT (11), FF (12), | 
 |        CR (13), and space (32). If locale-specific matching is  taking  place, | 
 |        the  list  of  space characters may be different; there may be fewer or | 
 |        more of them. "Space" and \s match the same set of  characters,  as  do | 
 |        "word" and \w. | 
 |  | 
 |        The  name  "word"  is  a Perl extension, and "blank" is a GNU extension | 
 |        from Perl 5.8. Another Perl extension is negation, which  is  indicated | 
 |        by a ^ character after the colon. For example, | 
 |  | 
 |          [12[:^digit:]] | 
 |  | 
 |        matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the | 
 |        POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but | 
 |        these are not supported, and an error is given if they are encountered. | 
 |  | 
 |        By default, characters with values greater than 127 do not match any of | 
 |        the POSIX character classes, although this may be different for charac- | 
 |        ters  in  the range 128-255 when locale-specific matching is happening. | 
 |        However, in UCP mode, unless certain options are set (see below),  some | 
 |        of  the  classes  are  changed so that Unicode character properties are | 
 |        used. This is achieved by replacing POSIX classes with other sequences, | 
 |        as follows: | 
 |  | 
 |          [:alnum:]  becomes  \p{Xan} | 
 |          [:alpha:]  becomes  \p{L} | 
 |          [:blank:]  becomes  \h | 
 |          [:cntrl:]  becomes  \p{Cc} | 
 |          [:digit:]  becomes  \p{Nd} | 
 |          [:lower:]  becomes  \p{Ll} | 
 |          [:space:]  becomes  \p{Xps} | 
 |          [:upper:]  becomes  \p{Lu} | 
 |          [:word:]   becomes  \p{Xwd} | 
 |  | 
 |        Negated versions, such as [:^alpha:] use \P instead of \p.  Four  other | 
 |        POSIX classes are handled specially in UCP mode: | 
 |  | 
 |        [:graph:] This  matches  characters that have glyphs that mark the page | 
 |                  when printed. In Unicode property terms, it matches all char- | 
 |                  acters with the L, M, N, P, S, or Cf properties, except for: | 
 |  | 
 |                    U+061C           Arabic Letter Mark | 
 |                    U+180E           Mongolian Vowel Separator | 
 |                    U+2066 - U+2069  Various "isolate"s | 
 |  | 
 |  | 
 |        [:print:] This matches the same  characters  as  [:graph:]  plus  space | 
 |                  characters  that  are  not controls, that is, characters with | 
 |                  the Zs property. | 
 |  | 
 |        [:punct:] This matches all characters that have the Unicode P (punctua- | 
 |                  tion) property, plus those characters with code  points  less | 
 |                  than 256 that have the S (Symbol) property. | 
 |  | 
 |        [:xdigit:] | 
 |                  In  addition  to  the  ASCII  hexadecimal  digits,  this also | 
 |                  matches the "fullwidth" versions of those  characters,  whose | 
 |                  Unicode  code  points  start at U+FF10. This is a change that | 
 |                  was made in PCRE2 release 10.43 for Perl compatibility. | 
 |  | 
 |        The other POSIX classes are unchanged  by  PCRE2_UCP,  and  match  only | 
 |        characters with code points less than 256. | 
 |  | 
 |        There are two options that can be used to restrict the POSIX classes to | 
 |        ASCII   characters   when   PCRE2_UCP  is  set.  The  option  PCRE2_EX- | 
 |        TRA_ASCII_DIGIT affects just [:digit:] and [:xdigit:].  Within  a  pat- | 
 |        tern,  this  can  be  set  and unset by (?aT) and (?-aT). The PCRE2_EX- | 
 |        TRA_ASCII_POSIX option disables UCP processing for all  POSIX  classes, | 
 |        including  [:digit:] and [:xdigit:]. Within a pattern, (?aP) and (?-aP) | 
 |        set and unset both these options for consistency. | 
 |  | 
 |  | 
 | COMPATIBILITY FEATURE FOR WORD BOUNDARIES | 
 |  | 
 |        In the POSIX.2 compliant library that was included in 4.4BSD Unix,  the | 
 |        ugly  syntax  [[:<:]]  and [[:>:]] is used for matching "start of word" | 
 |        and "end of word". PCRE2 treats these items as follows: | 
 |  | 
 |          [[:<:]]  is converted to  \b(?=\w) | 
 |          [[:>:]]  is converted to  \b(?<=\w) | 
 |  | 
 |        Only these exact character sequences are recognized. A sequence such as | 
 |        [a[:<:]b] provokes error for an unrecognized  POSIX  class  name.  This | 
 |        support  is not compatible with Perl. It is provided to help migrations | 
 |        from other environments, and is best not used in any new patterns. Note | 
 |        that \b matches at the start and the end of a word (see "Simple  asser- | 
 |        tions"  above),  and in a Perl-style pattern the preceding or following | 
 |        character normally shows which is wanted, without the need for the  as- | 
 |        sertions  that are used above in order to give exactly the POSIX behav- | 
 |        iour. Note also that the PCRE2_UCP option changes  the  meaning  of  \w | 
 |        (and  therefore  \b)  by  default,  so  it also affects these POSIX se- | 
 |        quences. | 
 |  | 
 |  | 
 | VERTICAL BAR | 
 |  | 
 |        Vertical bar characters are used to separate alternative patterns.  For | 
 |        example, the pattern | 
 |  | 
 |          gilbert|sullivan | 
 |  | 
 |        matches  either "gilbert" or "sullivan". Any number of alternatives may | 
 |        appear, and an empty  alternative  is  permitted  (matching  the  empty | 
 |        string). The matching process tries each alternative in turn, from left | 
 |        to  right, and the first one that succeeds is used. If the alternatives | 
 |        are within a group (defined below), "succeeds" means matching the  rest | 
 |        of the main pattern as well as the alternative in the group. | 
 |  | 
 |  | 
 | INTERNAL OPTION SETTING | 
 |  | 
 |        The  settings  of  several options can be changed within a pattern by a | 
 |        sequence of letters enclosed between "(?" and ")".  The  following  are | 
 |        Perl-compatible, and are described in detail in the pcre2api documenta- | 
 |        tion. The option letters are: | 
 |  | 
 |          i  for PCRE2_CASELESS | 
 |          m  for PCRE2_MULTILINE | 
 |          n  for PCRE2_NO_AUTO_CAPTURE | 
 |          s  for PCRE2_DOTALL | 
 |          x  for PCRE2_EXTENDED | 
 |          xx for PCRE2_EXTENDED_MORE | 
 |  | 
 |        For example, (?im) sets caseless, multiline matching. It is also possi- | 
 |        ble to unset these options by preceding the relevant letters with a hy- | 
 |        phen,  for  example (?-im). The two "extended" options are not indepen- | 
 |        dent; unsetting either one cancels the effects of both of them. | 
 |  | 
 |        A  combined  setting  and  unsetting  such  as  (?im-sx),  which   sets | 
 |        PCRE2_CASELESS  and  PCRE2_MULTILINE  while  unsetting PCRE2_DOTALL and | 
 |        PCRE2_EXTENDED, is also permitted. Only one hyphen may  appear  in  the | 
 |        options  string.  If a letter appears both before and after the hyphen, | 
 |        the option is unset. An empty options setting "(?)" is  allowed.  Need- | 
 |        less to say, it has no effect. | 
 |  | 
 |        If  the  first character following (? is a circumflex, it causes all of | 
 |        the above options to be unset. Letters may  follow  the  circumflex  to | 
 |        cause some options to be re-instated, but a hyphen may not appear. | 
 |  | 
 |        Some  PCRE2-specific options can be changed by the same mechanism using | 
 |        these pairs or individual letters: | 
 |  | 
 |          aD for PCRE2_EXTRA_ASCII_BSD | 
 |          aS for PCRE2_EXTRA_ASCII_BSS | 
 |          aW for PCRE2_EXTRA_ASCII_BSW | 
 |          aP for PCRE2_EXTRA_ASCII_POSIX and PCRE2_EXTRA_ASCII_DIGIT | 
 |          aT for PCRE2_EXTRA_ASCII_DIGIT | 
 |          r  for PCRE2_EXTRA_CASELESS_RESTRICT | 
 |          J  for PCRE2_DUPNAMES | 
 |          U  for PCRE2_UNGREEDY | 
 |  | 
 |        However, except for 'r', these are not unset by (?^), which is  equiva- | 
 |        lent  to  (?-imnrsx).  If  'a' is not followed by any of the upper case | 
 |        letters shown above, it sets (or unsets) all the ASCII options. | 
 |  | 
 |        PCRE2_EXTRA_ASCII_DIGIT  has  no  additional  effect   when   PCRE2_EX- | 
 |        TRA_ASCII_POSIX  is  set,  but  including it in (?aP) means that (?-aP) | 
 |        suppresses all ASCII restrictions for POSIX classes. | 
 |  | 
 |        When one of these option changes occurs at top level (that is, not  in- | 
 |        side  group parentheses), the change applies until a subsequent change, | 
 |        or the end of the pattern. An option change within a group  (see  below | 
 |        for  a  description of groups) affects only that part of the group that | 
 |        follows it. At the end of the group these  options  are  reset  to  the | 
 |        state they were before the group. For example, | 
 |  | 
 |          (a(?i)b)c | 
 |  | 
 |        matches  abc  and  aBc and no other strings (assuming PCRE2_CASELESS is | 
 |        not set externally). Any changes made in one alternative  do  carry  on | 
 |        into subsequent branches within the same group. For example, | 
 |  | 
 |          (a(?i)b|c) | 
 |  | 
 |        matches  "ab",  "aB",  "c",  and "C", even though when matching "C" the | 
 |        first branch is abandoned before the option setting.  This  is  because | 
 |        the  effects  of option settings happen at compile time. There would be | 
 |        some very weird behaviour otherwise. | 
 |  | 
 |        As a convenient shorthand, if any option settings are required  at  the | 
 |        start  of a non-capturing group (see the next section), the option let- | 
 |        ters may appear between the "?" and the ":". Thus the two patterns | 
 |  | 
 |          (?i:saturday|sunday) | 
 |          (?:(?i)saturday|sunday) | 
 |  | 
 |        match exactly the same set of strings. | 
 |  | 
 |        Note: There are other PCRE2-specific options,  applying  to  the  whole | 
 |        pattern,  which  can be set by the application when the compiling func- | 
 |        tion is called. In addition, the pattern can  contain  special  leading | 
 |        sequences  such  as (*CRLF) to override what the application has set or | 
 |        what has been defaulted.  Details are given  in  the  section  entitled | 
 |        "Newline sequences" above. There are also the (*UTF) and (*UCP) leading | 
 |        sequences  that can be used to set UTF and Unicode property modes; they | 
 |        are equivalent to setting the PCRE2_UTF and PCRE2_UCP options,  respec- | 
 |        tively.  However,  the  application  can  set  the  PCRE2_NEVER_UTF  or | 
 |        PCRE2_NEVER_UCP options, which lock out  the  use  of  the  (*UTF)  and | 
 |        (*UCP) sequences. | 
 |  | 
 |  | 
 | GROUPS | 
 |  | 
 |        Groups  are  delimited  by  parentheses  (round brackets), which can be | 
 |        nested.  Turning part of a pattern into a group does two things: | 
 |  | 
 |        1. It localizes a set of alternatives. For example, the pattern | 
 |  | 
 |          cat(aract|erpillar|) | 
 |  | 
 |        matches "cataract", "caterpillar", or "cat". Without  the  parentheses, | 
 |        it would match "cataract", "erpillar" or an empty string. | 
 |  | 
 |        2.  It  creates a "capture group". This means that, when the whole pat- | 
 |        tern matches, the portion of the subject string that matched the  group | 
 |        is  passed back to the caller, separately from the portion that matched | 
 |        the whole pattern.  (This applies  only  to  the  traditional  matching | 
 |        function; the DFA matching function does not support capturing.) | 
 |  | 
 |        Opening parentheses are counted from left to right (starting from 1) to | 
 |        obtain  numbers for capture groups. For example, if the string "the red | 
 |        king" is matched against the pattern | 
 |  | 
 |          the ((red|white) (king|queen)) | 
 |  | 
 |        the captured substrings are "red king", "red", and "king", and are num- | 
 |        bered 1, 2, and 3, respectively. | 
 |  | 
 |        The fact that plain parentheses fulfil  two  functions  is  not  always | 
 |        helpful.   There are often times when grouping is required without cap- | 
 |        turing. If an opening parenthesis is followed by a question mark and  a | 
 |        colon,  the  group  does  not do any capturing, and is not counted when | 
 |        computing the number of any subsequent capture groups. For example,  if | 
 |        the string "the white queen" is matched against the pattern | 
 |  | 
 |          the ((?:red|white) (king|queen)) | 
 |  | 
 |        the captured substrings are "white queen" and "queen", and are numbered | 
 |        1 and 2. The maximum number of capture groups is 65535. | 
 |  | 
 |        As  a  convenient shorthand, if any option settings are required at the | 
 |        start of a non-capturing group, the option letters may  appear  between | 
 |        the "?" and the ":". Thus the two patterns | 
 |  | 
 |          (?i:saturday|sunday) | 
 |          (?:(?i)saturday|sunday) | 
 |  | 
 |        match exactly the same set of strings. Because alternative branches are | 
 |        tried  from  left  to right, and options are not reset until the end of | 
 |        the group is reached, an option setting in one branch does affect  sub- | 
 |        sequent branches, so the above patterns match "SUNDAY" as well as "Sat- | 
 |        urday". | 
 |  | 
 |  | 
 | DUPLICATE GROUP NUMBERS | 
 |  | 
 |        Perl 5.10 introduced a feature whereby each alternative in a group uses | 
 |        the  same  numbers  for  its capturing parentheses. Such a group starts | 
 |        with (?| and is itself a non-capturing  group.  For  example,  consider | 
 |        this pattern: | 
 |  | 
 |          (?|(Sat)ur|(Sun))day | 
 |  | 
 |        Because  the two alternatives are inside a (?| group, both sets of cap- | 
 |        turing parentheses are numbered one. Thus, when  the  pattern  matches, | 
 |        you  can  look  at captured substring number one, whichever alternative | 
 |        matched. This construct is useful when you want to  capture  part,  but | 
 |        not all, of one of a number of alternatives. Inside a (?| group, paren- | 
 |        theses  are  numbered as usual, but the number is reset at the start of | 
 |        each branch. The numbers of any capturing parentheses that  follow  the | 
 |        whole group start after the highest number used in any branch. The fol- | 
 |        lowing example is taken from the Perl documentation. The numbers under- | 
 |        neath show in which buffer the captured content will be stored. | 
 |  | 
 |          # before  ---------------branch-reset----------- after | 
 |          / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x | 
 |          # 1            2         2  3        2     3     4 | 
 |  | 
 |        A  backreference  to a capture group uses the most recent value that is | 
 |        set for the group. The following pattern matches "abcabc" or "defdef": | 
 |  | 
 |          /(?|(abc)|(def))\1/ | 
 |  | 
 |        In contrast, a subroutine call to a capture group always refers to  the | 
 |        first  one  in the pattern with the given number. The following pattern | 
 |        matches "abcabc" or "defabc": | 
 |  | 
 |          /(?|(abc)|(def))(?1)/ | 
 |  | 
 |        A relative reference such as (?-1) is no different: it is just a conve- | 
 |        nient way of computing an absolute group number. | 
 |  | 
 |        If a condition test for a group's having matched refers to a non-unique | 
 |        number, the test is true if any group with that number has matched. | 
 |  | 
 |        An alternative approach to using this "branch reset" feature is to  use | 
 |        duplicate named groups, as described in the next section. | 
 |  | 
 |  | 
 | NAMED CAPTURE GROUPS | 
 |  | 
 |        Identifying capture groups by number is simple, but it can be very hard | 
 |        to  keep  track of the numbers in complicated patterns. Furthermore, if | 
 |        an expression is modified, the numbers may change. To  help  with  this | 
 |        difficulty,  PCRE2  supports the naming of capture groups. This feature | 
 |        was not added to Perl until release 5.10. Python had the  feature  ear- | 
 |        lier,  and PCRE1 introduced it at release 4.0, using the Python syntax. | 
 |        PCRE2 supports both the Perl and the Python syntax. | 
 |  | 
 |        In PCRE2,  a  capture  group  can  be  named  in  one  of  three  ways: | 
 |        (?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in Python. | 
 |        Names may be up to 128 code units long. When PCRE2_UTF is not set, they | 
 |        may  contain  only  ASCII  alphanumeric characters and underscores, but | 
 |        must start with a non-digit. When PCRE2_UTF is set, the syntax of group | 
 |        names is extended to allow any Unicode letter or Unicode decimal digit. | 
 |        In other words, group names must match one of these patterns: | 
 |  | 
 |          ^[_A-Za-z][_A-Za-z0-9]*\z   when PCRE2_UTF is not set | 
 |          ^[_\p{L}][_\p{L}\p{Nd}]*\z  when PCRE2_UTF is set | 
 |  | 
 |        References to capture groups from other parts of the pattern,  such  as | 
 |        backreferences,  recursion,  and conditions, can all be made by name as | 
 |        well as by number. | 
 |  | 
 |        Named capture groups are allocated numbers as well as names, exactly as | 
 |        if the names were not present. In both PCRE2 and Perl,  capture  groups | 
 |        are  primarily  identified  by  numbers; any names are just aliases for | 
 |        these numbers. The PCRE2 API provides function calls for extracting the | 
 |        complete name-to-number translation table from a compiled  pattern,  as | 
 |        well  as  convenience  functions  for extracting captured substrings by | 
 |        name. | 
 |  | 
 |        Warning: When more than one capture group has the same number,  as  de- | 
 |        scribed in the previous section, a name given to one of them applies to | 
 |        all  of them. Perl allows identically numbered groups to have different | 
 |        names.  Consider this pattern, where there are two capture groups, both | 
 |        numbered 1: | 
 |  | 
 |          (?|(?<AA>aa)|(?<BB>bb)) | 
 |  | 
 |        Perl allows this, with both names AA and BB  as  aliases  of  group  1. | 
 |        Thus, after a successful match, both names yield the same value (either | 
 |        "aa" or "bb"). | 
 |  | 
 |        In  an attempt to reduce confusion, PCRE2 does not allow the same group | 
 |        number to be associated with more than one name. The example above pro- | 
 |        vokes a compile-time error. However, there is still  scope  for  confu- | 
 |        sion. Consider this pattern: | 
 |  | 
 |          (?|(?<AA>aa)|(bb)) | 
 |  | 
 |        Although the second group number 1 is not explicitly named, the name AA | 
 |        is  still an alias for any group 1. Whether the pattern matches "aa" or | 
 |        "bb", a reference by name to group AA yields the matched string. | 
 |  | 
 |        By default, a name must be unique within a pattern, except that  dupli- | 
 |        cate names are permitted for groups with the same number, for example: | 
 |  | 
 |          (?|(?<AA>aa)|(?<AA>bb)) | 
 |  | 
 |        The duplicate name constraint can be disabled by setting the PCRE2_DUP- | 
 |        NAMES option at compile time, or by the use of (?J) within the pattern, | 
 |        as described in the section entitled "Internal Option Setting" above. | 
 |  | 
 |        Duplicate  names  can be useful for patterns where only one instance of | 
 |        the named capture group can match. Suppose you want to match  the  name | 
 |        of  a  weekday,  either as a 3-letter abbreviation or as the full name, | 
 |        and in both cases you want to extract the  abbreviation.  This  pattern | 
 |        (ignoring the line breaks) does the job: | 
 |  | 
 |          (?J) | 
 |          (?<DN>Mon|Fri|Sun)(?:day)?| | 
 |          (?<DN>Tue)(?:sday)?| | 
 |          (?<DN>Wed)(?:nesday)?| | 
 |          (?<DN>Thu)(?:rsday)?| | 
 |          (?<DN>Sat)(?:urday)? | 
 |  | 
 |        There  are five capture groups, but only one is ever set after a match. | 
 |        The convenience functions for extracting the data by name  returns  the | 
 |        substring  for  the first (and in this example, the only) group of that | 
 |        name that matched. This saves searching to find which numbered group it | 
 |        was. (An alternative way of solving this problem is to  use  a  "branch | 
 |        reset" group, as described in the previous section.) | 
 |  | 
 |        If  you make a backreference to a non-unique named group from elsewhere | 
 |        in the pattern, the groups to which the name refers are checked in  the | 
 |        order  in  which they appear in the overall pattern. The first one that | 
 |        is set is used for the reference. For  example,  this  pattern  matches | 
 |        both "foofoo" and "barbar" but not "foobar" or "barfoo": | 
 |  | 
 |          (?J)(?:(?<n>foo)|(?<n>bar))\k<n> | 
 |  | 
 |  | 
 |        If you make a subroutine call to a non-unique named group, the one that | 
 |        corresponds to the first occurrence of the name is used. In the absence | 
 |        of duplicate numbers this is the one with the lowest number. | 
 |  | 
 |        If you use a named reference in a condition test (see the section about | 
 |        conditions below), either to check whether a capture group has matched, | 
 |        or to check for recursion, all groups with the same name are tested. If | 
 |        the  condition  is  true  for any one of them, the overall condition is | 
 |        true. This is the same behaviour as testing by number. For further  de- | 
 |        tails  of  the  interfaces  for  handling named capture groups, see the | 
 |        pcre2api documentation. | 
 |  | 
 |  | 
 | REPETITION | 
 |  | 
 |        Repetition is specified by quantifiers, which may  follow  any  one  of | 
 |        these items: | 
 |  | 
 |          a literal data character | 
 |          the dot metacharacter | 
 |          the \C escape sequence | 
 |          the \R escape sequence | 
 |          the \X escape sequence | 
 |          any escape sequence that matches a single character | 
 |          a character class | 
 |          a backreference | 
 |          a parenthesized group (including lookaround assertions) | 
 |          a subroutine call (recursive or otherwise) | 
 |  | 
 |        If a quantifier does not follow a repeatable item, an error occurs. The | 
 |        general repetition quantifier specifies a minimum and maximum number of | 
 |        permitted  matches  by  giving  two numbers in curly brackets (braces), | 
 |        separated by a comma. The numbers must be  less  than  65536,  and  the | 
 |        first must be less than or equal to the second. For example, | 
 |  | 
 |          z{2,4} | 
 |  | 
 |        matches  "zz",  "zzz",  or  "zzzz". A closing brace on its own is not a | 
 |        special character. If the second number is omitted, but  the  comma  is | 
 |        present,  there  is  no upper limit; if the second number and the comma | 
 |        are both omitted, the quantifier specifies an exact number of  required | 
 |        matches. Thus | 
 |  | 
 |          [aeiou]{3,} | 
 |  | 
 |        matches at least 3 successive vowels, but may match many more, whereas | 
 |  | 
 |          \d{8} | 
 |  | 
 |        matches  exactly  8  digits.  If the first number is omitted, the lower | 
 |        limit is taken as zero; in this case the upper limit must be present. | 
 |  | 
 |          X{,4} is interpreted as X{0,4} | 
 |  | 
 |        This is a change in behaviour that happened in Perl  5.34.0  and  PCRE2 | 
 |        10.43.  In  earlier  versions  such a sequence was not interpreted as a | 
 |        quantifier. Other regular expression engines may behave either way. | 
 |  | 
 |        If the characters that follow an opening brace do not match the  syntax | 
 |        of a quantifier, the brace is taken as a literal character. In particu- | 
 |        lar, this means that {,} is a literal string of three characters. | 
 |  | 
 |        Note that not every opening brace is potentially the start of a quanti- | 
 |        fier  because  braces  are  used  in  other  items such as \N{U+345} or | 
 |        \k{name}. | 
 |  | 
 |        In UTF modes, quantifiers apply to characters rather than to individual | 
 |        code units. Thus, for example, \x{100}{2} matches two characters,  each | 
 |        of which is represented by a two-byte sequence in a UTF-8 string. Simi- | 
 |        larly,  \X{3} matches three Unicode extended grapheme clusters, each of | 
 |        which may be several code units long (and  they  may  be  of  different | 
 |        lengths). | 
 |  | 
 |        The quantifier {0} is permitted, causing the expression to behave as if | 
 |        the previous item and the quantifier were not present. This may be use- | 
 |        ful  for  capture  groups that are referenced as subroutines from else- | 
 |        where in the pattern (but see also the section entitled "Defining  cap- | 
 |        ture groups for use by reference only" below). Except for parenthesized | 
 |        groups,  items that have a {0} quantifier are omitted from the compiled | 
 |        pattern. | 
 |  | 
 |        For convenience, the three most common quantifiers have  single-charac- | 
 |        ter abbreviations: | 
 |  | 
 |          *    is equivalent to {0,} | 
 |          +    is equivalent to {1,} | 
 |          ?    is equivalent to {0,1} | 
 |  | 
 |        It  is  possible  to construct infinite loops by following a group that | 
 |        can match no characters with a quantifier that has no upper limit,  for | 
 |        example: | 
 |  | 
 |          (a?)* | 
 |  | 
 |        Earlier  versions  of  Perl  and PCRE1 used to give an error at compile | 
 |        time for such patterns. However, because there are cases where this can | 
 |        be useful, such patterns are now accepted, but whenever an iteration of | 
 |        such a group matches no characters, matching moves on to the next  item | 
 |        in  the  pattern  instead  of repeatedly matching an empty string. This | 
 |        does not prevent backtracking into any of the iterations  if  a  subse- | 
 |        quent item fails to match. | 
 |  | 
 |        By  default,  quantifiers  are "greedy", that is, they match as much as | 
 |        possible (up to the maximum number of permitted  repetitions),  without | 
 |        causing  the  rest of the pattern to fail. The classic example of where | 
 |        this gives problems is in trying to match comments in C programs. These | 
 |        appear between /* and */ and within the comment,  individual  *  and  / | 
 |        characters  may  appear. An attempt to match C comments by applying the | 
 |        pattern | 
 |  | 
 |          /\*.*\*/ | 
 |  | 
 |        to the string | 
 |  | 
 |          /* first comment */  not comment  /* second comment */ | 
 |  | 
 |        fails, because it matches the entire string owing to the greediness  of | 
 |        the  .*  item. However, if a quantifier is followed by a question mark, | 
 |        it ceases to be greedy, and instead matches the minimum number of times | 
 |        possible, so the pattern | 
 |  | 
 |          /\*.*?\*/ | 
 |  | 
 |        does the right thing with C comments. The meaning of the various  quan- | 
 |        tifiers is not otherwise changed, just the preferred number of matches. | 
 |        Do  not  confuse this use of question mark with its use as a quantifier | 
 |        in its own right.  Because it has two uses,  it  can  sometimes  appear | 
 |        doubled, as in | 
 |  | 
 |          \d??\d | 
 |  | 
 |        which matches one digit by preference, but can match two if that is the | 
 |        only way the rest of the pattern matches. | 
 |  | 
 |        If the PCRE2_UNGREEDY option is set (an option that is not available in | 
 |        Perl),  the  quantifiers are not greedy by default, but individual ones | 
 |        can be made greedy by following them with a  question  mark.  In  other | 
 |        words, it inverts the default behaviour. | 
 |  | 
 |        When  a  parenthesized  group is quantified with a minimum repeat count | 
 |        that is greater than 1 or with a limited maximum, more  memory  is  re- | 
 |        quired for the compiled pattern, in proportion to the size of the mini- | 
 |        mum or maximum. | 
 |  | 
 |        If  a  pattern  starts  with  .*  or  .{0,} and the PCRE2_DOTALL option | 
 |        (equivalent to Perl's /s) is set, thus allowing the dot to  match  new- | 
 |        lines,  the  pattern  is  implicitly anchored, because whatever follows | 
 |        will be tried against every character position in the  subject  string, | 
 |        so  there is no point in retrying the overall match at any position af- | 
 |        ter the first. PCRE2 normally treats such a pattern as though  it  were | 
 |        preceded by \A. | 
 |  | 
 |        In  cases  where  it  is known that the subject string contains no new- | 
 |        lines, it is worth setting PCRE2_DOTALL in order to obtain  this  opti- | 
 |        mization, or alternatively, using ^ to indicate anchoring explicitly. | 
 |  | 
 |        However,  there  are  some cases where the optimization cannot be used. | 
 |        When .*  is inside capturing parentheses that  are  the  subject  of  a | 
 |        backreference  elsewhere  in the pattern, a match at the start may fail | 
 |        where a later one succeeds. Consider, for example: | 
 |  | 
 |          (.*)abc\1 | 
 |  | 
 |        If the subject is "xyz123abc123" the match point is the fourth  charac- | 
 |        ter. For this reason, such a pattern is not implicitly anchored. | 
 |  | 
 |        Another  case where implicit anchoring is not applied is when the lead- | 
 |        ing .* is inside an atomic group. Once again, a match at the start  may | 
 |        fail where a later one succeeds. Consider this pattern: | 
 |  | 
 |          (?>.*?a)b | 
 |  | 
 |        It  matches "ab" in the subject "aab". The use of the backtracking con- | 
 |        trol verbs (*PRUNE) and (*SKIP) also disable this optimization.  To  do | 
 |        so  explicitly, either pass the compile option PCRE2_NO_DOTSTAR_ANCHOR, | 
 |        or call pcre2_set_optimize() with a PCRE2_DOTSTAR_ANCHOR_OFF directive. | 
 |  | 
 |        When a capture group is repeated, the value captured is  the  substring | 
 |        that matched the final iteration. For example, after | 
 |  | 
 |          (tweedle[dume]{3}\s*)+ | 
 |  | 
 |        has matched "tweedledum tweedledee" the value of the captured substring | 
 |        is  "tweedledee". However, if there are nested capture groups, the cor- | 
 |        responding captured values may have been set  in  previous  iterations. | 
 |        For example, after | 
 |  | 
 |          (a|(b))+ | 
 |  | 
 |        matches "aba" the value of the second captured substring is "b". | 
 |  | 
 |  | 
 | ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS | 
 |  | 
 |        With  both  maximizing ("greedy") and minimizing ("ungreedy" or "lazy") | 
 |        repetition, failure of what follows normally causes the  repeated  item | 
 |        to  be  re-evaluated to see if a different number of repeats allows the | 
 |        rest of the pattern to match. Sometimes it is useful to  prevent  this, | 
 |        either  to  change the nature of the match, or to cause it fail earlier | 
 |        than it otherwise might, when the author of the pattern knows there  is | 
 |        no point in carrying on. | 
 |  | 
 |        Consider,  for  example, the pattern \d+foo when applied to the subject | 
 |        line | 
 |  | 
 |          123456bar | 
 |  | 
 |        After matching all 6 digits and then failing to match "foo", the normal | 
 |        action of the matcher is to try again with only 5 digits  matching  the | 
 |        \d+  item,  and  then  with  4,  and  so on, before ultimately failing. | 
 |        "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides | 
 |        the means for specifying that once a group has matched, it is not to be | 
 |        re-evaluated in this way. | 
 |  | 
 |        If  we  use atomic grouping for the previous example, the matcher gives | 
 |        up immediately on failing to match "foo" the first time.  The  notation | 
 |        is a kind of special parenthesis, starting with (?> as in this example: | 
 |  | 
 |          (?>\d+)foo | 
 |  | 
 |        Perl  5.28  introduced an experimental alphabetic form starting with (* | 
 |        which may be easier to remember: | 
 |  | 
 |          (*atomic:\d+)foo | 
 |  | 
 |        This kind of parenthesized group "locks up" the part of the pattern  it | 
 |        contains once it has matched, and a failure further into the pattern is | 
 |        prevented  from  backtracking into it. Backtracking past it to previous | 
 |        items, however, works as normal. | 
 |  | 
 |        An alternative description is that a group of this type matches exactly | 
 |        the string of characters that an  identical  standalone  pattern  would | 
 |        match, if anchored at the current point in the subject string. | 
 |  | 
 |        Atomic  groups  are  not capture groups. Simple cases such as the above | 
 |        example can be thought of as a  maximizing  repeat  that  must  swallow | 
 |        everything  it can.  So, while both \d+ and \d+? are prepared to adjust | 
 |        the number of digits they match in order to make the rest of  the  pat- | 
 |        tern match, (?>\d+) can only match an entire sequence of digits. | 
 |  | 
 |        Atomic  groups in general can of course contain arbitrarily complicated | 
 |        expressions, and can be nested. However, when the contents of an atomic | 
 |        group is just a single repeated item, as in the example above,  a  sim- | 
 |        pler  notation, called a "possessive quantifier" can be used. This con- | 
 |        sists of an additional + character following a quantifier.  Using  this | 
 |        notation, the previous example can be rewritten as | 
 |  | 
 |          \d++foo | 
 |  | 
 |        Note that a possessive quantifier can be used with an entire group, for | 
 |        example: | 
 |  | 
 |          (abc|xyz){2,3}+ | 
 |  | 
 |        Possessive  quantifiers are always greedy; the setting of the PCRE2_UN- | 
 |        GREEDY option is ignored. They are a convenient notation for  the  sim- | 
 |        pler  forms  of  atomic  group.  However, there is no difference in the | 
 |        meaning of a possessive quantifier and  the  equivalent  atomic  group, | 
 |        though  there  may  be a performance difference; possessive quantifiers | 
 |        should be slightly faster. | 
 |  | 
 |        The possessive quantifier syntax is an extension to the Perl  5.8  syn- | 
 |        tax.   Jeffrey  Friedl  originated the idea (and the name) in the first | 
 |        edition of his book. Mike McCloskey liked it, so implemented it when he | 
 |        built Sun's Java package, and PCRE1 copied it from there. It found  its | 
 |        way into Perl at release 5.10. | 
 |  | 
 |        PCRE2  has  an  optimization  that automatically "possessifies" certain | 
 |        simple pattern constructs. For example, the sequence A+B is treated  as | 
 |        A++B  because  there is no point in backtracking into a sequence of A's | 
 |        when  B  must  follow.   This  feature   can   be   disabled   by   the | 
 |        PCRE2_NO_AUTO_POSSESS  option,  by  calling pcre2_set_optimize() with a | 
 |        PCRE2_AUTO_POSSESS_OFF directive,  or  by  starting  the  pattern  with | 
 |        (*NO_AUTO_POSSESS). | 
 |  | 
 |        When a pattern contains an unlimited repeat inside a group that can it- | 
 |        self  be  repeated  an  unlimited number of times, the use of an atomic | 
 |        group is the only way to avoid some failing matches taking a very  long | 
 |        time indeed. The pattern | 
 |  | 
 |          (\D+|<\d+>)*[!?] | 
 |  | 
 |        matches  an  unlimited number of substrings that either consist of non- | 
 |        digits, or digits enclosed in <>, followed by either ! or  ?.  When  it | 
 |        matches, it runs quickly. However, if it is applied to | 
 |  | 
 |          aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa | 
 |  | 
 |        it  takes  a  long  time  before reporting failure. This is because the | 
 |        string can be divided between the internal \D+ repeat and the  external | 
 |        *  repeat in a large number of ways, and all have to be tried. (The ex- | 
 |        ample uses [!?] rather than a single character at the end, because both | 
 |        PCRE2 and Perl have an optimization that allows for fast failure when a | 
 |        single character is used. They remember the last single character  that | 
 |        is  required  for  a  match, and fail early if it is not present in the | 
 |        string.) If the pattern is changed so that it  uses  an  atomic  group, | 
 |        like this: | 
 |  | 
 |          ((?>\D+)|<\d+>)*[!?] | 
 |  | 
 |        sequences of non-digits cannot be broken, and failure happens quickly. | 
 |  | 
 |  | 
 | BACKREFERENCES | 
 |  | 
 |        Outside a character class, a backslash followed by a digit greater than | 
 |        0  (and  possibly further digits) is a backreference to a capture group | 
 |        earlier (that is, to its left) in the pattern, provided there have been | 
 |        that many previous capture groups. | 
 |  | 
 |        However, if the decimal number following the backslash is less than  8, | 
 |        it  is  always  taken  as  a backreference, and causes an error only if | 
 |        there are not that many capture groups in the entire pattern. In  other | 
 |        words, the group that is referenced need not be to the left of the ref- | 
 |        erence  for numbers less than 8. A "forward backreference" of this type | 
 |        can make sense when a repetition is involved and the group to the right | 
 |        has participated in an earlier iteration. | 
 |  | 
 |        It is not possible to have a numerical  "forward  backreference"  to  a | 
 |        group  whose  number  is 8 or more using this syntax because a sequence | 
 |        such as \50 is interpreted as a character defined  in  octal.  See  the | 
 |        subsection entitled "Non-printing characters" above for further details | 
 |        of  the  handling of digits following a backslash. Other forms of back- | 
 |        referencing do not suffer from this restriction. In  particular,  there | 
 |        is no problem when named capture groups are used (see below). | 
 |  | 
 |        Another  way  of  avoiding  the ambiguity inherent in the use of digits | 
 |        following a backslash is to use the \g  escape  sequence.  This  escape | 
 |        must be followed by a signed or unsigned number, optionally enclosed in | 
 |        braces. These examples are all identical: | 
 |  | 
 |          (ring), \1 | 
 |          (ring), \g1 | 
 |          (ring), \g{1} | 
 |  | 
 |        An  unsigned number specifies an absolute reference without the ambigu- | 
 |        ity that is present in the older syntax. It is also useful when literal | 
 |        digits follow the reference. A signed number is a  relative  reference. | 
 |        Consider this example: | 
 |  | 
 |          (abc(def)ghi)\g{-1} | 
 |  | 
 |        The sequence \g{-1} is a reference to the capture group whose number is | 
 |        one  less  than  the number of the next group to be started, so in this | 
 |        example (where the next group would be numbered 3) is it equivalent  to | 
 |        \2,  and  \g{-2} would be equivalent to \1. Note that if this construct | 
 |        is inside a capture group, that group is included in the count,  so  in | 
 |        this example \g{-2} also refers to group 1: | 
 |  | 
 |          (A)(\g{-2}B) | 
 |  | 
 |        The  use  of  relative  references can be helpful in long patterns, and | 
 |        also in patterns that are created by joining  together  fragments  that | 
 |        contain references within themselves. | 
 |  | 
 |        The  sequence  \g{+1}  is a reference to the next capture group that is | 
 |        started after this item, and \g{+2} refers to the one after  that,  and | 
 |        so  on.  This  kind of forward reference can be useful in patterns that | 
 |        repeat. Perl does not support the use of + in this way. | 
 |  | 
 |        A backreference matches whatever actually  most  recently  matched  the | 
 |        capture  group  in  the current subject string, rather than anything at | 
 |        all that matches the group (see "Groups as subroutines" below for a way | 
 |        of doing that). So the pattern | 
 |  | 
 |          (sens|respons)e and \1ibility | 
 |  | 
 |        matches "sense and sensibility" and "response and responsibility",  but | 
 |        not  "sense and responsibility". If caseful matching is in force at the | 
 |        time of the backreference, the case of letters is relevant.  For  exam- | 
 |        ple, | 
 |  | 
 |          ((?i)rah)\s+\1 | 
 |  | 
 |        matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the | 
 |        original capture group is matched caselessly. | 
 |  | 
 |        There are several different ways of  writing  backreferences  to  named | 
 |        capture  groups.  The  .NET  syntax  is  \k{name}, the Python syntax is | 
 |        (?=name), and the original Perl syntax is \k<name> or \k'name'. All  of | 
 |        these  are  now  supported  by both Perl and PCRE2. Perl 5.10's unified | 
 |        backreference syntax, in which \g can be  used  for  both  numeric  and | 
 |        named  references,  is  also  supported by PCRE2.  We could rewrite the | 
 |        above example in any of the following ways: | 
 |  | 
 |          (?<p1>(?i)rah)\s+\k<p1> | 
 |          (?'p1'(?i)rah)\s+\k{p1} | 
 |          (?P<p1>(?i)rah)\s+(?P=p1) | 
 |          (?<p1>(?i)rah)\s+\g{p1} | 
 |  | 
 |        A capture group that is referenced by name may appear  in  the  pattern | 
 |        before or after the reference. | 
 |  | 
 |        There  may be more than one backreference to the same group. If a group | 
 |        has not actually been used in a particular match, backreferences to  it | 
 |        always fail by default. For example, the pattern | 
 |  | 
 |          (a|(bc))\2 | 
 |  | 
 |        always  fails  if  it starts to match "a" rather than "bc". However, if | 
 |        the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backref- | 
 |        erence to an unset value matches an empty string. | 
 |  | 
 |        Because there may be many capture groups in a pattern, all digits  fol- | 
 |        lowing  a backslash are taken as part of a potential backreference num- | 
 |        ber. If the pattern continues with a digit  character,  some  delimiter | 
 |        must  be  used to terminate the backreference. If the PCRE2_EXTENDED or | 
 |        PCRE2_EXTENDED_MORE option is set, this can be white space.  Otherwise, | 
 |        the \g{} syntax or an empty comment (see "Comments" below) can be used. | 
 |  | 
 |    Recursive backreferences | 
 |  | 
 |        A  backreference  that occurs inside the group to which it refers fails | 
 |        when the group is first used, so, for  example,  (a\1)  never  matches. | 
 |        However,  such references can be useful inside repeated groups. For ex- | 
 |        ample, the pattern | 
 |  | 
 |          (a|b\1)+ | 
 |  | 
 |        matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- | 
 |        ation of the group, the backreference matches the character string cor- | 
 |        responding to the previous iteration. In order for this  to  work,  the | 
 |        pattern  must  be  such that the first iteration does not need to match | 
 |        the backreference. This can be done using alternation, as in the  exam- | 
 |        ple above, or by a quantifier with a minimum of zero. | 
 |  | 
 |        For versions of PCRE2 less than 10.25, backreferences of this type used | 
 |        to  cause  the  group  that  they  reference to be treated as an atomic | 
 |        group.  This restriction no longer applies, and backtracking into  such | 
 |        groups can occur as normal. | 
 |  | 
 |  | 
 | ASSERTIONS | 
 |  | 
 |        An  assertion  is a test that does not consume any characters. The test | 
 |        must succeed for the match to continue. The simple assertions coded  as | 
 |        \b, \B, \A, \G, \Z, \z, ^ and $ are described above. | 
 |  | 
 |        More  complicated  assertions  are  coded  as  parenthesized groups. If | 
 |        matching such a group succeeds, matching continues after it,  but  with | 
 |        the matching position in the subject string reset to what it was before | 
 |        the assertion was processed. | 
 |  | 
 |        A  special  kind  of  assertion,  called  a "scan substring" assertion, | 
 |        matches a subpattern against a previously captured substring.  This  is | 
 |        described in the section entitled "Scan substring assertions" below. It | 
 |        is a PCRE2 extension, not compatible with Perl. | 
 |  | 
 |        The other goup-based assertions are of two kinds: those that look ahead | 
 |        of  the current position in the subject string, and those that look be- | 
 |        hind it, and in each case an assertion may be positive (must match  for | 
 |        the assertion to be true) or negative (must not match for the assertion | 
 |        to be true). | 
 |  | 
 |        The  Perl-compatible  lookaround assertions are atomic. If an assertion | 
 |        is true, but there is a subsequent matching failure, there is no  back- | 
 |        tracking  into  the assertion. However, there are some cases where non- | 
 |        atomic assertions can be useful. PCRE2 has some support for these,  de- | 
 |        scribed in the section entitled "Non-atomic assertions" below, but they | 
 |        are not Perl-compatible. | 
 |  | 
 |        A  lookaround  assertion  may  appear as the condition in a conditional | 
 |        group (see below). In this case, the result of matching  the  assertion | 
 |        determines which branch of the condition is followed. | 
 |  | 
 |        Assertion  groups are not capture groups. If an assertion contains cap- | 
 |        ture groups within it, these are counted for the purposes of  numbering | 
 |        the  capture  groups in the whole pattern. Within each branch of an as- | 
 |        sertion, locally captured substrings may be  referenced  in  the  usual | 
 |        way.  For  example,  a  sequence such as (.)\g{-1} can be used to check | 
 |        that two adjacent characters are the same. | 
 |  | 
 |        When a branch within an assertion fails to match, any  substrings  that | 
 |        were  captured  are  discarded (as happens with any pattern branch that | 
 |        fails to match). A  negative  assertion  is  true  only  when  all  its | 
 |        branches fail to match; this means that no captured substrings are ever | 
 |        retained  after a successful negative assertion. When an assertion con- | 
 |        tains a matching branch, what happens depends on the type of assertion. | 
 |  | 
 |        For a positive assertion, internally captured substrings  in  the  suc- | 
 |        cessful  branch are retained, and matching continues with the next pat- | 
 |        tern item after the assertion. For a  negative  assertion,  a  matching | 
 |        branch  means  that  the assertion is not true. If such an assertion is | 
 |        being used as a condition in a conditional group (see below),  captured | 
 |        substrings  are  retained,  because  matching  continues  with the "no" | 
 |        branch of the condition. For other failing negative assertions, control | 
 |        passes to the previous backtracking point, thus discarding any captured | 
 |        strings within the assertion. | 
 |  | 
 |        Most assertion groups may be repeated; though it makes no sense to  as- | 
 |        sert the same thing several times, the side effect of capturing in pos- | 
 |        itive assertions may occasionally be useful. However, an assertion that | 
 |        forms  the  condition  for  a  conditional group may not be quantified. | 
 |        PCRE2 used to restrict the repetition of assertions, but  from  release | 
 |        10.35  the  only restriction is that an unlimited maximum repetition is | 
 |        changed to be one more than the minimum. For example, {3,}  is  treated | 
 |        as {3,4}. | 
 |  | 
 |    Alphabetic assertion names | 
 |  | 
 |        Traditionally,  symbolic  sequences such as (?= and (?<= have been used | 
 |        to specify lookaround assertions. Perl 5.28 introduced some  experimen- | 
 |        tal alphabetic alternatives which might be easier to remember. They all | 
 |        start  with  (* instead of (? and must be written using lower case let- | 
 |        ters. PCRE2 supports the following synonyms: | 
 |  | 
 |          (*positive_lookahead:  or (*pla: is the same as (?= | 
 |          (*negative_lookahead:  or (*nla: is the same as (?! | 
 |          (*positive_lookbehind: or (*plb: is the same as (?<= | 
 |          (*negative_lookbehind: or (*nlb: is the same as (?<! | 
 |  | 
 |        For example, (*pla:foo) is the same assertion as (?=foo). In  the  fol- | 
 |        lowing  sections, the various assertions are described using the origi- | 
 |        nal symbolic forms. | 
 |  | 
 |    Lookahead assertions | 
 |  | 
 |        Lookahead assertions start with (?= for positive assertions and (?! for | 
 |        negative assertions. For example, | 
 |  | 
 |          \w+(?=;) | 
 |  | 
 |        matches a word followed by a semicolon, but does not include the  semi- | 
 |        colon in the match, and | 
 |  | 
 |          foo(?!bar) | 
 |  | 
 |        matches  any  occurrence  of  "foo" that is not followed by "bar". Note | 
 |        that the apparently similar pattern | 
 |  | 
 |          (?!foo)bar | 
 |  | 
 |        does not find an occurrence of "bar"  that  is  preceded  by  something | 
 |        other  than "foo"; it finds any occurrence of "bar" whatsoever, because | 
 |        the assertion (?!foo) is always true when the next three characters are | 
 |        "bar". A lookbehind assertion is needed to achieve the other effect. | 
 |  | 
 |        If you want to force a matching failure at some point in a pattern, the | 
 |        most convenient way to do it is with (?!) because an empty  string  al- | 
 |        ways  matches,  so  an assertion that requires there not to be an empty | 
 |        string must always fail.  The backtracking control verb (*FAIL) or (*F) | 
 |        is a synonym for (?!). | 
 |  | 
 |    Lookbehind assertions | 
 |  | 
 |        Lookbehind assertions start with (?<= for positive assertions and  (?<! | 
 |        for negative assertions. For example, | 
 |  | 
 |          (?<!foo)bar | 
 |  | 
 |        does  find  an  occurrence  of "bar" that is not preceded by "foo". The | 
 |        contents of a lookbehind assertion are restricted such that there  must | 
 |        be  a known maximum to the lengths of all the strings it matches. There | 
 |        are two cases: | 
 |  | 
 |        If every top-level alternative matches a fixed length, for example | 
 |  | 
 |          (?<=colour|color) | 
 |  | 
 |        there is a limit of 65535 characters to the lengths, which do not  have | 
 |        to  be the same, as this example demonstrates. This is the only kind of | 
 |        lookbehind supported by PCRE2 versions earlier than 10.43  and  by  the | 
 |        alternative matching function pcre2_dfa_match(). | 
 |  | 
 |        In  PCRE2 10.43 and later, pcre2_match() supports lookbehind assertions | 
 |        in which one or more top-level alternatives can  match  more  than  one | 
 |        string length, for example | 
 |  | 
 |          (?<=colou?r) | 
 |  | 
 |        The maximum matching length for any branch of the lookbehind is limited | 
 |        to  a value set by the calling program (default 255 characters). Unlim- | 
 |        ited repetition (for example \d*) is not supported. In some cases,  the | 
 |        escape  sequence \K (see above) can be used instead of a lookbehind as- | 
 |        sertion at the start of a pattern to get round  the  length  limit  re- | 
 |        striction. | 
 |  | 
 |        In  UTF-8  and  UTF-16 modes, PCRE2 does not allow the \C escape (which | 
 |        matches a single code unit even in a UTF mode) to appear in  lookbehind | 
 |        assertions,  because  it makes it impossible to calculate the length of | 
 |        the lookbehind. The \X and \R escapes, which can match  different  num- | 
 |        bers of code units, are never permitted in lookbehinds. | 
 |  | 
 |        "Subroutine"  calls  (see below) such as (?2) or (?&X) are permitted in | 
 |        lookbehinds, as long as the called capture  group  matches  a  limited- | 
 |        length  string. However, recursion, that is, a "subroutine" call into a | 
 |        group that is already active, is not supported. | 
 |  | 
 |        PCRE2 supports backreferences in lookbehinds, but only if certain  con- | 
 |        ditions  are met. The PCRE2_MATCH_UNSET_BACKREF option must not be set, | 
 |        there must be no use of (?| in the pattern (it creates duplicate  group | 
 |        numbers), and if the backreference is by name, the name must be unique. | 
 |        Of course, the referenced group must itself match a limited length sub- | 
 |        string.  The  following  pattern  matches words containing at least two | 
 |        characters that begin and end with the same character: | 
 |  | 
 |           \b(\w)\w++(?<=\1) | 
 |  | 
 |        Possessive quantifiers can be used in conjunction with  lookbehind  as- | 
 |        sertions  to  specify efficient matching at the end of subject strings. | 
 |        Consider a simple pattern such as | 
 |  | 
 |          abcd$ | 
 |  | 
 |        when applied to a long string that does  not  match.  Because  matching | 
 |        proceeds  from  left to right, PCRE2 will look for each "a" in the sub- | 
 |        ject and then see if what follows matches the rest of the  pattern.  If | 
 |        the pattern is specified as | 
 |  | 
 |          ^.*abcd$ | 
 |  | 
 |        the  initial .* matches the entire string at first, but when this fails | 
 |        (because there is no following "a"), it backtracks to match all but the | 
 |        last character, then all but the last two characters, and so  on.  Once | 
 |        again  the search for "a" covers the entire string, from right to left, | 
 |        so we are no better off. However, if the pattern is written as | 
 |  | 
 |          ^.*+(?<=abcd) | 
 |  | 
 |        there can be no backtracking for the .*+ item because of the possessive | 
 |        quantifier; it can match only the entire string. The subsequent lookbe- | 
 |        hind assertion does a single test on the last four  characters.  If  it | 
 |        fails,  the  match  fails  immediately. For long strings, this approach | 
 |        makes a significant difference to the processing time. | 
 |  | 
 |    Using multiple assertions | 
 |  | 
 |        Several assertions (of any sort) may occur in succession. For example, | 
 |  | 
 |          (?<=\d{3})(?<!999)foo | 
 |  | 
 |        matches "foo" preceded by three digits that are not "999". Notice  that | 
 |        each  of  the  assertions is applied independently at the same point in | 
 |        the subject string. First there is a  check  that  the  previous  three | 
 |        characters  are  all  digits,  and  then there is a check that the same | 
 |        three characters are not "999".  This pattern does not match "foo" pre- | 
 |        ceded by six characters, the first of which are  digits  and  the  last | 
 |        three  of  which  are not "999". For example, it doesn't match "123abc- | 
 |        foo". A pattern to do that is | 
 |  | 
 |          (?<=\d{3}...)(?<!999)foo | 
 |  | 
 |        This time the first assertion looks at the  preceding  six  characters, | 
 |        checking that the first three are digits, and then the second assertion | 
 |        checks that the preceding three characters are not "999". | 
 |  | 
 |        Assertions can be nested in any combination. For example, | 
 |  | 
 |          (?<=(?<!foo)bar)baz | 
 |  | 
 |        matches  an occurrence of "baz" that is preceded by "bar" which in turn | 
 |        is not preceded by "foo", while | 
 |  | 
 |          (?<=\d{3}(?!999)...)foo | 
 |  | 
 |        is another pattern that matches "foo" preceded by three digits and  any | 
 |        three characters that are not "999". | 
 |  | 
 |  | 
 | NON-ATOMIC ASSERTIONS | 
 |  | 
 |        Traditional  lookaround assertions are atomic. That is, if an assertion | 
 |        is true, but there is a subsequent matching failure, there is no  back- | 
 |        tracking  into  the assertion. However, there are some cases where non- | 
 |        atomic positive assertions can be useful. PCRE2  provides  these  using | 
 |        the following syntax: | 
 |  | 
 |          (*non_atomic_positive_lookahead:  or (*napla: or (?* | 
 |          (*non_atomic_positive_lookbehind: or (*naplb: or (?<* | 
 |  | 
 |        Consider  the  problem  of finding the right-most word in a string that | 
 |        also appears earlier in the string, that is, it must  appear  at  least | 
 |        twice  in  total.  This pattern returns the required result as captured | 
 |        substring 1: | 
 |  | 
 |          ^(?x)(*napla: .* \b(\w++)) (?> .*? \b\1\b ){2} | 
 |  | 
 |        For a subject such as "word1 word2 word3 word2 word3 word4" the  result | 
 |        is  "word3".  How does it work? At the start, ^(?x) anchors the pattern | 
 |        and sets the "x" option, which causes white space (introduced for read- | 
 |        ability) to be ignored. Inside the assertion, the greedy  .*  at  first | 
 |        consumes the entire string, but then has to backtrack until the rest of | 
 |        the  assertion can match a word, which is captured by group 1. In other | 
 |        words, when the assertion first succeeds, it  captures  the  right-most | 
 |        word in the string. | 
 |  | 
 |        The  current  matching point is then reset to the start of the subject, | 
 |        and the rest of the pattern match checks for  two  occurrences  of  the | 
 |        captured  word,  using  an  ungreedy .*? to scan from the left. If this | 
 |        succeeds, we are done, but if the last word in the string does not  oc- | 
 |        cur  twice,  this  part  of  the pattern fails. If a traditional atomic | 
 |        lookahead (?= or (*pla: had been used, the assertion could not  be  re- | 
 |        entered, and the whole match would fail. The pattern would succeed only | 
 |        if the very last word in the subject was found twice. | 
 |  | 
 |        Using  a  non-atomic  lookahead, however, means that when the last word | 
 |        does not occur twice in the string, the  lookahead  can  backtrack  and | 
 |        find  the second-last word, and so on, until either the match succeeds, | 
 |        or all words have been tested. | 
 |  | 
 |        Two conditions must be met for a non-atomic assertion to be useful: the | 
 |        contents of one or more capturing groups must change after a  backtrack | 
 |        into  the  assertion,  and  there  must be a backreference to a changed | 
 |        group later in the pattern. If this is not the case, the  rest  of  the | 
 |        pattern  match  fails exactly as before because nothing has changed, so | 
 |        using a non-atomic assertion just wastes resources. | 
 |  | 
 |        There is one exception to backtracking into a non-atomic assertion.  If | 
 |        an  (*ACCEPT)  control verb is triggered, the assertion succeeds atomi- | 
 |        cally. That is, a subsequent match failure cannot  backtrack  into  the | 
 |        assertion. | 
 |  | 
 |        Non-atomic  assertions  are  not  supported by the alternative matching | 
 |        function pcre2_dfa_match(). They are supported by JIT, but only if they | 
 |        do not contain any control verbs such as (*ACCEPT). (This may change in | 
 |        future). Note that assertions that appear as conditions for conditional | 
 |        groups (see below) must be atomic. | 
 |  | 
 |  | 
 | SCAN SUBSTRING ASSERTIONS | 
 |  | 
 |        A special kind of assertion, not compatible with Perl, makes it  possi- | 
 |        ble to check the contents of a captured substring by matching it with a | 
 |        subpattern.   Because this involves capturing, this feature is not sup- | 
 |        ported by pcre2_dfa_match(). | 
 |  | 
 |        A scan substring assertion starts with the  sequence  (*scan_substring: | 
 |        or (*scs: which is followed by a list of substring numbers (absolute or | 
 |        relative)  and/or  substring  names  enclosed in single quotes or angle | 
 |        brackets, all within parentheses. The rest of the item is  the  subpat- | 
 |        tern that is applied to the substring, as shown in these examples: | 
 |  | 
 |          (*scan_substring:(1)...) | 
 |          (*scs:(-2)...) | 
 |          (*scs:('AB')...) | 
 |          (*scs:(1,'AB',-2)...) | 
 |  | 
 |        The  list  of  groups is checked in the order they are given, and it is | 
 |        the contents of the first one that is found to be set that are scanned. | 
 |        When PCRE2_DUPNAMES is set and there are  ambiguous  group  names,  all | 
 |        groups  with  the same name are checked in numerical order. A scan sub- | 
 |        string assertion fails if none of the groups it  references  have  been | 
 |        set. | 
 |  | 
 |        The pattern match on the substring is always anchored, that is, it must | 
 |        match  from  the  start of the substring. There is no "bumpalong" if it | 
 |        does not match at the start. The end of the subject is temporarily  re- | 
 |        set  to be the end of the substring, so \Z, \z, and $ will match there. | 
 |        However, the start of the subject is  not  reset.  This  means  that  ^ | 
 |        matches only if the substring is actually at the start of the main sub- | 
 |        ject,  but  it also means that lookbehind assertions into what precedes | 
 |        the substring are possible. | 
 |  | 
 |        Here is a very simple example: find a word that contains the  rare  (in | 
 |        English) sequence of letters "rh" not at the start: | 
 |  | 
 |          \b(\w++)(*scs:(1).+rh) | 
 |  | 
 |        The  first  group  captures  a word which is then scanned by the second | 
 |        group.  This example does not actually need this  heavyweight  feature; | 
 |        the same match can be achieved with: | 
 |  | 
 |          \b\w+?rh\w*\b | 
 |  | 
 |        When  things  are  more  complicated, however, scanning a captured sub- | 
 |        string can be a useful way to describe the required match. For  exmple, | 
 |        there  is  a  rather  complicated  pattern  in the PCRE2 test data that | 
 |        checks an entire subject string for a palindrome, that is, the sequence | 
 |        of letters is the same in both directions. Suppose you want  to  search | 
 |        for individual words of two or more characters such as "level" that are | 
 |        palindromes: | 
 |  | 
 |          (\b\w{2,}+\b)(*scs:(1)...palindrome-matching-pattern...) | 
 |  | 
 |        Within a substring scanning subpattern, references to other groups work | 
 |        as  normal.  Capturing  groups may appear, and will retain their values | 
 |        during ongoing matching if the assertion succeeds. | 
 |  | 
 |  | 
 | SCRIPT RUNS | 
 |  | 
 |        In concept, a script run is a sequence of characters that are all  from | 
 |        the  same  Unicode script such as Latin or Greek. However, because some | 
 |        scripts are commonly used together, and because  some  diacritical  and | 
 |        other  marks  are  used  with  multiple scripts, it is not that simple. | 
 |        There is a full description of the rules that PCRE2 uses in the section | 
 |        entitled "Script Runs" in the pcre2unicode documentation. | 
 |  | 
 |        If part of a pattern is enclosed between (*script_run: or (*sr:  and  a | 
 |        closing  parenthesis,  it  fails  if the sequence of characters that it | 
 |        matches are not a script run. After a failure, normal backtracking  oc- | 
 |        curs.  Script runs can be used to detect spoofing attacks using charac- | 
 |        ters that look the same, but are from  different  scripts.  The  string | 
 |        "paypal.com"  is an infamous example, where the letters could be a mix- | 
 |        ture of Latin and Cyrillic. This pattern ensures that the matched char- | 
 |        acters in a sequence of non-spaces that follow white space are a script | 
 |        run: | 
 |  | 
 |          \s+(*sr:\S+) | 
 |  | 
 |        To be sure that they are all from the Latin  script  (for  example),  a | 
 |        lookahead can be used: | 
 |  | 
 |          \s+(?=\p{Latin})(*sr:\S+) | 
 |  | 
 |        This works as long as the first character is expected to be a character | 
 |        in  that  script,  and  not (for example) punctuation, which is allowed | 
 |        with any script. If this is not the case, a more creative lookahead  is | 
 |        needed.  For  example, if digits, underscore, and dots are permitted at | 
 |        the start: | 
 |  | 
 |          \s+(?=[0-9_.]*\p{Latin})(*sr:\S+) | 
 |  | 
 |  | 
 |        In many cases, backtracking into a script run pattern fragment  is  not | 
 |        desirable.  The  script run can employ an atomic group to prevent this. | 
 |        Because this is a common requirement, a shorthand notation is  provided | 
 |        by (*atomic_script_run: or (*asr: | 
 |  | 
 |          (*asr:...) is the same as (*sr:(?>...)) | 
 |  | 
 |        Note that the atomic group is inside the script run. Putting it outside | 
 |        would not prevent backtracking into the script run pattern. | 
 |  | 
 |        Support  for  script runs is not available if PCRE2 is compiled without | 
 |        Unicode support. A compile-time error is given if any of the above con- | 
 |        structs is encountered. Script runs are not supported by the  alternate | 
 |        matching  function,  pcre2_dfa_match() because they use the same mecha- | 
 |        nism as capturing parentheses. | 
 |  | 
 |        Warning: The (*ACCEPT) control verb (see  below)  should  not  be  used | 
 |        within a script run group, because it causes an immediate exit from the | 
 |        group, bypassing the script run checking. | 
 |  | 
 |  | 
 | CONDITIONAL GROUPS | 
 |  | 
 |        It is possible to cause the matching process to obey a pattern fragment | 
 |        conditionally or to choose between two alternative fragments, depending | 
 |        on  the result of an assertion, or whether a specific capture group has | 
 |        already been matched. The two possible forms of conditional group are: | 
 |  | 
 |          (?(condition)yes-pattern) | 
 |          (?(condition)yes-pattern|no-pattern) | 
 |  | 
 |        If the condition is satisfied, the yes-pattern is used;  otherwise  the | 
 |        no-pattern  (if present) is used. An absent no-pattern is equivalent to | 
 |        an empty string (it always matches). If there are more than two  alter- | 
 |        natives  in the group, a compile-time error occurs. Each of the two al- | 
 |        ternatives may itself contain nested groups of any form, including con- | 
 |        ditional groups; the restriction to two alternatives  applies  only  at | 
 |        the  level of the condition itself. This pattern fragment is an example | 
 |        where the alternatives are complex: | 
 |  | 
 |          (?(1) (A|B|C) | (D | (?(2)E|F) | E) ) | 
 |  | 
 |  | 
 |        There are five kinds of condition: references to capture groups, refer- | 
 |        ences to recursion, two pseudo-conditions called  DEFINE  and  VERSION, | 
 |        and assertions. | 
 |  | 
 |    Checking for a used capture group by number | 
 |  | 
 |        If  the  text between the parentheses consists of a sequence of digits, | 
 |        the condition is true if a capture group of that number has  previously | 
 |        matched.  If  there is more than one capture group with the same number | 
 |        (see the earlier section about duplicate group numbers), the  condition | 
 |        is  true if any of them have matched. An alternative notation, which is | 
 |        a PCRE2 extension, not supported by Perl, is to precede the digits with | 
 |        a plus or minus sign. In this case, the group number is relative rather | 
 |        than absolute. The most recently opened capture group (which  could  be | 
 |        enclosing  this  condition)  can be referenced by (?(-1), the next most | 
 |        recent by (?(-2), and so on. Inside loops it can also make sense to re- | 
 |        fer to subsequent groups.  The next capture group to be opened  can  be | 
 |        referenced  as  (?(+1), and so on. The value zero in any of these forms | 
 |        is not used; it provokes a compile-time error. | 
 |  | 
 |        Consider the following pattern, which  contains  non-significant  white | 
 |        space  to  make it more readable (assume the PCRE2_EXTENDED option) and | 
 |        to divide it into three parts for ease of discussion: | 
 |  | 
 |          ( \( )?    [^()]+    (?(1) \) ) | 
 |  | 
 |        The first part matches an optional opening  parenthesis,  and  if  that | 
 |        character is present, sets it as the first captured substring. The sec- | 
 |        ond  part  matches one or more characters that are not parentheses. The | 
 |        third part is a conditional group that tests whether or not  the  first | 
 |        capture  group  matched. If it did, that is, if subject started with an | 
 |        opening parenthesis, the condition is true, and so the  yes-pattern  is | 
 |        executed  and  a  closing parenthesis is required. Otherwise, since no- | 
 |        pattern is not present, the conditional group matches nothing. In other | 
 |        words, this pattern matches a sequence of  non-parentheses,  optionally | 
 |        enclosed in parentheses. | 
 |  | 
 |        If  you  were  embedding  this pattern in a larger one, you could use a | 
 |        relative reference: | 
 |  | 
 |          ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ... | 
 |  | 
 |        This makes the fragment independent of the parentheses  in  the  larger | 
 |        pattern. | 
 |  | 
 |    Checking for a used capture group by name | 
 |  | 
 |        Perl  uses  the  syntax  (?(<name>)...) or (?('name')...) to test for a | 
 |        used capture group by name. For compatibility with earlier versions  of | 
 |        PCRE1,  which had this facility before Perl, the syntax (?(name)...) is | 
 |        also recognized.  Note, however, that undelimited names  consisting  of | 
 |        the  letter  R followed by digits are ambiguous (see the following sec- | 
 |        tion). Rewriting the above example to use a named group gives this: | 
 |  | 
 |          (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) ) | 
 |  | 
 |        If the name used in a condition of this kind is a duplicate,  the  test | 
 |        is  applied  to  all groups of the same name, and is true if any one of | 
 |        them has matched. | 
 |  | 
 |    Checking for pattern recursion | 
 |  | 
 |        "Recursion" in this sense refers to any subroutine-like call  from  one | 
 |        part  of  the  pattern to another, whether or not it is actually recur- | 
 |        sive. See the sections entitled "Recursive  patterns"  and  "Groups  as | 
 |        subroutines" below for details of recursion and subroutine calls. | 
 |  | 
 |        If  a  condition  is the string (R), and there is no capture group with | 
 |        the name R, the condition is true if matching is currently in a  recur- | 
 |        sion  or  subroutine call to the whole pattern or any capture group. If | 
 |        digits follow the letter R, and there is no group with that  name,  the | 
 |        condition  is  true  if  the  most recent call is into a group with the | 
 |        given number, which must exist somewhere in the overall  pattern.  This | 
 |        is a contrived example that is equivalent to a+b: | 
 |  | 
 |          ((?(R1)a+|(?1)b)) | 
 |  | 
 |        However,  in  both  cases,  if there is a capture group with a matching | 
 |        name, the condition tests for its being set, as described in  the  sec- | 
 |        tion  above,  instead of testing for recursion. For example, creating a | 
 |        group with the name R1 by adding (?<R1>)  to  the  above  pattern  com- | 
 |        pletely changes its meaning. | 
 |  | 
 |        If a name preceded by ampersand follows the letter R, for example: | 
 |  | 
 |          (?(R&name)...) | 
 |  | 
 |        the  condition  is true if the most recent recursion is into a group of | 
 |        that name (which must exist within the pattern). | 
 |  | 
 |        This condition does not check the entire recursion stack. It tests only | 
 |        the current level. If the name used in a condition of this  kind  is  a | 
 |        duplicate,  the  test is applied to all groups of the same name, and is | 
 |        true if any one of them is the most recent recursion. | 
 |  | 
 |        At "top level", all these recursion test conditions are false. | 
 |  | 
 |    Defining capture groups for use by reference only | 
 |  | 
 |        If the condition is the string (DEFINE), the condition is always false, | 
 |        even if there is a group with the name DEFINE. In this case, there  may | 
 |        be only one alternative in the rest of the conditional group. It is al- | 
 |        ways  skipped if control reaches this point in the pattern; the idea of | 
 |        DEFINE is that it can be used to define subroutines that can be  refer- | 
 |        enced  from elsewhere. (The use of subroutines is described below.) For | 
 |        example, a pattern to match an IPv4 address  such  as  "192.168.23.245" | 
 |        could be written like this (ignore white space and line breaks): | 
 |  | 
 |          (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) | 
 |          \b (?&byte) (\.(?&byte)){3} \b | 
 |  | 
 |        The  first  part  of the pattern is a DEFINE group inside which another | 
 |        group named "byte" is defined. This matches an individual component  of | 
 |        an  IPv4  address  (a number less than 256). When matching takes place, | 
 |        this part of the pattern is skipped because DEFINE acts  like  a  false | 
 |        condition.  The  rest of the pattern uses references to the named group | 
 |        to match the four dot-separated components of an IPv4 address,  insist- | 
 |        ing on a word boundary at each end. | 
 |  | 
 |    Checking the PCRE2 version | 
 |  | 
 |        Programs  that link with a PCRE2 library can check the version by call- | 
 |        ing pcre2_config() with appropriate arguments.  Users  of  applications | 
 |        that  do  not have access to the underlying code cannot do this. A spe- | 
 |        cial "condition" called VERSION exists to allow such users to  discover | 
 |        which version of PCRE2 they are dealing with by using this condition to | 
 |        match  a string such as "yesno". VERSION must be followed either by "=" | 
 |        or ">=" and a version number.  For example: | 
 |  | 
 |          (?(VERSION>=10.4)yes|no) | 
 |  | 
 |        This pattern matches "yes" if the PCRE2 version is greater or equal  to | 
 |        10.4,  or  "no"  otherwise.  The  fractional part of the version number | 
 |        could be ommited. | 
 |  | 
 |    Assertion conditions | 
 |  | 
 |        If the condition is not in any of the  above  formats,  it  must  be  a | 
 |        parenthesized  assertion.  This may be a positive or negative lookahead | 
 |        or lookbehind assertion. However, it must be a traditional  atomic  as- | 
 |        sertion, not one of the non-atomic assertions. | 
 |  | 
 |        Consider  this  pattern,  again containing non-significant white space, | 
 |        and with the two alternatives on the second line: | 
 |  | 
 |          (?(?=[^a-z]*[a-z]) | 
 |          \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} ) | 
 |  | 
 |        The condition is a positive lookahead assertion  that  matches  an  op- | 
 |        tional sequence of non-letters followed by a letter. In other words, it | 
 |        tests for the presence of at least one letter in the subject. If a let- | 
 |        ter  is  found,  the  subject is matched against the first alternative; | 
 |        otherwise it is  matched  against  the  second.  This  pattern  matches | 
 |        strings  in  one  of the two forms dd-aaa-dd or dd-dd-dd, where aaa are | 
 |        letters and dd are digits. | 
 |  | 
 |        When an assertion that is a condition contains capture groups, any cap- | 
 |        turing that occurs in a matching branch  is  retained  afterwards,  for | 
 |        both  positive and negative assertions, because matching always contin- | 
 |        ues after the assertion, whether it succeeds or  fails.  (Compare  non- | 
 |        conditional  assertions, for which captures are retained only for posi- | 
 |        tive assertions that succeed.) | 
 |  | 
 |  | 
 | COMMENTS | 
 |  | 
 |        There are two ways of including comments in patterns that are processed | 
 |        by PCRE2. In both cases, the start of the comment  must  not  be  in  a | 
 |        character  class,  nor  in  the middle of any other sequence of related | 
 |        characters such as (?: or a group name or number or a Unicode  property | 
 |        name. The characters that make up a comment play no part in the pattern | 
 |        matching. | 
 |  | 
 |        The  sequence (?# marks the start of a comment that continues up to the | 
 |        next closing parenthesis. Nested parentheses are not permitted. If  the | 
 |        PCRE2_EXTENDED  or  PCRE2_EXTENDED_MORE  option  is set, an unescaped # | 
 |        character also introduces a comment, which in this  case  continues  to | 
 |        immediately  after  the next newline character or character sequence in | 
 |        the pattern. Which characters are interpreted as newlines is controlled | 
 |        by an option passed to the compiling function or by a special  sequence | 
 |        at the start of the pattern, as described in the section entitled "New- | 
 |        line conventions" above. Note that the end of this type of comment is a | 
 |        literal  newline  sequence in the pattern; escape sequences that happen | 
 |        to represent a newline do not count. For example, consider this pattern | 
 |        when PCRE2_EXTENDED is set, and the default newline convention (a  sin- | 
 |        gle linefeed character) is in force: | 
 |  | 
 |          abc #comment \n still comment | 
 |  | 
 |        On  encountering  the # character, pcre2_compile() skips along, looking | 
 |        for a newline in the pattern. The sequence \n is still literal at  this | 
 |        stage,  so  it does not terminate the comment. Only an actual character | 
 |        with the code value 0x0a (the default newline) does so. | 
 |  | 
 |  | 
 | RECURSIVE PATTERNS | 
 |  | 
 |        Consider the problem of matching a string in parentheses, allowing  for | 
 |        unlimited  nested  parentheses.  Without the use of recursion, the best | 
 |        that can be done is to use a pattern that  matches  up  to  some  fixed | 
 |        depth  of  nesting.  It  is not possible to handle an arbitrary nesting | 
 |        depth. | 
 |  | 
 |        For some time, Perl has provided a facility that allows regular expres- | 
 |        sions to recurse (amongst other things). It does this by  interpolating | 
 |        Perl  code in the expression at run time, and the code can refer to the | 
 |        expression itself. A Perl pattern using code interpolation to solve the | 
 |        parentheses problem can be created like this: | 
 |  | 
 |          $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x; | 
 |  | 
 |        The (?p{...}) item interpolates Perl code at run time, and in this case | 
 |        refers recursively to the pattern in which it appears. | 
 |  | 
 |        Obviously, PCRE2 cannot support the interpolation  of  Perl  code.  In- | 
 |        stead,  it supports special syntax for recursion of the entire pattern, | 
 |        and also for individual capture group recursion. After its introduction | 
 |        in PCRE1 and Python, this kind of recursion was subsequently introduced | 
 |        into Perl at release 5.10. | 
 |  | 
 |        A special item that consists of (? followed by a  number  greater  than | 
 |        zero  and  a  closing parenthesis is a recursive subroutine call of the | 
 |        capture group of the given number, provided that it occurs inside  that | 
 |        group.  (If  not,  it  is a non-recursive subroutine call, which is de- | 
 |        scribed in the next section.) The special item (?R) or (?0) is a recur- | 
 |        sive call of the entire regular expression. | 
 |  | 
 |        This PCRE2 pattern solves the nested parentheses  problem  (assume  the | 
 |        PCRE2_EXTENDED option is set so that white space is ignored): | 
 |  | 
 |          \( ( [^()]++ | (?R) )* \) | 
 |  | 
 |        First  it matches an opening parenthesis. Then it matches any number of | 
 |        substrings which can either be a sequence of non-parentheses, or a  re- | 
 |        cursive match of the pattern itself (that is, a correctly parenthesized | 
 |        substring).   Finally there is a closing parenthesis. Note the use of a | 
 |        possessive quantifier to avoid  backtracking  into  sequences  of  non- | 
 |        parentheses. | 
 |  | 
 |        If  this  were  part of a larger pattern, you would not want to recurse | 
 |        the entire pattern, so instead you could use this: | 
 |  | 
 |          ( \( ( [^()]++ | (?1) )* \) ) | 
 |  | 
 |        We have put the pattern into parentheses, and caused the  recursion  to | 
 |        refer to them instead of the whole pattern. | 
 |  | 
 |        In  a  larger  pattern,  keeping  track  of  parenthesis numbers can be | 
 |        tricky. This is made easier by the use of relative references.  Instead | 
 |        of (?1) in the pattern above you can write (?-2) to refer to the second | 
 |        most  recently  opened  parentheses  preceding  the recursion. In other | 
 |        words, a negative number counts capturing  parentheses  leftwards  from | 
 |        the point at which it is encountered. | 
 |  | 
 |        Be  aware  however, that if duplicate capture group numbers are in use, | 
 |        relative references refer to the earliest group  with  the  appropriate | 
 |        number. Consider, for example: | 
 |  | 
 |          (?|(a)|(b)) (c) (?-2) | 
 |  | 
 |        The first two capture groups (a) and (b) are both numbered 1, and group | 
 |        (c)  is  number  2. When the reference (?-2) is encountered, the second | 
 |        most recently opened parentheses has the number 1, but it is the  first | 
 |        such group (the (a) group) to which the recursion refers. This would be | 
 |        the  same if an absolute reference (?1) was used. In other words, rela- | 
 |        tive references are just a shorthand for computing a group number. | 
 |  | 
 |        It is also possible to refer to subsequent capture groups,  by  writing | 
 |        references  such  as  (?+2). However, these cannot be recursive because | 
 |        the reference is not inside the parentheses that are  referenced.  They | 
 |        are  always  non-recursive  subroutine  calls, as described in the next | 
 |        section. | 
 |  | 
 |        An alternative approach is to use named parentheses.  The  Perl  syntax | 
 |        for  this  is  (?&name);  PCRE1's earlier syntax (?P>name) is also sup- | 
 |        ported. We could rewrite the above example as follows: | 
 |  | 
 |          (?<pn> \( ( [^()]++ | (?&pn) )* \) ) | 
 |  | 
 |        If there is more than one group with the same name, the earliest one is | 
 |        used. | 
 |  | 
 |        The example pattern that we have been looking at contains nested unlim- | 
 |        ited repeats, and so the use of a possessive  quantifier  for  matching | 
 |        strings  of  non-parentheses  is important when applying the pattern to | 
 |        strings that do not match. For example, when this pattern is applied to | 
 |  | 
 |          (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() | 
 |  | 
 |        it yields "no match" quickly. However, if a  possessive  quantifier  is | 
 |        not  used, the match runs for a very long time indeed because there are | 
 |        so many different ways the + and * repeats can carve  up  the  subject, | 
 |        and all have to be tested before failure can be reported. | 
 |  | 
 |        At  the  end  of a match, the values of capturing parentheses are those | 
 |        from the outermost level. If you want to obtain intermediate values,  a | 
 |        callout function can be used (see below and the pcre2callout documenta- | 
 |        tion). If the pattern above is matched against | 
 |  | 
 |          (ab(cd)ef) | 
 |  | 
 |        the  value  for  the  inner capturing parentheses (numbered 2) is "ef", | 
 |        which is the last value taken on at the top level. If a  capture  group | 
 |        is  not  matched  at  the top level, its final captured value is unset, | 
 |        even if it was (temporarily) set at a deeper level during the  matching | 
 |        process. | 
 |  | 
 |        Do  not  confuse  the (?R) item with the condition (R), which tests for | 
 |        recursion.  Consider this pattern, which matches text in  angle  brack- | 
 |        ets,  allowing for arbitrary nesting. Only digits are allowed in nested | 
 |        brackets (that is, when recursing), whereas any characters are  permit- | 
 |        ted at the outer level. | 
 |  | 
 |          < (?: (?(R) \d++  | [^<>]*+) | (?R)) * > | 
 |  | 
 |        In  this  pattern,  (?(R) is the start of a conditional group, with two | 
 |        different alternatives for the recursive and non-recursive  cases.  The | 
 |        (?R) item is the actual recursive call. | 
 |  | 
 |    Differences in recursion processing between PCRE2 and Perl | 
 |  | 
 |        Some former differences between PCRE2 and Perl no longer exist. | 
 |  | 
 |        Before  release 10.30, recursion processing in PCRE2 differed from Perl | 
 |        in that a recursive subroutine call was always  treated  as  an  atomic | 
 |        group.  That is, once it had matched some of the subject string, it was | 
 |        never re-entered, even if it contained untried alternatives  and  there | 
 |        was  a  subsequent matching failure. (Historical note: PCRE implemented | 
 |        recursion before Perl did.) | 
 |  | 
 |        Starting with release 10.30, recursive subroutine calls are  no  longer | 
 |        treated as atomic. That is, they can be re-entered to try unused alter- | 
 |        natives  if  there  is a matching failure later in the pattern. This is | 
 |        now compatible with the way Perl works. If you want a  subroutine  call | 
 |        to be atomic, you must explicitly enclose it in an atomic group. | 
 |  | 
 |        Supporting backtracking into recursions simplifies certain types of re- | 
 |        cursive pattern. For example, this pattern matches palindromic strings: | 
 |  | 
 |          ^((.)(?1)\2|.?)$ | 
 |  | 
 |        The  second  branch  in the group matches a single central character in | 
 |        the palindrome when there are an odd number of characters,  or  nothing | 
 |        when  there  are  an even number of characters, but in order to work it | 
 |        has to be able to try the second case when  the  rest  of  the  pattern | 
 |        match fails. If you want to match typical palindromic phrases, the pat- | 
 |        tern  has  to  ignore  all  non-word characters, which can be done like | 
 |        this: | 
 |  | 
 |          ^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$ | 
 |  | 
 |        If run with the PCRE2_CASELESS option,  this  pattern  matches  phrases | 
 |        such  as "A man, a plan, a canal: Panama!". Note the use of the posses- | 
 |        sive quantifier *+ to avoid backtracking  into  sequences  of  non-word | 
 |        characters. Without this, PCRE2 takes a great deal longer (ten times or | 
 |        more)  to  match typical phrases, and Perl takes so long that you think | 
 |        it has gone into a loop. | 
 |  | 
 |        Another way in which PCRE2 and Perl used to differ in  their  recursion | 
 |        processing  is  in  the  handling of captured values. Formerly in Perl, | 
 |        when a group was called recursively or as a subroutine  (see  the  next | 
 |        section), it had no access to any values that were captured outside the | 
 |        recursion,  whereas  in  PCRE2 these values can be referenced. Consider | 
 |        this pattern: | 
 |  | 
 |          ^(.)(\1|a(?2)) | 
 |  | 
 |        This pattern matches "bab". The first capturing parentheses match  "b", | 
 |        then in the second group, when the backreference \1 fails to match "b", | 
 |        the second alternative matches "a" and then recurses. In the recursion, | 
 |        \1  does now match "b" and so the whole match succeeds. This match used | 
 |        to fail in Perl, but in later versions (I tried 5.024) it now works. | 
 |  | 
 |    Groups as subroutines | 
 |  | 
 |        If the syntax for a recursive group call (either by number or by  name) | 
 |        is  used  outside the parentheses to which it refers, it operates a bit | 
 |        like a subroutine in a programming  language.  More  accurately,  PCRE2 | 
 |        treats the referenced group as an independent subpattern which it tries | 
 |        to  match at the current matching position. The called group may be de- | 
 |        fined before or after the reference. A numbered reference  can  be  ab- | 
 |        solute or relative, as in these examples: | 
 |  | 
 |          (...(absolute)...)...(?2)... | 
 |          (...(relative)...)...(?-1)... | 
 |          (...(?+1)...(relative)... | 
 |  | 
 |        An earlier example pointed out that the pattern | 
 |  | 
 |          (sens|respons)e and \1ibility | 
 |  | 
 |        matches  "sense and sensibility" and "response and responsibility", but | 
 |        not "sense and responsibility". If instead the pattern | 
 |  | 
 |          (sens|respons)e and (?1)ibility | 
 |  | 
 |        is used, it does match "sense and responsibility" as well as the  other | 
 |        two  strings.  Another  example  is  given  in the discussion of DEFINE | 
 |        above. | 
 |  | 
 |        Like recursions, subroutine calls used to be  treated  as  atomic,  but | 
 |        this  changed  at  PCRE2 release 10.30, so backtracking into subroutine | 
 |        calls can now occur. However, any capturing parentheses  that  are  set | 
 |        during the subroutine call revert to their previous values afterwards. | 
 |  | 
 |        Processing  options such as case-independence are fixed when a group is | 
 |        defined, so if it is used as  a  subroutine,  such  options  cannot  be | 
 |        changed for different calls. For example, consider this pattern: | 
 |  | 
 |          (abc)(?i:(?-1)) | 
 |  | 
 |        It  matches  "abcabc". It does not match "abcABC" because the change of | 
 |        processing option does not affect the called group. | 
 |  | 
 |        The behaviour of backtracking control verbs in groups  when  called  as | 
 |        subroutines is described in the section entitled "Backtracking verbs in | 
 |        subroutines" below. | 
 |  | 
 |    Recursion and subroutines with returned capture groups | 
 |  | 
 |        Since  PCRE2  10.46,  recursion and subroutine calls may also specify a | 
 |        list of capture groups to return. This is a PCRE2 syntax extension  not | 
 |        supported  by  Perl.  The pattern matching recurses into the referenced | 
 |        expression as described above, however, when the recursion  returns  to | 
 |        the  calling expression the subgroups captured during the recursion can | 
 |        be retained when the calling expression's context is restored. | 
 |  | 
 |        When used as a subroutine, this allows the subroutine's capture  groups | 
 |        to be used as return values. | 
 |  | 
 |        Only the specific capture groups listed by the caller will be retained, | 
 |        using the following syntax: | 
 |  | 
 |          (?R(grouplist))       recurse whole pattern, returning capture groups | 
 |          (?n(grouplist))       ) | 
 |          (?+n(grouplist))      ) | 
 |          (?-n(grouplist))      ) call subroutine, returning capture groups | 
 |          (?&name(grouplist))   ) | 
 |          (?P>name(grouplist))  ) | 
 |  | 
 |        The  list  of  capture  groups "grouplist" is a comma-separated list of | 
 |        (absolute or relative) group numbers, and group names enclosed in  sin- | 
 |        gle quotes or angle brackets. | 
 |  | 
 |        Here  is  an  example which first uses the DEFINE condition to create a | 
 |        re-usable routine for matching a weekday, then  calls  that  subroutine | 
 |        and retains the groups it captures for use later: | 
 |  | 
 |          (?x: # ignore whitespace for clarity | 
 |            # Define the routine "weekendday" which matches Saturday or | 
 |            # Sunday, and returns the Sat/Sun prefix as \k<short>. | 
 |            (?(DEFINE) (?<weekendday> | 
 |                (?|(?<short>Sat)urday|(?<short>Sun)day) ) ) | 
 |            # Call the routine. Matches "Saturday,Sat" or "Sunday,Sun". | 
 |            (?&weekendday(<short>)),\k<short> ) | 
 |  | 
 |        This  feature  is  not  available using the Oniguruma syntax \g<...> or | 
 |        \g'...'  below. | 
 |  | 
 |    Oniguruma subroutine syntax | 
 |  | 
 |        For compatibility with Oniguruma, the non-Perl syntax \g followed by  a | 
 |        name or a number enclosed either in angle brackets or single quotes, is | 
 |        an alternative syntax for calling a group as a subroutine, possibly re- | 
 |        cursively.  Here  are  two  of the examples used above, rewritten using | 
 |        this syntax: | 
 |  | 
 |          (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) ) | 
 |          (sens|respons)e and \g'1'ibility | 
 |  | 
 |        PCRE2 supports an extension to Oniguruma: if a number is preceded by  a | 
 |        plus or a minus sign it is taken as a relative reference. For example: | 
 |  | 
 |          (abc)(?i:\g<-1>) | 
 |  | 
 |        Note  that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not | 
 |        synonymous. The former is a backreference; the latter is  a  subroutine | 
 |        call. | 
 |  | 
 |  | 
 | CALLOUTS | 
 |  | 
 |        Perl has a feature whereby using the sequence (?{...}) causes arbitrary | 
 |        Perl  code to be obeyed in the middle of matching a regular expression. | 
 |        This makes it possible, amongst other things, to extract different sub- | 
 |        strings that match the same pair of parentheses when there is a repeti- | 
 |        tion. | 
 |  | 
 |        PCRE2 provides a similar feature, but of course it  cannot  obey  arbi- | 
 |        trary  Perl  code. The feature is called "callout". The caller of PCRE2 | 
 |        provides an external function by putting its entry  point  in  a  match | 
 |        context  using  the function pcre2_set_callout(), and then passing that | 
 |        context to pcre2_match() or pcre2_dfa_match(). If no match  context  is | 
 |        passed,  or  if  the callout entry point is set to NULL, callout points | 
 |        will be passed over silently during matching. To disallow  callouts  in | 
 |        the pattern syntax, you may use the PCRE2_EXTRA_NEVER_CALLOUT option. | 
 |  | 
 |        Within  a  regular expression, (?C<arg>) indicates a point at which the | 
 |        external function is to be called. There  are  two  kinds  of  callout: | 
 |        those  with a numerical argument and those with a string argument. (?C) | 
 |        on its own with no argument is treated as (?C0). A  numerical  argument | 
 |        allows  the  application  to  distinguish  between  different callouts. | 
 |        String arguments were added for release 10.20 to make it  possible  for | 
 |        script  languages that use PCRE2 to embed short scripts within patterns | 
 |        in a similar way to Perl. | 
 |  | 
 |        During matching, when PCRE2 reaches a callout point, the external func- | 
 |        tion is called. It is provided with the number or  string  argument  of | 
 |        the  callout, the position in the pattern, and one item of data that is | 
 |        also set in the match block. The callout function may cause matching to | 
 |        proceed, to backtrack, or to fail. | 
 |  | 
 |        By default, PCRE2 implements a  number  of  optimizations  at  matching | 
 |        time,  and  one  side-effect is that sometimes callouts are skipped. If | 
 |        you need all possible callouts to happen, you need to set options  that | 
 |        disable  the relevant optimizations. More details, including a complete | 
 |        description of the programming interface to the callout  function,  are | 
 |        given in the pcre2callout documentation. | 
 |  | 
 |    Callouts with numerical arguments | 
 |  | 
 |        If  you  just  want  to  have  a means of identifying different callout | 
 |        points, put a number less than 256 after the  letter  C.  For  example, | 
 |        this pattern has two callout points: | 
 |  | 
 |          (?C1)abc(?C2)def | 
 |  | 
 |        If  the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(), numerical | 
 |        callouts are automatically installed before each item in  the  pattern. | 
 |        They  are all numbered 255. If there is a conditional group in the pat- | 
 |        tern whose condition is an assertion, an additional callout is inserted | 
 |        just before the condition. An explicit callout may also be set at  this | 
 |        position, as in this example: | 
 |  | 
 |          (?(?C9)(?=a)abc|def) | 
 |  | 
 |        Note that this applies only to assertion conditions, not to other types | 
 |        of condition. | 
 |  | 
 |    Callouts with string arguments | 
 |  | 
 |        A  delimited  string may be used instead of a number as a callout argu- | 
 |        ment. The starting delimiter must be one of ` ' " ^ % #  $  {  and  the | 
 |        ending delimiter is the same as the start, except for {, where the end- | 
 |        ing  delimiter  is  }.  If  the  ending  delimiter is needed within the | 
 |        string, it must be doubled. For example: | 
 |  | 
 |          (?C'ab ''c'' d')xyz(?C{any text})pqr | 
 |  | 
 |        The doubling is removed before the string  is  passed  to  the  callout | 
 |        function. | 
 |  | 
 |  | 
 | BACKTRACKING CONTROL | 
 |  | 
 |        There  are  a  number  of  special "Backtracking Control Verbs" (to use | 
 |        Perl's terminology) that modify the behaviour  of  backtracking  during | 
 |        matching.  They are generally of the form (*VERB) or (*VERB:NAME). Some | 
 |        verbs take either form, and may behave differently depending on whether | 
 |        or not a name argument is present. The names are  not  required  to  be | 
 |        unique within the pattern. | 
 |  | 
 |        By  default,  for  compatibility  with  Perl, a name is any sequence of | 
 |        characters that does not include a closing parenthesis. The name is not | 
 |        processed in any way, and it is  not  possible  to  include  a  closing | 
 |        parenthesis   in  the  name.   This  can  be  changed  by  setting  the | 
 |        PCRE2_ALT_VERBNAMES option, but the result is no  longer  Perl-compati- | 
 |        ble. | 
 |  | 
 |        When  PCRE2_ALT_VERBNAMES  is  set,  backslash processing is applied to | 
 |        verb names and only an unescaped  closing  parenthesis  terminates  the | 
 |        name.  However, the only backslash items that are permitted are \Q, \E, | 
 |        and sequences such as \x{100} that define character code points.  Char- | 
 |        acter type escapes such as \d are faulted. | 
 |  | 
 |        A closing parenthesis can be included in a name either as \) or between | 
 |        \Q  and  \E. In addition to backslash processing, if the PCRE2_EXTENDED | 
 |        or PCRE2_EXTENDED_MORE option is also set,  unescaped  white  space  in | 
 |        verb names is skipped, and #-comments are recognized, exactly as in the | 
 |        rest of the pattern.  PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not af- | 
 |        fect verb names unless PCRE2_ALT_VERBNAMES is also set. | 
 |  | 
 |        The  maximum  length of a name is 255 in the 8-bit library and 65535 in | 
 |        the 16-bit and 32-bit libraries. If the name is empty, that is, if  the | 
 |        closing  parenthesis immediately follows the colon, the effect is as if | 
 |        the colon were not there. Any number of these verbs may occur in a pat- | 
 |        tern. Except for (*ACCEPT), they may not be quantified. | 
 |  | 
 |        Since these verbs are specifically related  to  backtracking,  most  of | 
 |        them  can be used only when the pattern is to be matched using the tra- | 
 |        ditional matching function or JIT, because they use backtracking  algo- | 
 |        rithms.  With  the  exception  of (*FAIL), which behaves like a failing | 
 |        negative assertion, the backtracking control verbs cause  an  error  if | 
 |        encountered by the DFA matching function. | 
 |  | 
 |        The  behaviour  of  these  verbs in repeated groups, assertions, and in | 
 |        capture groups called as subroutines (whether or  not  recursively)  is | 
 |        documented below. | 
 |  | 
 |    Optimizations that affect backtracking verbs | 
 |  | 
 |        PCRE2 contains some optimizations that are used to speed up matching by | 
 |        running some checks at the start of each match attempt. For example, it | 
 |        may  know  the minimum length of matching subject, or that a particular | 
 |        character must be present. When one of these optimizations bypasses the | 
 |        running of a match,  any  included  backtracking  verbs  will  not,  of | 
 |        course, be processed. You can suppress the start-of-match optimizations | 
 |        by  setting  the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com- | 
 |        pile(), by calling pcre2_set_optimize() with a PCRE2_START_OPTIMIZE_OFF | 
 |        directive, or by starting the pattern with  (*NO_START_OPT).  There  is | 
 |        more  discussion  of  this  option in the section entitled "Compiling a | 
 |        pattern" in the pcre2api documentation. | 
 |  | 
 |        Experiments with Perl suggest that it too  has  similar  optimizations, | 
 |        and like PCRE2, turning them off can change the result of a match. | 
 |  | 
 |    Verbs that act immediately | 
 |  | 
 |        The following verbs act as soon as they are encountered. | 
 |  | 
 |           (*ACCEPT) or (*ACCEPT:NAME) | 
 |  | 
 |        This  verb causes the match to end successfully, skipping the remainder | 
 |        of the pattern. However, when it is inside  a  capture  group  that  is | 
 |        called as a subroutine, only that group is ended successfully. Matching | 
 |        then continues at the outer level. If (*ACCEPT) in triggered in a posi- | 
 |        tive  assertion,  the  assertion succeeds; in a negative assertion, the | 
 |        assertion fails. | 
 |  | 
 |        If (*ACCEPT) is inside capturing parentheses, the data so far  is  cap- | 
 |        tured. For example: | 
 |  | 
 |          A((?:A|B(*ACCEPT)|C)D) | 
 |  | 
 |        This  matches  "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- | 
 |        tured by the outer parentheses. | 
 |  | 
 |        (*ACCEPT) is the only backtracking verb that is allowed to  be  quanti- | 
 |        fied  because  an  ungreedy  quantification with a minimum of zero acts | 
 |        only when a backtrack happens. Consider, for example, | 
 |  | 
 |          (A(*ACCEPT)??B)C | 
 |  | 
 |        where A, B, and C may be complex expressions. After matching  "A",  the | 
 |        matcher  processes  "BC"; if that fails, causing a backtrack, (*ACCEPT) | 
 |        is triggered and the match succeeds. In both cases, all but C  is  cap- | 
 |        tured.  Whereas  (*COMMIT) (see below) means "fail on backtrack", a re- | 
 |        peated (*ACCEPT) of this type means "succeed on backtrack". | 
 |  | 
 |        Warning: (*ACCEPT) should not be used within a script  run  group,  be- | 
 |        cause  it causes an immediate exit from the group, bypassing the script | 
 |        run checking. | 
 |  | 
 |          (*FAIL) or (*FAIL:NAME) | 
 |  | 
 |        This verb causes a matching failure, forcing backtracking to occur.  It | 
 |        may  be  abbreviated  to  (*F).  It is equivalent to (?!) but easier to | 
 |        read. The Perl documentation notes that it is probably useful only when | 
 |        combined with (?{}) or (??{}). Those are, of course, Perl features that | 
 |        are not present in PCRE2. The nearest equivalent is  the  callout  fea- | 
 |        ture, as for example in this pattern: | 
 |  | 
 |          a+(?C)(*FAIL) | 
 |  | 
 |        A  match  with the string "aaaa" always fails, but the callout is taken | 
 |        before each backtrack happens (in this example, 10 times). | 
 |  | 
 |        (*ACCEPT:NAME) and (*FAIL:NAME) behave the  same  as  (*MARK:NAME)(*AC- | 
 |        CEPT)  and  (*MARK:NAME)(*FAIL),  respectively,  that  is, a (*MARK) is | 
 |        recorded just before the verb acts. | 
 |  | 
 |    Recording which path was taken | 
 |  | 
 |        There is one verb whose main purpose is to track how a  match  was  ar- | 
 |        rived  at,  though  it also has a secondary use in conjunction with ad- | 
 |        vancing the match starting point (see (*SKIP) below). | 
 |  | 
 |          (*MARK:NAME) or (*:NAME) | 
 |  | 
 |        A name is always required with this verb. For all the other  backtrack- | 
 |        ing control verbs, a NAME argument is optional. | 
 |  | 
 |        When  a  match  succeeds, the name of the last-encountered mark name on | 
 |        the matching path is passed back to the caller as described in the sec- | 
 |        tion entitled "Other information about the match" in the pcre2api docu- | 
 |        mentation. This applies to all instances of (*MARK)  and  other  verbs, | 
 |        including those inside assertions and atomic groups. However, there are | 
 |        differences  in  those  cases  when (*MARK) is used in conjunction with | 
 |        (*SKIP) as described below. | 
 |  | 
 |        The mark name that was last encountered on the matching path is  passed | 
 |        back.  A verb without a NAME argument is ignored for this purpose. Here | 
 |        is an example of pcre2test output, where the "mark"  modifier  requests | 
 |        the retrieval and outputting of (*MARK) data: | 
 |  | 
 |            re> /X(*MARK:A)Y|X(*MARK:B)Z/mark | 
 |          data> XY | 
 |           0: XY | 
 |          MK: A | 
 |          XZ | 
 |           0: XZ | 
 |          MK: B | 
 |  | 
 |        The (*MARK) name is tagged with "MK:" in this output, and in this exam- | 
 |        ple  it indicates which of the two alternatives matched. This is a more | 
 |        efficient way of obtaining this information than putting each  alterna- | 
 |        tive in its own capturing parentheses. | 
 |  | 
 |        If  a  verb  with a name is encountered in a positive assertion that is | 
 |        true, the name is recorded and passed back if it  is  the  last-encoun- | 
 |        tered. This does not happen for negative assertions or failing positive | 
 |        assertions. | 
 |  | 
 |        After  a  partial match or a failed match, the last encountered name in | 
 |        the entire match process is returned. For example: | 
 |  | 
 |            re> /X(*MARK:A)Y|X(*MARK:B)Z/mark | 
 |          data> XP | 
 |          No match, mark = B | 
 |  | 
 |        Note that in this unanchored example the  mark  is  retained  from  the | 
 |        match attempt that started at the letter "X" in the subject. Subsequent | 
 |        match attempts starting at "P" and then with an empty string do not get | 
 |        as far as the (*MARK) item, but nevertheless do not reset it. | 
 |  | 
 |        If  you  are  interested  in  (*MARK)  values after failed matches, you | 
 |        should probably either set the PCRE2_NO_START_OPTIMIZE option  or  call | 
 |        pcre2_set_optimize()  with  a  PCRE2_START_OPTIMIZE_OFF  directive (see | 
 |        above) to ensure that the match is always attempted. | 
 |  | 
 |    Verbs that act after backtracking | 
 |  | 
 |        The following verbs do nothing when they are encountered. Matching con- | 
 |        tinues with what follows, but if there is a subsequent  match  failure, | 
 |        causing  a  backtrack  to the verb, a failure is forced. That is, back- | 
 |        tracking cannot pass to the left of the  verb.  However,  when  one  of | 
 |        these  verbs  appears inside an atomic group or in an atomic lookaround | 
 |        assertion that is true, its effect is confined to that  group,  because | 
 |        once  the  group has been matched, there is never any backtracking into | 
 |        it. Backtracking from beyond an atomic assertion or group  ignores  the | 
 |        entire group, and seeks a preceding backtracking point. | 
 |  | 
 |        These  verbs  differ  in exactly what kind of failure occurs when back- | 
 |        tracking reaches them. The behaviour described below  is  what  happens | 
 |        when  the  verb is not in a subroutine or an assertion. Subsequent sec- | 
 |        tions cover these special cases. | 
 |  | 
 |          (*COMMIT) or (*COMMIT:NAME) | 
 |  | 
 |        This verb causes the whole match to fail outright if there is  a  later | 
 |        matching failure that causes backtracking to reach it. Even if the pat- | 
 |        tern  is  unanchored,  no further attempts to find a match by advancing | 
 |        the starting point take place. If (*COMMIT) is  the  only  backtracking | 
 |        verb that is encountered, once it has been passed pcre2_match() is com- | 
 |        mitted to finding a match at the current starting point, or not at all. | 
 |        For example: | 
 |  | 
 |          a+(*COMMIT)b | 
 |  | 
 |        This  matches  "xxaab" but not "aacaab". It can be thought of as a kind | 
 |        of dynamic anchor, or "I've started, so I must finish." | 
 |  | 
 |        The behaviour of (*COMMIT:NAME) is not the same  as  (*MARK:NAME)(*COM- | 
 |        MIT).  It is like (*MARK:NAME) in that the name is remembered for pass- | 
 |        ing back to the caller. However, (*SKIP:NAME) searches only  for  names | 
 |        that are set with (*MARK), ignoring those set by any of the other back- | 
 |        tracking verbs. | 
 |  | 
 |        If  there  is more than one backtracking verb in a pattern, a different | 
 |        one that follows (*COMMIT) may be triggered first,  so  merely  passing | 
 |        (*COMMIT) during a match does not always guarantee that a match must be | 
 |        at this starting point. | 
 |  | 
 |        Note that (*COMMIT) at the start of a pattern is not the same as an an- | 
 |        chor,  unless  PCRE2's  start-of-match optimizations are turned off, as | 
 |        shown in this output from pcre2test: | 
 |  | 
 |            re> /(*COMMIT)abc/ | 
 |          data> xyzabc | 
 |           0: abc | 
 |          data> | 
 |          re> /(*COMMIT)abc/no_start_optimize | 
 |          data> xyzabc | 
 |          No match | 
 |  | 
 |        For the first pattern, PCRE2 knows that any match must start with  "a", | 
 |        so  the optimization skips along the subject to "a" before applying the | 
 |        pattern to the first set of data. The match attempt then succeeds.  The | 
 |        second  pattern disables the optimization that skips along to the first | 
 |        character. The pattern is now applied  starting  at  "x",  and  so  the | 
 |        (*COMMIT)  causes  the  match to fail without trying any other starting | 
 |        points. | 
 |  | 
 |          (*PRUNE) or (*PRUNE:NAME) | 
 |  | 
 |        This verb causes the match to fail at the current starting position  in | 
 |        the subject if there is a later matching failure that causes backtrack- | 
 |        ing  to  reach it. If the pattern is unanchored, the normal "bumpalong" | 
 |        advance to the next starting character then happens.  Backtracking  can | 
 |        occur  as  usual to the left of (*PRUNE), before it is reached, or when | 
 |        matching to the right of (*PRUNE), but if there  is  no  match  to  the | 
 |        right,  backtracking cannot cross (*PRUNE). In simple cases, the use of | 
 |        (*PRUNE) is just an alternative to an atomic group or possessive  quan- | 
 |        tifier, but there are some uses of (*PRUNE) that cannot be expressed in | 
 |        any  other  way. In an anchored pattern (*PRUNE) has the same effect as | 
 |        (*COMMIT). | 
 |  | 
 |        The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE). | 
 |        It is like (*MARK:NAME) in that the name is remembered for passing back | 
 |        to the caller. However, (*SKIP:NAME) searches only for names  set  with | 
 |        (*MARK), ignoring those set by other backtracking verbs. | 
 |  | 
 |          (*SKIP) | 
 |  | 
 |        This  verb, when given without a name, is like (*PRUNE), except that if | 
 |        the pattern is unanchored, the "bumpalong" advance is not to  the  next | 
 |        character, but to the position in the subject where (*SKIP) was encoun- | 
 |        tered.  (*SKIP)  signifies that whatever text was matched leading up to | 
 |        it cannot be part of a successful match if there is a  later  mismatch. | 
 |        Consider: | 
 |  | 
 |          a+(*SKIP)b | 
 |  | 
 |        If  the  subject  is  "aaaac...",  after  the first match attempt fails | 
 |        (starting at the first character in the  string),  the  starting  point | 
 |        skips on to start the next attempt at "c". Note that a possessive quan- | 
 |        tifier does not have the same effect as this example; although it would | 
 |        suppress  backtracking  during  the first match attempt, the second at- | 
 |        tempt would start at the second character instead  of  skipping  on  to | 
 |        "c". | 
 |  | 
 |        If  (*SKIP) is used to specify a new starting position that is the same | 
 |        as the starting position of the current match, or (by  being  inside  a | 
 |        lookbehind)  earlier, the position specified by (*SKIP) is ignored, and | 
 |        instead the normal "bumpalong" occurs. | 
 |  | 
 |          (*SKIP:NAME) | 
 |  | 
 |        When (*SKIP) has an associated name, its behaviour  is  modified.  When | 
 |        such  a  (*SKIP) is triggered, the previous path through the pattern is | 
 |        searched for the most recent (*MARK) that has the same name. If one  is | 
 |        found,  the  "bumpalong" advance is to the subject position that corre- | 
 |        sponds to that (*MARK) instead of to where (*SKIP) was encountered.  If | 
 |        no (*MARK) with a matching name is found, the (*SKIP) is ignored. | 
 |  | 
 |        The  search  for a (*MARK) name uses the normal backtracking mechanism, | 
 |        which means that it does not  see  (*MARK)  settings  that  are  inside | 
 |        atomic groups or assertions, because they are never re-entered by back- | 
 |        tracking. Compare the following pcre2test examples: | 
 |  | 
 |            re> /a(?>(*MARK:X))(*SKIP:X)(*F)|(.)/ | 
 |          data: abc | 
 |           0: a | 
 |           1: a | 
 |          data: | 
 |            re> /a(?:(*MARK:X))(*SKIP:X)(*F)|(.)/ | 
 |          data: abc | 
 |           0: b | 
 |           1: b | 
 |  | 
 |        In  the first example, the (*MARK) setting is in an atomic group, so it | 
 |        is not seen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored. | 
 |        This allows the second branch of the pattern to be tried at  the  first | 
 |        character  position.  In the second example, the (*MARK) setting is not | 
 |        in an atomic group. This allows (*SKIP:X) to find the (*MARK)  when  it | 
 |        backtracks, and this causes a new matching attempt to start at the sec- | 
 |        ond  character.  This  time, the (*MARK) is never seen because "a" does | 
 |        not match "b", so the matcher immediately jumps to the second branch of | 
 |        the pattern. | 
 |  | 
 |        Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME).  It | 
 |        ignores names that are set by other backtracking verbs. | 
 |  | 
 |          (*THEN) or (*THEN:NAME) | 
 |  | 
 |        This  verb  causes  a skip to the next innermost alternative when back- | 
 |        tracking reaches it. That  is,  it  cancels  any  further  backtracking | 
 |        within  the  current  alternative.  Its name comes from the observation | 
 |        that it can be used for a pattern-based if-then-else block: | 
 |  | 
 |          ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... | 
 |  | 
 |        If the COND1 pattern matches, FOO is tried (and possibly further  items | 
 |        after  the  end  of the group if FOO succeeds); on failure, the matcher | 
 |        skips to the second alternative and tries COND2,  without  backtracking | 
 |        into  COND1.  If that succeeds and BAR fails, COND3 is tried. If subse- | 
 |        quently BAZ fails, there are no more alternatives, so there is a  back- | 
 |        track  to  whatever came before the entire group. If (*THEN) is not in- | 
 |        side an alternation, it acts like (*PRUNE). | 
 |  | 
 |        The behaviour of (*THEN:NAME) is not the same  as  (*MARK:NAME)(*THEN). | 
 |        It is like (*MARK:NAME) in that the name is remembered for passing back | 
 |        to  the  caller. However, (*SKIP:NAME) searches only for names set with | 
 |        (*MARK), ignoring those set by other backtracking verbs. | 
 |  | 
 |        A group that does not contain a | character is just a part of  the  en- | 
 |        closing  alternative;  it is not a nested alternation with only one al- | 
 |        ternative. The effect of (*THEN) extends beyond such a group to the en- | 
 |        closing alternative.  Consider this pattern, where A, B, etc. are  com- | 
 |        plex  pattern  fragments  that  do not contain any | characters at this | 
 |        level: | 
 |  | 
 |          A (B(*THEN)C) | D | 
 |  | 
 |        If A and B are matched, but there is a failure in C, matching does  not | 
 |        backtrack into A; instead it moves to the next alternative, that is, D. | 
 |        However,  if  the  group containing (*THEN) is given an alternative, it | 
 |        behaves differently: | 
 |  | 
 |          A (B(*THEN)C | (*FAIL)) | D | 
 |  | 
 |        The effect of (*THEN) is now confined to the inner group. After a fail- | 
 |        ure in C, matching moves to (*FAIL), which causes the  whole  group  to | 
 |        fail  because  there  are  no  more  alternatives to try. In this case, | 
 |        matching does backtrack into A. | 
 |  | 
 |        Note that a conditional group is not considered as having two  alterna- | 
 |        tives,  because  only one is ever used. In other words, the | character | 
 |        in a conditional group has a different meaning. Ignoring  white  space, | 
 |        consider: | 
 |  | 
 |          ^.*? (?(?=a) a | b(*THEN)c ) | 
 |  | 
 |        If the subject is "ba", this pattern does not match. Because .*? is un- | 
 |        greedy,  it initially matches zero characters. The condition (?=a) then | 
 |        fails, the character "b" is matched, but "c" is  not.  At  this  point, | 
 |        matching  does  not  backtrack to .*? as might perhaps be expected from | 
 |        the presence of the | character. The conditional group is part  of  the | 
 |        single  alternative  that comprises the whole pattern, and so the match | 
 |        fails. (If there was a backtrack into .*?, allowing it  to  match  "b", | 
 |        the match would succeed.) | 
 |  | 
 |        The  verbs just described provide four different "strengths" of control | 
 |        when subsequent matching fails. (*THEN) is the weakest, carrying on the | 
 |        match at the next alternative. (*PRUNE) comes next, failing  the  match | 
 |        at  the  current starting position, but allowing an advance to the next | 
 |        character (for an unanchored pattern). (*SKIP) is similar, except  that | 
 |        the advance may be more than one character. (*COMMIT) is the strongest, | 
 |        causing the entire match to fail. | 
 |  | 
 |    More than one backtracking verb | 
 |  | 
 |        If  more  than  one  backtracking verb is present in a pattern, the one | 
 |        that is backtracked onto first acts. For example,  consider  this  pat- | 
 |        tern, where A, B, etc. are complex pattern fragments: | 
 |  | 
 |          (A(*COMMIT)B(*THEN)C|ABD) | 
 |  | 
 |        If  A matches but B fails, the backtrack to (*COMMIT) causes the entire | 
 |        match to fail. However, if A and B match, but C fails, the backtrack to | 
 |        (*THEN) causes the next alternative (ABD) to be tried.  This  behaviour | 
 |        is  consistent,  but is not always the same as Perl's. It means that if | 
 |        two or more backtracking verbs appear in succession, all but  the  last | 
 |        of them has no effect. Consider this example: | 
 |  | 
 |          ...(*COMMIT)(*PRUNE)... | 
 |  | 
 |        If there is a matching failure to the right, backtracking onto (*PRUNE) | 
 |        causes  it to be triggered, and its action is taken. There can never be | 
 |        a backtrack onto (*COMMIT). | 
 |  | 
 |    Backtracking verbs in repeated groups | 
 |  | 
 |        PCRE2 sometimes differs from Perl in its handling of backtracking verbs | 
 |        in repeated groups. For example, consider: | 
 |  | 
 |          /(a(*COMMIT)b)+ac/ | 
 |  | 
 |        If the subject is "abac", Perl matches  unless  its  optimizations  are | 
 |        disabled,  but  PCRE2  always fails because the (*COMMIT) in the second | 
 |        repeat of the group acts. | 
 |  | 
 |    Backtracking verbs in assertions | 
 |  | 
 |        (*FAIL) in any assertion has its normal effect: it forces an  immediate | 
 |        backtrack.  The  behaviour  of  the other backtracking verbs depends on | 
 |        whether or not the assertion is standalone or acting as  the  condition | 
 |        in a conditional group. | 
 |  | 
 |        (*ACCEPT)  in  a  standalone positive assertion causes the assertion to | 
 |        succeed without any further processing; captured  strings  and  a  mark | 
 |        name  (if  set) are retained. In a standalone negative assertion, (*AC- | 
 |        CEPT) causes the assertion to fail without any further processing; cap- | 
 |        tured substrings and any mark name are discarded. | 
 |  | 
 |        If the assertion is a condition, (*ACCEPT) causes the condition  to  be | 
 |        true  for  a  positive assertion and false for a negative one; captured | 
 |        substrings are retained in both cases. | 
 |  | 
 |        The remaining verbs act only when a later failure causes a backtrack to | 
 |        reach them. This means that, for the Perl-compatible assertions,  their | 
 |        effect is confined to the assertion, because Perl lookaround assertions | 
 |        are atomic. A backtrack that occurs after such an assertion is complete | 
 |        does  not  jump  back  into  the  assertion.  Note in particular that a | 
 |        (*MARK) name that is set in an assertion is not "seen" by  an  instance | 
 |        of (*SKIP:NAME) later in the pattern. | 
 |  | 
 |        PCRE2  now  supports non-atomic positive assertions and also "scan sub- | 
 |        string" assertions, as described in the sections  entitled  "Non-atomic | 
 |        assertions"  and  "Scan  substring  assertions" above. These assertions | 
 |        must be standalone (not used as conditions). They are not Perl-compati- | 
 |        ble. For these assertions, a later backtrack does jump  back  into  the | 
 |        assertion,  and  therefore  verbs such as (*COMMIT) can be triggered by | 
 |        backtracks from later in the pattern. | 
 |  | 
 |        The effect of (*THEN) is not allowed to escape beyond an assertion.  If | 
 |        there  are no more branches to try, (*THEN) causes a positive assertion | 
 |        to be false, and a negative assertion to be true. This  behaviour  dif- | 
 |        fers from Perl when the assertion has only one branch. | 
 |  | 
 |        The  other  backtracking verbs are not treated specially if they appear | 
 |        in a standalone positive assertion. In a  conditional  positive  asser- | 
 |        tion, backtracking (from within the assertion) into (*COMMIT), (*SKIP), | 
 |        or  (*PRUNE) causes the condition to be false. However, for both stand- | 
 |        alone and conditional negative assertions, backtracking into (*COMMIT), | 
 |        (*SKIP), or (*PRUNE) causes the assertion to be true, without consider- | 
 |        ing any further alternative branches. | 
 |  | 
 |    Backtracking verbs in subroutines | 
 |  | 
 |        These behaviours occur whether or not the group is called recursively. | 
 |  | 
 |        (*ACCEPT) in a group called as a subroutine causes the subroutine match | 
 |        to succeed without any further processing. Matching then continues  af- | 
 |        ter  the  subroutine call. Perl documents this behaviour. Perl's treat- | 
 |        ment of the other verbs in subroutines is different in some cases. | 
 |  | 
 |        (*FAIL) in a group called as a subroutine has  its  normal  effect:  it | 
 |        forces an immediate backtrack. | 
 |  | 
 |        (*COMMIT),  (*SKIP),  and  (*PRUNE)  cause the subroutine match to fail | 
 |        when triggered by being backtracked to in a group called as  a  subrou- | 
 |        tine. There is then a backtrack at the outer level. | 
 |  | 
 |        (*THEN), when triggered, skips to the next alternative in the innermost | 
 |        enclosing  group that has alternatives (its normal behaviour). However, | 
 |        if there is no such group within the subroutine's group, the subroutine | 
 |        match fails and there is a backtrack at the outer level. | 
 |  | 
 |  | 
 | EBCDIC ENVIRONMENTS | 
 |  | 
 |        Differences in the way PCRE behaves when it is running in an EBCDIC en- | 
 |        vironment are covered in this section. | 
 |  | 
 |    Escape sequences | 
 |  | 
 |        When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..}  is  not  supported. | 
 |        \a, \e, \f, \n, \r, and \t generate the appropriate EBCDIC code values. | 
 |        The \c escape is processed as specified for Perl in the perlebcdic doc- | 
 |        ument.  The  only characters that are allowed after \c are A-Z, a-z, or | 
 |        one of @, [, \, ], ^, _, or ?. Any other character provokes a  compile- | 
 |        time  error.  The  sequence  \c@ encodes character code 0; after \c the | 
 |        letters (in either case) encode characters 1-26 (hex 01 to hex 1A);  [, | 
 |        \,  ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c? be- | 
 |        comes either 255 (hex FF) or 95 (hex 5F). | 
 |  | 
 |        Thus, apart from \c?, these escapes generate the  same  character  code | 
 |        values  as they do in an ASCII or Unicode environment, though the mean- | 
 |        ings of the values mostly differ. For  example,  \cG  always  generates | 
 |        code value 7, which is BEL in ASCII but DEL in EBCDIC. | 
 |  | 
 |        The  sequence  \c? generates DEL (127, hex 7F) in an ASCII environment, | 
 |        but because 127 is not a control character in  EBCDIC,  Perl  makes  it | 
 |        generate  the  APC character. Unfortunately, there are several variants | 
 |        of EBCDIC. In most of them the APC character has  the  value  255  (hex | 
 |        FF),  but  in  the one Perl calls POSIX-BC its value is 95 (hex 5F). If | 
 |        certain other characters have POSIX-BC values, PCRE2 makes \c? generate | 
 |        95; otherwise it generates 255. | 
 |  | 
 |    Character classes | 
 |  | 
 |        In character classes there is a special case in EBCDIC environments for | 
 |        ranges whose end points are both specified as literal  letters  in  the | 
 |        same  case.  For compatibility with Perl, EBCDIC code points within the | 
 |        range that are not letters are omitted. For example, [h-k] matches only | 
 |        four characters, even though the EBCDIC codes for h and k are 0x88  and | 
 |        0x92, a range of 11 code points. However, if the range is specified nu- | 
 |        merically,  for  example,  [\x88-\x92] or [h-\x92], all code points are | 
 |        included. | 
 |  | 
 |  | 
 | SEE ALSO | 
 |  | 
 |        pcre2api(3),   pcre2callout(3),    pcre2matching(3),    pcre2syntax(3), | 
 |        pcre2(3). | 
 |  | 
 |  | 
 | AUTHOR | 
 |  | 
 |        Philip Hazel | 
 |        Retired from University Computing Service | 
 |        Cambridge, England. | 
 |  | 
 |  | 
 | REVISION | 
 |  | 
 |        Last updated: 03 September 2025 | 
 |        Copyright (c) 1997-2024 University of Cambridge. | 
 |  | 
 |  | 
 | PCRE2 10.48-DEV                03 September 2025               PCRE2PATTERN(3) | 
 | ------------------------------------------------------------------------------ | 
 |  | 
 |  | 
 | PCRE2PERFORM(3)            Library Functions Manual            PCRE2PERFORM(3) | 
 |  | 
 |  | 
 | NAME | 
 |        PCRE2 - Perl-compatible regular expressions (revised API) | 
 |  | 
 |  | 
 | PCRE2 PERFORMANCE | 
 |  | 
 |        Two  aspects  of performance are discussed below: memory usage and pro- | 
 |        cessing time. The way you express your pattern as a regular  expression | 
 |        can affect both of them. | 
 |  | 
 |  | 
 | COMPILED PATTERN MEMORY USAGE | 
 |  | 
 |        Patterns are compiled by PCRE2 into a reasonably efficient interpretive | 
 |        code,  so  that most simple patterns do not use much memory for storing | 
 |        the compiled version. However, there is one case where the memory usage | 
 |        of a compiled pattern can be unexpectedly  large.  If  a  parenthesized | 
 |        group  has  a quantifier with a minimum greater than 1 and/or a limited | 
 |        maximum, the whole group is repeated in the compiled code. For example, | 
 |        the pattern | 
 |  | 
 |          (abc|def){2,4} | 
 |  | 
 |        is compiled as if it were | 
 |  | 
 |          (abc|def)(abc|def)((abc|def)(abc|def)?)? | 
 |  | 
 |        (Technical aside: It is done this way so that backtrack  points  within | 
 |        each of the repetitions can be independently maintained.) | 
 |  | 
 |        For  regular expressions whose quantifiers use only small numbers, this | 
 |        is not usually a problem. However, if the numbers are large,  and  par- | 
 |        ticularly  if  such repetitions are nested, the memory usage can become | 
 |        an embarrassment. For example, the very simple pattern | 
 |  | 
 |          ((ab){1,1000}c){1,3} | 
 |  | 
 |        uses over 50KiB when compiled using the 8-bit library.  When  PCRE2  is | 
 |        compiled  with its default internal pointer size of two bytes, the size | 
 |        limit on a compiled pattern is 65535 code units in the 8-bit and 16-bit | 
 |        libraries, and this is reached with the above pattern if the outer rep- | 
 |        etition is increased from 3 to 4. PCRE2 can be compiled to  use  larger | 
 |        internal  pointers  and thus handle larger compiled patterns, but it is | 
 |        better to try to rewrite your pattern to use less memory if you can. | 
 |  | 
 |        One way of reducing the memory usage for such patterns is to  make  use | 
 |        of PCRE2's "subroutine" facility. Re-writing the above pattern as | 
 |  | 
 |          ((ab)(?2){0,999}c)(?1){0,2} | 
 |  | 
 |        reduces  the memory requirements to around 16KiB, and indeed it remains | 
 |        under 20KiB even with the outer repetition increased to  100.  However, | 
 |        this kind of pattern is not always exactly equivalent, because any cap- | 
 |        tures  within  subroutine calls are lost when the subroutine completes. | 
 |        If this is not a problem, this kind of  rewriting  will  allow  you  to | 
 |        process  patterns that PCRE2 cannot otherwise handle. The matching per- | 
 |        formance of the two different versions of the pattern are  roughly  the | 
 |        same.  (This applies from release 10.30 - things were different in ear- | 
 |        lier releases.) | 
 |  | 
 |  | 
 | STACK AND HEAP USAGE AT RUN TIME | 
 |  | 
 |        From release 10.30, the interpretive (non-JIT) version of pcre2_match() | 
 |        uses very little system stack at run time. In earlier  releases  recur- | 
 |        sive  function  calls  could  use a great deal of stack, and this could | 
 |        cause problems, but this usage has been eliminated. Backtracking  posi- | 
 |        tions  are now explicitly remembered in memory frames controlled by the | 
 |        code. | 
 |  | 
 |        The size of each frame depends on the size of pointer variables and the | 
 |        number of capturing parenthesized groups in the pattern being  matched. | 
 |        On a 64-bit system the frame size for a pattern with no captures is 128 | 
 |        bytes. For each capturing group the size increases by 16 bytes. | 
 |  | 
 |        Until  release  10.41,  an initial 20KiB frames vector was allocated on | 
 |        the system stack, but this still caused some  issues  for  multi-thread | 
 |        applications  where  each  thread  has a very small stack. From release | 
 |        10.41 backtracking memory frames are always held  in  heap  memory.  An | 
 |        initial heap allocation is obtained the first time any match data block | 
 |        is  passed  to  pcre2_match().  This  is remembered with the match data | 
 |        block and re-used if that block is used for another match. It is  freed | 
 |        when the match data block itself is freed. | 
 |  | 
 |        The  size  of the initial block is the larger of 20KiB or ten times the | 
 |        pattern's frame size, unless the heap limit is less than this, in which | 
 |        case the heap limit is used. If the initial  block  proves  to  be  too | 
 |        small during matching, it is replaced by a larger block, subject to the | 
 |        heap  limit.  The  heap limit is checked only when a new block is to be | 
 |        allocated. Reducing the heap limit between calls to pcre2_match()  with | 
 |        the same match data block does not affect the saved block. | 
 |  | 
 |        In  contrast  to  pcre2_match(),  pcre2_dfa_match()  does use recursive | 
 |        function calls, but only for processing atomic groups,  lookaround  as- | 
 |        sertions, and recursion within the pattern. The original version of the | 
 |        code  used  to  allocate  quite large internal workspace vectors on the | 
 |        stack, which caused some problems for  some  patterns  in  environments | 
 |        with  small  stacks.  From release 10.32 the code for pcre2_dfa_match() | 
 |        has been re-factored to use heap memory  when  necessary  for  internal | 
 |        workspace  when  recursing,  though  recursive function calls are still | 
 |        used. | 
 |  | 
 |        The "match depth" parameter can be used to limit the depth of  function | 
 |        recursion,  and  the  "match  heap"  parameter  to limit heap memory in | 
 |        pcre2_dfa_match(). | 
 |  | 
 |  | 
 | PROCESSING TIME | 
 |  | 
 |        Certain items in regular expression patterns are processed  more  effi- | 
 |        ciently than others. It is more efficient to use a character class like | 
 |        [aeiou]   than   a   set   of  single-character  alternatives  such  as | 
 |        (a|e|i|o|u). In general, the simplest construction  that  provides  the | 
 |        required behaviour is usually the most efficient. Jeffrey Friedl's book | 
 |        contains  a  lot  of useful general discussion about optimizing regular | 
 |        expressions for efficient performance. This document contains a few ob- | 
 |        servations about PCRE2. | 
 |  | 
 |        Using Unicode character properties (the \p,  \P,  and  \X  escapes)  is | 
 |        slow,  because  PCRE2 has to use a multi-stage table lookup whenever it | 
 |        needs a character's property. If you can find  an  alternative  pattern | 
 |        that does not use character properties, it will probably be faster. | 
 |  | 
 |        By  default,  the  escape  sequences  \b, \d, \s, and \w, and the POSIX | 
 |        character classes such as [:alpha:]  do  not  use  Unicode  properties, | 
 |        partly for backwards compatibility, and partly for performance reasons. | 
 |        However,  you  can  set  the PCRE2_UCP option or start the pattern with | 
 |        (*UCP) if you want Unicode character properties to be  used.  This  can | 
 |        double  the  matching  time  for  items  such  as \d, when matched with | 
 |        pcre2_match(); the performance loss is less with a DFA  matching  func- | 
 |        tion, and in both cases there is not much difference for \b. | 
 |  | 
 |        When  a pattern begins with .* not in atomic parentheses, nor in paren- | 
 |        theses that are the subject of a backreference,  and  the  PCRE2_DOTALL | 
 |        option  is  set,  the pattern is implicitly anchored by PCRE2, since it | 
 |        can match only at the start of a subject string.  If  the  pattern  has | 
 |        multiple top-level branches, they must all be anchorable. The optimiza- | 
 |        tion  can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is au- | 
 |        tomatically disabled if the pattern contains (*PRUNE) or (*SKIP). | 
 |  | 
 |        If PCRE2_DOTALL is not set, PCRE2 cannot make  this  optimization,  be- | 
 |        cause  the  dot metacharacter does not then match a newline, and if the | 
 |        subject string contains newlines, the pattern may match from the  char- | 
 |        acter immediately following one of them instead of from the very start. | 
 |        For example, the pattern | 
 |  | 
 |          .*second | 
 |  | 
 |        matches  the subject "first\nand second" (where \n stands for a newline | 
 |        character), with the match starting at the seventh character. In  order | 
 |        to  do  this, PCRE2 has to retry the match starting after every newline | 
 |        in the subject. | 
 |  | 
 |        If you are using such a pattern with subject strings that do  not  con- | 
 |        tain   newlines,   the   best   performance   is  obtained  by  setting | 
 |        PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate  ex- | 
 |        plicit  anchoring.  That saves PCRE2 from having to scan along the sub- | 
 |        ject looking for a newline to restart at. | 
 |  | 
 |        Beware of patterns that contain nested indefinite  repeats.  These  can | 
 |        take  a  long time to run when applied to a string that does not match. | 
 |        Consider the pattern fragment | 
 |  | 
 |          ^(a+)* | 
 |  | 
 |        This can match "aaaa" in 16 different ways, and this  number  increases | 
 |        very  rapidly  as the string gets longer. (The * repeat can match 0, 1, | 
 |        2, 3, or 4 times, and for each of those cases other than 0 or 4, the  + | 
 |        repeats  can  match  different numbers of times.) When the remainder of | 
 |        the pattern is such that the entire match is going to fail,  PCRE2  has | 
 |        in  principle to try every possible variation, and this can take an ex- | 
 |        tremely long time, even for relatively short strings. | 
 |  | 
 |        An optimization catches some of the more simple cases such as | 
 |  | 
 |          (a+)*b | 
 |  | 
 |        where a literal character follows. Before  embarking  on  the  standard | 
 |        matching  procedure, PCRE2 checks that there is a "b" later in the sub- | 
 |        ject string, and if there is not, it fails the match immediately.  How- | 
 |        ever,  when  there  is no following literal this optimization cannot be | 
 |        used. You can see the difference by comparing the behaviour of | 
 |  | 
 |          (a+)*\d | 
 |  | 
 |        with the pattern above. The former gives  a  failure  almost  instantly | 
 |        when  applied  to  a  whole  line of "a" characters, whereas the latter | 
 |        takes an appreciable time with strings longer than about 20 characters. | 
 |  | 
 |        In many cases, the solution to this kind of performance issue is to use | 
 |        an atomic group or a possessive quantifier. This can often reduce  mem- | 
 |        ory requirements as well. As another example, consider this pattern: | 
 |  | 
 |          ([^<]|<(?!inet))+ | 
 |  | 
 |        It  matches  from wherever it starts until it encounters "<inet" or the | 
 |        end of the data, and is the kind of pattern that  might  be  used  when | 
 |        processing an XML file. Each iteration of the outer parentheses matches | 
 |        either  one  character that is not "<" or a "<" that is not followed by | 
 |        "inet". However, each time a parenthesis is processed,  a  backtracking | 
 |        position  is  passed,  so this formulation uses a memory frame for each | 
 |        matched character. For a long string, a lot of memory is required. Con- | 
 |        sider now this  rewritten  pattern,  which  matches  exactly  the  same | 
 |        strings: | 
 |  | 
 |          ([^<]++|<(?!inet))+ | 
 |  | 
 |        This runs much faster, because sequences of characters that do not con- | 
 |        tain "<" are "swallowed" in one item inside the parentheses, and a pos- | 
 |        sessive  quantifier  is  used to stop any backtracking into the runs of | 
 |        non-"<" characters. This version also uses a lot  less  memory  because | 
 |        entry  to  a  new  set of parentheses happens only when a "<" character | 
 |        that is not followed by "inet" is encountered (and we  assume  this  is | 
 |        relatively rare). | 
 |  | 
 |        This example shows that one way of optimizing performance when matching | 
 |        long  subject strings is to write repeated parenthesized subpatterns to | 
 |        match more than one character whenever possible. | 
 |  | 
 |    SETTING RESOURCE LIMITS | 
 |  | 
 |        You can set limits on the amount of processing that  takes  place  when | 
 |        matching,  and  on  the amount of heap memory that is used. The default | 
 |        values of the limits are very large, and unlikely ever to operate. They | 
 |        can be changed when PCRE2 is built, and  they  can  also  be  set  when | 
 |        pcre2_match()  or pcre2_dfa_match() is called. For details of these in- | 
 |        terfaces, see the pcre2build documentation  and  the  section  entitled | 
 |        "The match context" in the pcre2api documentation. | 
 |  | 
 |        The  pcre2test  test program has a modifier called "find_limits" which, | 
 |        if applied to a subject line, causes it to  find  the  smallest  limits | 
 |        that allow a pattern to match. This is done by repeatedly matching with | 
 |        different limits. | 
 |  | 
 |  | 
 | AUTHOR | 
 |  | 
 |        Philip Hazel | 
 |        Retired from University Computing Service | 
 |        Cambridge, England. | 
 |  | 
 |  | 
 | REVISION | 
 |  | 
 |        Last updated: 06 December 2022 | 
 |        Copyright (c) 1997-2022 University of Cambridge. | 
 |  | 
 |  | 
 | PCRE2 10.48-DEV                06 December 2022                PCRE2PERFORM(3) | 
 | ------------------------------------------------------------------------------ | 
 |  | 
 |  | 
 | PCRE2POSIX(3)              Library Functions Manual              PCRE2POSIX(3) | 
 |  | 
 |  | 
 | NAME | 
 |        PCRE2 - Perl-compatible regular expressions (revised API) | 
 |  | 
 |  | 
 | SYNOPSIS | 
 |  | 
 |        #include <pcre2posix.h> | 
 |  | 
 |        int pcre2_regcomp(regex_t *preg, const char *pattern, | 
 |             int cflags); | 
 |  | 
 |        int pcre2_regexec(const regex_t *preg, const char *string, | 
 |             size_t nmatch, regmatch_t pmatch[], int eflags); | 
 |  | 
 |        size_t pcre2_regerror(int errcode, const regex_t *preg, | 
 |             char *errbuf, size_t errbuf_size); | 
 |  | 
 |        void pcre2_regfree(regex_t *preg); | 
 |  | 
 |  | 
 | DESCRIPTION | 
 |  | 
 |        This  set of functions provides a POSIX-style API for the PCRE2 regular | 
 |        expression 8-bit library. There are no POSIX-style wrappers for PCRE2's | 
 |        16-bit and 32-bit libraries. See the pcre2api documentation for  a  de- | 
 |        scription  of  PCRE2's native API, which contains much additional func- | 
 |        tionality. | 
 |  | 
 |        IMPORTANT NOTE: The functions described here are NOT  thread-safe,  and | 
 |        should  not  be used in multi-threaded applications. They are also lim- | 
 |        ited to processing subjects that are not bigger than 2GB. Use  the  na- | 
 |        tive API instead. | 
 |  | 
 |        These  functions  are  wrapper functions that ultimately call the PCRE2 | 
 |        native API. Their prototypes are defined  in  the  pcre2posix.h  header | 
 |        file, and they all have unique names starting with pcre2_. However, the | 
 |        pcre2posix.h  header  also  contains macro definitions that convert the | 
 |        standard POSIX names such  regcomp()  into  pcre2_regcomp()  etc.  This | 
 |        means  that a program can use the usual POSIX names without running the | 
 |        risk of accidentally linking with POSIX functions from a different  li- | 
 |        brary. | 
 |  | 
 |        On  Unix-like systems the PCRE2 POSIX library is called libpcre2-posix, | 
 |        so can be accessed by adding -lpcre2-posix to the command  for  linking | 
 |        an application. Because the POSIX functions call the native ones, it is | 
 |        also necessary to add -lpcre2-8. | 
 |  | 
 |        On Windows systems, if you are linking to a DLL version of the library, | 
 |        it  is  recommended  that PCRE2POSIX_SHARED is defined before including | 
 |        the pcre2posix.h header, as it will allow for a more efficient  way  to | 
 |        invoke the functions by adding the __declspec(dllimport) decorator. | 
 |  | 
 |        Although  they were not defined as prototypes in pcre2posix.h, releases | 
 |        10.33 to 10.36 of the library contained functions with the POSIX  names | 
 |        regcomp()  etc.  These simply passed their arguments to the PCRE2 func- | 
 |        tions. These functions were provided for backwards  compatibility  with | 
 |        earlier  versions  of  PCRE2, which had only POSIX names. However, this | 
 |        has proved troublesome in situations where a program links with several | 
 |        libraries, some of which use PCRE2's POSIX interface while  others  use | 
 |        the  real  POSIX functions.  For this reason, the POSIX names have been | 
 |        removed since release 10.37. | 
 |  | 
 |        Calling the header file pcre2posix.h avoids  any  conflict  with  other | 
 |        POSIX  libraries.  It can, of course, be renamed or aliased as regex.h, | 
 |        which is the "correct" name, if there is  no  clash.  It  provides  two | 
 |        structure  types,  regex_t  for compiled internal forms, and regmatch_t | 
 |        for returning captured substrings. It also defines some constants whose | 
 |        names start with "REG_"; these are used for setting options and identi- | 
 |        fying error codes. | 
 |  | 
 |  | 
 | USING THE POSIX FUNCTIONS | 
 |  | 
 |        Note that these functions are just POSIX-style wrappers for PCRE2's na- | 
 |        tive API.  They do not give POSIX  regular  expression  behaviour,  and | 
 |        they are not thread-safe or even POSIX compatible. | 
 |  | 
 |        Those  POSIX  option bits that can reasonably be mapped to PCRE2 native | 
 |        options have been implemented. In addition, the option REG_EXTENDED  is | 
 |        defined  with  the  value  zero. This has no effect, but since programs | 
 |        that are written to the POSIX interface often use  it,  this  makes  it | 
 |        easier  to  slot in PCRE2 as a replacement library. Other POSIX options | 
 |        are not even defined. | 
 |  | 
 |        There are also some options that are not defined by POSIX.  These  have | 
 |        been  added  at  the  request  of users who want to make use of certain | 
 |        PCRE2-specific features via the POSIX calling interface or to  add  BSD | 
 |        or GNU functionality. | 
 |  | 
 |        When  PCRE2  is  called via these functions, it is only the API that is | 
 |        POSIX-like in style. The syntax and semantics of  the  regular  expres- | 
 |        sions  themselves  are  still  those of Perl, subject to the setting of | 
 |        various PCRE2 options, as described below. "POSIX-like in style"  means | 
 |        that  the  API  approximates  to  the POSIX definition; it is not fully | 
 |        POSIX-compatible, and in multi-unit encoding  domains  it  is  probably | 
 |        even less compatible. | 
 |  | 
 |        The  descriptions  below use the actual names of the functions, but, as | 
 |        described above, the standard POSIX names (without the  pcre2_  prefix) | 
 |        may also be used. | 
 |  | 
 |  | 
 | COMPILING A PATTERN | 
 |  | 
 |        The function pcre2_regcomp() is called to compile a pattern into an in- | 
 |        ternal  form. By default, the pattern is a C string terminated by a bi- | 
 |        nary zero (but see REG_PEND below). The preg argument is a pointer to a | 
 |        regex_t structure that is used as a base for storing information  about | 
 |        the  compiled  regular  expression.  It  is  also  used  for input when | 
 |        REG_PEND is set. The regex_t structure used by pcre2_regcomp()  is  de- | 
 |        fined  in  pcre2posix.h  and  is  not the same as the structure used by | 
 |        other libraries that provide POSIX-style matching. | 
 |  | 
 |        The argument cflags is either zero, or contains one or more of the bits | 
 |        defined by the following macros: | 
 |  | 
 |          REG_DOTALL | 
 |  | 
 |        The PCRE2_DOTALL option is set when the regular  expression  is  passed | 
 |        for  compilation  to  the  native function. Note that REG_DOTALL is not | 
 |        part of the POSIX standard. | 
 |  | 
 |          REG_ICASE | 
 |  | 
 |        The PCRE2_CASELESS option is set when the regular expression is  passed | 
 |        for compilation to the native function. | 
 |  | 
 |          REG_NEWLINE | 
 |  | 
 |        The PCRE2_MULTILINE option is set when the regular expression is passed | 
 |        for  compilation  to the native function. Note that this does not mimic | 
 |        the defined POSIX behaviour for REG_NEWLINE  (see  the  following  sec- | 
 |        tion). | 
 |  | 
 |          REG_NOSPEC | 
 |  | 
 |        The  PCRE2_LITERAL  option is set when the regular expression is passed | 
 |        for compilation to the native function. This disables all meta  charac- | 
 |        ters  in the pattern, causing it to be treated as a literal string. The | 
 |        only other options that are  allowed  with  REG_NOSPEC  are  REG_ICASE, | 
 |        REG_NOSUB,  REG_PEND,  and REG_UTF. Note that REG_NOSPEC is not part of | 
 |        the POSIX standard. | 
 |  | 
 |          REG_NOSUB | 
 |  | 
 |        When  a  pattern  that  is  compiled  with  this  flag  is  passed   to | 
 |        pcre2_regexec()  for  matching, the nmatch and pmatch arguments are ig- | 
 |        nored, and no captured strings are returned. Versions of the PCRE2  li- | 
 |        brary  prior to 10.22 used to set the PCRE2_NO_AUTO_CAPTURE compile op- | 
 |        tion, but this no longer happens because it disables the use  of  back- | 
 |        references. | 
 |  | 
 |          REG_PEND | 
 |  | 
 |        If  this option is set, the reg_endp field in the preg structure (which | 
 |        has the type const char *) must be set to point to the character beyond | 
 |        the end of the pattern before calling pcre2_regcomp(). The pattern  it- | 
 |        self  may  now  contain binary zeros, which are treated as data charac- | 
 |        ters. Without REG_PEND, a binary zero terminates the  pattern  and  the | 
 |        re_endp field is ignored. This is a GNU extension to the POSIX standard | 
 |        and  should be used with caution in software intended to be portable to | 
 |        other systems. | 
 |  | 
 |          REG_UCP | 
 |  | 
 |        The PCRE2_UCP option is set when the regular expression is  passed  for | 
 |        compilation  to  the  native function. This causes PCRE2 to use Unicode | 
 |        properties when matching \d, \w,  etc.,  instead  of  just  recognizing | 
 |        ASCII values. Note that REG_UCP is not part of the POSIX standard. | 
 |  | 
 |          REG_UNGREEDY | 
 |  | 
 |        The  PCRE2_UNGREEDY option is set when the regular expression is passed | 
 |        for compilation to the native function. Note that REG_UNGREEDY  is  not | 
 |        part of the POSIX standard. | 
 |  | 
 |          REG_UTF | 
 |  | 
 |        The  PCRE2_UTF  option is set when the regular expression is passed for | 
 |        compilation to the native function. This causes the pattern itself  and | 
 |        all  data  strings used for matching it to be treated as UTF-8 strings. | 
 |        Note that REG_UTF is not part of the POSIX standard. | 
 |  | 
 |        In the absence of these flags, no options  are  passed  to  the  native | 
 |        function.  This means that the regex is compiled with PCRE2 default se- | 
 |        mantics.  In  particular,  the way it handles newline characters in the | 
 |        subject string is the Perl way, not the POSIX way.  Note  that  setting | 
 |        PCRE2_MULTILINE has only some of the effects specified for REG_NEWLINE. | 
 |        It  does not affect the way newlines are matched by the dot metacharac- | 
 |        ter (they are not) or by a negative class such as [^a] (they are). | 
 |  | 
 |        The yield of pcre2_regcomp() is zero on success,  and  non-zero  other- | 
 |        wise.  The preg structure is filled in on success, and one other member | 
 |        of  the  structure (as well as re_endp) is public: re_nsub contains the | 
 |        number of capturing subpatterns in the regular expression. Various  er- | 
 |        ror codes are defined in the header file. | 
 |  | 
 |        NOTE: If the yield of pcre2_regcomp() is non-zero, you must not attempt | 
 |        to use the contents of the preg structure. If, for example, you pass it | 
 |        to  pcre2_regexec(), the result is undefined and your program is likely | 
 |        to crash. | 
 |  | 
 |  | 
 | MATCHING NEWLINE CHARACTERS | 
 |  | 
 |        This area is not simple, because POSIX and Perl take different views of | 
 |        things.  It is not possible to get PCRE2 to obey POSIX  semantics,  but | 
 |        then PCRE2 was never intended to be a POSIX engine. The following table | 
 |        lists  the  different  possibilities for matching newline characters in | 
 |        Perl and PCRE2: | 
 |  | 
 |                                  Default   Change with | 
 |  | 
 |          . matches newline          no     PCRE2_DOTALL | 
 |          newline matches [^a]       yes    not changeable | 
 |          $ matches \n at end        yes    PCRE2_DOLLAR_ENDONLY | 
 |          $ matches \n in middle     no     PCRE2_MULTILINE | 
 |          ^ matches \n in middle     no     PCRE2_MULTILINE | 
 |  | 
 |        This is the equivalent table for a POSIX-compatible pattern matcher: | 
 |  | 
 |                                  Default   Change with | 
 |  | 
 |          . matches newline          yes    REG_NEWLINE | 
 |          newline matches [^a]       yes    REG_NEWLINE | 
 |          $ matches \n at end        no     REG_NEWLINE | 
 |          $ matches \n in middle     no     REG_NEWLINE | 
 |          ^ matches \n in middle     no     REG_NEWLINE | 
 |  | 
 |        This behaviour is not what happens when PCRE2 is called via  its  POSIX | 
 |        API.  By  default, PCRE2's behaviour is the same as Perl's, except that | 
 |        there is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both  PCRE2 | 
 |        and Perl, there is no way to stop newline from matching [^a]. | 
 |  | 
 |        Default  POSIX newline handling can be obtained by setting PCRE2_DOTALL | 
 |        and PCRE2_DOLLAR_ENDONLY when  calling  pcre2_compile()  directly,  but | 
 |        there is no way to make PCRE2 behave exactly as for the REG_NEWLINE ac- | 
 |        tion.  When  using  the  POSIX  API,  passing  REG_NEWLINE  to  PCRE2's | 
 |        pcre2_regcomp()  function  causes  PCRE2_MULTILINE  to  be  passed   to | 
 |        pcre2_compile(), and REG_DOTALL passes PCRE2_DOTALL. There is no way to | 
 |        pass PCRE2_DOLLAR_ENDONLY. | 
 |  | 
 |  | 
 | MATCHING A PATTERN | 
 |  | 
 |        The function pcre2_regexec() is called to match a compiled pattern preg | 
 |        against  a  given string, which is by default terminated by a zero byte | 
 |        (but see REG_STARTEND below), subject to the options in eflags.   These | 
 |        can be: | 
 |  | 
 |          REG_NOTBOL | 
 |  | 
 |        The PCRE2_NOTBOL option is set when calling the underlying PCRE2 match- | 
 |        ing function. | 
 |  | 
 |          REG_NOTEMPTY | 
 |  | 
 |        The  PCRE2_NOTEMPTY  option  is  set  when calling the underlying PCRE2 | 
 |        matching function. Note that REG_NOTEMPTY is  not  part  of  the  POSIX | 
 |        standard.  However, setting this option can give more POSIX-like behav- | 
 |        iour in some situations. | 
 |  | 
 |          REG_NOTEOL | 
 |  | 
 |        The PCRE2_NOTEOL option is set when calling the underlying PCRE2 match- | 
 |        ing function. | 
 |  | 
 |          REG_STARTEND | 
 |  | 
 |        When this option  is  set,  the  subject  string  starts  at  string  + | 
 |        pmatch[0].rm_so  and  ends  at  string  + pmatch[0].rm_eo, which should | 
 |        point to the first character beyond the string. There may be binary ze- | 
 |        ros within the subject string, and indeed, using  REG_STARTEND  is  the | 
 |        only way to pass a subject string that contains a binary zero. | 
 |  | 
 |        Whatever  the  value  of  pmatch[0].rm_so,  the  offsets of the matched | 
 |        string and any captured substrings are  still  given  relative  to  the | 
 |        start  of  string  itself. (Before PCRE2 release 10.30 these were given | 
 |        relative to string + pmatch[0].rm_so, but this differs from  other  im- | 
 |        plementations.) | 
 |  | 
 |        This  is  a  BSD  extension,  compatible with but not specified by IEEE | 
 |        Standard 1003.2 (POSIX.2), and should be used with caution in  software | 
 |        intended  to  be  portable to other systems. Note that a non-zero rm_so | 
 |        does not imply REG_NOTBOL; REG_STARTEND affects only the  location  and | 
 |        length  of  the string, not how it is matched. Setting REG_STARTEND and | 
 |        passing pmatch as NULL are mutually exclusive; the error REG_INVARG  is | 
 |        returned. | 
 |  | 
 |        If  the pattern was compiled with the REG_NOSUB flag, no data about any | 
 |        matched strings  is  returned.  The  nmatch  and  pmatch  arguments  of | 
 |        pcre2_regexec()  are  ignored  (except  possibly as input for REG_STAR- | 
 |        TEND). | 
 |  | 
 |        The value of nmatch may be zero, and the value pmatch may be NULL  (un- | 
 |        less  REG_STARTEND  is  set);  in  both  these  cases no data about any | 
 |        matched strings is returned. | 
 |  | 
 |        Otherwise, the portion of the string that was  matched,  and  also  any | 
 |        captured substrings, are returned via the pmatch argument, which points | 
 |        to  an  array  of  nmatch structures of type regmatch_t, containing the | 
 |        members rm_so and rm_eo. These contain the byte  offset  to  the  first | 
 |        character of each substring and the offset to the first character after | 
 |        the  end of each substring, respectively. The 0th element of the vector | 
 |        relates to the entire portion of string that  was  matched;  subsequent | 
 |        elements relate to the capturing subpatterns of the regular expression. | 
 |        Unused entries in the array have both structure members set to -1. | 
 |  | 
 |        regmatch_t  as  well  as  the  regoff_t  typedef it uses are defined in | 
 |        pcre2posix.h and are not warranted to have the same size or  layout  as | 
 |        other  similarly  named  types from other libraries that provide POSIX- | 
 |        style matching. | 
 |  | 
 |        A successful match yields a zero return; various error  codes  are  de- | 
 |        fined  in the header file, of which REG_NOMATCH is the "expected" fail- | 
 |        ure code. | 
 |  | 
 |  | 
 | ERROR MESSAGES | 
 |  | 
 |        The pcre2_regerror() function maps a  non-zero  errorcode  from  either | 
 |        pcre2_regcomp()  or  pcre2_regexec() to a printable message. If preg is | 
 |        not NULL, the error should have arisen from the use of that  structure. | 
 |        A  message  terminated  by  a  binary  zero is placed in errbuf. If the | 
 |        buffer is too short, only the first errbuf_size - 1 characters  of  the | 
 |        error message are used. The yield of the function is the size of buffer | 
 |        needed  to hold the whole message, including the terminating zero. This | 
 |        value is greater than errbuf_size if the message was truncated. | 
 |  | 
 |  | 
 | MEMORY USAGE | 
 |  | 
 |        Compiling a regular expression causes memory to be allocated and  asso- | 
 |        ciated  with the preg structure. The function pcre2_regfree() frees all | 
 |        such memory, after which preg may no longer be used as a  compiled  ex- | 
 |        pression. | 
 |  | 
 |  | 
 | AUTHOR | 
 |  | 
 |        Philip Hazel | 
 |        Retired from University Computing Service | 
 |        Cambridge, England. | 
 |  | 
 |  | 
 | REVISION | 
 |  | 
 |        Last updated: 27 November 2024 | 
 |        Copyright (c) 1997-2024 University of Cambridge. | 
 |  | 
 |  | 
 | PCRE2 10.48-DEV                27 November 2024                  PCRE2POSIX(3) | 
 | ------------------------------------------------------------------------------ | 
 |  | 
 |  | 
 | PCRE2SAMPLE(3)             Library Functions Manual             PCRE2SAMPLE(3) | 
 |  | 
 |  | 
 | NAME | 
 |        PCRE2 - Perl-compatible regular expressions (revised API) | 
 |  | 
 |  | 
 | PCRE2 SAMPLE PROGRAM | 
 |  | 
 |        A  simple, complete demonstration program to get you started with using | 
 |        PCRE2 is supplied in the file pcre2demo.c in the src directory  in  the | 
 |        PCRE2 distribution. A listing of this program is given in the pcre2demo | 
 |        documentation. If you do not have a copy of the PCRE2 distribution, you | 
 |        can save this listing to recreate the contents of pcre2demo.c. | 
 |  | 
 |        The  demonstration  program compiles the regular expression that is its | 
 |        first argument, and matches it against the subject string in its second | 
 |        argument. No PCRE2 options are set, and default  character  tables  are | 
 |        used. If matching succeeds, the program outputs the portion of the sub- | 
 |        ject  that  matched,  together  with  the contents of any captured sub- | 
 |        strings. | 
 |  | 
 |        If the -g option is given on the command line, the program then goes on | 
 |        to check for further matches of the same regular expression in the same | 
 |        subject string. The logic is a little bit tricky because of the  possi- | 
 |        bility  of  matching an empty string. Comments in the code explain what | 
 |        is going on. | 
 |  | 
 |        The code in pcre2demo.c is an 8-bit program that uses the  PCRE2  8-bit | 
 |        library.  It  handles  strings  and characters that are stored in 8-bit | 
 |        code units.  By default, one character corresponds to  one  code  unit, | 
 |        but  if  the  pattern starts with "(*UTF)", both it and the subject are | 
 |        treated as UTF-8 strings, where characters  may  occupy  multiple  code | 
 |        units. | 
 |  | 
 |        If  PCRE2  is installed in the standard include and library directories | 
 |        for your operating system, you should be able to compile the demonstra- | 
 |        tion program using a command like this: | 
 |  | 
 |          cc -o pcre2demo pcre2demo.c -lpcre2-8 | 
 |  | 
 |        If PCRE2 is installed elsewhere, you may need to add additional options | 
 |        to the command line. For example, on a Unix-like system that has  PCRE2 | 
 |        installed  in /usr/local, you can compile the demonstration program us- | 
 |        ing a command like this: | 
 |  | 
 |          cc -o pcre2demo -I/usr/local/include pcre2demo.c \ | 
 |             -L/usr/local/lib -lpcre2-8 | 
 |  | 
 |        Once you have built the demonstration program, you can run simple tests | 
 |        like this: | 
 |  | 
 |          ./pcre2demo 'cat|dog' 'the cat sat on the mat' | 
 |          ./pcre2demo -g 'cat|dog' 'the dog sat on the cat' | 
 |          ./pcre2demo -i 'cat' 'the dog sat on the CAT' | 
 |  | 
 |        Note that there is a  much  more  comprehensive  test  program,  called | 
 |        pcre2test,  which supports many more facilities for testing regular ex- | 
 |        pressions using all three PCRE2 libraries (8-bit, 16-bit,  and  32-bit, | 
 |        though  not all three need be installed). The pcre2demo program is pro- | 
 |        vided as a relatively simple coding example. | 
 |  | 
 |        If you try to run pcre2demo when PCRE2 is not installed in the standard | 
 |        library directory, you may get an error like  this  on  some  operating | 
 |        systems (e.g. Solaris): | 
 |  | 
 |          ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file | 
 |        or directory | 
 |  | 
 |        This  is  caused  by the way shared library support works on those sys- | 
 |        tems. You need to add | 
 |  | 
 |          -R/usr/local/lib | 
 |  | 
 |        (for example) to the compile command to get round this problem. | 
 |  | 
 |  | 
 | AUTHOR | 
 |  | 
 |        Philip Hazel | 
 |        Retired from University Computing Service | 
 |        Cambridge, England. | 
 |  | 
 |  | 
 | REVISION | 
 |  | 
 |        Last updated: 28 February 2025 | 
 |        Copyright (c) 1997-2016 University of Cambridge. | 
 |  | 
 |  | 
 | PCRE2 10.48-DEV                28 February 2025                 PCRE2SAMPLE(3) | 
 | ------------------------------------------------------------------------------ | 
 | PCRE2SERIALIZE(3)          Library Functions Manual          PCRE2SERIALIZE(3) | 
 |  | 
 |  | 
 | NAME | 
 |        PCRE2 - Perl-compatible regular expressions (revised API) | 
 |  | 
 |  | 
 | SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS | 
 |  | 
 |        int32_t pcre2_serialize_decode(pcre2_code **codes, | 
 |          int32_t number_of_codes, const uint8_t *bytes, | 
 |          pcre2_general_context *gcontext); | 
 |  | 
 |        int32_t pcre2_serialize_encode(const pcre2_code **codes, | 
 |          int32_t number_of_codes, uint8_t **serialized_bytes, | 
 |          PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext); | 
 |  | 
 |        void pcre2_serialize_free(uint8_t *bytes); | 
 |  | 
 |        int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes); | 
 |  | 
 |        If  you  are running an application that uses a large number of regular | 
 |        expression patterns, it may be useful to store them  in  a  precompiled | 
 |        form  instead  of  having to compile them every time the application is | 
 |        run. However, if you are using the just-in-time  optimization  feature, | 
 |        it is not possible to save and reload the JIT data, because it is posi- | 
 |        tion-dependent.  The  host  on  which the patterns are reloaded must be | 
 |        running the same version of PCRE2, with the same code unit  width,  and | 
 |        must  also have the same endianness, pointer width and PCRE2_SIZE type. | 
 |        For example, patterns compiled on a 32-bit system using PCRE2's  16-bit | 
 |        library cannot be reloaded on a 64-bit system, nor can they be reloaded | 
 |        using the 8-bit library. | 
 |  | 
 |        Note  that  "serialization" in PCRE2 does not convert compiled patterns | 
 |        to an abstract format like Java or .NET serialization.  The  serialized | 
 |        output  is really just a bytecode dump, which is why it can only be re- | 
 |        loaded in the same environment as the one that created  it.  Hence  the | 
 |        restrictions  mentioned  above.   Applications  that are not statically | 
 |        linked with a fixed version of PCRE2 must be prepared to recompile pat- | 
 |        terns from their sources, in order to be immune to PCRE2 upgrades. | 
 |  | 
 |  | 
 | SECURITY CONCERNS | 
 |  | 
 |        The facility for saving and restoring compiled patterns is intended for | 
 |        use within individual applications.  As  such,  the  data  supplied  to | 
 |        pcre2_serialize_decode()  is expected to be trusted data, not data from | 
 |        arbitrary external sources.  There  is  only  some  simple  consistency | 
 |        checking, not complete validation of what is being re-loaded. Corrupted | 
 |        data may cause undefined results. For example, if the length field of a | 
 |        pattern in the serialized data is corrupted, the deserializing code may | 
 |        read beyond the end of the byte stream that is passed to it. | 
 |  | 
 |  | 
 | SAVING COMPILED PATTERNS | 
 |  | 
 |        Before compiled patterns can be saved they must be serialized, which in | 
 |        PCRE2  means converting the pattern to a stream of bytes. A single byte | 
 |        stream may contain any number of compiled patterns, but they  must  all | 
 |        use  the same character tables. A single copy of the tables is included | 
 |        in the byte stream (its size is 1088 bytes). For more details of  char- | 
 |        acter  tables,  see the section on locale support in the pcre2api docu- | 
 |        mentation. | 
 |  | 
 |        The function pcre2_serialize_encode() creates a serialized byte  stream | 
 |        from  a  list of compiled patterns. Its first two arguments specify the | 
 |        list, being a pointer to a vector of pointers to compiled patterns, and | 
 |        the length of the vector. The third and fourth arguments point to vari- | 
 |        ables which are set to point to the created byte stream and its length, | 
 |        respectively. The final argument is a pointer  to  a  general  context, | 
 |        which  can  be  used  to specify custom memory management functions. If | 
 |        this argument is NULL, malloc() is used to obtain memory for  the  byte | 
 |        stream. The yield of the function is the number of serialized patterns, | 
 |        or one of the following negative error codes: | 
 |  | 
 |          PCRE2_ERROR_BADDATA      the number of patterns is zero or less | 
 |          PCRE2_ERROR_BADMAGIC     mismatch of id bytes in one of the patterns | 
 |          PCRE2_ERROR_NOMEMORY     memory allocation failed | 
 |          PCRE2_ERROR_MIXEDTABLES  the patterns do not all use the same tables | 
 |          PCRE2_ERROR_NULL         the 1st, 3rd, or 4th argument is NULL | 
 |  | 
 |        PCRE2_ERROR_BADMAGIC  means  either that a pattern's code has been cor- | 
 |        rupted, or that a slot in the vector does not point to a compiled  pat- | 
 |        tern. | 
 |  | 
 |        Once a set of patterns has been serialized you can save the data in any | 
 |        appropriate  manner. Here is sample code that compiles two patterns and | 
 |        writes them to a file. It assumes that the variable fd refers to a file | 
 |        that is open for output. The error checking that should be present in a | 
 |        real application has been omitted for simplicity. | 
 |  | 
 |          int errorcode; | 
 |          uint8_t *bytes; | 
 |          PCRE2_SIZE erroroffset; | 
 |          PCRE2_SIZE bytescount; | 
 |          pcre2_code *list_of_codes[2]; | 
 |          list_of_codes[0] = pcre2_compile("first pattern", | 
 |            PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL); | 
 |          list_of_codes[1] = pcre2_compile("second pattern", | 
 |            PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL); | 
 |          errorcode = pcre2_serialize_encode(list_of_codes, 2, &bytes, | 
 |            &bytescount, NULL); | 
 |          errorcode = fwrite(bytes, 1, bytescount, fd); | 
 |  | 
 |        Note that the serialized data is binary data that may  contain  any  of | 
 |        the  256  possible  byte values. On systems that make a distinction be- | 
 |        tween binary and non-binary data, be sure that the file is  opened  for | 
 |        binary output. | 
 |  | 
 |        Serializing  a  set  of patterns leaves the original data untouched, so | 
 |        they can still be used for matching. Their memory  must  eventually  be | 
 |        freed in the usual way by calling pcre2_code_free(). When you have fin- | 
 |        ished with the byte stream, it too must be freed by calling pcre2_seri- | 
 |        alize_free().  If  this function is called with a NULL argument, it re- | 
 |        turns immediately without doing anything. | 
 |  | 
 |  | 
 | RE-USING PRECOMPILED PATTERNS | 
 |  | 
 |        In order to re-use a set of saved patterns you must first make the  se- | 
 |        rialized  byte stream available in main memory (for example, by reading | 
 |        from a file). The management of this memory block is up to the applica- | 
 |        tion. You can use the pcre2_serialize_get_number_of_codes() function to | 
 |        find out how many compiled patterns are in the serialized data  without | 
 |        actually decoding the patterns: | 
 |  | 
 |          uint8_t *bytes = <serialized data>; | 
 |          int32_t number_of_codes = pcre2_serialize_get_number_of_codes(bytes); | 
 |  | 
 |        The pcre2_serialize_decode() function reads a byte stream and recreates | 
 |        the compiled patterns in new memory blocks, setting pointers to them in | 
 |        a  vector.  The  first two arguments are a pointer to a suitable vector | 
 |        and its length, and the third argument points to a byte stream. The fi- | 
 |        nal argument is a pointer to a general context, which can  be  used  to | 
 |        specify custom memory management functions for the decoded patterns. If | 
 |        this argument is NULL, malloc() and free() are used. After deserializa- | 
 |        tion, the byte stream is no longer needed and can be discarded. | 
 |  | 
 |          pcre2_code *list_of_codes[2]; | 
 |          uint8_t *bytes = <serialized data>; | 
 |          int32_t number_of_codes = | 
 |            pcre2_serialize_decode(list_of_codes, 2, bytes, NULL); | 
 |  | 
 |        If  the  vector  is  not  large enough for all the patterns in the byte | 
 |        stream, it is filled with those that fit, and  the  remainder  are  ig- | 
 |        nored.  The yield of the function is the number of decoded patterns, or | 
 |        one of the following negative error codes: | 
 |  | 
 |          PCRE2_ERROR_BADDATA    second argument is zero or less | 
 |          PCRE2_ERROR_BADMAGIC   mismatch of id bytes in the data | 
 |          PCRE2_ERROR_BADMODE    mismatch of code unit size or PCRE2 version | 
 |          PCRE2_ERROR_BADSERIALIZEDDATA  other sanity check failure | 
 |          PCRE2_ERROR_MEMORY     memory allocation failed | 
 |          PCRE2_ERROR_NULL       first or third argument is NULL | 
 |  | 
 |        PCRE2_ERROR_BADMAGIC may mean that the data is corrupt, or that it  was | 
 |        compiled on a system with different endianness. | 
 |  | 
 |        Decoded patterns can be used for matching in the usual way, and must be | 
 |        freed  by  calling pcre2_code_free(). However, be aware that there is a | 
 |        potential race issue if you are using multiple patterns that  were  de- | 
 |        coded  from a single byte stream in a multithreaded application. A sin- | 
 |        gle copy of the character tables is used by all  the  decoded  patterns | 
 |        and a reference count is used to arrange for its memory to be automati- | 
 |        cally  freed when the last pattern is freed, but there is no locking on | 
 |        this reference count. Therefore, if you want to call  pcre2_code_free() | 
 |        for  these  patterns  in  different  threads, you must arrange your own | 
 |        locking, and ensure that pcre2_code_free()  cannot  be  called  by  two | 
 |        threads at the same time. | 
 |  | 
 |        If  a pattern was processed by pcre2_jit_compile() before being serial- | 
 |        ized, the JIT data is discarded and so is no longer available  after  a | 
 |        save/restore  cycle.  You can, however, process a restored pattern with | 
 |        pcre2_jit_compile() if you wish. | 
 |  | 
 |  | 
 | AUTHOR | 
 |  | 
 |        Philip Hazel | 
 |        Retired from University Computing Service | 
 |        Cambridge, England. | 
 |  | 
 |  | 
 | REVISION | 
 |  | 
 |        Last updated: 19 January 2024 | 
 |        Copyright (c) 1997-2018 University of Cambridge. | 
 |  | 
 |  | 
 | PCRE2 10.48-DEV                 19 January 2024              PCRE2SERIALIZE(3) | 
 | ------------------------------------------------------------------------------ | 
 |  | 
 |  | 
 | PCRE2SYNTAX(3)             Library Functions Manual             PCRE2SYNTAX(3) | 
 |  | 
 |  | 
 | NAME | 
 |        PCRE2 - Perl-compatible regular expressions (revised API) | 
 |  | 
 |  | 
 | PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY | 
 |  | 
 |        The  full  syntax and semantics of the regular expression patterns that | 
 |        are supported by PCRE2 are described in the pcre2pattern documentation. | 
 |        This document contains a quick-reference summary of the pattern  syntax | 
 |        followed by the syntax of replacement strings in substitution function. | 
 |        The full description of the latter is in the pcre2api documentation. | 
 |  | 
 |  | 
 | QUOTING | 
 |  | 
 |          \x         where x is non-alphanumeric is a literal x | 
 |          \Q...\E    treat enclosed characters as literal | 
 |  | 
 |        Note that white space inside \Q...\E is always treated as literal, even | 
 |        if PCRE2_EXTENDED is set, causing most other white space to be ignored. | 
 |        Note  also  that  PCRE2's handling of \Q...\E has some differences from | 
 |        Perl's. See the pcre2pattern documentation for details. | 
 |  | 
 |  | 
 | BRACED ITEMS | 
 |  | 
 |        With one exception, wherever brace characters { and } are  required  to | 
 |        enclose  data for constructions such as \g{2} or \k{name}, space and/or | 
 |        horizontal tab characters that follow { or precede }  are  allowed  and | 
 |        are ignored. In the case of quantifiers, they may also appear before or | 
 |        after  the comma. The exception is \u{...} which is not Perl-compatible | 
 |        and is recognized only when PCRE2_EXTRA_ALT_BSUX is set. This is an EC- | 
 |        MAScript compatibility feature, and follows ECMAScript's behaviour. | 
 |  | 
 |  | 
 | ESCAPED CHARACTERS | 
 |  | 
 |        This table applies to ASCII and Unicode environments.  An  unrecognized | 
 |        escape sequence causes an error. | 
 |  | 
 |          \a         alarm, that is, the BEL character (hex 07) | 
 |          \cx        "control-x", where x is a non-control ASCII character | 
 |          \e         escape (hex 1B) | 
 |          \f         form feed (hex 0C) | 
 |          \n         newline (hex 0A) | 
 |          \r         carriage return (hex 0D) | 
 |          \t         tab (hex 09) | 
 |          \0dd       character with octal code 0dd | 
 |          \ddd       character with octal code ddd, or backreference | 
 |          \o{ddd..}  character with octal code ddd.. | 
 |          \N{U+hh..} character with Unicode code point hh.. (Unicode mode only) | 
 |          \xhh       character with hex code hh | 
 |          \x{hh..}   character with hex code hh.. | 
 |  | 
 |        \N{U+hh..} is synonymous with \x{hh..} but is not supported in environ- | 
 |        ments  that  use  EBCDIC code (mainly IBM mainframes). Note that \N not | 
 |        followed by an opening curly bracket has a different meaning  (see  be- | 
 |        low). | 
 |  | 
 |        If PCRE2_ALT_BSUX or PCRE2_EXTRA_ALT_BSUX is set ("ALT_BSUX mode"), the | 
 |        following are also recognized: | 
 |  | 
 |          \U         the character "U" | 
 |          \uhhhh     character with hex code hhhh | 
 |          \u{hh..}   character with hex code hh.. but only for EXTRA_ALT_BSUX | 
 |  | 
 |        When  \x  is not followed by {, one or two hexadecimal digits are read, | 
 |        but in ALT_BSUX mode \x must be followed by two hexadecimal  digits  to | 
 |        be  recognized  as a hexadecimal escape; otherwise it matches a literal | 
 |        "x".  Likewise, if \u (in ALT_BSUX mode) is not followed by four  hexa- | 
 |        decimal  digits or (in EXTRA_ALT_BSUX mode) a sequence of hex digits in | 
 |        curly brackets, it matches a literal "u". | 
 |  | 
 |        Note that \0dd is always an octal code. The treatment of backslash fol- | 
 |        lowed by a non-zero digit is complicated; for details see  the  section | 
 |        "Non-printing  characters" in the pcre2pattern documentation, where de- | 
 |        tails of escape processing in EBCDIC environments are also given. | 
 |  | 
 |  | 
 | CHARACTER TYPES | 
 |  | 
 |          .          any character except newline; | 
 |                       in dotall mode, any character whatsoever | 
 |          \C         one code unit, even in UTF mode (best avoided) | 
 |          \d         a decimal digit | 
 |          \D         a character that is not a decimal digit | 
 |          \h         a horizontal white space character | 
 |          \H         a character that is not a horizontal white space character | 
 |          \N         a character that is not a newline | 
 |          \p{xx}     a character with the xx property | 
 |          \P{xx}     a character without the xx property | 
 |          \R         a newline sequence | 
 |          \s         a white space character | 
 |          \S         a character that is not a white space character | 
 |          \v         a vertical white space character | 
 |          \V         a character that is not a vertical white space character | 
 |          \w         a "word" character | 
 |          \W         a "non-word" character | 
 |          \X         a Unicode extended grapheme cluster | 
 |  | 
 |        \C is dangerous because it may leave the current matching point in  the | 
 |        middle of a UTF-8 or UTF-16 character. The application can lock out the | 
 |        use  of  \C  by  setting the PCRE2_NEVER_BACKSLASH_C option. It is also | 
 |        possible to build PCRE2 with the use of \C permanently disabled. | 
 |  | 
 |        By default, \d, \s, and \w match only ASCII characters, even  in  UTF-8 | 
 |        mode or in the 16-bit and 32-bit libraries. However, if locale-specific | 
 |        matching  is  happening,  \s and \w may also match characters with code | 
 |        points in the range 128-255. If the PCRE2_UCP option is set, the behav- | 
 |        iour of these escape sequences is changed to use Unicode properties and | 
 |        they match many more characters, but there  are  some  option  settings | 
 |        that  can  restrict individual sequences to matching only ASCII charac- | 
 |        ters. | 
 |  | 
 |        Property descriptions in \p and \P are matched caselessly; hyphens, un- | 
 |        derscores, and ASCII white space characters are ignored, in  accordance | 
 |        with  Unicode's  "loose matching" rules. For example, \p{Bidi_Class=al} | 
 |        is the same as \p{ bidi class = AL }. | 
 |  | 
 |  | 
 | GENERAL CATEGORY PROPERTIES FOR \p and \P | 
 |  | 
 |          C          Other | 
 |          Cc         Control | 
 |          Cf         Format | 
 |          Cn         Unassigned | 
 |          Co         Private use | 
 |          Cs         Surrogate | 
 |  | 
 |          L          Letter | 
 |          Lc         Cased letter, the union of Ll, Lu, and Lt | 
 |          L&         Synonym of Lc | 
 |          Ll         Lower case letter | 
 |          Lm         Modifier letter | 
 |          Lo         Other letter | 
 |          Lt         Title case letter | 
 |          Lu         Upper case letter | 
 |  | 
 |          M          Mark | 
 |          Mc         Spacing mark | 
 |          Me         Enclosing mark | 
 |          Mn         Non-spacing mark | 
 |  | 
 |          N          Number | 
 |          Nd         Decimal number | 
 |          Nl         Letter number | 
 |          No         Other number | 
 |  | 
 |          P          Punctuation | 
 |          Pc         Connector punctuation | 
 |          Pd         Dash punctuation | 
 |          Pe         Close punctuation | 
 |          Pf         Final punctuation | 
 |          Pi         Initial punctuation | 
 |          Po         Other punctuation | 
 |          Ps         Open punctuation | 
 |  | 
 |          S          Symbol | 
 |          Sc         Currency symbol | 
 |          Sk         Modifier symbol | 
 |          Sm         Mathematical symbol | 
 |          So         Other symbol | 
 |  | 
 |          Z          Separator | 
 |          Zl         Line separator | 
 |          Zp         Paragraph separator | 
 |          Zs         Space separator | 
 |  | 
 |        From release 10.45, when caseless matching is set, Ll, Lu, and  Lt  are | 
 |        all equivalent to Lc. | 
 |  | 
 |  | 
 | PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P | 
 |  | 
 |          Xan        Alphanumeric: union of properties L and N | 
 |          Xps        POSIX space: property Z or tab, NL, VT, FF, CR | 
 |          Xsp        Perl space: property Z or tab, NL, VT, FF, CR | 
 |          Xuc        Universally-named character: one that can be | 
 |                       represented by a Universal Character Name | 
 |          Xwd        Perl word: property Xan or underscore | 
 |  | 
 |        Perl and POSIX space are now the same. Perl added VT to its space char- | 
 |        acter set at release 5.18. | 
 |  | 
 |  | 
 | BINARY PROPERTIES FOR \p AND \P | 
 |  | 
 |        Unicode  defines  a  number  of  binary properties, that is, properties | 
 |        whose only values are true or false. You can obtain  a  list  of  those | 
 |        that  are  recognized  by \p and \P, along with their abbreviations, by | 
 |        running this command: | 
 |  | 
 |          pcre2test -LP | 
 |  | 
 |  | 
 | SCRIPT MATCHING WITH \p AND \P | 
 |  | 
 |        Many script names and their 4-letter abbreviations  are  recognized  in | 
 |        \p{sc:...}  or  \p{scx:...} items, or on their own with \p (and also \P | 
 |        of course). You can obtain a list of these scripts by running this com- | 
 |        mand: | 
 |  | 
 |          pcre2test -LS | 
 |  | 
 |  | 
 | THE BIDI_CLASS PROPERTY FOR \p AND \P | 
 |  | 
 |          \p{Bidi_Class:<class>}   matches a character with the given class | 
 |          \p{BC:<class>}           matches a character with the given class | 
 |  | 
 |        The recognized classes are: | 
 |  | 
 |          AL          Arabic letter | 
 |          AN          Arabic number | 
 |          B           paragraph separator | 
 |          BN          boundary neutral | 
 |          CS          common separator | 
 |          EN          European number | 
 |          ES          European separator | 
 |          ET          European terminator | 
 |          FSI         first strong isolate | 
 |          L           left-to-right | 
 |          LRE         left-to-right embedding | 
 |          LRI         left-to-right isolate | 
 |          LRO         left-to-right override | 
 |          NSM         non-spacing mark | 
 |          ON          other neutral | 
 |          PDF         pop directional format | 
 |          PDI         pop directional isolate | 
 |          R           right-to-left | 
 |          RLE         right-to-left embedding | 
 |          RLI         right-to-left isolate | 
 |          RLO         right-to-left override | 
 |          S           segment separator | 
 |          WS          white space | 
 |  | 
 |  | 
 | CHARACTER CLASSES | 
 |  | 
 |          [...]       positive character class | 
 |          [^...]      negative character class | 
 |          [x-y]       range (can be used for hex characters) | 
 |          [[:xxx:]]   positive POSIX named set | 
 |          [[:^xxx:]]  negative POSIX named set | 
 |  | 
 |          alnum       alphanumeric | 
 |          alpha       alphabetic | 
 |          ascii       0-127 | 
 |          blank       space or tab | 
 |          cntrl       control character | 
 |          digit       decimal digit | 
 |          graph       printing, excluding space | 
 |          lower       lower case letter | 
 |          print       printing, including space | 
 |          punct       printing, excluding alphanumeric | 
 |          space       white space | 
 |          upper       upper case letter | 
 |          word        same as \w | 
 |          xdigit      hexadecimal digit | 
 |  | 
 |        In PCRE2, POSIX character set names recognize only ASCII characters  by | 
 |        default,  but  some of them use Unicode properties if PCRE2_UCP is set. | 
 |        You can use \Q...\E inside a character class. | 
 |  | 
 |        When PCRE2_ALT_EXTENDED_CLASS is set, UTS#18 extended character classes | 
 |        may be used, allowing nested character classes, combined using set  op- | 
 |        erators. | 
 |  | 
 |          [x&&[^y]]   UTS#18 extended character class | 
 |  | 
 |          x||y        set union (OR) | 
 |          x&&y        set intersection (AND) | 
 |          x--y        set difference (AND NOT) | 
 |          x~~y        set symmetric difference (XOR) | 
 |  | 
 |  | 
 | PERL EXTENDED CHARACTER CLASSES | 
 |  | 
 |          (?[...])                Perl extended character class | 
 |          (?[\p{Thai} & \p{Nd}])  operators; white space ignored | 
 |          (?[(x - y) & z])        parentheses for grouping | 
 |  | 
 |          (?[ [^3] & \p{Nd} ])    [...] is a nested ordinary class | 
 |          (?[ [:alpha:] - [z] ])  POSIX set is allowed outside [...] | 
 |          (?[  \d  -  [3]  ])          backslash-escaped set is allowed outside | 
 |        [...] | 
 |          (?[ !\n & [:ascii:] ])  backslash-escaped character is  allowed  out- | 
 |        side [...] | 
 |                              all  other  characters or ranges must be enclosed | 
 |        in [...] | 
 |  | 
 |          x|y, x+y                set union (OR) | 
 |          x&y                     set intersection (AND) | 
 |          x-y                     set difference (AND NOT) | 
 |          x^y                     set symmetric difference (XOR) | 
 |          !x                      set complement (NOT) | 
 |  | 
 |        Inside a Perl extended character class, [...] switches mode to  be  in- | 
 |        terpreted  as  an  ordinary character class. Outside of a nested [...], | 
 |        the only items permitted are backslash-escapes, POSIX sets,  operators, | 
 |        and  parentheses. Inside a nested ordinary class, ^ has its usual mean- | 
 |        ing (inverts the class when used as the first character); outside of  a | 
 |        nested class, ^ is the XOR operator. | 
 |  | 
 |  | 
 | QUANTIFIERS | 
 |  | 
 |          ?           0 or 1, greedy | 
 |          ?+          0 or 1, possessive | 
 |          ??          0 or 1, lazy | 
 |          *           0 or more, greedy | 
 |          *+          0 or more, possessive | 
 |          *?          0 or more, lazy | 
 |          +           1 or more, greedy | 
 |          ++          1 or more, possessive | 
 |          +?          1 or more, lazy | 
 |          {n}         exactly n | 
 |          {n,m}       at least n, no more than m, greedy | 
 |          {n,m}+      at least n, no more than m, possessive | 
 |          {n,m}?      at least n, no more than m, lazy | 
 |          {n,}        n or more, greedy | 
 |          {n,}+       n or more, possessive | 
 |          {n,}?       n or more, lazy | 
 |          {,m}        zero up to m, greedy | 
 |          {,m}+       zero up to m, possessive | 
 |          {,m}?       zero up to m, lazy | 
 |  | 
 |  | 
 | ANCHORS AND SIMPLE ASSERTIONS | 
 |  | 
 |          \b          word boundary | 
 |          \B          not a word boundary | 
 |          ^           start of subject | 
 |                        also after an internal newline in multiline mode | 
 |                        (after any newline if PCRE2_ALT_CIRCUMFLEX is set) | 
 |          \A          start of subject | 
 |          $           end of subject | 
 |                        also before newline at end of subject | 
 |                        also before internal newline in multiline mode | 
 |          \Z          end of subject | 
 |                        also before newline at end of subject | 
 |          \z          end of subject | 
 |          \G          first matching position in subject | 
 |  | 
 |  | 
 | REPORTED MATCH POINT SETTING | 
 |  | 
 |          \K          set reported start of match | 
 |  | 
 |        From  release 10.38 \K is not permitted by default in lookaround asser- | 
 |        tions, for compatibility with Perl.  However,  if  the  PCRE2_EXTRA_AL- | 
 |        LOW_LOOKAROUND_BSK option is set, the previous behaviour is re-enabled. | 
 |        When this option is set, \K is honoured in positive assertions, but ig- | 
 |        nored in negative ones. | 
 |  | 
 |  | 
 | ALTERNATION | 
 |  | 
 |          expr|expr|expr... | 
 |  | 
 |  | 
 | CAPTURING | 
 |  | 
 |          (...)           capture group | 
 |          (?<name>...)    named capture group (Perl) | 
 |          (?'name'...)    named capture group (Perl) | 
 |          (?P<name>...)   named capture group (Python) | 
 |          (?:...)         non-capture group | 
 |          (?|...)         non-capture group; reset group numbers for | 
 |                           capture groups in each alternative | 
 |  | 
 |        In  non-UTF  modes, names may contain underscores and ASCII letters and | 
 |        digits; in UTF modes, any Unicode letters and  Unicode  decimal  digits | 
 |        are permitted. In both cases, a name must not start with a digit. | 
 |  | 
 |  | 
 | ATOMIC GROUPS | 
 |  | 
 |          (?>...)         atomic non-capture group | 
 |          (*atomic:...)   atomic non-capture group | 
 |  | 
 |  | 
 | COMMENT | 
 |  | 
 |          (?#....)        comment (not nestable) | 
 |  | 
 |  | 
 | OPTION SETTING | 
 |        Changes  of these options within a group are automatically cancelled at | 
 |        the end of the group. | 
 |  | 
 |          (?a)            all ASCII options | 
 |          (?aD)           restrict \d to ASCII in UCP mode | 
 |          (?aS)           restrict \s to ASCII in UCP mode | 
 |          (?aW)           restrict \w to ASCII in UCP mode | 
 |          (?aP)           restrict all POSIX classes to ASCII in UCP mode | 
 |          (?aT)           restrict POSIX digit classes to ASCII in UCP mode | 
 |          (?i)            caseless | 
 |          (?J)            allow duplicate named groups | 
 |          (?m)            multiline | 
 |          (?n)            no auto capture | 
 |          (?r)            restrict caseless to either ASCII or non-ASCII | 
 |          (?s)            single line (dotall) | 
 |          (?U)            default ungreedy (lazy) | 
 |          (?x)            ignore white space except in classes or \Q...\E | 
 |          (?xx)           as (?x) but also ignore space and tab in classes | 
 |          (?-...)         unset the given option(s) | 
 |          (?^)            unset imnrsx options | 
 |  | 
 |        (?aP) implies (?aT) as well, though this has no additional effect. How- | 
 |        ever, it means that (?-aP) also implies (?-aT) and disables  all  ASCII | 
 |        restrictions for POSIX classes. | 
 |  | 
 |        Unsetting  x or xx unsets both. Several options may be set at once, and | 
 |        a mixture of setting and unsetting such as (?i-x) is allowed, but there | 
 |        may be only one hyphen. Setting (but no unsetting) is allowed after (?^ | 
 |        for example (?^in). An option setting may appear at the start of a non- | 
 |        capture group, for example (?i:...). | 
 |  | 
 |        The following are recognized only at the very start of a pattern or af- | 
 |        ter one of the newline or \R sequences or options with similar  syntax. | 
 |        More  than  one of them may appear. For the first three, d is a decimal | 
 |        number. | 
 |  | 
 |          (*LIMIT_DEPTH=d)     set the backtracking limit to d | 
 |          (*LIMIT_HEAP=d)      set the heap size limit to d * 1024 bytes | 
 |          (*LIMIT_MATCH=d)     set the match limit to d | 
 |          (*CASELESS_RESTRICT) set PCRE2_EXTRA_CASELESS_RESTRICT when matching | 
 |          (*NOTEMPTY)          set PCRE2_NOTEMPTY when matching | 
 |          (*NOTEMPTY_ATSTART)  set PCRE2_NOTEMPTY_ATSTART when matching | 
 |          (*NO_AUTO_POSSESS)   no auto-possessification (PCRE2_NO_AUTO_POSSESS) | 
 |          (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR) | 
 |          (*NO_JIT)            disable JIT optimization | 
 |          (*NO_START_OPT)      no start-match optimization  (PCRE2_NO_START_OP- | 
 |        TIMIZE) | 
 |          (*TURKISH_CASING)    set PCRE2_EXTRA_TURKISH_CASING when matching | 
 |          (*UTF)               set appropriate UTF mode for the library in use | 
 |          (*UCP)                set  PCRE2_UCP  (use  Unicode properties for \d | 
 |        etc) | 
 |  | 
 |        Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce  the | 
 |        value   of   the   limits   set  by  the  caller  of  pcre2_match()  or | 
 |        pcre2_dfa_match(), not increase them. LIMIT_RECURSION  is  an  obsolete | 
 |        synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF) | 
 |        and  (*UCP)  by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, | 
 |        respectively, at compile time. | 
 |  | 
 |  | 
 | NEWLINE CONVENTION | 
 |  | 
 |        These are recognized only at the very start of the pattern or after op- | 
 |        tion settings with a similar syntax. | 
 |  | 
 |          (*CR)           carriage return only | 
 |          (*LF)           linefeed only | 
 |          (*CRLF)         carriage return followed by linefeed | 
 |          (*ANYCRLF)      all three of the above | 
 |          (*ANY)          any Unicode newline sequence | 
 |          (*NUL)          the NUL character (binary zero) | 
 |  | 
 |  | 
 | WHAT \R MATCHES | 
 |  | 
 |        These are recognized only at the very start of the pattern or after op- | 
 |        tion setting with a similar syntax. | 
 |  | 
 |          (*BSR_ANYCRLF)  CR, LF, or CRLF | 
 |          (*BSR_UNICODE)  any Unicode newline sequence | 
 |  | 
 |  | 
 | LOOKAHEAD AND LOOKBEHIND ASSERTIONS | 
 |  | 
 |          (?=...)                     ) | 
 |          (*pla:...)                  ) positive lookahead | 
 |          (*positive_lookahead:...)   ) | 
 |  | 
 |          (?!...)                     ) | 
 |          (*nla:...)                  ) negative lookahead | 
 |          (*negative_lookahead:...)   ) | 
 |  | 
 |          (?<=...)                    ) | 
 |          (*plb:...)                  ) positive lookbehind | 
 |          (*positive_lookbehind:...)  ) | 
 |  | 
 |          (?<!...)                    ) | 
 |          (*nlb:...)                  ) negative lookbehind | 
 |          (*negative_lookbehind:...)  ) | 
 |  | 
 |        Each top-level branch of a lookbehind must have a limit for the  number | 
 |        of  characters it matches. If any branch can match a variable number of | 
 |        characters, the maximum for each branch is limited to a  value  set  by | 
 |        the  caller  of  pcre2_compile()  or defaulted. The default is set when | 
 |        PCRE2 is built (ultimate default 255). If every branch matches a  fixed | 
 |        number of characters, the limit for each branch is 65535 characters. | 
 |  | 
 |  | 
 | NON-ATOMIC LOOKAROUND ASSERTIONS | 
 |  | 
 |        These assertions are specific to PCRE2 and are not Perl-compatible. | 
 |  | 
 |          (?*...)                                ) | 
 |          (*napla:...)                           ) synonyms | 
 |          (*non_atomic_positive_lookahead:...)   ) | 
 |  | 
 |          (?<*...)                               ) | 
 |          (*naplb:...)                           ) synonyms | 
 |          (*non_atomic_positive_lookbehind:...)  ) | 
 |  | 
 |  | 
 | SUBSTRING SCAN ASSERTION | 
 |        This feature is not Perl-compatible. | 
 |  | 
 |          (*scan_substring:(grouplist)...)  scan captured substring | 
 |          (*scs:(grouplist)...)             scan captured substring | 
 |  | 
 |        The  comma-separated list "grouplist" may identify groups in any of the | 
 |        following ways: | 
 |  | 
 |          n       absolute reference | 
 |          +n      relative reference | 
 |          -n      relative reference | 
 |          <name>  name | 
 |          'name'  name | 
 |  | 
 |  | 
 | SCRIPT RUNS | 
 |  | 
 |          (*script_run:...)           ) script run, can be backtracked into | 
 |          (*sr:...)                   ) | 
 |  | 
 |          (*atomic_script_run:...)    ) atomic script run | 
 |          (*asr:...)                  ) | 
 |  | 
 |  | 
 | BACKREFERENCES | 
 |  | 
 |          \n              reference by number (can be ambiguous) | 
 |          \gn             reference by number | 
 |          \g{n}           reference by number | 
 |          \g+n            relative reference by number (PCRE2 extension) | 
 |          \g-n            relative reference by number | 
 |          \g{+n}          relative reference by number (PCRE2 extension) | 
 |          \g{-n}          relative reference by number | 
 |          \k<name>        reference by name (Perl) | 
 |          \k'name'        reference by name (Perl) | 
 |          \g{name}        reference by name (Perl) | 
 |          \k{name}        reference by name (.NET) | 
 |          (?P=name)       reference by name (Python) | 
 |  | 
 |  | 
 | SUBROUTINE REFERENCES (POSSIBLY RECURSIVE) | 
 |  | 
 |          (?R)            recurse whole pattern | 
 |          (?n)            call subroutine by absolute number | 
 |          (?+n)           call subroutine by relative number | 
 |          (?-n)           call subroutine by relative number | 
 |          (?&name)        call subroutine by name (Perl) | 
 |          (?P>name)       call subroutine by name (Python) | 
 |          \g<name>        call subroutine by name (Oniguruma) | 
 |          \g'name'        call subroutine by name (Oniguruma) | 
 |          \g<n>           call subroutine by absolute number (Oniguruma) | 
 |          \g'n'           call subroutine by absolute number (Oniguruma) | 
 |          \g<+n>          call subroutine by relative number (PCRE2 extension) | 
 |          \g'+n'          call subroutine by relative number (PCRE2 extension) | 
 |          \g<-n>          call subroutine by relative number (PCRE2 extension) | 
 |          \g'-n'          call subroutine by relative number (PCRE2 extension) | 
 |  | 
 |        The variants using parentheses (?...) may also specify a list  of  cap- | 
 |        ture  groups  to  return, which shall be retained in the calling subex- | 
 |        pression if set during the recursion (this feature is not supported  by | 
 |        Perl). | 
 |  | 
 |          (?R(grouplist))       recurse whole pattern, returning capture groups | 
 |                                  (PCRE2 extension) | 
 |          (?n(grouplist))       ) | 
 |          (?+n(grouplist))      ) call subroutine, returning capture groups | 
 |          (?-n(grouplist))      )   (PCRE2 extension) | 
 |          (?&name(grouplist))   ) | 
 |          (?P>name(grouplist))  ) | 
 |  | 
 |        The   comma-separated   list   "grouplist"  uses  the  same  syntax  as | 
 |        (*scan_substring:(grouplist)...), and may identify groups in any of the | 
 |        following ways: | 
 |  | 
 |          n       absolute reference | 
 |          +n      relative reference | 
 |          -n      relative reference | 
 |          <name>  name | 
 |          'name'  name | 
 |  | 
 |  | 
 | CONDITIONAL PATTERNS | 
 |  | 
 |          (?(condition)yes-pattern) | 
 |          (?(condition)yes-pattern|no-pattern) | 
 |  | 
 |          (?(n)                absolute reference condition | 
 |          (?(+n)               relative reference condition (PCRE2 extension) | 
 |          (?(-n)               relative reference condition (PCRE2 extension) | 
 |          (?(<name>)           named reference condition (Perl) | 
 |          (?('name')           named reference condition (Perl) | 
 |          (?(name)             named reference condition (PCRE2, deprecated) | 
 |          (?(R)                overall recursion condition | 
 |          (?(Rn)               specific numbered group recursion condition | 
 |          (?(R&name)           specific named group recursion condition | 
 |          (?(DEFINE)           define groups for reference | 
 |          (?(VERSION[>]=n[.m]) test PCRE2 version | 
 |          (?(assert)           assertion condition | 
 |  | 
 |        Note the ambiguity of (?(R) and (?(Rn) which might be  named  reference | 
 |        conditions  or  recursion  tests.  Such a condition is interpreted as a | 
 |        reference condition if the relevant named group exists. | 
 |  | 
 |        The parts within brackets for the VERSION conditional syntax  could  be | 
 |        ommited.   The  fractional  part of the version number defaults to 0 in | 
 |        that case. | 
 |  | 
 |  | 
 | BACKTRACKING CONTROL | 
 |  | 
 |        All backtracking control verbs may be in  the  form  (*VERB:NAME).  For | 
 |        (*MARK)  the  name is mandatory, for the others it is optional. (*SKIP) | 
 |        changes its behaviour if :NAME is present. The others just set  a  name | 
 |        for passing back to the caller, but this is not a name that (*SKIP) can | 
 |        see. The following act immediately they are reached: | 
 |  | 
 |          (*ACCEPT)       force successful match | 
 |          (*FAIL)         force backtrack; synonym (*F) | 
 |          (*MARK:NAME)    set name to be passed back; synonym (*:NAME) | 
 |  | 
 |        The  following  act only when a subsequent match failure causes a back- | 
 |        track to reach them. They all force a match failure, but they differ in | 
 |        what happens afterwards. Those that advance the start-of-match point do | 
 |        so only if the pattern is not anchored. | 
 |  | 
 |          (*COMMIT)       overall failure, no advance of starting point | 
 |          (*PRUNE)        advance to next starting character | 
 |          (*SKIP)         advance to current matching position | 
 |          (*SKIP:NAME)    advance to position corresponding to an earlier | 
 |                          (*MARK:NAME); if not found, the (*SKIP) is ignored | 
 |          (*THEN)         local failure, backtrack to next alternation | 
 |  | 
 |        The effect of one of these verbs in a group called as a  subroutine  is | 
 |        confined to the subroutine call. | 
 |  | 
 |  | 
 | CALLOUTS | 
 |  | 
 |          (?C)            callout (assumed number 0) | 
 |          (?Cn)           callout with numerical data n | 
 |          (?C"text")      callout with string data | 
 |  | 
 |        The allowed string delimiters are ` ' " ^ % # $ (which are the same for | 
 |        the  start  and the end), and the starting delimiter { matched with the | 
 |        ending delimiter }. To encode the ending delimiter within  the  string, | 
 |        double it. | 
 |  | 
 |  | 
 | REPLACEMENT STRINGS | 
 |  | 
 |        If the PCRE2_SUBSTITUTE_LITERAL option is set, a replacement string for | 
 |        pcre2_substitute()  is not interpreted. Otherwise, by default, the only | 
 |        special character is the dollar  character  in  one  of  the  following | 
 |        forms: | 
 |  | 
 |          $$                  insert a dollar character | 
 |          $n or ${n}          insert the contents of group n | 
 |          $<name>             insert the contents of named group | 
 |          $0 or $&            insert the entire matched substring | 
 |          $`                  insert the substring that precedes the match | 
 |          $'                  insert the substring that follows the match | 
 |          $_                  insert the entire input string | 
 |          $+                   insert  the highest-numbered capture group which | 
 |        matched | 
 |          $*MARK or ${*MARK}  insert a control verb name | 
 |  | 
 |        For ${n}, n can be a name or a number. If PCRE2_SUBSTITUTE_EXTENDED  is | 
 |        set, there is additional interpretation: | 
 |  | 
 |        1.  Backslash  is  an escape character, and the forms described in "ES- | 
 |        CAPED CHARACTERS" above are recognized. Also: | 
 |  | 
 |          \Q...\E can be used to suppress interpretation | 
 |          \l      force the next character to lower case | 
 |          \u      force the next character to upper case | 
 |          \L      force subsequent characters to lower case | 
 |          \U      force subsequent characters to upper case | 
 |          \u\L    force next character to upper case, then all lower | 
 |          \l\U    force next character to lower case, then all upper | 
 |          \E      end \L or \U case forcing | 
 |          \b      backspace character (note: as in character class in pattern) | 
 |          \v      vertical tab character (note: not the same as in a pattern) | 
 |  | 
 |        2. The Python form \g<n>, where the angle brackets are part of the syn- | 
 |        tax and n is either a group name or a number, is recognized as  an  al- | 
 |        ternative way of inserting the contents of a group, for example \g<3>. | 
 |  | 
 |        3. Capture substitution supports the following additional forms: | 
 |  | 
 |          ${n:-string}             default for unset group | 
 |          ${n:+string1:string2}    values for set/unset group | 
 |  | 
 |        The substitution strings themselves are expanded. Backslash can be used | 
 |        to escape colons and closing curly brackets. | 
 |  | 
 |  | 
 | SEE ALSO | 
 |  | 
 |        pcre2pattern(3),    pcre2api(3),   pcre2callout(3),   pcre2matching(3), | 
 |        pcre2(3). | 
 |  | 
 |  | 
 | AUTHOR | 
 |  | 
 |        Philip Hazel | 
 |        Retired from University Computing Service | 
 |        Cambridge, England. | 
 |  | 
 |  | 
 | REVISION | 
 |  | 
 |        Last updated: 14 October 2025 | 
 |        Copyright (c) 1997-2024 University of Cambridge. | 
 |  | 
 |  | 
 | PCRE2 10.48-DEV                 14 October 2025                 PCRE2SYNTAX(3) | 
 | ------------------------------------------------------------------------------ | 
 |  | 
 |  | 
 | PCRE2UNICODE(3)            Library Functions Manual            PCRE2UNICODE(3) | 
 |  | 
 |  | 
 | NAME | 
 |        PCRE2 - Perl-compatible regular expressions (revised API) | 
 |  | 
 |  | 
 | UNICODE AND UTF SUPPORT | 
 |  | 
 |        PCRE2 is normally built with Unicode support, though if you do not need | 
 |        it,  you  can  build  it  without,  in  which  case the library will be | 
 |        smaller. With Unicode support, PCRE2 has knowledge of Unicode character | 
 |        properties and can process strings of text in UTF-8, UTF-16, and UTF-32 | 
 |        format (depending on the code unit width), but this is not the default. | 
 |        Unless specifically requested, PCRE2 treats each code unit in a  string | 
 |        as one character. | 
 |  | 
 |        There  are two ways of telling PCRE2 to switch to UTF mode, where char- | 
 |        acters may consist of more than one code unit and the range  of  values | 
 |        is constrained. The program can call pcre2_compile() with the PCRE2_UTF | 
 |        option,  or  the  pattern may start with the sequence (*UTF).  However, | 
 |        the latter facility can be locked out by  the  PCRE2_NEVER_UTF  option. | 
 |        That  is,  the  programmer can prevent the supplier of the pattern from | 
 |        switching to UTF mode. | 
 |  | 
 |        Note  that  the  PCRE2_MATCH_INVALID_UTF  option  (see  below)   forces | 
 |        PCRE2_UTF to be set. | 
 |  | 
 |        In  UTF mode, both the pattern and any subject strings that are matched | 
 |        against it are treated as UTF strings instead of strings of  individual | 
 |        one-code-unit  characters. There are also some other changes to the way | 
 |        characters are handled, as documented below. | 
 |  | 
 |  | 
 | UNICODE PROPERTY SUPPORT | 
 |  | 
 |        When PCRE2 is built with Unicode support, the escape sequences  \p{..}, | 
 |        \P{..}, and \X can be used. This is not dependent on the PCRE2_UTF set- | 
 |        ting.   The Unicode properties that can be tested are a subset of those | 
 |        that Perl supports. Currently they are limited to the general  category | 
 |        properties such as Lu for an upper case letter or Nd for a decimal num- | 
 |        ber, the derived properties Any and Lc (synonym L&), the Unicode script | 
 |        names such as Arabic or Han, Bidi_Class, Bidi_Control, and a few binary | 
 |        properties. | 
 |  | 
 |        The full lists are given in the pcre2pattern and pcre2syntax documenta- | 
 |        tion.  In  general,  only the short names for properties are supported. | 
 |        For example, \p{L} matches a letter. Its longer synonym, \p{Letter}, is | 
 |        not supported. Furthermore, in Perl, many properties may optionally  be | 
 |        prefixed  by "Is", for compatibility with Perl 5.6. PCRE2 does not sup- | 
 |        port this. | 
 |  | 
 |  | 
 | WIDE CHARACTERS AND UTF MODES | 
 |  | 
 |        Code points less than 256 can be specified in patterns by either braced | 
 |        or unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3). | 
 |        Larger values have to use braced sequences. Unbraced octal code  points | 
 |        up to \777 are also recognized; larger ones can be coded using \o{...}. | 
 |  | 
 |        The  escape sequence \N{U+<hex digits>} is recognized as another way of | 
 |        specifying a Unicode character by code point in a UTF mode. It  is  not | 
 |        allowed in non-UTF mode. | 
 |  | 
 |        In  UTF  mode, repeat quantifiers apply to complete UTF characters, not | 
 |        to individual code units. | 
 |  | 
 |        In UTF mode, the dot metacharacter matches one UTF character instead of | 
 |        a single code unit. | 
 |  | 
 |        In UTF mode, capture group names are not restricted to ASCII,  and  may | 
 |        contain any Unicode letters and decimal digits, as well as underscore. | 
 |  | 
 |        The  escape  sequence \C can be used to match a single code unit in UTF | 
 |        mode, but its use can lead to some strange effects because it breaks up | 
 |        multi-unit characters (see the description of \C  in  the  pcre2pattern | 
 |        documentation). For this reason, there is a build-time option that dis- | 
 |        ables  support  for  \C completely. There is also a less draconian com- | 
 |        pile-time option for locking out the use of \C when a pattern  is  com- | 
 |        piled. | 
 |  | 
 |        The  use  of  \C  is not supported by the alternative matching function | 
 |        pcre2_dfa_match() when in UTF-8 or UTF-16 mode, that is, when a charac- | 
 |        ter may consist of more than one code unit. The  use  of  \C  in  these | 
 |        modes  provokes a match-time error. Also, the JIT optimization does not | 
 |        support \C in these modes. If JIT optimization is requested for a UTF-8 | 
 |        or UTF-16 pattern that contains \C, it will not succeed,  and  so  when | 
 |        pcre2_match() is called, the matching will be carried out by the inter- | 
 |        pretive function. | 
 |  | 
 |        The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test | 
 |        characters  of  any  code  value,  but, by default, the characters that | 
 |        PCRE2 recognizes as digits, spaces, or word characters remain the  same | 
 |        set  as  in  non-UTF mode, all with code points less than 256. This re- | 
 |        mains true even when PCRE2 is built to include Unicode support, because | 
 |        to do otherwise would slow down matching in  many  common  cases.  Note | 
 |        that  this also applies to \b and \B, because they are defined in terms | 
 |        of \w and \W. If you want to test for a wider sense of,  say,  "digit", | 
 |        you  can  use  explicit Unicode property tests such as \p{Nd}. Alterna- | 
 |        tively, if you set the PCRE2_UCP option, the way that the character es- | 
 |        capes work is changed so that Unicode properties are used to  determine | 
 |        which  characters  match,  though  there are some options that suppress | 
 |        this for individual escapes. For details see  the  section  on  generic | 
 |        character types in the pcre2pattern documentation. | 
 |  | 
 |        Like  the  escapes,  characters  that  match  the POSIX named character | 
 |        classes are all low-valued characters unless the  PCRE2_UCP  option  is | 
 |        set, but there is an option to override this. | 
 |  | 
 |        In contrast to the character escapes and character classes, the special | 
 |        horizontal  and  vertical  white  space escapes (\h, \H, \v, and \V) do | 
 |        match all the appropriate Unicode characters, whether or not  PCRE2_UCP | 
 |        is set. | 
 |  | 
 |  | 
 | UNICODE CASE-EQUIVALENCE | 
 |  | 
 |        If  either  PCRE2_UTF  or PCRE2_UCP is set, upper/lower case processing | 
 |        makes use of Unicode properties except for characters whose code points | 
 |        are less than 128 and that have at most two case-equivalent values. For | 
 |        these, a direct table lookup is used for speed. A few  Unicode  charac- | 
 |        ters  such as Greek sigma have more than two code points that are case- | 
 |        equivalent, and these are treated specially. Setting PCRE2_UCP  without | 
 |        PCRE2_UTF  allows  Unicode-style  case processing for non-UTF character | 
 |        encodings such as UCS-2. | 
 |  | 
 |        There are two ASCII characters (S and K) that,  in  addition  to  their | 
 |        ASCII  lower case equivalents, have a non-ASCII one as well (long S and | 
 |        Kelvin sign).  Recognition of these non-ASCII characters as case-equiv- | 
 |        alent to their ASCII  counterparts  can  be  disabled  by  setting  the | 
 |        PCRE2_EXTRA_CASELESS_RESTRICT  option. When this is set, all characters | 
 |        in a case equivalence must either be ASCII or non-ASCII; there  can  be | 
 |        no mixing. | 
 |  | 
 |            Without PCRE2_EXTRA_CASELESS_RESTRICT: | 
 |              'k' = 'K' = U+212A (Kelvin sign) | 
 |              's' = 'S' = U+017F (long S) | 
 |            With PCRE2_EXTRA_CASELESS_RESTRICT: | 
 |              'k' = 'K' | 
 |              U+212A (Kelvin sign)  only case-equivalent to itself | 
 |              's' = 'S' | 
 |              U+017F (long S)       only case-equivalent to itself | 
 |  | 
 |        One  language family, Turkish and Azeri, has its own case-insensitivity | 
 |        rules, which can be  selected  by  setting  PCRE2_EXTRA_TURKISH_CASING. | 
 |        This  alters  the behaviour of the 'i', 'I', U+0130 (capital I with dot | 
 |        above), and U+0131 (small dotless i) characters. | 
 |  | 
 |            Without PCRE2_EXTRA_TURKISH_CASING: | 
 |              'i' = 'I' | 
 |              U+0130 (capital I with dot above)  only case-equivalent to itself | 
 |              U+0131 (small dotless i)           only case-equivalent to itself | 
 |            With PCRE2_EXTRA_TURKISH_CASING: | 
 |              'i' = U+0130 (capital I with dot above) | 
 |              U+0131 (small dotless i) = 'I' | 
 |  | 
 |        It is not allowed to  specify  both  PCRE2_EXTRA_CASELESS_RESTRICT  and | 
 |        PCRE2_EXTRA_TURKISH_CASING together. | 
 |  | 
 |        From  release  10.45  the Unicode letter properties Lu (upper case), Ll | 
 |        (lower case), and Lt (title case) are all treated as Lc (cased  letter) | 
 |        when  caseless  matching  is  set  by the PCRE2_CASELESS option or (?i) | 
 |        within the pattern. | 
 |  | 
 |  | 
 | SCRIPT RUNS | 
 |  | 
 |        The pattern constructs (*script_run:...) and  (*atomic_script_run:...), | 
 |        with  synonyms (*sr:...) and (*asr:...), verify that the string matched | 
 |        within the parentheses is a script run. In concept, a script run  is  a | 
 |        sequence  of characters that are all from the same Unicode script. How- | 
 |        ever, because some scripts are commonly used together, and because some | 
 |        diacritical and other marks are used with multiple scripts, it  is  not | 
 |        that simple. | 
 |  | 
 |        Every Unicode character has a Script property, mostly with a value cor- | 
 |        responding  to the name of a script, such as Latin, Greek, or Cyrillic. | 
 |        There are also three special values: | 
 |  | 
 |        "Unknown" is used for code points that have not been assigned, and also | 
 |        for the surrogate code points. In the PCRE2 32-bit library,  characters | 
 |        whose  code  points  are  greater  than the Unicode maximum (U+10FFFF), | 
 |        which are accessible only in non-UTF mode,  are  assigned  the  Unknown | 
 |        script. | 
 |  | 
 |        "Common"  is used for characters that are used with many scripts. These | 
 |        include punctuation, emoji, mathematical, musical,  and  currency  sym- | 
 |        bols, and the ASCII digits 0 to 9. | 
 |  | 
 |        "Inherited"  is used for characters such as diacritical marks that mod- | 
 |        ify a previous character. These are considered to take on the script of | 
 |        the character that they modify. | 
 |  | 
 |        Some Inherited characters are used with many scripts, but many of  them | 
 |        are  only  normally  used  with a small number of scripts. For example, | 
 |        U+102E0 (Coptic Epact thousands mark) is used only with Arabic and Cop- | 
 |        tic. In order to make it possible to check  this,  a  Unicode  property | 
 |        called Script Extension exists. Its value is a list of scripts that ap- | 
 |        ply to the character. For the majority of characters, the list contains | 
 |        just  one  script,  the  same  one as the Script property. However, for | 
 |        characters such as U+102E0 more than one Script is  listed.  There  are | 
 |        also  some  Common  characters that have a single, non-Common script in | 
 |        their Script Extension list. | 
 |  | 
 |        The next section describes the basic rules for deciding whether a given | 
 |        string of characters is a script run. Note,  however,  that  there  are | 
 |        some  special cases involving the Chinese Han script, and an additional | 
 |        constraint for decimal digits. These are  covered  in  subsequent  sec- | 
 |        tions. | 
 |  | 
 |    Basic script run rules | 
 |  | 
 |        A string that is less than two characters long is a script run. This is | 
 |        the  only  case  in  which an Unknown character can be part of a script | 
 |        run. Longer strings are checked using only the Script Extensions  prop- | 
 |        erty, not the basic Script property. | 
 |  | 
 |        If  a character's Script Extension property is the single value "Inher- | 
 |        ited", it is always accepted as part of a script run. This is also true | 
 |        for the property "Common", subject to the checking  of  decimal  digits | 
 |        described below. All the remaining characters in a script run must have | 
 |        at  least one script in common in their Script Extension lists. In set- | 
 |        theoretic terminology, the intersection of all the sets of scripts must | 
 |        not be empty. | 
 |  | 
 |        A simple example is an Internet name such as "google.com". The  letters | 
 |        are all in the Latin script, and the dot is Common, so this string is a | 
 |        script run.  However, the Cyrillic letter "o" looks exactly the same as | 
 |        the  Latin "o"; a string that looks the same, but with Cyrillic "o"s is | 
 |        not a script run. | 
 |  | 
 |        More interesting examples involve characters with more than one  script | 
 |        in their Script Extension. Consider the following characters: | 
 |  | 
 |          U+060C  Arabic comma | 
 |          U+06D4  Arabic full stop | 
 |  | 
 |        The  first  has the Script Extension list Arabic, Hanifi Rohingya, Syr- | 
 |        iac, and Thaana; the second has just Arabic and Hanifi  Rohingya.  Both | 
 |        of  them  could  appear  in  script runs of either Arabic or Hanifi Ro- | 
 |        hingya. The first could also appear in Syriac or  Thaana  script  runs, | 
 |        but the second could not. | 
 |  | 
 |    The Chinese Han script | 
 |  | 
 |        The  Chinese  Han  script  is  commonly  used in conjunction with other | 
 |        scripts for writing certain languages. Japanese uses the  Hiragana  and | 
 |        Katakana  scripts  together  with Han; Korean uses Hangul and Han; Tai- | 
 |        wanese Mandarin uses Bopomofo and Han.  These  three  combinations  are | 
 |        treated  as special cases when checking script runs and are, in effect, | 
 |        "virtual scripts". Thus, a script run may contain a  mixture  of  Hira- | 
 |        gana,  Katakana,  and Han, or a mixture of Hangul and Han, or a mixture | 
 |        of Bopomofo and Han, but not, for example,  a  mixture  of  Hangul  and | 
 |        Bopomofo  and  Han. PCRE2 (like Perl) follows Unicode's Technical Stan- | 
 |        dard  39   ("Unicode   Security   Mechanisms",   http://unicode.org/re- | 
 |        ports/tr39/) in allowing such mixtures. | 
 |  | 
 |    Decimal digits | 
 |  | 
 |        Unicode  contains  many sets of 10 decimal digits in different scripts, | 
 |        and some scripts (including the Common script) contain  more  than  one | 
 |        set.  Some  of these decimal digits them are visually indistinguishable | 
 |        from the common ASCII digits. In addition to the  script  checking  de- | 
 |        scribed  above,  if a script run contains any decimal digits, they must | 
 |        all come from the same set of 10 adjacent characters. | 
 |  | 
 |  | 
 | VALIDITY OF UTF STRINGS | 
 |  | 
 |        When the PCRE2_UTF option is set, the strings passed  as  patterns  and | 
 |        subjects are (by default) checked for validity on entry to the relevant | 
 |        functions. If an invalid UTF string is passed, a negative error code is | 
 |        returned.  The  code  unit offset to the offending character can be ex- | 
 |        tracted from the match data  block  by  calling  pcre2_get_startchar(), | 
 |        which is used for this purpose after a UTF error. | 
 |  | 
 |        In  some  situations, you may already know that your strings are valid, | 
 |        and therefore want to skip these checks in  order  to  improve  perfor- | 
 |        mance,  for  example in the case of a long subject string that is being | 
 |        scanned repeatedly.  If you set the PCRE2_NO_UTF_CHECK option  at  com- | 
 |        pile  time  or at match time, PCRE2 assumes that the pattern or subject | 
 |        it is given (respectively) contains only valid UTF code unit sequences. | 
 |  | 
 |        If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is  set,  the | 
 |        result  is undefined and your program may crash or loop indefinitely or | 
 |        give incorrect results. There is, however, one mode  of  matching  that | 
 |        can  handle  invalid  UTF  subject  strings. This is enabled by passing | 
 |        PCRE2_MATCH_INVALID_UTF to pcre2_compile() and is  discussed  below  in | 
 |        the  next  section.  The  rest  of  this  section  covers the case when | 
 |        PCRE2_MATCH_INVALID_UTF is not set. | 
 |  | 
 |        Passing PCRE2_NO_UTF_CHECK to pcre2_compile()  just  disables  the  UTF | 
 |        check  for  the  pattern; it does not also apply to subject strings. If | 
 |        you want to disable the check for a subject string you must  pass  this | 
 |        same option to pcre2_match() or pcre2_dfa_match(). | 
 |  | 
 |        UTF-16 and UTF-32 strings can indicate their endianness by special code | 
 |        knows  as  a  byte-order  mark (BOM). The PCRE2 functions do not handle | 
 |        this, expecting strings to be in host byte order. | 
 |  | 
 |        Unless PCRE2_NO_UTF_CHECK is set, a UTF string is  checked  before  any | 
 |        other  processing  takes  place.  In  the  case  of  pcre2_match()  and | 
 |        pcre2_dfa_match() calls with a non-zero starting offset, the  check  is | 
 |        applied only to that part of the subject that could be inspected during | 
 |        matching,  and  there is a check that the starting offset points to the | 
 |        first code unit of a character or to the end of the subject.  If  there | 
 |        are  no  lookbehind  assertions in the pattern, the check starts at the | 
 |        starting offset.  Otherwise, it starts at the  length  of  the  longest | 
 |        lookbehind  before  the starting offset, or at the start of the subject | 
 |        if there are not that many characters before the starting offset.  Note | 
 |        that the sequences \b and \B are one-character lookbehinds. | 
 |  | 
 |        In  addition  to checking the format of the string, there is a check to | 
 |        ensure that all code points lie in the range U+0 to U+10FFFF, excluding | 
 |        the surrogate area. The so-called "non-character" code points  are  not | 
 |        excluded because Unicode corrigendum #9 makes it clear that they should | 
 |        not be. | 
 |  | 
 |        Characters  in  the "Surrogate Area" of Unicode are reserved for use by | 
 |        UTF-16, where they are used in pairs to encode code points with  values | 
 |        greater  than  0xFFFF. The code points that are encoded by UTF-16 pairs | 
 |        are available independently in the  UTF-8  and  UTF-32  encodings.  (In | 
 |        other  words, the whole surrogate thing is a fudge for UTF-16 which un- | 
 |        fortunately messes up UTF-8 and UTF-32.) | 
 |  | 
 |        Setting PCRE2_NO_UTF_CHECK at compile time does not disable  the  error | 
 |        that  is  given if an escape sequence for an invalid Unicode code point | 
 |        is encountered in the pattern. If you want to  allow  escape  sequences | 
 |        such  as  \x{d800}  (a  surrogate code point) you can set the PCRE2_EX- | 
 |        TRA_ALLOW_SURROGATE_ESCAPES extra option.  However,  this  is  possible | 
 |        only  in  UTF-8  and  UTF-32 modes, because these values are not repre- | 
 |        sentable in UTF-16. | 
 |  | 
 |    Errors in UTF-8 strings | 
 |  | 
 |        The following negative error codes are given for invalid UTF-8 strings: | 
 |  | 
 |          PCRE2_ERROR_UTF8_ERR1 | 
 |          PCRE2_ERROR_UTF8_ERR2 | 
 |          PCRE2_ERROR_UTF8_ERR3 | 
 |          PCRE2_ERROR_UTF8_ERR4 | 
 |          PCRE2_ERROR_UTF8_ERR5 | 
 |  | 
 |        The string ends with a truncated UTF-8 character;  the  code  specifies | 
 |        how  many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8 | 
 |        characters to be no longer than 4 bytes, the  encoding  scheme  (origi- | 
 |        nally  defined  by  RFC  2279)  allows  for  up to 6 bytes, and this is | 
 |        checked first; hence the possibility of 4 or 5 missing bytes. | 
 |  | 
 |          PCRE2_ERROR_UTF8_ERR6 | 
 |          PCRE2_ERROR_UTF8_ERR7 | 
 |          PCRE2_ERROR_UTF8_ERR8 | 
 |          PCRE2_ERROR_UTF8_ERR9 | 
 |          PCRE2_ERROR_UTF8_ERR10 | 
 |  | 
 |        The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of | 
 |        the character do not have the binary value 0b10 (that  is,  either  the | 
 |        most significant bit is 0, or the next bit is 1). | 
 |  | 
 |          PCRE2_ERROR_UTF8_ERR11 | 
 |          PCRE2_ERROR_UTF8_ERR12 | 
 |  | 
 |        A  character that is valid by the RFC 2279 rules is either 5 or 6 bytes | 
 |        long; these code points are excluded by RFC 3629. | 
 |  | 
 |          PCRE2_ERROR_UTF8_ERR13 | 
 |  | 
 |        A 4-byte character has a value greater than 0x10ffff; these code points | 
 |        are excluded by RFC 3629. | 
 |  | 
 |          PCRE2_ERROR_UTF8_ERR14 | 
 |  | 
 |        A 3-byte character has a value in the  range  0xd800  to  0xdfff;  this | 
 |        range  of code points are reserved by RFC 3629 for use with UTF-16, and | 
 |        so are excluded from UTF-8. | 
 |  | 
 |          PCRE2_ERROR_UTF8_ERR15 | 
 |          PCRE2_ERROR_UTF8_ERR16 | 
 |          PCRE2_ERROR_UTF8_ERR17 | 
 |          PCRE2_ERROR_UTF8_ERR18 | 
 |          PCRE2_ERROR_UTF8_ERR19 | 
 |  | 
 |        A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it  codes | 
 |        for  a  value that can be represented by fewer bytes, which is invalid. | 
 |        For example, the two bytes 0xc0, 0xae give the value 0x2e,  whose  cor- | 
 |        rect coding uses just one byte. | 
 |  | 
 |          PCRE2_ERROR_UTF8_ERR20 | 
 |  | 
 |        The two most significant bits of the first byte of a character have the | 
 |        binary  value 0b10 (that is, the most significant bit is 1 and the sec- | 
 |        ond is 0). Such a byte can only validly occur as the second  or  subse- | 
 |        quent byte of a multi-byte character. | 
 |  | 
 |          PCRE2_ERROR_UTF8_ERR21 | 
 |  | 
 |        The  first byte of a character has the value 0xfe or 0xff. These values | 
 |        can never occur in a valid UTF-8 string. | 
 |  | 
 |    Errors in UTF-16 strings | 
 |  | 
 |        The following  negative  error  codes  are  given  for  invalid  UTF-16 | 
 |        strings: | 
 |  | 
 |          PCRE2_ERROR_UTF16_ERR1  Missing low surrogate at end of string | 
 |          PCRE2_ERROR_UTF16_ERR2  Invalid low surrogate follows high surrogate | 
 |          PCRE2_ERROR_UTF16_ERR3  Isolated low surrogate | 
 |  | 
 |  | 
 |    Errors in UTF-32 strings | 
 |  | 
 |        The  following  negative  error  codes  are  given  for  invalid UTF-32 | 
 |        strings: | 
 |  | 
 |          PCRE2_ERROR_UTF32_ERR1  Surrogate character (0xd800 to 0xdfff) | 
 |          PCRE2_ERROR_UTF32_ERR2  Code point is greater than 0x10ffff | 
 |  | 
 |  | 
 | MATCHING IN INVALID UTF STRINGS | 
 |  | 
 |        You can run pattern matches on subject strings that may contain invalid | 
 |        UTF sequences if you  call  pcre2_compile()  with  the  PCRE2_MATCH_IN- | 
 |        VALID_UTF  option.  This  is  supported by pcre2_match(), including JIT | 
 |        matching, but not by pcre2_dfa_match(). When PCRE2_MATCH_INVALID_UTF is | 
 |        set, it forces PCRE2_UTF to be set as well.  Note,  however,  that  the | 
 |        pattern itself must be a valid UTF string. | 
 |  | 
 |        If  you  do not set PCRE2_MATCH_INVALID_UTF when calling pcre2_compile, | 
 |        and you are not certain that your subject strings  are  valid  UTF  se- | 
 |        quences,  you  should  not  make  use  of  the JIT "fast path" function | 
 |        pcre2_jit_match() because it bypasses sanity checks, including the  one | 
 |        for  UTF validity. An invalid string may cause undefined behaviour, in- | 
 |        cluding looping, crashing, or giving the wrong answer. | 
 |  | 
 |        Setting PCRE2_MATCH_INVALID_UTF does not  affect  what  pcre2_compile() | 
 |        generates,  but  if pcre2_jit_compile() is subsequently called, it does | 
 |        generate different code. If JIT is not used, the option affects the be- | 
 |        haviour of the interpretive code in pcre2_match(). When PCRE2_MATCH_IN- | 
 |        VALID_UTF is set at compile  time,  PCRE2_NO_UTF_CHECK  is  ignored  at | 
 |        match time. | 
 |  | 
 |        In  this  mode,  an  invalid  code  unit  sequence in the subject never | 
 |        matches any pattern item. It does not match  dot,  it  does  not  match | 
 |        \p{Any},  it does not even match negative items such as [^X]. A lookbe- | 
 |        hind assertion fails if it encounters an invalid sequence while  moving | 
 |        the  current  point backwards. In other words, an invalid UTF code unit | 
 |        sequence acts as a barrier which no match can cross. | 
 |  | 
 |        You can also think of this as the subject being split up into fragments | 
 |        of valid UTF, delimited internally by invalid code unit sequences.  The | 
 |        pattern  is  matched  fragment  by fragment. The result of a successful | 
 |        match, however, is given as code unit offsets  in  the  entire  subject | 
 |        string in the usual way. There are a few points to consider: | 
 |  | 
 |        The  internal  boundaries are not interpreted as the beginnings or ends | 
 |        of lines and so do not match circumflex or  dollar  characters  in  the | 
 |        pattern. | 
 |  | 
 |        If  pcre2_match()  is  called  with an offset that points to an invalid | 
 |        UTF-sequence, that sequence is skipped, and the  match  starts  at  the | 
 |        next valid UTF character, or the end of the subject. | 
 |  | 
 |        At internal fragment boundaries, \b and \B behave in the same way as at | 
 |        the  beginning  and end of the subject. For example, a sequence such as | 
 |        \bWORD\b would match an instance of WORD that is surrounded by  invalid | 
 |        UTF code units. | 
 |  | 
 |        Using  PCRE2_MATCH_INVALID_UTF, an application can run matches on arbi- | 
 |        trary data, knowing that any matched  strings  that  are  returned  are | 
 |        valid UTF. This can be useful when searching for UTF text in executable | 
 |        or other binary files. | 
 |  | 
 |        Note,  however,  that  the  16-bit  and  32-bit PCRE2 libraries process | 
 |        strings as sequences of uint16_t or uint32_t code points.  They  cannot | 
 |        find  valid  UTF  sequences  within an arbitrary string of bytes unless | 
 |        such sequences are suitably aligned. | 
 |  | 
 |  | 
 | AUTHOR | 
 |  | 
 |        Philip Hazel | 
 |        Retired from University Computing Service | 
 |        Cambridge, England. | 
 |  | 
 |  | 
 | REVISION | 
 |  | 
 |        Last updated: 27 November 2024 | 
 |        Copyright (c) 1997-2024 University of Cambridge. | 
 |  | 
 |  | 
 | PCRE2 10.48-DEV                27 November 2024                PCRE2UNICODE(3) | 
 | ------------------------------------------------------------------------------ | 
 |  | 
 |  |