From 582b500cd996c96054615870fd13d6ab0ea77428 Mon Sep 17 00:00:00 2001
From: Jay Berkenbilt
+The PCRE library is a set of functions that implement regular expression
+pattern matching using the same syntax and semantics as Perl, with just a few
+differences. The current implementation of PCRE (release 4.x) corresponds
+approximately with Perl 5.8, including support for UTF-8 encoded strings.
+However, this support has to be explicitly enabled; it is not the default.
+
+PCRE is written in C and released as a C library. However, a number of people
+have written wrappers and interfaces of various kinds. A C++ class is included
+in these contributions, which can be found in the Contrib directory at
+the primary FTP site, which is:
+
+Details of exactly which Perl regular expression features are and are not
+supported by PCRE are given in separate documents. See the
+pcrepattern
+and
+pcrecompat
+pages.
+
+Some features of PCRE can be included, excluded, or changed when the library is
+built. The
+pcre_config()
+function makes it possible for a client to discover which features are
+available. Documentation about building PCRE for various operating systems can
+be found in the README file in the source distribution.
+
+The user documentation for PCRE has been split up into a number of different
+sections. In the "man" format, each of these is a separate "man page". In the
+HTML format, each is a separate page, linked from the index page. In the plain
+text format, all the sections are concatenated, for ease of searching. The
+sections are as follows:
+
+
+
+
DESCRIPTION
+
USER DOCUMENTATION
+
+ pcre this document
+ pcreapi details of PCRE's native API
+ pcrebuild options for building PCRE
+ pcrecallout details of the callout feature
+ pcrecompat discussion of Perl compatibility
+ pcregrep description of the pcregrep command
+ pcrepattern syntax and semantics of supported
+ regular expressions
+ pcreperform discussion of performance issues
+ pcreposix the POSIX-compatible API
+ pcresample discussion of the sample program
+ pcretest the pcretest testing command
+
+
+In addition, in the "man" and HTML formats, there is a short page for each +library function, listing its arguments and results. +
++There are some size limitations in PCRE but it is hoped that they will never in +practice be relevant. +
++The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE is +compiled with the default internal linkage size of 2. If you want to process +regular expressions that are truly enormous, you can compile PCRE with an +internal linkage size of 3 or 4 (see the README file in the source +distribution and the +pcrebuild +documentation for details). If these cases the limit is substantially larger. +However, the speed of execution will be slower. +
++All values in repeating quantifiers must be less than 65536. +The maximum number of capturing subpatterns is 65535. +
++There is no limit to the number of non-capturing subpatterns, but the maximum +depth of nesting of all kinds of parenthesized subpattern, including capturing +subpatterns, assertions, and other types of subpattern, is 200. +
++The maximum length of a subject string is the largest positive number that an +integer variable can hold. However, PCRE uses recursion to handle subpatterns +and indefinite repetition. This means that the available stack space may limit +the size of a subject string that can be processed by certain patterns. +
++Starting at release 3.3, PCRE has had some support for character strings +encoded in the UTF-8 format. For release 4.0 this has been greatly extended to +cover most common requirements. +
++In order process UTF-8 strings, you must build PCRE to include UTF-8 support in +the code, and, in addition, you must call +pcre_compile() +with the PCRE_UTF8 option flag. When you do this, both the pattern and any +subject strings that are matched against it are treated as UTF-8 strings +instead of just strings of bytes. +
++If you compile PCRE with UTF-8 support, but do not use it at run time, the +library will be a bit bigger, but the additional run time overhead is limited +to testing the PCRE_UTF8 flag in several places, so should not be very large. +
++The following comments apply when PCRE is running in UTF-8 mode: +
++1. When you set the PCRE_UTF8 flag, the strings passed as patterns and subjects +are checked for validity on entry to the relevant functions. If an invalid +UTF-8 string is passed, an error return is given. In some situations, you may +already know that your strings are valid, and therefore want to skip these +checks in order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag +at compile time or at run time, PCRE assumes that the pattern or subject it +is given (respectively) contains only valid UTF-8 codes. In this case, it does +not diagnose an invalid UTF-8 string. If you pass an invalid UTF-8 string to +PCRE when PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program +may crash. +
++2. In a pattern, the escape sequence \x{...}, where the contents of the braces +is a string of hexadecimal digits, is interpreted as a UTF-8 character whose +code number is the given hexadecimal number, for example: \x{1234}. If a +non-hexadecimal digit appears between the braces, the item is not recognized. +This escape sequence can be used either as a literal, or within a character +class. +
++3. The original hexadecimal escape sequence, \xhh, matches a two-byte UTF-8 +character if the value is greater than 127. +
++4. Repeat quantifiers apply to complete UTF-8 characters, not to individual +bytes, for example: \x{100}{3}. +
++5. The dot metacharacter matches one UTF-8 character instead of a single byte. +
++6. The escape sequence \C can be used to match a single byte in UTF-8 mode, +but its use can lead to some strange effects. +
++7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly +test characters of any code value, but the characters that PCRE recognizes as +digits, spaces, or word characters remain the same set as before, all with +values less than 256. +
++8. Case-insensitive matching applies only to characters whose values are less +than 256. PCRE does not support the notion of "case" for higher-valued +characters. +
++9. PCRE does not support the use of Unicode tables and properties or the Perl +escapes \p, \P, and \X. +
+
+Philip Hazel <ph10@cam.ac.uk>
+
+University Computing Service,
+
+Cambridge CB2 3QG, England.
+
+Phone: +44 1223 334714
+
+Last updated: 20 August 2003
+
+Copyright © 1997-2003 University of Cambridge.
--
cgit v1.2.3-70-g09d2