Received: by mail.netbsd.org (Postfix, from userid 605) id 52A7C855B0; Wed, 16 Aug 2017 23:38:37 +0000 (UTC) Received: from localhost (localhost [127.0.0.1]) by mail.netbsd.org (Postfix) with ESMTP id C77D4855AF for ; Wed, 16 Aug 2017 23:38:36 +0000 (UTC) X-Virus-Scanned: amavisd-new at netbsd.org Received: from mail.netbsd.org ([127.0.0.1]) by localhost (mail.netbsd.org [127.0.0.1]) (amavisd-new, port 10025) with ESMTP id bsiga1QOrLtf for ; Wed, 16 Aug 2017 23:38:36 +0000 (UTC) Received: from cvs.NetBSD.org (ivanova.netbsd.org [199.233.217.197]) by mail.netbsd.org (Postfix) with ESMTP id 4898F84D8D for ; Wed, 16 Aug 2017 23:38:36 +0000 (UTC) Received: by cvs.NetBSD.org (Postfix, from userid 500) id 585B3FAD0; Wed, 16 Aug 2017 23:38:35 +0000 (UTC) Content-Disposition: inline Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset="US-ASCII" MIME-Version: 1.0 Date: Wed, 16 Aug 2017 23:38:35 +0000 From: "Alistair G. Crooks" Subject: CVS commit: othersrc/external/bsd/agcre/dist To: source-changes@NetBSD.org X-Mailer: log_accum Message-Id: <20170816233835.585B3FAD0@cvs.NetBSD.org> Sender: source-changes-owner@NetBSD.org List-Id: source-changes.NetBSD.org Precedence: bulk Reply-To: source-changes-d@NetBSD.org Mail-Reply-To: "Alistair G. Crooks" Mail-Followup-To: source-changes-d@NetBSD.org List-Unsubscribe: Module Name: othersrc Committed By: agc Date: Wed Aug 16 23:38:35 UTC 2017 Added Files: othersrc/external/bsd/agcre/dist: internal.h Log Message: Just what this world needs - another regexp library. However, for something I was doing, I needed a regexp library in C, BSD-licensed, and able to be exposed to a wide range of expressions, some better controlled than others. The resulting library is libagcre, which implements regular expression compilation and execution. It uses the Pike Virtual Machine approach, and features: + standard POSIX features where sane + some/most Perl escapes + lazy matching via '?' + non-capture parenthese (?:...) + in-expression case-insensitive directives are supported (?i)...(?-i) + all case-insensitivity is actioned at expression exec time. Case-insensitivity can be specified at expression compile-time, and, if so, it will be remembered. But the expression itself, once compiled, can be used to match in both a case-sensitive and insensitive manner + utf8 is supported both for expressions and for input text when matching + unicode escapes (in the Java format of \uABCD) are supported + exact multiple repetition specifiers {N}, and {N,M} are supported + backreferences are supported + utf16 (LE and BE) and utf32 (LE and BE) are supported, both for the expression and for the input being searched + at the most basic level, individual 32bit unicode characters are matched + an egrep/grep implementation for matching unicode regexps is included A simple implementation of sets is used to provide inclusion and exclusion information for unicode characters, which is taken directly from unicode.org. No bitmasks are used - ranges are specified by using an upper and a lower bound for the codepoints. Callbacks can also be added to these sets, to provide functionality similar to the ctype macros across the whole unicode character set. The standard regular expression basic3 torture test is passed with 4 known (and, I'd argue, incorrect) results flagged. As expected, the expression '(a?){9999}aaaaaaaaaaaaaaaaaaaaaaaaaaaaa' matches in linear time, as does the expression '((((((((((((((((((((((((((((((x))))))))))))))))))))))))))))))' % time agcre '(a?){9999}aaaaaaaaaaaaaaaaaaaaaaaaaaaaa' dist/tests/2.in aaaaaaaaaaaaaaaaaaaaaaaaaaaaa 0.063u 0.000s 0:00.06 100.0% 0+0k 0+0io 0pf+0w % time egrep '(a?){9999}aaaaaaaaaaaaaaaaaaaaaaaaaaaaa' dist/tests/2.in ^C88.462u 0.730s 1:29.21 99.9% 0+0k 0+0io 0pf+0w % The library and agcre utility have been run through valgrind to confirm no memory leaks. In general, the emphasis is on a modern, predictable, VM-style, well-featured regexp library, in C, with a BSD license. In particular, sljit has not been used to speed up on certain platforms, most Perl regexp features are supported, as are back references, and UTF-8, UTF-16 and UTF32. Once again, I wouldn't expect anyone to use this as the main engine in egrep. But I am always amazed at the uses for some of the things that I write. For more information about the Pike VM, and comparison to other regexp implementations, please see: https://swtch.com/~rsc/regexp/regexp2.html Alistair Crooks Tue Aug 15 07:43:34 PDT 2017 To generate a diff of this commit: cvs rdiff -u -r0 -r1.1 othersrc/external/bsd/agcre/dist/internal.h Please note that diffs are not public domain; they are subject to the copyright notices on the relevant files.