1/* Part of SWI-Prolog 2 3 Author: Jan Wielemaker 4 E-mail: J.Wielemaker@vu.nl 5 WWW: http://www.swi-prolog.org 6 Copyright (c) 2017, VU University Amsterdam 7 All rights reserved. 8 9 Redistribution and use in source and binary forms, with or without 10 modification, are permitted provided that the following conditions 11 are met: 12 13 1. Redistributions of source code must retain the above copyright 14 notice, this list of conditions and the following disclaimer. 15 16 2. Redistributions in binary form must reproduce the above copyright 17 notice, this list of conditions and the following disclaimer in 18 the documentation and/or other materials provided with the 19 distribution. 20 21 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 22 "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 23 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 24 FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE 25 COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, 26 INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, 27 BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 28 LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 29 CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 30 LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN 31 ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 32 POSSIBILITY OF SUCH DAMAGE. 33*/ 34 35:- module(pcre, 36 [ re_match/2, % +Regex, +String 37 re_match/3, % +Regex, +String, +Options 38 re_matchsub/4, % +Regex, +String, -Subs, +Options 39 re_foldl/6, % :Goal, +Regex, +String, ?V0, ?V, +Options 40 re_split/3, % +Pattern, +String, -Split:list 41 re_split/4, % +Pattern, +String, -Split:list, +Options 42 re_replace/4, % +Pattern, +With, +String, -NewString 43 44 re_compile/3, % +Pattern, -Regex, +Options 45 re_flush/0, 46 re_config/1 % ?Config 47 ]). 48:- autoload(library(apply),[maplist/3]). 49:- autoload(library(error),[must_be/2,existence_error/2]). 50:- autoload(library(dcg/basics),[string/3,eos/2,digit/3,digits/3]). 51 52:- use_foreign_library(foreign(pcre4pl)). 53 54:- meta_predicate 55 re_foldl( , , , , , ). 56 57/** <module> Perl compatible regular expression matching for SWI-Prolog 58 59This module provides an interface to the [PCRE](http://www.pcre.org/) 60(Perl Compatible Regular Expression) library. This Prolog interface 61provides an almost comprehensive wrapper around PCRE. 62 63Regular expressions are created from a pattern and options and 64represented as a SWI-Prolog _blob_. This implies they are subject to 65(atom) garbage collection. Compiled regular expressions can safely be 66used in multiple threads. Most predicates accept both an explicitly 67compiled regular expression, a pattern or a term Pattern/Flags. In the 68latter two cases a regular expression _blob_ is created and stored in a 69cache. The cache can be cleared using re_flush/0. 70 71@see `man pcre` for details. 72*/ 73 74:- predicate_options(re_match/3, 3, 75 [ anchored(boolean), 76 bol(boolean), 77 bsr(oneof([anycrlf,unicode])), 78 empty(boolean), 79 empty_atstart(boolean), 80 eol(boolean), 81 newline(oneof([any,anycrlf,cr,lf,crlf])), 82 start(integer) 83 ]). 84:- predicate_options(re_compile/3, 3, 85 [ anchored(boolean), 86 bsr(oneof([anycrlf,unicode])), 87 caseless(boolean), 88 dollar_endonly(boolean), 89 dotall(boolean), 90 dupnames(boolean), 91 extended(boolean), 92 extra(boolean), 93 firstline(boolean), 94 compat(oneof([javascript])), 95 multiline(boolean), 96 newline(oneof([any,anycrlf,cr,lf,crlf])), 97 ucp(boolean), 98 ungreedy(boolean) 99 ]). 100 101 102%! re_match(+Regex, +String) is semidet. 103%! re_match(+Regex, +String, +Options) is semidet. 104% 105% Succeeds if String matches Regex. For example: 106% 107% ``` 108% ?- re_match("^needle"/i, "Needle in a haystack"). 109% true. 110% ``` 111% 112% Options: 113% 114% * anchored(Bool) 115% If =true=, match only at the first position 116% * bol(Bool) 117% Subject string is the beginning of a line (default =false=) 118% * bsr(Mode) 119% If =anycrlf=, \R only matches CR, LF or CRLF. If =unicode=, 120% \R matches all Unicode line endings. 121% Subject string is the end of a line (default =false=) 122% * empty(Bool) 123% An empty string is a valid match (default =true=) 124% * empty_atstart(Bool) 125% An empty string at the start of the subject is a valid match 126% (default =true=) 127% * eol(Bool) 128% Subject string is the end of a line (default =false=) 129% * newline(Mode) 130% If =any=, recognize any Unicode newline sequence, 131% if =anycrlf=, recognize CR, LF, and CRLF as newline 132% sequences, if =cr=, recognize CR, if =lf=, recognize 133% LF and finally if =crlf= recognize CRLF as newline. 134% * start(+From) 135% Start at the given character index 136% 137% @arg Regex is the output of re_compile/3, a pattern or a term 138% Pattern/Flags, where Pattern is an atom or string. The defined flags 139% and there related option for re_compile/3 are below. 140% 141% - *x*: extended(true) 142% - *i*: caseless(true) 143% - *m*: multiline(true) 144% - *s*: dotall(true) 145% - *a*: capture_type(atom) 146% - *r*: capture_type(range) 147% - *t*: capture_type(term) 148 149re_match(Regex, String) :- 150 re_match(Regex, String, []). 151re_match(Regex, String, Options) :- 152 re_compiled(Regex, Compiled), 153 re_match_(Compiled, String, Options). 154 155%! re_matchsub(+Regex, +String, -Sub:dict, +Options) is semidet. 156% 157% Match String against Regex. On success, Sub is a dict containing 158% integer keys for the numbered capture group and atom keys for the 159% named capture groups. The associated value is determined by the 160% capture_type(Type) option passed to re_compile/3, may be specified 161% using flags if Regex is of the form Pattern/Flags and may be 162% specified at the level of individual captures using a naming 163% convention for the caption name. See re_compile/3 for details. 164% 165% The example below exploits the typed groups to parse a date 166% specification: 167% 168% ``` 169% ?- re_matchsub("(?<date> (?<year_I>(?:\\d\\d)?\\d\\d) - 170% (?<month_I>\\d\\d) - (?<day_I>\\d\\d) )"/e, 171% "2017-04-20", Sub, []). 172% Sub = re_match{0:"2017-04-20", date:"2017-04-20", 173% day:20, month:4, year:2017}. 174% 175% ``` 176% 177% @arg Options Only _execution_ options are processed. See re_match/3 178% for the set of options. _Compilation_ options must be passed as 179% `/flags` to Regex. 180% @arg Regex See re_match/2 for a description of this argument. 181 182re_matchsub(Regex, String, Subs, Options) :- 183 re_compiled(Regex, Compiled), 184 re_matchsub_(Compiled, String, Pairs, Options), 185 dict_pairs(Subs, re_match, Pairs). 186 187%! re_foldl(:Goal, +Regex, +String, ?V0, ?V, +Options) is semidet. 188% 189% _Fold_ all matches of Regex on String. Each match is represented by 190% a dict as specified for re_matchsub/4. V0 and V are related using a 191% sequence of invocations of Goal as illustrated below. 192% 193% ``` 194% call(Goal, Dict1, V0, V1), 195% call(Goal, Dict2, V1, V2), 196% ... 197% call(Goal, Dictn, Vn, V). 198% ``` 199% 200% This predicate is used to implement re_split/4 and re_replace/4. For 201% example, we can count all matches of a Regex on String using this 202% code: 203% 204% ``` 205% re_match_count(Regex, String, Count) :- 206% re_foldl(increment, Regex, String, 0, Count, []). 207% 208% increment(_Match, V0, V1) :- 209% V1 is V0+1. 210% ``` 211% 212% After which we can query 213% 214% ``` 215% ?- re_match_count("a", "aap", X). 216% X = 2. 217% ``` 218 219re_foldl(Goal, Regex, String, V0, V, Options) :- 220 re_compiled(Regex, Compiled), 221 re_foldl_(Compiled, String, Goal, V0, V, Options). 222 223:- public re_call_folder/4. 224 225re_call_folder(Goal, Pairs, V0, V1) :- 226 dict_pairs(Dict, re_match, Pairs), 227 call(Goal, Dict, V0, V1). 228 229 230%! re_split(+Pattern, +String, -Split:list) is det. 231%! re_split(+Pattern, +String, -Split:list, +Options) is det. 232% 233% Split String using the regular expression Pattern. Split is a list 234% of strings holding alternating matches of Pattern and skipped parts 235% of the String, starting with a skipped part. The Split lists ends 236% with a string of the content of String after the last match. If 237% Pattern does not appear in String, Split is a list holding a copy of 238% String. This implies the number of elements in Split is _always_ 239% odd. For example: 240% 241% ``` 242% ?- re_split("a+", "abaac", Split, []). 243% Split = ["","a","b","aa","c"]. 244% ?- re_split(":\\s*"/n, "Age: 33", Split, []). 245% Split = ['Age', ': ', 33]. 246% ``` 247% 248% @arg Pattern is the pattern text, optionally follows by /Flags. 249% Similar to re_matchsub/4, the final output type can be controlled by 250% a flag =a= (atom), =s= (string, default) or =n= (number if possible, 251% atom otherwise). 252 253re_split(Pattern, String, Split) :- 254 re_split(Pattern, String, Split, []). 255re_split(Pattern, String, Split, Options) :- 256 range_regex(Pattern, Compiled, Type), 257 State = state(String, 0, Type), 258 re_foldl(split(State), Compiled, String, Split, [Last], Options), 259 arg(2, State, LastSkipStart), 260 typed_sub(Type, String, LastSkipStart, _, 0, Last). 261 262range_regex(Pattern/Flags, Compiled, Type) :- !, 263 atom_chars(Flags, Chars), 264 replace_flags(Chars, Chars1, Type), 265 atom_chars(RFlags, [r|Chars1]), 266 re_compiled(Pattern/RFlags, Compiled). 267range_regex(Pattern, Compiled, string) :- 268 re_compiled(Pattern/r, Compiled). 269 270replace_flags([], [], Type) :- 271 default(Type, string). 272replace_flags([H|T0], T, Type) :- 273 split_type(H, Type), 274 !, 275 replace_flags(T0, T, Type). 276replace_flags([H|T0], [H|T], Type) :- 277 replace_flags(T0, T, Type). 278 279split_type(a, atom). 280split_type(s, string). 281split_type(n, name). 282 283split(State, Dict, [Skipped,Sep|T], T) :- 284 matched(State, Dict.0, Sep), 285 skipped(State, Dict.0, Skipped). 286 287matched(state(String, _, Type), Start-Len, Matched) :- 288 typed_sub(Type, String, Start, Len, _, Matched). 289 290skipped(State, Start-Len, Skipped) :- 291 State = state(String, Here, Type), 292 SkipLen is Start-Here, 293 typed_sub(Type, String, Here, SkipLen, _, Skipped), 294 NextSkipStart is Start+Len, 295 nb_setarg(2, State, NextSkipStart). 296 297typed_sub(string, Haystack, B, L, A, String) :- 298 sub_string(Haystack, B, L, A, String). 299typed_sub(atom, Haystack, B, L, A, String) :- 300 sub_atom(Haystack, B, L, A, String). 301typed_sub(name, Haystack, B, L, A, Value) :- 302 sub_string(Haystack, B, L, A, String), 303 ( number_string(Number, String) 304 -> Value = Number 305 ; atom_string(Value, String) 306 ). 307 308%! re_replace(+Pattern, +With, +String, -NewString) 309% 310% Replace matches of the regular expression Pattern in String with 311% With. With may reference captured substrings using \N or $Name. Both 312% N and Name may be written as {N} and {Name} to avoid ambiguities. 313% 314% @arg Pattern is the pattern text, optionally follows by /Flags. 315% Flags may include `g`, replacing all occurences of Pattern. In 316% addition, similar to re_matchsub/4, the final output type can be 317% controlled by a flag =a= (atom) or =s= (string, default). 318 319re_replace(Pattern, With, String, NewString) :- 320 range_regex(Pattern, Compiled, All, Type), 321 compile_replacement(With, RCompiled), 322 State = state(String, 0, Type), 323 ( All == all 324 -> re_foldl(replace(State, RCompiled), Compiled, String, Parts, [Last], []) 325 ; ( re_matchsub(Compiled, String, Match, []) 326 -> replace(State, RCompiled, Match, Parts, [Last]) 327 ; Repl = false 328 ) 329 ), 330 ( Repl == false 331 -> parts_to_output(Type, [String], NewString) 332 ; arg(2, State, LastSkipStart), 333 sub_string(String, LastSkipStart, _, 0, Last), 334 parts_to_output(Type, Parts, NewString) 335 ). 336 337range_regex(Pattern/Flags, Compiled, All, Type) :- !, 338 atom_chars(Flags, Chars), 339 replace_flags(Chars, Chars1, All, Type), 340 atom_chars(RFlags, [r|Chars1]), 341 re_compiled(Pattern/RFlags, Compiled). 342range_regex(Pattern, Compiled, first, string) :- 343 re_compiled(Pattern/r, Compiled). 344 345replace_flags([], [], All, Type) :- 346 default(All, first), 347 default(Type, string). 348replace_flags([H|T0], T, All, Type) :- 349 ( all(H, All) 350 -> true 351 ; type(H, Type) 352 ), 353 !, 354 replace_flags(T0, T, All, Type). 355replace_flags([H|T0], [H|T], All, Type) :- 356 replace_flags(T0, T, All, Type). 357 358all(g, all). 359type(a, atom). 360type(s, string). 361 362default(Val, Val) :- !. 363default(_, _). 364 365replace(State, With, Dict, [Skipped|Parts], T) :- 366 State = state(String, _, _Type), 367 copy_term(With, r(PartsR, Skel)), 368 Skel :< Dict, 369 range_strings(PartsR, String, Parts, T), 370 skipped(State, Dict.0, Skipped). 371 372range_strings([], _, T, T). 373range_strings([Start-Len|T0], String, [S|T1], T) :- 374 !, 375 sub_string(String, Start, Len, _, S), 376 range_strings(T0, String, T1, T). 377range_strings([S|T0], String, [S|T1], T) :- 378 range_strings(T0, String, T1, T). 379 380parts_to_output(string, Parts, String) :- 381 atomics_to_string(Parts, String). 382parts_to_output(atom, Parts, String) :- 383 atomic_list_concat(Parts, String). 384 385%! compile_replacement(+With, -Compiled) 386% 387% Compile the replacement specification into a specification that can 388% be processed quickly. The compiled expressions are cached and may be 389% reclaimed using re_flush/0. 390 391:- dynamic replacement_cache/2. 392:- volatile replacement_cache/2. 393 394compile_replacement(With, Compiled) :- 395 replacement_cache(With, Compiled), 396 !. 397compile_replacement(With, Compiled) :- 398 compile_replacement_nocache(With, Compiled), 399 assertz(replacement_cache(With, Compiled)). 400 401compile_replacement_nocache(With, r(Parts, Extract)) :- 402 string_codes(With, Codes), 403 phrase(replacement_parts(Parts, Pairs), Codes), 404 dict_pairs(Extract, _, Pairs). 405 406replacement_parts(Parts, Extract) --> 407 string(HCodes), 408 ( ("\\" ; "$"), 409 capture_name(Name) 410 -> !, 411 { add_part(HCodes, Parts, T0), 412 T0 = [Repl|T1], 413 Extract = [Name-Repl|Extract1] 414 }, 415 replacement_parts(T1, Extract1) 416 ; eos 417 -> !, 418 { add_part(HCodes, Parts, []), 419 Extract = [] 420 } 421 ). 422 423add_part([], Parts, Parts) :- 424 !. 425add_part(Codes, [H|T], T) :- 426 string_codes(H, Codes). 427 428capture_name(Name) --> 429 "{", 430 ( digit(D0) 431 -> digits(DL), 432 "}", 433 { number_codes(Name, [D0|DL]) } 434 ; letter(A0), 435 alnums(AL), 436 "}", 437 { atom_codes(Name, [A0|AL]) } 438 ). 439capture_name(Name) --> 440 digit(D0), 441 !, 442 digits(DL), 443 { number_codes(Name, [D0|DL]) }. 444capture_name(Name) --> 445 letter(A0), 446 !, 447 alnums(AL), 448 { atom_codes(Name, [A0|AL]) }. 449 450letter(L) --> 451 [L], 452 { between(0'a,0'z,L) 453 ; between(0'A,0'Z,L) 454 ; L == 0'_ 455 }, !. 456 457alnums([H|T]) --> 458 alnum(H), 459 !, 460 alnums(T). 461alnums([]) --> 462 "". 463 464alnum(L) --> 465 [L], 466 { between(0'a,0'z,L) 467 ; between(0'A,0'Z,L) 468 ; between(0'0,0'9,L) 469 ; L == 0'_ 470 }, !. 471 472%! re_compile(+Pattern, -Regex, +Options) is det. 473% 474% Compiles Pattern to a Regex _blob_ of type =regex= (see blob/2). 475% Defined Options are defined below. Please consult the PCRE 476% documentation for details. 477% 478% * anchored(Bool) 479% Force pattern anchoring 480% * bsr(Mode) 481% If =anycrlf=, \R only matches CR, LF or CRLF. If =unicode=, 482% \R matches all Unicode line endings. 483% * caseless(Bool) 484% If =true=, do caseless matching. 485% * dollar_endonly(Bool) 486% If =true=, $ not to match newline at end 487% * dotall(Bool) 488% If =true=, . matches anything including NL 489% * dupnames(Bool) 490% If =true=, allow duplicate names for subpatterns 491% * extended(Bool) 492% If =true=, ignore white space and # comments 493% * extra(Bool) 494% If =true=, PCRE extra features (not much use currently) 495% * firstline(Bool) 496% If =true=, force matching to be before newline 497% * compat(With) 498% If =javascript=, JavaScript compatibility 499% * multiline(Bool) 500% If =true=, ^ and $ match newlines within data 501% * newline(Mode) 502% If =any=, recognize any Unicode newline sequence, 503% if =anycrlf= (default), recognize CR, LF, and CRLF as newline 504% sequences, if =cr=, recognize CR, if =lf=, recognize 505% LF and finally if =crlf= recognize CRLF as newline. 506% * ucp(Bool) 507% If =true=, use Unicode properties for \d, \w, etc. 508% * ungreedy(Bool) 509% If =true=, invert greediness of quantifiers 510% 511% In addition to the options above that directly map to pcre flags the 512% following options are processed: 513% 514% * optimize(Bool) 515% If `true`, _study_ the regular expression. 516% * capture_type(+Type) 517% How to return the matched part of the input and possibly captured 518% groups in there. Possible values are: 519% - string 520% Return the captured string as a string (default). 521% - atom 522% Return the captured string as an atom. 523% - range 524% Return the captured string as a pair `Start-Length`. Note 525% the we use ``Start-Length` rather than the more conventional 526% `Start-End` to allow for immediate use with sub_atom/5 and 527% sub_string/5. 528% - term 529% Parse the captured string as a Prolog term. This is notably 530% practical if you capture a number. 531% 532% The `capture_type` specifies the default for this pattern. The 533% interface supports a different type for each _named_ group using 534% the syntax =|(?<name_T>...)|=, where =T= is one of =S= (string), 535% =A= (atom), =I= (integer), =F= (float), =N= (number), =T= (term) 536% and =R= (range). In the current implementation =I=, =F= and =N= are 537% synonyms for =T=. Future versions may act different if the parsed 538% value is not of the requested numeric type. 539 540%! re_compiled(+Spec, --Regex) is det. 541% 542% Create a compiled regex from a specification. Cached compiled 543% regular expressions can be reclaimed using re_flush/0. 544 545:- dynamic re_pool/3. 546:- volatile re_pool/3. 547 548re_compiled(Regex, Regex) :- 549 blob(Regex, regex), 550 !. 551re_compiled(Text/Flags, Regex) :- 552 must_be(text, Text), 553 must_be(atom, Flags), 554 re_pool(Text, Flags, Regex), 555 !. 556re_compiled(Text/Flags, Regex) :- 557 !, 558 re_flags_options(Flags, Options), 559 re_compile(Text, Regex, Options), 560 assertz(re_pool(Text, Flags, Regex)). 561re_compiled(Text, Regex) :- 562 must_be(text, Text), 563 re_pool(Text, '', Regex), 564 !. 565re_compiled(Text, Regex) :- 566 re_compiled(Text/'', Regex). 567 568re_flags_options(Flags, Options) :- 569 atom_chars(Flags, Chars), 570 maplist(re_flag_option, Chars, Options). 571 572re_flag_option(Flag, Option) :- 573 re_flag_option_(Flag, Option), 574 !. 575re_flag_option(Flag, _) :- 576 existence_error(re_flag, Flag). 577 578re_flag_option_(i, caseless(true)). 579re_flag_option_(m, multiline(true)). 580re_flag_option_(x, extended(true)). 581re_flag_option_(s, dotall(true)). 582re_flag_option_(a, capture_type(atom)). 583re_flag_option_(r, capture_type(range)). 584re_flag_option_(t, capture_type(term)). 585 586%! re_flush 587% 588% Clean pattern and replacement caches. 589% 590% @tbd Flush automatically if the cache becomes too large. 591 592re_flush :- 593 retractall(replacement_cache(_,_)), 594 retractall(re_pool(_,_,_)). 595 596%! re_config(+Term) 597% 598% Extract configuration information from the pcre library. Term is of 599% the form Name(Value). Name is derived from the =|PCRE_CONFIG_*|= 600% constant after removing =PCRE_CONFIG_= and mapping the name to lower 601% case, e.g. `utf8`, `unicode_properties`, etc. Value is either a 602% Prolog boolean, integer or atom. 603% 604% Finally, the functionality of pcre_version() is available using the 605% configuration name `version`. 606% 607% @see `man pcreapi` for details