1/* Part of SWI-Prolog 2 3 Author: Jan Wielemaker 4 E-mail: J.Wielemaker@vu.nl 5 WWW: http://www.swi-prolog.org 6 Copyright (c) 2010-2020, University of Amsterdam 7 CWI, Amsterdam 8 All rights reserved. 9 10 Redistribution and use in source and binary forms, with or without 11 modification, are permitted provided that the following conditions 12 are met: 13 14 1. Redistributions of source code must retain the above copyright 15 notice, this list of conditions and the following disclaimer. 16 17 2. Redistributions in binary form must reproduce the above copyright 18 notice, this list of conditions and the following disclaimer in 19 the documentation and/or other materials provided with the 20 distribution. 21 22 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 23 "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 24 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 25 FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE 26 COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, 27 INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, 28 BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 29 LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 30 CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 31 LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN 32 ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 33 POSSIBILITY OF SUCH DAMAGE. 34*/ 35 36:- module(unicode, 37 [ unicode_property/2, % ?Code, ?Property 38 unicode_map/3, % +In, -Out, +Options 39 unicode_nfd/2, % +In, -Out 40 unicode_nfc/2, % +In, -Out 41 unicode_nfkd/2, % +In, -Out 42 unicode_nfkc/2 % +In, -Out 43 ]). 44:- use_foreign_library(foreign(unicode4pl)). 45 46/** <module> Unicode string handling 47 48This library is a wrapper around the 49[[utf8proc][http://www.public-software-group.org/utf8proc]] library, 50providing information about Unicode code-points and performing 51operations (mappings) on Unicode atoms. The central predicate is 52unicode_map/3, mapping a Unicode atom to another Unicode atom using a 53sequence of operations. The predicates unicode_nfd/2, unicode_nfc/2, 54unicode_nfkd/2 and unicode_nfkc/2 implement the four standard Unicode 55normalization forms. 56 57Lump handling: 58 59== 60U+0020 <-- all space characters (general category Zs) 61U+0027 ' <-- left/right single quotation mark U+2018..2019, 62 modifier letter apostrophe U+02BC, 63 modifier letter vertical line U+02C8 64U+002D - <-- all dash characters (general category Pd), 65 minus U+2212 66U+002F / <-- fraction slash U+2044, 67 division slash U+2215 68U+003A : <-- ratio U+2236 69U+003C < <-- single left-pointing angle quotation mark U+2039, 70 left-pointing angle bracket U+2329, 71 left angle bracket U+3008 72U+003E > <-- single right-pointing angle quotation mark U+203A, 73 right-pointing angle bracket U+232A, 74 right angle bracket U+3009 75U+005C \ <-- set minus U+2216 76U+005E ^ <-- modifier letter up arrowhead U+02C4, 77 modifier letter circumflex accent U+02C6, 78 caret U+2038, 79 up arrowhead U+2303 80U+005F _ <-- all connector characters (general category Pc), 81 modifier letter low macron U+02CD 82U+0060 ` <-- modifier letter grave accent U+02CB 83U+007C | <-- divides U+2223 84U+007E ~ <-- tilde operator U+223C 85== 86 87@see http://www.public-software-group.org/utf8proc 88*/ 89 90systemgoal_expansion(unicode_map(In, Out, Options), 91 unicode_map(In, Out, Mask)) :- 92 is_list(Options), 93 unicode_option_mask(Options, Mask). 94 95%! unicode_map(+In, -Out, +Options) is det. 96% 97% Perform unicode normalization operations. Options is a list 98% of operations. Defined operations are: 99% 100% * stable 101% Unicode Versioning Stability has to be respected. 102% * compat 103% Compatiblity decomposition (i.e. formatting information is lost) 104% * compose 105% Return a result with composed characters. 106% * decompose 107% Return a result with decomposed characters. 108% * ignore 109% Strip "default ignorable characters" 110% * rejectna 111% Return an error, if the input contains unassigned code 112% points. 113% * nlf2ls 114% Indicating that NLF-sequences (LF, CRLF, CR, NEL) are 115% representing a line break, and should be converted to the 116% unicode character for line separation (LS). 117% * nlf2ps 118% Indicating that NLF-sequences are representing a paragraph 119% break, and should be converted to the unicode character for 120% paragraph separation (PS). 121% * nlf2lf 122% Indicating that the meaning of NLF-sequences is unknown. 123% * stripcc 124% Strips and/or convers control characters. 125% NLF-sequences are transformed into space, except if one of 126% the NLF2LS/PS/LF options is given. 127% HorizontalTab (HT) and FormFeed (FF) are treated as a 128% NLF-sequence in this case. 129% All other control characters are simply removed. 130% * casefold 131% Performs unicode case folding, to be able to do a 132% case-insensitive string comparison. 133% * charbound 134% Inserts 0xFF bytes at the beginning of each sequence which 135% is representing a single grapheme cluster (see UAX#29). 136% * lump 137% (e.g. HYPHEN U+2010 and MINUS U+2212 to ASCII "-"). 138% (See module header for details.) 139% If NLF2LF is set, this includes a transformation of 140% paragraph and line separators to ASCII line-feed (LF). 141% * stripmark 142% Strips all character markings 143% (non-spacing, spacing and enclosing) (i.e. accents) 144% NOTE: this option works only with =compose= or =decompose=. 145 146%! unicode_nfd(+In, -Out) is det. 147% 148% Characters are decomposed by canonical equivalence. 149 150unicode_nfd(In, Out) :- 151 unicode_map(In, Out, [stable,decompose]). 152 153%! unicode_nfc(+In, -Out) is det. 154% 155% Characters are decomposed and then recomposed by canonical 156% equivalence. It is possible for the result to be a different 157% sequence of characters than the original. 158% 159% @see http://en.wikipedia.org/wiki/Unicode_equivalence#Normal_forms 160 161unicode_nfc(In, Out) :- 162 unicode_map(In, Out, [stable,compose]). 163 164%! unicode_nfkd(+In, -Out) is det. 165% 166% Characters are decomposed by compatibility equivalence. 167 168unicode_nfkd(In, Out) :- 169 unicode_map(In, Out, [stable,decompose,compat]). 170 171%! unicode_nfkc(+In, -Out) is det. 172% 173% Characters are decomposed by compatibility equivalence, then 174% recomposed by canonical equivalence. 175 176unicode_nfkc(In, Out) :- 177 unicode_map(In, Out, [stable,compose,compat]). 178 179 180%! unicode_property(?Char, ?Property) is nondet. 181% 182% True if Property is defined for Char. Property is a term 183% Name(Value). Defined property-names are: 184% 185% * category(atom) 186% Unicode code category of Char. This is one of Cc, Cf, Cn, 187% Co, Cs, Ll, Lm, Lo, Lt, Lu, Mc, Me, Mn, Nd, Nl, No, Pc, Pd, 188% Pe, Pf, Pi, Po, Ps, Sc, Sk, Sm, So, Zl, Zp, Zs. When 189% testing, a single letter stands for all its subcategories. 190% E.g. to test form a letter, you can use 191% 192% == 193% unicode_property(C, category('L')) 194% == 195% 196% * combining_class(integer) 197% * bidi_class(atom) 198% * decomp_type(atom) 199% * decomp_mapping(list(code)) 200% * bidi_mirrored(bool) 201% * uppercase_mapping(code) 202% * lowercase_mapping(code) 203% * titlecase_mapping(code) 204% * comb1st_index(code) 205% * comb2nd_index(code) 206% * comp_exclusion(bool) 207% * ignorable(bool) 208% * control_boundary(bool) 209% * extend(bool) 210% * casefold_mapping(list(code)) 211% 212% @tbd Complete documentation 213 214unicode_property(Code, Property) :- 215 nonvar(Code), nonvar(Property), 216 !, 217 '$unicode_property'(Code, Property). 218unicode_property(Code, Property) :- 219 nonvar(Code), 220 !, 221 property(Property), 222 '$unicode_property'(Code, Property). 223unicode_property(Code, Property) :- 224 var(Code), 225 !, 226 between(0, 0x10ffff, Code), 227 property(Property), 228 '$unicode_property'(Code, Property). 229 230property(category(_)). 231property(combining_class(_)). 232property(bidi_class(_)). 233property(decomp_type(_)). 234property(decomp_mapping(_)). 235property(bidi_mirrored(_)). 236property(uppercase_mapping(_)). 237property(lowercase_mapping(_)). 238property(titlecase_mapping(_)). 239property(comb1st_index(_)). 240property(comb2nd_index(_)). 241property(comp_exclusion(_)). 242property(ignorable(_)). 243property(control_boundary(_)). 244property(extend(_)). 245property(casefold_mapping(_))