1/* Part of SWI-Prolog 2 3 Author: Jan Wielemaker 4 E-mail: J.Wielemaker@vu.nl 5 WWW: http://www.swi-prolog.org 6 Copyright (c) 2009-2019, University of Amsterdam 7 VU University Amsterdam 8 CWI, Amsterdam 9 All rights reserved. 10 11 Redistribution and use in source and binary forms, with or without 12 modification, are permitted provided that the following conditions 13 are met: 14 15 1. Redistributions of source code must retain the above copyright 16 notice, this list of conditions and the following disclaimer. 17 18 2. Redistributions in binary form must reproduce the above copyright 19 notice, this list of conditions and the following disclaimer in 20 the documentation and/or other materials provided with the 21 distribution. 22 23 THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS 24 "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 25 LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS 26 FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE 27 COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, 28 INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, 29 BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; 30 LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER 31 CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT 32 LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN 33 ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE 34 POSSIBILITY OF SUCH DAMAGE. 35*/ 36 37:- module(xpath, 38 [ xpath/3, % +DOM, +Spec, -Value 39 xpath_chk/3, % +DOM, +Spec, -Value 40 41 op(400, fx, //), 42 op(400, fx, /), 43 op(200, fy, @) 44 ]). 45:- use_module(library(record),[record/1, op(_,_,record)]). 46 47:- autoload(library(debug),[assertion/1]). 48:- autoload(library(error),[instantiation_error/1,must_be/2]). 49:- autoload(library(lists),[member/2]). 50:- autoload(library(sgml),[xsd_number_string/2]).
67:- record
68 element(name, attributes, content).
74xpath_chk(DOM, Spec, Content) :-
75 xpath(DOM, Spec, Content),
76 !.
//
Term/
Term
The Terms above are of type callable. The functor specifies
the element name. The element name '*' refers to any element.
The name self
refers to the top-element itself and is often
used for processing matches of an earlier xpath/3 query. A term
NS:Term refers to an XML name in the namespace NS. Optional
arguments specify additional constraints and functions. The
arguments are processed from left to right. Defined conditional
argument values are:
last
last
- IntExprlast-1
is the element directly preceding the last one.index(Integer)
.last
index(last)
.last
- IntExprindex(last-IntExpr)
.Defined function argument values are:
self
content
text
text(As)
atom
or string
.normalize_space
text
, but uses normalize_space/2 to normalise
white-space in the outputnumber
@
Attributenumber
, but subsequently transform the value
into an integer using the round/1 function.number
, but subsequently transform the value
into a float using the float/1 function.In addition, the argument-list can be conditions:
content = content
defines that the content
of the element is the atom content
.
The functions lower_case
and upper_case
can be applied
to Right (see example below).contains(Haystack, Needle)
h3
element inside a div
element, where the div
element itself contains an h2
child with a strong
child.
//div(h2/strong)/h3
This is equivalent to the conjunction of XPath goals below.
..., xpath(DOM, //(div), Div), xpath(Div, h2/strong, _), xpath(Div, h3, Result)
Examples:
Match each table-row in DOM:
xpath(DOM, //tr, TR)
Match the last cell of each tablerow in DOM. This example illustrates that a result can be the input of subsequent xpath/3 queries. Using multiple queries on the intermediate TR term guarantee that all results come from the same table-row:
xpath(DOM, //tr, TR), xpath(TR, /td(last), TD)
Match each href
attribute in an <a> element
xpath(DOM, //a(@href), HREF)
Suppose we have a table containing rows where each first column is the name of a product with a link to details and the second is the price (a number). The following predicate matches the name, URL and price:
product(DOM, Name, URL, Price) :- xpath(DOM, //tr, TR), xpath(TR, td(1), C1), xpath(C1, /self(normalize_space), Name), xpath(C1, a(@href), URL), xpath(TR, td(2, number), Price).
Suppose we want to select books with genre="thriller" from a
tree containing elements <book genre=...>
thriller(DOM, Book) :- xpath(DOM, //book(@genre=thiller), Book).
Match the elements <table align="center">
and <table
align="CENTER">
:
//table(@align(lower) = center)
Get the width
and height
of a div
element as a number,
and the div
node itself:
xpath(DOM, //div(@width(number)=W, @height(number)=H), Div)
Note that div
is an infix operator, so parentheses must be
used in cases like the following:
xpath(DOM, //(div), Div)
262xpath(DOM, Spec, Content) :- 263 in_dom(Spec, DOM, Content). 264 265in_dom(//Spec, DOM, Value) :- 266 !, 267 element_spec(Spec, Name, Modifiers), 268 sub_dom(I, Len, Name, E, DOM), 269 modifiers(Modifiers, I, Len, E, Value). 270in_dom(/Spec, E, Value) :- 271 !, 272 element_spec(Spec, Name, Modifiers), 273 ( Name == self 274 -> true 275 ; element_name(E, Name) 276 ), 277 modifiers(Modifiers, 1, 1, E, Value). 278in_dom(A/B, DOM, Value) :- 279 !, 280 in_dom(A, DOM, Value0), 281 in_dom(B, Value0, Value). 282in_dom(A//B, DOM, Value) :- 283 !, 284 in_dom(A, DOM, Value0), 285 in_dom(//B, Value0, Value). 286in_dom(Spec, element(_, _, Content), Value) :- 287 element_spec(Spec, Name, Modifiers), 288 count_named_elements(Content, Name, CLen), 289 CLen > 0, 290 nth_element(N, Name, E, Content), 291 modifiers(Modifiers, N, CLen, E, Value). 292 293element_spec(Var, _, _) :- 294 var(Var), 295 !, 296 instantiation_error(Var). 297element_spec(NS:Term, NS:Name, Modifiers) :- 298 !, 299 callable_name_arguments(Term, Name0, Modifiers), 300 star(Name0, Name). 301element_spec(Term, Name, Modifiers) :- 302 !, 303 callable_name_arguments(Term, Name0, Modifiers), 304 star(Name0, Name). 305 306callable_name_arguments(Atom, Name, Arguments) :- 307 atom(Atom), 308 !, 309 Name = Atom, Arguments = []. 310callable_name_arguments(Compound, Name, Arguments) :- 311 compound_name_arguments(Compound, Name, Arguments). 312 313 314star(*, _) :- !. 315star(Name, Name).
327sub_dom(1, 1, Name, DOM, DOM) :- 328 element_name(DOM, Name0), 329 \+ Name \= Name0. 330sub_dom(N, Len, Name, E, element(_,_,Content)) :- 331 !, 332 sub_dom_2(N, Len, Name, E, Content). 333sub_dom(N, Len, Name, E, Content) :- 334 is_list(Content), 335 sub_dom_2(N, Len, Name, E, Content). 336 337sub_dom_2(N, Len, Name, Element, Content) :- 338 ( count_named_elements(Content, Name, Len), 339 nth_element(N, Name, Element, Content) 340 ; member(element(_,_,C2), Content), 341 sub_dom_2(N, Len, Name, Element, C2) 342 ).
349count_named_elements(Content, Name, Count) :- 350 count_named_elements(Content, Name, 0, Count). 351 352count_named_elements([], _, Count, Count). 353count_named_elements([element(Name,_,_)|T], Name0, C0, C) :- 354 \+ Name \= Name0, 355 !, 356 C1 is C0+1, 357 count_named_elements(T, Name0, C1, C). 358count_named_elements([_|T], Name, C0, C) :- 359 count_named_elements(T, Name, C0, C).
366nth_element(N, Name, Element, Content) :- 367 nth_element_(1, N, Name, Element, Content). 368 369nth_element_(I, N, Name, E, [H|T]) :- 370 element_name(H, Name0), 371 \+ Name \= Name0, 372 !, 373 ( N = I, 374 E = H 375 ; I2 is I + 1, 376 ( nonvar(N), I2 > N 377 -> !, fail 378 ; true 379 ), 380 nth_element_(I2, N, Name, E, T) 381 ). 382nth_element_(I, N, Name, E, [_|T]) :- 383 nth_element_(I, N, Name, E, T).
390modifiers([], _, _, Value, Value). 391modifiers([H|T], I, L, Value0, Value) :- 392 modifier(H, I, L, Value0, Value1), 393 modifiers(T, I, L, Value1, Value). 394 395modifier(M, _, _, _, _) :- 396 var(M), 397 !, 398 instantiation_error(M). 399modifier(Index, I, L, Value0, Value) :- 400 implicit_index_modifier(Index), 401 !, 402 Value = Value0, 403 index_modifier(Index, I, L). 404modifier(index(Index), I, L, Value, Value) :- 405 !, 406 index_modifier(Index, I, L). 407modifier(Function, _, _, In, Out) :- 408 xpath_function(Function), 409 !, 410 xpath_function(Function, In, Out). 411modifier(Function, _, _, In, Out) :- 412 xpath_condition(Function, In), 413 Out = In. 414 415implicit_index_modifier(I) :- 416 integer(I), 417 !. 418implicit_index_modifier(last). 419implicit_index_modifier(last-_Expr). 420 421index_modifier(Var, I, _L) :- 422 var(Var), 423 !, 424 Var = I. 425index_modifier(last, I, L) :- 426 !, 427 I =:= L. 428index_modifier(last-Expr, I, L) :- 429 !, 430 I =:= L-Expr. 431index_modifier(N, I, _) :- 432 N =:= I. 433 434xpath_function(self, DOM, Value) :- % self 435 !, 436 Value = DOM. 437xpath_function(content, Element, Value) :- % content 438 !, 439 element_content(Element, Value). 440xpath_function(text, DOM, Text) :- % text 441 !, 442 text_of_dom(DOM, atom, Text). 443xpath_function(text(As), DOM, Text) :- % text(As) 444 !, 445 text_of_dom(DOM, As, Text). 446xpath_function(normalize_space, DOM, Text) :- % normalize_space 447 !, 448 text_of_dom(DOM, string, Text0), 449 normalize_space(atom(Text), Text0). 450xpath_function(number, DOM, Number) :- % number 451 !, 452 text_of_dom(DOM, string, Text0), 453 normalize_space(string(Text), Text0), 454 catch(xsd_number_string(Number, Text), _, fail). 455xpath_function(@Name, element(_, Attrs, _), Value) :- % @Name 456 !, 457 ( atom(Name) 458 -> memberchk(Name=Value, Attrs) 459 ; compound(Name) 460 -> compound_name_arguments(Name, AName, AOps), 461 memberchk(AName=Value0, Attrs), 462 translate_attribute(AOps, Value0, Value) 463 ; member(Name=Value, Attrs) 464 ). 465xpath_function(quote(Value), _, Value). % quote(Value) 466 467xpath_function(self). 468xpath_function(content). 469xpath_function(text). 470xpath_function(text(_)). 471xpath_function(normalize_space). 472xpath_function(number). 473xpath_function(@_). 474xpath_function(quote(_)). 475 476translate_attribute([], Value, Value). 477translate_attribute([H|T], Value0, Value) :- 478 translate_attr(H, Value0, Value1), 479 translate_attribute(T, Value1, Value). 480 481translate_attr(number, Value0, Value) :- 482 xsd_number_string(Value, Value0). 483translate_attr(integer, Value0, Value) :- 484 xsd_number_string(Value1, Value0), 485 Value is round(Value1). 486translate_attr(float, Value0, Value) :- 487 xsd_number_string(Value1, Value0), 488 Value is float(Value1). 489translate_attr(string, Value0, Value) :- 490 atom_string(Value0, Value). 491translate_attr(lower, Value0, Value) :- 492 ( atom(Value0) 493 -> downcase_atom(Value0, Value) 494 ; string_lower(Value0, Value) 495 ). 496translate_attr(upper, Value0, Value) :- 497 ( atom(Value0) 498 -> upcase_atom(Value0, Value) 499 ; string_upper(Value0, Value) 500 ). 501 502xpath_condition(Left = Right, Value) :- % = 503 !, 504 var_or_function(Left, Value, LeftValue), 505 process_equality(LeftValue, Right). 506xpath_condition(contains(Haystack, Needle), Value) :- % contains(Haystack, Needle) 507 !, 508 val_or_function(Haystack, Value, HaystackValue), 509 val_or_function(Needle, Value, NeedleValue), 510 atom(HaystackValue), atom(NeedleValue), 511 ( sub_atom(HaystackValue, _, _, _, NeedleValue) 512 -> true 513 ). 514xpath_condition(Spec, Dom) :- 515 in_dom(Spec, Dom, _).
For example the XPath expression in [1], and the equivalent Prolog expression in [2], would both match the HTML element in [3].
[1] //table[align=lower-case(center)] [2] //table(@align=lower_case(center)) [3] <table align="CENTER">
533process_equality(Left, Right) :- 534 var(Right), 535 !, 536 Left = Right. 537process_equality(Left, lower_case(Right)) :- 538 !, 539 downcase_atom(Left, Right). 540process_equality(Left, upper_case(Right)) :- 541 !, 542 upcase_atom(Left, Right). 543process_equality(Left, Right) :- 544 Left = Right, 545 !. 546process_equality(Left, Right) :- 547 atom(Left), 548 atomic(Right), 549 \+ atom(Left), 550 atom_string(Left, Right). 551 552var_or_function(Arg, _, Arg) :- 553 var(Arg), 554 !. 555var_or_function(Func, Value0, Value) :- 556 xpath_function(Func), 557 !, 558 xpath_function(Func, Value0, Value). 559var_or_function(Value, _, Value). 560 561val_or_function(Arg, _, Arg) :- 562 var(Arg), 563 !, 564 instantiation_error(Arg). 565val_or_function(Func, Value0, Value) :- % TBD 566 xpath_function(Func, Value0, Value), 567 !. 568val_or_function(Value, _, Value).
575text_of_dom(DOM, As, Text) :- 576 phrase(text_of(DOM), Tokens), 577 ( As == atom 578 -> atomic_list_concat(Tokens, Text) 579 ; As == string 580 -> atomics_to_string(Tokens, Text) 581 ; must_be(oneof([atom,string]), As) 582 ). 583 584text_of(element(_,_,Content)) --> 585 text_of_list(Content). 586text_of([]) --> 587 []. 588text_of([H|T]) --> 589 text_of(H), 590 text_of(T). 591 592 593text_of_list([]) --> 594 []. 595text_of_list([H|T]) --> 596 text_of_1(H), 597 text_of_list(T). 598 599 600text_of_1(element(_,_,Content)) --> 601 !, 602 text_of_list(Content). 603text_of_1(Data) --> 604 { assertion(atom_or_string(Data)) }, 605 [Data]. 606 607atom_or_string(Data) :- 608 ( atom(Data) 609 -> true 610 ; string(Data) 611 )
Select nodes in an XML DOM
The library
xpath.pl
provides predicates to select nodes from an XML DOM tree as produced by library(sgml) based on descriptions inspired by the XPath language.The predicate xpath/3 selects a sub-structure of the DOM non-deterministically based on an XPath-like specification. Not all selectors of XPath are implemented, but the ability to mix xpath/3 calls with arbitrary Prolog code provides a powerful tool for extracting information from XML parse-trees.