sgml.pl -- SGML, XML and HTML parser
This library allows you to parse SGML, XML and HTML data into a Prolog data structure. The library defines several families of predicates:
- High-level predicates
-
Most users will only use load_html/3, load_xml/3 or load_sgml/3 to
parse arbitrary input into a DOM structure. These predicates all
call load_structure/3, which provides more options and may be
used for processing non-standard documents.
The DOM structure can be used by library(xpath) to extract information from the document.
- The low-level parser
- The actual parser is written in C and consists of two parts: one for processing DTD (Document Type Definitions) and one for parsing data. The data can either be parsed to a Prolog (DOM) term or the parser can perform callbacks for the DOM events.
- Utility predicates
- Finally, this library provides prmitives for classifying characters and strings according to the XML specification such as xml_name/1 to verify whether an atom is a valid XML name (identifier). It also provides primitives to quote attributes and CDATA elements.
- dtd(+Type, -DTD) is det
- DTD is a DTD object created from the file
dtd(Type)
. Loaded DTD objects are cached. Note that DTD objects may not be shared between threads. Therefore, dtd/2 maintains the pool of DTD objects using a thread_local predicate. DTD objects are destroyed if a thread terminates. - load_dtd(+DTD, +DtdFile, +Options)
- Load DtdFile into a DTD. Defined options are:
- dialect(+Dialect)
- Dialect to use (xml, xmlns, sgml)
- encoding(+Encoding)
- Encoding of DTD file
- load_structure(+Source, -ListOfContent, :Options) is det
- Parse Source and return the resulting structure in
ListOfContent. Source is handed to open_any/5, which allows for
processing an extensible set of input sources.
A proper XML document contains only a single toplevel element whose name matches the document type. Nevertheless, a list is returned for consistency with the representation of element content.
The
encoding(+Encoding)
option is treated special for compatibility reasons:- If Encoding is one of
iso-8859-1
,us-ascii
orutf-8
, the stream is opened in binary mode and the option is passed to the SGML parser. - If Encoding is present, but not one of the above, the stream is opened in text mode using the given encoding.
- Otherwise (no Encoding), the stream is opened in binary mode and doing the correct decoding is left to the parser.
- If Encoding is one of
- load_sgml_file(+File, -DOM) is det
- Load SGML from File and unify the resulting DOM structure with DOM.
- load_xml_file(+File, -DOM) is det
- Load XML from File and unify the resulting DOM structure with DOM.
- load_html_file(+File, -DOM) is det
- Load HTML from File and unify the resulting DOM structure with DOM.
- load_html(+Input, -DOM, +Options) is det
- Load HTML text from Input and unify the resulting DOM structure
with DOM. Options are passed to load_structure/3, after adding
the following default options:
- dtd(DTD)
- Pass the DTD for HTML as obtained using
dtd(html, DTD)
. - dialect(Dialect)
- Current dialect from the Prolog flag
html_dialect
- max_errors(-1)
- syntax_errors(quiet)
- Most HTML encountered in the wild contains errors. Even in the context of errors, the resulting DOM term is often a reasonable guess at the intent of the author.
You may also want to use the library(http/http_open) to support loading from HTTP and HTTPS URLs. For example:
:- use_module(library(http/http_open)). :- use_module(library(sgml)). load_html_url(URL, DOM) :- load_html(URL, DOM, []).
- load_xml(+Input, -DOM, +Options) is det
- Load XML text from Input and unify the resulting DOM structure
with DOM. Options are passed to load_structure/3, after adding
the following default options:
dialect(xml)
- load_sgml(+Input, -DOM, +Options) is det
- Load SGML text from Input and unify the resulting DOM structure
with DOM. Options are passed to load_structure/3, after adding
the following default options:
dialect(sgml)
- xml_quote_attribute(+In, -Quoted) is det
- xml_quote_cdata(+In, -Quoted) is det
- Backward compatibility for versions that allow to specify encoding. All characters that cannot fit the encoding are mapped to XML character entities (&#dd;). Using ASCII is the safest value.
- xml_name(+Atom) is semidet
- True if Atom is a valid XML name.
- xml_basechar(+CodeOrChar) is semidet
- xml_ideographic(+CodeOrChar) is semidet
- xml_combining_char(+CodeOrChar) is semidet
- xml_digit(+CodeOrChar) is semidet
- xml_extender(+CodeOrChar) is semidet
- XML character classification predicates. Each of these predicates accept both a character (one-character atom) and a code (integer).
- xml_is_dom(@Term) is semidet
- True if term statisfies the structure as returned by load_structure/3 and friends.
Undocumented predicates
The following predicates are exported, but not or incorrectly documented.
- xml_quote_attribute(Arg1, Arg2, Arg3)
- set_sgml_parser(Arg1, Arg2)
- free_dtd(Arg1)
- xml_extender(Arg1)
- sgml_register_catalog_file(Arg1, Arg2)
- free_sgml_parser(Arg1)
- new_dtd(Arg1, Arg2)
- load_dtd(Arg1, Arg2)
- xml_digit(Arg1)
- xsd_time_string(Arg1, Arg2, Arg3)
- xml_quote_cdata(Arg1, Arg2)
- sgml_parse(Arg1, Arg2)
- new_sgml_parser(Arg1, Arg2)
- dtd_property(Arg1, Arg2)
- iri_xml_namespace(Arg1, Arg2, Arg3)
- xml_combining_char(Arg1)
- xsd_number_string(Arg1, Arg2)
- get_sgml_parser(Arg1, Arg2)
- open_dtd(Arg1, Arg2, Arg3)
- xml_ideographic(Arg1)
- iri_xml_namespace(Arg1, Arg2)
- xml_name(Arg1, Arg2)
- xml_quote_cdata(Arg1, Arg2, Arg3)