Home
Description
Publications

Available Resources
Text Acknowledgements
Related links


Events


CLaRK System

CLaRK System Online Manual


Bulgarian dialects'
electronic archive




eXTReMe Tracker

 

 

 

 

 

 

 

XML Path Language (XPath)

Abstract

XPath is a language for addressing parts of an XML document.

Status of this document

This document is based on the W3C Recommendation 16 November 1999 (http://www.w3.org/TR/1999/REC-xpath-19991116). It describes the XPath language according to the implementation used in the Clark System. The implementation covers almost the whole language. Because of the general purpose of XPath, in the implementation there are some insignificant exclusions, which are not needed for the system. On the other hand, there are new things which will be described in this document. The implementation also covers an abbreviated syntax.


1 Introduction

XPath is the result of an effort to provide a common syntax and semantics for functionality shared between XSL Transformations [XSLT] and XPointer [XPointer]. The primary purpose of XPath is to address parts of an XML [XML] document. In support of this primary purpose, it also provides basic facilities for manipulation of strings, numbers and booleans. XPath operates on the abstract, logical structure of an XML document, rather than its surface syntax. XPath gets its name from its use of a path notation as in URLs for navigating through the hierarchical structure of an XML document.

In addition to its use for addressing, XPath is also designed so that it has a natural subset that can be used for matching (testing whether or not a node matches a pattern); this use of XPath is described in XSLT.

XPath models an XML document as a tree of nodes. There are different types of nodes, including element nodes, attribute nodes and text nodes. XPath defines a way to compute a string-value for each type of node. Some types of nodes also have names.

The primary syntactic construct in XPath is the expression. An expression is evaluated to yield an object, which has one of the following four basic types:

  • node-set (an unordered collection of nodes without duplicates)
  • boolean (true or false)
  • number (a floating-point number)
  • string (a sequence of UCS characters)

Expression evaluation occurs with respect to a context. The context consists of:

  • a node (the context node)
  • a pair of non-zero positive integers (the context position and the context size)
  • a function library

The context position is always less than or equal to the context size.

The function library consists of a mapping from function names to functions. Each function takes zero or more arguments and returns a single result. This document defines a core function library that the XPath implementation supports. For a function in the core function library, arguments and result are of the four basic types.

The context node, context position, and context size used to evaluate a subexpression are sometimes different from those used to evaluate the containing expression. Several kinds of expressions change the context node; only predicates change the context position and context size. When the evaluation of a kind of expression is described, it will always be explicitly stated if the context node, context position, and context size change for the evaluation of subexpressions; if nothing is said about the context node, context position, and context size, they remain unchanged for the evaluation of subexpressions of that kind of expression.

The grammar specified in this section applies to the attribute value after XML 1.0 normalization. So, for example, if the grammar uses the character <, this must not appear in the XML source as < but must be quoted according to XML 1.0 rules by, for example, entering it as &lt;. Within expressions, literal strings are delimited by single or double quotation marks, which are also used to delimit XML attributes. To avoid a quotation mark in an expression being interpreted by the XML processor as terminating the attribute value the quotation mark can be entered as a character reference (&quot; or &apos;). Alternatively, the expression can use single quotation marks if the XML attribute is delimited with double quotation marks or vice-versa.

One important kind of expression is a location path. A location path selects a set of nodes relative to the context node. The result of evaluating an expression that is a location path is the node-set containing the nodes selected by the location path. Location paths can recursively contain expressions that are used to filter sets of nodes.

2 Location Paths

Although location paths are not the most general grammatical construct in the language, they are the most important construct and will therefore be described first.

Every location path can be expressed using a straightforward but rather verbose syntax. There are also a number of syntactic abbreviations that allow common cases to be expressed concisely. This section will explain the semantics of location paths using the unabbreviated syntax. The abbreviated syntax will then be explained by showing how it expands into the unabbreviated syntax .

Here are some examples of location paths using the unabbreviated syntax:

  • child::para selects the para element children of the context node
  • child::text() selects all text node children of the context node
  • child::node() or child::* selects all the children of the context node, whatever their node type
  • attribute::name selects the name attribute of the context node
  • attribute::* selects all the attributes of the context node
  • descendant::para selects the para element descendants of the context node
  • ancestor::div selects all div ancestors of the context node
  • ancestor-or-self::div selects the div ancestors of the context node and, if the context node is a div element, the context node as well
  • descendant-or-self::para selects the para element descendants of the context node and, if the context node is a para element, the context node as well
  • self::para selects the context node if it is a para element, and otherwise selects nothing
  • child::chapter/descendant::para selects the para element descendants of the chapter element children of the context node
  • child::*/child::para selects all para grandchildren of the context node
  • / selects the document root (which is always the parent of the document element)
  • /descendant::para selects all the para elements in the same document as the context node
  • /descendant::olist/child::item selects all the item elements that have an olist parent and that are in the same document as the context node
  • child::para[position()=1] selects the first para child of the context node
  • child::para[position()=last()] selects the last para child of the context node
  • child::para[position()=last()-1] selects the last but one para child of the context node
  • child::para[position()>1] selects all the para children of the context node other than the first para child of the context node
  • following-sibling::chapter[position()=1] selects the next chapter sibling of the context node
  • preceding-sibling::chapter[position()=1] selects the previous chapter sibling of the context node
  • /descendant::figure[position()=42] selects the forty-second figure element in the document
  • /child::doc/child::chapter[position()=5]/child::section[position()=2] selects the second section of the fifth chapter of the doc document element
  • child::para[attribute::type="warning"] selects all para children of the context node that have a type attribute with value warning
  • child::para[attribute::type='warning'][position()=5] selects the fifth para child of the context node that has a type attribute with value warning
  • child::para[position()=5][attribute::type="warning"] selects the fifth para child of the context node if that child has a type attribute with value warning
  • child::chapter[child::title='Introduction'] selects the chapter children of the context node that have one or more title children with string-value equal to Introduction
  • child::chapter[child::title] selects the chapter children of the context node that have one or more title children
  • child::*[self::chapter or self::appendix] selects the chapter and appendix children of the context node
  • child::*[self::chapter or self::appendix][position()=last()] selects the last chapter or appendix child of the context node

There are two kinds of location path: relative location paths and absolute location paths.

A relative location path consists of a sequence of one or more location steps separated by /. The steps in a relative location path are composed together from left to right. Each step in turn selects a set of nodes relative to a context node. An initial sequence of steps is composed together with a following step as follows. The initial sequence of steps selects a set of nodes relative to a context node. Each node in that set is used as a context node for the following step. The sets of nodes identified by that step are united together. The set of nodes identified by the composition of the steps is this union. For example, child::div/child::para selects the para element children of the div element children of the context node, or, in other words, the para element grandchildren that have div parents.

An absolute location path consists of / followed by a relative location path. The / selects the root node of the document as a context node for the next relative location path. A / itself does not select the root node. This can be done by /self::*

2.1 Location Steps

A location step has three parts:

  • an axis, which specifies the tree relationship between the nodes selected by the location step and the context node,
  • a node test, which specifies the node type or the name of the nodes selected by the location step, and
  • zero or more predicates, which use arbitrary expressions to further refine the set of nodes selected by the location step.

The syntax for a location step is the axis name and node test separated by a double colon, followed by zero or more expressions each in square brackets. For example, in child::para[position()=1], child is the name of the axis, para is the node test and [position()=1] is a predicate.

The node-set selected by the location step is the node-set that results from generating an initial node-set from the axis and node-test, and then filtering that node-set by each of the predicates in turn.

The initial node-set consists of the nodes having the relationship to the context node specified by the axis, and having the node type or name specified by the node test. For example, a location step descendant::para selects the para element descendants of the context node: descendant specifies that each node in the initial node-set must be a descendant of the context; para specifies that each node in the initial node-set must be an element named para. The available axes are described in Axes. The available node tests are described in Node Tests.

The initial node-set is filtered by the first predicate to generate a new node-set; this new node-set is then filtered using the second predicate, and so on. The final node-set is the node-set selected by the location step. The axis affects how the expression in each predicate is evaluated and so the semantics of a predicate is defined with respect to an axis.

2.2 Axes

The following axes are available:

  • the child axis contains the children of the context node
  • the descendant axis contains the descendants of the context node; a descendant is a child or a child of a child and so on; thus the descendant axis never contains attribute nodes
  • the parent axis contains the parent of the context node, if there is one
  • the ancestor axis contains the ancestors of the context node; the ancestors of the context node consist of the parent of context node and the parent's parent and so on; thus, the ancestor axis will always include the root node, unless the context node is the root node
  • the following-sibling axis contains all the following siblings of the context node; if the context node is an attribute node node, the following-sibling axis is empty
  • the preceding-sibling axis contains all the preceding siblings of the context node; if the context node is an attribute node, the preceding-sibling axis is empty
  • the following axis contains all nodes in the same document as the context node that are after the context node in document order, excluding any descendants and excluding attribute nodes.
  • the preceding axis contains all nodes in the same document as the context node that are before the context node in document order, excluding any ancestors and excluding attribute nodes.
  • the attribute axis contains the attributes of the context node; the axis will be empty unless the context node is an element
  • the self axis contains just the context node itself
  • the descendant-or-self axis contains the context node and the descendants of the context node
  • the ancestor-or-self axis contains the context node and the ancestors of the context node; thus, the ancestor axis will always include the root node

NOTE: The ancestor, descendant, following, preceding and self axes partition a document (ignoring attributes): they do not overlap and together they contain all the nodes in the document.

2.3 Node Tests

The node tests are divided into two categories: node type tests and node name tests.

The node type tests are a finite number. They are:

  • text()

From the initial node-set preserves only the text nodes.

  • text(<text>)

From the initial node-set preserves only the text nodes, which contain the <text> as a substring of their text content. For example child::text(“play”) will return all text nodes which are children of the context node and contain the token “play” in them.

  • text( <mode>, <text>)

From the initial node-set preserves only the text nodes, which contain the <text> on a certain position depending on the <mode>. The modes are denoted by four integers as follows:

  • 1 - the tested node' value must have the <text> as a prefix (i.e. to start with it);
  • 2 - the tested node' value must contain the <text> as a sub-string;
  • 3 - the tested node' value must have the <text> as a suffix (i.e. to end with it);
  • 4 - the tested node' value must match exactly with <text>.
The mode identifier may or may not be quoted. Example: child::text(1, "aba") will preserve only text nodes whose content starts with "aba".

  • text( <mode>, (y|n), '('<reg. expr.>')')

This node test is an extension on the previous nodes test. The main difference is that the search pattern is not a fixed string, but a regular expression. The syntax of the regular expression is the same as the one in the Grammar tool. The regular expression must be wrapped in brace - ( and ). The tokenizer used in the evaluation process is the one specified for the current document/element in section Definitions/Element Features. The second parameter here enables/disabled the usage of token normalization, defined in the corresponding tokenizer.

  • node()
This node test do not filter the initial node-set. This is used when all the nodes selected from the axis are needed for further evaluation. For short “*” can be used as instead, i.e. child::* is the same as child::node().
  • element()
From the initial node-set only the element nodes remain.
  • attribute()
From the initial node-set only the attribute nodes remain. It is possible the initial nodes to contain not only attributes.
  • attribute(<attributeName>)

Filters only for element nodes which have an attribute named <attributeName>. Example child::attribute(id) will take only the element nodes, children of the context node, which have an attribute id.

  • attribute(<attributeName> = “<arrtibuteValue>”)
This node test is almost the same as the attribute(<attributeName>) node test, but in addition it has also a restriction on the value of the attribute. Example: child::attribute(id=”243”) will take only these element nodes which have an attribute id set to value 243.
  • comment()
From the initial node-set only comment nodes remain.
  • processing-instruction()
From the initial node-set only processing-instruction nodes remain.
  • content(<mode>, '('<reg. expr.>')') and content(<mode>, <query_name>)
The behaviour of this note test is similar to the one for text(mode, (y|n), (reg.expr)). The advantage here is that the usage is not restricted only to text nodes, but to any type of nodes and their content. The matching modes are the same as described before. The second parameter specifies either a regular expression (in brace) or a Grammar Tool Query name (quoted). If the query is not found, the system raises an error message.

The name node tests are used to filter the initial node-set for element nodes with a given name. All other non-element nodes and element nodes with other name are removed from the set. Example child::para takes only the para element nodes, children of the context node.

 

2.4 Predicates

An axis is either a forward axis or a reverse axis. An axis that only ever contains the context node or nodes that are after the context node in document order is a forward axis. An axis that only ever contains the context node or nodes that are before the context node in document order is a reverse axis. Thus, the ancestor, ancestor-or-self, preceding, and preceding-sibling axes are reverse axes; all other axes are forward axes. Since the self axis always contains at most one node, it makes no difference whether it is a forward or reverse axis. The proximity position of a member of a node-set with respect to an axis is defined to be the position of the node in the node-set ordered in document order if the axis is a forward axis and ordered in reverse document order if the axis is a reverse axis. The first position is 1.

A predicate filters a node-set with respect to an axis to produce a new node-set. For each node in the node-set to be filtered, the predicate is evaluated with that node as the context node, with the number of nodes in the node-set as the context size, and with the proximity position of the node in the node-set with respect to the axis as the context position; if the predicate evaluates to true for that node, the node is included in the new node-set; otherwise, it is not included.

A predicate is evaluated as an expression and the result is converted to a boolean. If the result is a number, the result will be converted to true if the number is equal to the context position and will be converted to false otherwise; if the result is not a number, then the result will be converted as if by a call to the boolean() function. Thus a location path para[3] is equivalent to para[position()=3]. In other words if the context node has 4 child nodes para, then only for the third para child the predicate [3] (or [position()=3]) will be true. So the new node-set will contain only the third para child.

2.5 Abbreviated Syntax

Here are some examples of location paths using abbreviated syntax:

  • para selects the para element children of the context node
  • * selects all children of the context node
  • text() selects all text node children of the context node
  • @name selects the name attribute of the context node
  • @* selects all the attributes of the context node
  • para[1] selects the first para child of the context node
  • para[last()] selects the last para child of the context node
  • */para selects all para grandchildren of the context node
  • /doc/chapter[5]/section[2] selects the second section of the fifth chapter of the doc
  • chapter//para selects the para element descendants of the chapter element children of the context node
  • //para selects all the para descendants of the document root and thus selects all para elements in the same document as the context node
  • //olist/item selects all the item elements in the same document as the context node that have an olist parent
  • . selects the context node
  • .//para selects the para element descendants of the context node
  • .. selects the parent of the context node
  • ../@lang selects the lang attribute of the parent of the context node
  • para[@type="warning"] selects all para children of the context node that have a type attribute with value warning
  • para[@type="warning"][5] selects the fifth para child of the context node that has a type attribute with value warning
  • para[5][@type="warning"] selects the fifth para child of the context node if that child has a type attribute with value warning
  • chapter[title="Introduction"] selects the chapter children of the context node that have one or more title children with string-value equal to Introduction
  • chapter[title] selects the chapter children of the context node that have one or more title children
  • employee[@secretary and @assistant] selects all the employee children of the context node that have both a secretary attribute and an assistant attribute

The most important abbreviation is that child:: can be omitted from a location step. In effect, child is the default axis. For example, a location path div/para is short for child::div/child::para.

There is also an abbreviation for attributes: attribute:: can be abbreviated to @. For example, a location path para[@type="warning"] is short for child::para[attribute::type="warning"] and so selects para children with a type attribute with value equal to warning.

// is short for /descendant-or-self::*/. For example, //para is short for /descendant-or-self::node()/child::para and so will select any para element in the document; div//para is short for div/descendant-or-self::node()/child::para and so will select all para descendants of div children.

NOTE: The location path //para[1] does not mean the same as the location path /descendant::para[1]. The latter selects the first descendant para element; the former selects all descendant para elements that are the first para children of their parents.

A location step of . is short for self::node(). This is particularly useful in conjunction with //. For example, the location path .//para is short for self::node()/descendant-or-self::node()/child::para and so will select all para descendant elements of the context node.

Similarly, a location step of .. is short for parent::node(). For example, ../title is short for parent::node()/child::title and so will select the title children of the parent of the context node.

3 Expressions

3.1 Function Calls

A function call is evaluated by evaluating each of the arguments, converting each argument to the type required by the function, and finally calling the function, passing it the converted arguments. It is an error if the number of arguments is wrong or if an argument cannot be converted to the required type.

An argument is converted to type string as if by calling the string() function. An argument is converted to type number as if by calling the number() function. An argument is converted to type boolean as if by calling the boolean() function. An argument that is not of type node-set cannot be converted to a node-set.

Examples:

child::para[count(child::*) > 3] returns all para child elements of the context node which have more then 3 children. The evaluation will be in the following sequence: getting all child nodes of the context; applying the name node test and keeping only the para elements; for each para element as a context evaluating the count() function. First evaluating the location path child::* with a context each of the para elements. The location path every time returns a node-sets (different in general) which is passed as an argument for the count() function. The function returns a number which later will be used for further evaluation of the predicate.

 

3.2 Node-sets

A location path can be used as an expression. The expression returns the set of nodes selected by the path.

The | operator computes the union of its operands, which must be node-sets.

Example: /descendant-or-self::para | /descendant-or-self::head will return an union of all para and head element nodes in a document. The two location paths are evaluated independently and the two results are unified.

3.3 Booleans

An object of type boolean can have one of two values, true and false.

An or expression is evaluated by evaluating each operand and converting its value to a boolean as if by a call to the boolean() function. The result is true if either value is true and false otherwise.

An and expression is evaluated by evaluating each operand and converting its value to a boolean as if by a call to the boolean() function. The result is true if both values are true and false otherwise.

An equality(A = B) or a relational(A < B) expression is evaluated by comparing the objects that result from evaluating the two operands. Comparison of the resulting objects is defined in the following three paragraphs. First, comparisons that involve node-sets are defined in terms of comparisons that do not involve node-sets; this is defined uniformly for =, !=, <=, <, >= and >. Second, comparisons that do not involve node-sets are defined for = and !=. Third, comparisons that do not involve node-sets are defined for <=, <, >= and >.

If both objects to be compared are node-sets, then the comparison will be true if and only if there is a node in the first node-set and a node in the second node-set such that the result of performing the comparison on the string-values of the two nodes is true. If one object to be compared is a node-set and the other is a number, then the comparison will be true if and only if there is a node in the node-set such that the result of performing the comparison on the number to be compared and on the result of converting the string-value of that node to a number using the number() function is true. If one object to be compared is a node-set and the other is a string, then the comparison will be true if and only if there is a node in the node-set such that the result of performing the comparison on the string-value of the node and the other string is true. If one object to be compared is a node-set and the other is a boolean, then the comparison will be true if and only if the result of performing the comparison on the boolean and on the result of converting the node-set to a boolean using the boolean() function is true.

When neither object to be compared is a node-set and the operator is = or !=, then the objects are compared by converting them to a common type as follows and then comparing them. If at least one object to be compared is a boolean, then each object to be compared is converted to a boolean as if by applying the boolean() function. Otherwise, if at least one object to be compared is a number, then each object to be compared is converted to a number as if by applying the boolean() function. Otherwise, both objects to be compared are converted to strings as if by applying the string() function. The = comparison will be true if and only if the objects are equal; the != comparison will be true if and only if the objects are not equal. Two booleans are equal if either both are true or both are false. Two strings are equal if and only if they consist of the same sequence of UCS characters.

When neither object to be compared is a node-set and the operator is <=, <, >= or >, then the objects are compared by converting both objects to numbers and comparing the numbers. The < comparison will be true if and only if the first number is less than the second number. The <= comparison will be true if and only if the first number is less than or equal to the second number. The > comparison will be true if and only if the first number is greater than the second number. The >= comparison will be true if and only if the first number is greater than or equal to the second number.

3.4 Numbers

The numeric operators convert their operands to numbers as if by calling the number() function.

The + operator performs addition.

The - operator performs subtraction.

NOTE: Since XML allows - in names, the - operator typically needs to be preceded by whitespace. For example, foo-bar evaluates to a node-set containing the child elements named foo-bar; foo - bar evaluates to the difference of the result of converting the string-value of the first foo child element to a number and the result of converting the string-value of the first bar child to a number.

The div operator performs floating-point division.

The mod operator returns the remainder from a truncating division. For example,

  • 5 mod 2 returns 1
  • 5 mod -2 returns 1
  • -5 mod 2 returns -1
  • -5 mod -2 returns -1

4 Variables

One specific extension of this XPath implementation is the support of variables within expressions. The variables can be used for storing temporary results and using them later. Each variable has a name by which it is identified. The support can be split into two parts: variable definitions and variable usage, which are described below.

4.1 Variable Definitions

The variable definitions are part from of the XPath expression. They can not exist alone. They are always attached in front of another expression within which the corresponding variable values can be used. Once a variable is defined it can be used ONLY in the following expression or/and in the direct following variable definitions (if any). Finally, the structure of one XPath expression can be described as: variable_definition*, expression. Each variable definition consists of two parts: a variable name and an XPath expression which represents the variable value.

The syntax of one definition is: '{', variable_name, ':=', variable_value, '}'.

The variable name can be a non empty string, starting with a Latin letter and followed by a sequence of Latin letters and digits (a1, aValue, val, ..).

The variable value must be a valid XPath expression. It may use values of variables defined either in preceding variable definitions or in definitions on a higher level of expression, i.e. definitions in expression which contain the current expression as compound part. During evaluation, the context for a variable value expression is the same as the context for the expression which follows the definitions. The value of a variable can be of any standard data type for XPath (node-set, number, string and boolean).

4.2 Variable Usage

The variable usage is the same as it is described in the original XPath specification. A variable value is used within expressions by citing its name followed by the dollar sign ($). Example: $a1, $aValue, $val, etc. A variable value can be used only if it is already defined before. The usage of not defined variables will cause an error.

5. XPath Macros

The XPath Macros are used basically for simplification of XPath expressions. It is a means for naming XPath expressions and embedding them within other expressions just by referring to their names. The embedded named expressions (macros) during evaluation are expanded with the corresponding original XPath expressions. An XPath macro is embedded in another expression by citing its name, preceded by a percent sign ('%'). If a macro is used but not defined, the system raises an error. A macro can be inserted everywhere in an expression where the corresponding macro expression can be used. Roughly, the XPath macros substitute independent sub-expressions within other more complex expressions.

6 Core Function Library

This section describes functions that the XPath implementation include in the function library that is used to evaluate expressions.

Each function in the function library is specified using a function prototype, which gives the return type, the name of the function, and the type of the arguments. If an argument type is followed by a question mark, then the argument is optional; otherwise, the argument is required.

6.1 Node Set Functions

Function: number last()

The last function returns a number equal to the context size from the expression evaluation context.

Function: number position()

The position function returns a number equal to the context position from the expression evaluation context.

Function: number count(node-set)

The count function returns the number of nodes in the argument node-set.

Function: string name(node-set?)

The name function returns a string containing a name of the node in the argument node-set that is first in document order. If the argument node-set is empty or the first node has no name, an empty string is returned. If the argument it omitted, it defaults to a node-set with the context node as its only member.

Function: node-set id(object)

The id function selects elements by their unique IDs. When the argument to id is of type node-set, then the result is the union of the result of applying id to the string-value of each of the nodes in the argument node-set. When the argument to id is of any other type, the argument is converted to a string as if by a call to the string function; the string is split into a whitespace-separated list of tokens; the result is a node-set containing the elements in the same document as the context node that have a unique ID equal to any of the tokens in the list.

  • id("foo") selects the element with unique ID foo
  • id("foo")/child::para[position()=5] selects the fifth para child of the element with unique ID foo

Extended Node Set Functions

Function: node-set set:difference(node-set, node-set)

The set:difference function returns the difference between two node sets - those nodes that are in the node set passed as the first argument that are not in the node set passed as the second argument.

Function: node-set set:intersection(node-set, node-set)

The set:intersection function returns a node set comprising the nodes that are within both the node sets passed as arguments to it.

Function: node-set set:distinct(node-set)

The set:distinct function returns a node-set containing a subset of nodes for which each two nodes, compared with the eq() function the result is false, i.e. it returns a subset of all nodes which have unique substructures.

Function: boolean set:has-same-node(node-set, node-set)

The set:has-same-node function returns true if the node set passed as the first argument shares any nodes with the node set passed as the second argument. If there are no nodes that are in both node sets, then it returns false.

Function: node-set set:leading(node-set, node-set)

The set:leading function returns the nodes in the node set passed as the first argument that precede, in document order, the first node in the node set passed as the second argument. If the first node in the second node set is not contained in the first node set, then an empty node set is returned. If the second node set is empty, then the first node set is returned.

Function: node-set set:trailing(node-set, node-set)

The set:trailing function returns the nodes in the node set passed as the first argument that follow, in document order, the first node in the node set passed as the second argument. If the first node in the second node set is not contained in the first node set, then an empty node set is returned. If the second node set is empty, then the first node set is returned.

Function: string set:generate-id(node-set)

The set:generate-id function returns an unique string identifier for the first member node of the argument. The identifier is based on the position of the node in the tree. The result is an empty string if the argument set is empty or the first node is an attribute node.

Function: boolean set:eq(object, object)

The set:eq function compares two objects which are expected to be ordered node-sets. If an argument is not of type node-set, this function compares the two arguments in the way they are compared by the "=" operator. The nodes in the two sets are compared positionally (first with first, second with second, and so on). Two nodes are equal if:

  • they are of the same type (element, text, attribute);
  • they are text nodes and their content character sequences are the same;
  • they are attribute nodes which have same names and same values;
  • they are element nodes which:
    1. have the same names;
    2. each attribute from the first node has a corresponding equal attribute in the second node and vice versa;
    3. have the same number of children and each two children which have equal positions in their parent nodes are recursively equal;

In all other cases the nodes are different.

Function: node-set set:order(node-set)

The set:order function receives an arbitrary set of nodes and produces a new set with the same nodes ordered in document order. If the input set contains attributes, their position in the document order is the same as the element nodes they belong to. As attributes within an element node do not have an ordering defined, the way they appear in the result set is undetermined. But for sure they will appear in the positions where their corresponding element nodes would appear.

Function: node-set set:reverse(node-set)

The set:order function receives an arbitrary set of nodes and produces a new set with the same nodes ordered in reverse document order. If the input set contains attributes, their position in the document order is the same as the element nodes they belong to. As attributes within an element node do not have an ordering defined, the way they appear in the result set is undetermined. But for sure they will appear in the positions where their corresponding element nodes would appear.

Function: node-set set:eval(string)

The set:eval function receives a string as an argument and interprets it as an XPath expression according to the current context node. The result from the evaluation is expected to be a node-set. In case it is not, an empty node-set is returned. It is possible the argument string for the function to be a result from evaluation of another XPath converted to string using the string() function.

Function: boolean error()

The error function is related to the system dtd validator. It tests the context node if it is contained in the list of errors produced by the validator. This function works properly only for documents opened in the system editor.

Function: string doc-name(node-set)

The doc-name function returns the name of a document containing a certain node. The returned name is the name of the document within the CLaRK system. If the function is used without an argument, the result will be the name of the document containing the context node. Otherwise, it will be the name of the document of the first node in the argument. If no document name is found, the function returns an empty string.

Function: node-set search(string, string?)

The function search performs searching in the document on the basis of a content indexing for quick access. In order this function to work properly the target document must be indexed in advance. Otherwise the function returns an empty node-set. This function can be used in construction of location paths as a separate location step.

The first argument of the function represents the search query. It can be a whole word/token or a partial description containing wildcard symbols. The second argument is optional and it points to a repository (if it exists) in which the search to be performed. For more details about the function usage see: Definitions / Document Index.

6.2 String Functions

Function: string string(object?)

The string function converts an object to a string as follows:

  • A node-set is converted to a string by returning the string-value of the node in the node-set that is first in document order. If the node-set is empty, an empty string is returned.
  • A number is converted to a string as follows
    • NaN is converted to the string NaN
    • positive zero is converted to the string 0
    • negative zero is converted to the string 0
    • positive infinity is converted to the string Infinity
    • negative infinity is converted to the string -Infinity
    • if the number is an integer, the number is represented in decimal form as a number with no decimal point and no leading zeros, preceded by a minus sign (-) if the number is negative
    • otherwise, the number is represented in decimal form as a number including a decimal point with at least one digit before the decimal point and at least one digit after the decimal point, preceded by a minus sign (-) if the number is negative;
  • The boolean false value is converted to the string false. The boolean true value is converted to the string true.
  • An object of a type other than the four basic types will cause an exception (error).

If the argument is omitted, it defaults to a node-set with the context node as its only member.

Function: string concat(string, string, string*)

The concat function returns the concatenation of its arguments.

Function: boolean starts-with(string, string)

The starts-with function returns true if the first argument string starts with the second argument string, and otherwise returns false.

Function: boolean contains(string, string)

The contains function returns true if the first argument string contains the second argument string, and otherwise returns false.

Function: string substring-before(string, string)

The substring-before function returns the substring of the first argument string that precedes the first occurrence of the second argument string in the first argument string, or the empty string if the first argument string does not contain the second argument string. For example, substring-before("1999/04/01","/") returns 1999.

Function: string substring-after(string, string)

The substring-after function returns the substring of the first argument string that follows the first occurrence of the second argument string in the first argument string, or the empty string if the first argument string does not contain the second argument string. For example, substring-after("1999/04/01","/") returns 04/01, and substring-after("1999/04/01","19") returns 99/04/01.

Function: string substring(string, number, number?)

The substring function returns the substring of the first argument starting at the position specified in the second argument with length specified in the third argument. For example, substring("12345",2,3) returns "234". If the third argument is not specified, it returns the substring starting at the position specified in the second argument and continuing to the end of the string. For example, substring("12345",2) returns "2345".

More precisely, each character in the string is considered to have a numeric position: the position of the first character is 1, the position of the second character is 2 and so on.

The returned substring contains those characters for which the position of the character is greater than or equal to the rounded value of the second argument and, if the third argument is specified, less than the sum of the rounded value of the second argument and the rounded value of the third argument; rounding is done as if by a call to the round function. The following examples illustrate various unusual cases:

  • substring("12345", 1.5, 2.6) returns "234"
  • substring("12345", 0, 3) returns "12"
  • substring("12345", 0 div 0, 3) returns ""
  • substring("12345", 1, 0 div 0) returns ""
  • substring("12345", -42, 1 div 0) returns "12345"
  • substring("12345", -1 div 0, 1 div 0) returns ""

Function: number string-length(string?)

The string-length returns the number of characters in the string. If the argument is omitted, it defaults to the context node converted to a string, in other words the string-value of the context node.

Function: string normalize-space(string?)

The normalize-space function returns the argument string with whitespace normalized by stripping leading and trailing whitespace and replacing sequences of whitespace characters by a single space. Whitespace characters are: SPACEs, TABs, new lines. If the argument is omitted, it defaults to the context node converted to a string, in other words the string-value of the context node.

Function: string translate(string, string, string)

The translate function returns the first argument string with occurrences of characters in the second argument string replaced by the character at the corresponding position in the third argument string. For example, translate("bar","abc","ABC") returns the string BAr. If there is a character in the second argument string with no character at a corresponding position in the third argument string (because the second argument string is longer than the third argument string), then occurrences of that character in the first argument string are removed. For example, translate("--aaa--","abc-","ABC") returns "AAA". If a character occurs more than once in the second argument string, then the first occurrence determines the replacement character. If the third argument string is longer than the second argument string, then excess characters are ignored.

Extended String Functions

Function: string str:align(string, string, string?)

The str:align function aligns a string within another string. The first argument gives the target string to be aligned. The second argument gives the padding string within which it is to be aligned. If the target string is shorter than the padding string then a range of characters in the padding string are replaced with those in the target string. Which characters are replaced depends on the value of the third argument, which gives the type of alignment. It can be one of 'left', 'right' or 'center'. If no third argument is given or if it is not one of these values, then it defaults to left alignment. With left alignment, the range of characters replaced by the target string begins with the first character in the padding string. With right alignment, the range of characters replaced by the target string ends with the last character in the padding string. With center alignment, the range of characters replaced by the target string is in the middle of the padding string, such that either the number of unreplaced characters on either side of the range is the same or there is one less on the left than there is on the right. If the target string is longer than the padding string, then it is truncated to be the same length as the padding string and returned.

Function: string str:concat(node-set, string?)

The str:concat function takes a node set and returns the concatenation of the string values of the nodes in that node set. If the node set is empty, it returns an empty string. If a second argument (string) is supplied, it serves as a separator between each node's string value from the first argument.

Function: string str:padding(number, string?)

The str:padding function creates a padding string of a certain length. The first argument gives the length of the padding string to be created. The second argument gives a string to be used to create the padding. This string is repeated as many times as is necessary to create a string of the length specified by the first argument; if the string is more than a character long, it may have to be truncated to produce the required length. If no second argument is specified, it defaults to a space (' '). If the second argument is an empty string, str:padding returns an empty string.

Function: number str:index-of(string, string, number?)

The str:index-of function returns the starting position of the first occurrence of the second string argument within the content of the first argument. If a third argument is presented then the searching in the first argument starts from the specified position. If the first argument does not contain the second one (after a certain position if specified) then the result is -1.

Function: number str:last-index-of(string, string, number?)

The str:last-index-of function returns the starting position of the last occurrence of the second string argument within the content of the first argument. If a third argument is presented then the function returns the last occurrence before the specified position. If the first argument does not contain the second one the result is -1.

Function: boolean str:region-matches(string, number, string, number, number, boolean)

Tests if two string regions are equal. A substring of the first string argument, starting at position - second argument, is compared to a substring of the third argument, starting at position - fourth argument with length - fifth argument. The result is true if these substrings represent character sequences that are the same, ignoring case if and only if the sixth argument is true. The result is false if one of the following is true:

  • second argument is negative;
  • fourth argument is negative;
  • the sum of the second argument and the sixth argument is greater than the length of first argument;
  • the sum of the fourth argument and the sixth argument is greater than the length of third argument;

Example: region-matches("abcdef", 2, "ECDF", 1, 2, true) --> true

Function: string str:to-lower-case(string)

Converts all of the characters in the string argument to lower case using the rules of the current environment.

Function: string str:to-upper-case(string)

Converts all of the characters in the string argument to upper case using the rules of the current environment.

Function: boolean ends-with(string, string)

Tests if the first argument ends with the specified suffix as second argument. The result is true if the character sequence represented by the second argument is a suffix of the character sequence represented by the first argument. Otherwise it is false.

Function: number str:count-tokens(string, string)

Returns the number of the tokens in the first string argument. The token boundaries are stated in the second argument. Example: str:count-tokens("one two three", " ") --> 3

Function: string str:token-at(string, string, number)

Returns a token from the first string argument at position specified in the third argument. The second argument represents the separator between the tokens.

Function: string str:normalize(string, string)

Returns the normalized representation of the string argument at first position. The normalization is performed on the basis of a tokenizer, supplied as a second argument. If the tokenizer is not primitive, the system refers to its primitive parent tokenizer. If the supplied tokenizer is not presented in the system, the original text value is returned. The tokenizer name should be quoted.

Function: number clark:count-tokens(string, string, string?)

Returns the number of the tokens in the first string argument. This function is CLaRK System specific: the second argument is a tokenizer name and the third one is a filter name (optional).

Function: string clark:token-at(string, number, string, string?)

Returns a token from the first string argument at position specified in the second argument.This function is CLaRK System specific: the third argument is a tokenizer name and the fourth one is a filter name (optional).

Function: string str:eval(string)

The str:eval function receives a string as an argument and interprets it as an XPath expression according to the current context node. The result from the evaluation is expected to be a string. In case it is not, an empty string is returned. It is possible the argument string for the function to be a result from evaluation of another XPath converted to string using the string() function.

6.3 Boolean Functions

Function: boolean boolean(object)

The boolean function converts its argument to a boolean as follows:

  • a number is true if and only if it is neither positive or negative zero nor NaN
  • a node-set is true if and only if it is non-empty
  • a string is true if and only if its length is non-zero
  • An object of a type other than the four basic types will cause an exception (error).

Function: boolean not(boolean)

The not function returns true if its argument is false, and false otherwise.

Function: boolean true()

The true function returns true.

Function: boolean false()

The false function returns false.

Extended Boolean Function

Function: boolean test(string)

The test function receives a string as an argument and interprets it as an XPath expression according to the current context node. The result from the evaluation is expected to be a boolean value. In case it is not, a false value is returned. It is possible the argument string for the function to be a result from evaluation of another XPath converted to string using the string() function.

Function: boolean error()

The error function tests the context node if it is an invalid node according to the system validator of the editor. This function is CLaRK specific and in order it to work properly, the document on which it is applied has to be opened (and activated) in the editor. Otherwise, any node is assumed to be valid.

6.4 Number Functions

Function: number number(object?)

The number function converts its argument to a number as follows:

  • a string that consists of optional whitespace followed by an optional minus sign followed by a number followed by whitespace is converted to the number that is nearest to the mathematical value represented by the string; any other string is converted to NaN
  • boolean true is converted to 1; boolean false is converted to 0
  • a node-set is first converted to a string as if by a call to the string function and then converted in the same way as a string argument
  • an object of a type other than the four basic types will cause an exception (error).

If the argument is omitted, it defaults to a node-set with the context node as its only member.

Function: number sum(node-set)

The sum function returns the sum, for each node in the argument node-set, of the result of converting the string-values of the node to a number.

Function: number floor(number)

The floor function returns the largest (closest to positive infinity) number that is not greater than the argument and that is an integer.

Function: number ceiling(number)

The ceiling function returns the smallest (closest to negative infinity) number that is not less than the argument and that is an integer.

Function: number round(number)

The round function returns the number that is closest to the argument and that is an integer. If there are two such numbers, then the one that is closest to positive infinity is returned. If the argument is NaN, then NaN is returned. If the argument is positive infinity, then positive infinity is returned. If the argument is negative infinity, then negative infinity is returned. If the argument is positive zero, then positive zero is returned. If the argument is negative zero, then negative zero is returned. If the argument is less than zero, but greater than or equal to -0.5, then negative zero is returned.

Extended Number Functions

Function: number math:min(node-set)

The math:min function returns the minimum value of the nodes passed as the argument. The minimum value is defined as follows. The node set passed as an argument is sorted in ascending order as it would be by xsl:sort with a data type of number. The minimum is the result of converting the string value of the first node in this sorted list to a number using the number function. If the node set is empty, or if the result of converting the string values of any of the nodes to a number is NaN, then NaN is returned.

Function: number math:max(node-set)

The math:max function returns the maximum value of the nodes passed as the argument. The maximum value is defined as follows. The node set passed as an argument is sorted in descending order as it would be by xsl:sort with a data type of number. The maximum is the result of converting the string value of the first node in this sorted list to a number using the number function. If the node set is empty, or if the result of converting the string values of any of the nodes to a number is NaN, then NaN is returned.

Function: node-set math:highest(node-set)

The math:highest function returns the nodes in the node set whose value is the maximum value for the node set. The maximum value for the node set is the same as the value as calculated by math:max. A node has this maximum value if the result of converting its string value to a number as if by the number function is equal to the maximum value, where the equality comparison is defined as a numerical comparison using the = operator. If any of the nodes in the node set has a non-numeric value, the math:max function will return NaN. The definition numeric comparisons entails that NaN != NaN. Therefore if any of the nodes in the node set has a non-numeric value, math:highest will return an empty node set.

Function: node-set math:lowest(node-set)

The math:lowest function returns the nodes in the node set whose value is the minimum value for the node set. The minimum value for the node set is the same as the value as calculated by math:min. A node has this minimum value if the result of converting its string value to a number as if by the number function is equal to the minimum value, where the equality comparison is defined as a numerical comparison using the = operator. If any of the nodes in the node set has a non-numeric value, the math:min function will return NaN. The definition numeric comparisons entails that NaN != NaN. Therefore if any of the nodes in the node set has a non-numeric value, math:lowest will return an empty node set.

Function: number math:abs(number)

The math:abs function returns a number containing the absolute value of the number passed as an argument.

Function: number math:sqrt(number)

The math:sqrt function returns the square root of a number. If the argument is a negative number, the return value is zero.

Function: number math:power(number, number)

The math:power function returns the value of a base expression taken to a specified power.

Function: number math:log(number)

The math:log function returns the natural logarithm of a number. The return value is the natural logarithm of number. The base is e.

Function: number math:random()

The math:random function returns a random number from 0 to 1.

Function: number math:sin(number)

The math:sin function returns the sine of the number in radians.

Function: number math:cos(number)

The math:cos function returns cosine of the passed argument in radians.

Function: number math:tan(number)

The math:tan function returns the tangent of the number passed as an argument in radians.

Function: number math:asin(number)

The math:asin function returns the arcsine value of a number in radians.

Function: number math:acos(number)

The math:acos function returns the arccosine value of a number in radians.

Function: number math:atan(number)

The math:atan function returns the arctangent value of a number in radians.

Function: number math:atan2(number, number)

The math:atan2 function returns the angle ( in radians ) from the X axis to a point (y,x). Value1 is a number argument corresponding to y of point (y,x). Value2 is a number argument corresponding to x of point (y,x). second.

Function: number math:exp(number)

The math:exp function returns e (the base of natural logarithms) raised to a power.

Function: number math:eval(string)

The math:eval function receives a string as an argument and interprets it as an XPath expression according to the current context node. The result from the evaluation is expected to be a number value. In case it is not, a Not-A-Number value is returned. It is possible the argument string for the function to be a result from evaluation of another XPath converted to string using the string() function.

 

7 Data Model

XPath operates on an XML document as a tree. This section describes how XPath models an XML document as a tree. This model is conceptual only and does not mandate any particular implementation.

The tree contains nodes. There are several supported types of node:

  • element nodes
  • text nodes
  • attribute nodes
  • processing instruction nodes
  • comment nodes

For every type of node, there is a way of determining a string-value for a node of that type. For some types of node, the string-value is part of the node; for other types of node, the string-value is computed from the string-value of descendant nodes.

For element nodes, the string-value is a concatenation of the string-values of their child nodes in document order. If an element node has no child nodes then an empty string is returned.

For attribute, comment, text and processing-instruction nodes the string-value is the text content of each of the nodes.

There is an ordering, document order, defined on all the nodes in the document corresponding to the order in which the first character of the XML representation of each node occurs in the XML representation of the document after expansion of general entities. Thus, the root node will be the first node. Element nodes occur before their children. Thus, document order orders element nodes in order of the occurrence of their start-tag in the XML (after expansion of entities). Reverse document order is the reverse of document order.


Element nodes have an ordered list of child nodes. Nodes never share children: if one node is not the same node as another node, then none of the children of the one node will be the same node as any of the children of another node. Every node other than the root node has exactly one parent, which is an element node. The descendants of a node are the children of the node and the descendants of the children of the node.

7.1 Element Nodes

There is an element node for every element in the document. An element node has a name.

The children of an element node are the element nodes, comment nodes, processing instruction nodes and text nodes for its content.

7.2 Attribute Nodes

Each element node has an associated set of attribute nodes; the element is the parent of each of these attribute nodes; however, an attribute node is not a child of its parent element.

Elements never share attribute nodes: if one element node is not the same node as another element node, then none of the attribute nodes of the one element node will be the same node as the attribute nodes of another element node.

7.3 Processing Instruction Nodes

There is a processing instruction node for every processing instruction, except for any processing instruction that occurs within the document type declaration.

7.4 Comment Nodes

There is a comment node for every comment, except for any comment that occurs within the document type declaration.

7.5 Text Nodes

Character data is grouped into text nodes. As much character data as possible is grouped into each text node: a text node never has an immediately following or preceding sibling that is a text node. The string-value of a text node is the character data. A text node always has at least one character of data.

Each character within a CDATA section is treated as character data. Thus, <![CDATA[<]]> in the source document will treated the same as &lt;. Both will result in a single < character in a text node in the tree. Thus, a CDATA section is treated as if the <![CDATA[ and ]]> were removed and every occurrence of < and & were replaced by &lt; and &amp; respectively.

A text node does not have a name.


A References

A.1 Normative References

XML
World Wide Web Consortium. Extensible Markup Language (XML) 1.0. W3C Recommendation. See http://www.w3.org/TR/1998/REC-xml-19980210
XML Names
World Wide Web Consortium. Namespaces in XML. W3C Recommendation. See http://www.w3.org/TR/REC-xml-names

A.2 Other References

Character Model
World Wide Web Consortium. Character Model for the World Wide Web. W3C Working Draft. See http://www.w3.org/TR/WD-charmod
DOM
World Wide Web Consortium. Document Object Model (DOM) Level 1 Specification. W3C Recommendation. See http://www.w3.org/TR/REC-DOM-Level-1
TEI
C.M. Sperberg-McQueen, L. Burnard Guidelines for Electronic Text Encoding and Interchange. See http://etext.virginia.edu/TEI.html.
Unicode
Unicode Consortium. The Unicode Standard. See http://www.unicode.org/unicode/standard/standard.html.