Menu File
This menu item includes subitems for management of documents in the CLaRK system. The documents that the system recognizes are XML well-formed documents. These documents can be divided into two kinds. The documents of the first kind are the documents that are outside of the system. In order for the system to work with these documents they have to be imported inside the system. The external documents can have one of the following character encodings: ASCII, UNICODE (UTF-16) or UNICODE (Microsoft version of UTF-16). The documents of the second kind are the documents that are imported into the system and which are stored in an internal representation. When a document is processed inside the system, this document has to be connected with a document type definition (DTD). This DTD has to be compiled in an internal representation before being assigned to the document. The document loaded in the active view of the system is called a current document.
- Document Manager
This component is responsible for the management of the internal documents in the CLaRK System. It keeps a record of all documents saved in the system. In addition, there is a system of document groups which are arranged in a tree, similar to the directory structure. Each document group contains zero or more internal documents and zero or more sub-groups. The main group is called "Root" and it can not be renamed or deleted by the user. It includes one more special document group, named "SYSTEM". The latter is expected to contain documents for special purposes (help documents for the constraints, documents containing XSL Transformations and others). The "SYSTEM" group also can not be modified. The possible operations for a group are: creating of a new group; removing a group with its content; adding document(s) which do not present in the group but are saved in the system. It is allowable for one document to be included in more than one group. In this way different views of the document database can be created. Removing a document from a group does not mean that it is removed from the database. It is just excluded from the group. The Document Manager also provides a list of all internal documents no matter in which group they are included. Here is one example picture of the Document Manager window:
The manager window contains two main parts:
- The panel on the left. It contains the tree representation of the group hierarchy. As it can be seen above, it consists of 5 groups (2 system and 3 user defined). When the user selects a node in this tree, content of the corresponding group is loaded in the component on the right side.
- Current group monitor. This is the panel situated on the right side of the window. It is a list with the content of the currently selected group in the tree. The list components, which are in blue color, are sub-groups. The other entries are the documents included in this group. They are colored in black or red, depending on whether they are valid or not, according to their DTDs.There are three additional buttons which can be used for modification the content of the current group:
- New Group - creates a new sub-group of the current one. The user is asked to supply a name for it;
- Remove - removes the selected entries from the list. The removal is preceded by a confirm message. If the selection includes sub-groups, they are removed with their entire contents.
- Add Document - gives a list of all internal documents which do not present in the current group. The user is expected to choose one or more documents to be included in the current group.
- Hard Remove!* - this button is visible only when the Document Manager is started from menu File/Document Manager. This function can be used for removing documents from the internal document database. It can be applied only to documents, not to whole groups. The latter are excluded from any selections, if included in such. The removal of documents is preceded by a confirm message. The documents to be removed, are excluded from all groups they may belong to.
When the document management is done, the user can update or cancel the changes made over the document database structure by using the two buttons on the bottom of the window.
Navigation in the group structure can be made also in the panel on the right. When the user wants to see the content of a certain sub-group of the current group, s/he just have to perform a double click on the desired sub-group. This will change the current group to the new one. This represents the movement from a group to a sub-group. The movement in the other direction is also possible. For each document group (excluding the Root), a special sub-group is included, named: ". .". By performing a double click on it, similarly to most file systems, the current group is changed to the parent group of the current one.
- New
A shortcut Ctrl+N
An icon on the toolbarChoosing this item the user creates a new empty document. First, a menu for choosing a DTD appears. The user can create a document with respect to one of the DTDs compiled into the system. If the appropriate DTD is not compiled, then the user has to close the dialog and compile the required DTD. After choosing the DTD a new document containing only the root element with empty content is created. In most cases this document will not be valid.
- Open
A shortcut Ctrl+O
An icon on the toolbarChoosing this item the user opens document(s) saved in the internal document database of the system. A Document Manager window is opened on the screen.
The user has to point out to a certain document and click the Open button. Then the corresponding document is loaded into the system. The internal documents are stored together with information for their DTD. If their DTD is compiled in the system, then they are opened with respect to it. If the DTD is deleted (for some reason) from the DTD database of the system, the user is asked to choose a DTD from the available ones.
The user can choose document(s) to open in two modes:
- selecting one or more documents from a list of all internal documents. It is an option provided by the Document Manager;
- the internal documents are arranged in groups in a hierarchy. The user can select document(s) from the current group.
If the Cancel button is chosen the manager window is closed and no document is loaded into the system.
The user can also open a document in the manager by performing a double click on a certain document name either in the list of all documents or in the list of the current group.
Note that Open menu does not open text files. If you want to load an XML document from a text file, then choose Import XML item.
- Save
A shortcut Ctrl+O
An icon on the toolbarChoosing this item the user saves the current document in the internal document database of the system. The user has to give a name for the document in a dialog box. Optionally the user can specify a document group in which the document should be included. Then the system tries to save the document. If there exists a document with the specified name, the user is asked to confirm the overwriting of the document or to choose a new name.
- Delete
Choosing this item the user deletes document(s), saved in the internal document database of the system. A Document Manager window is opened on the screen.
The user has to point out to a document (or a set of documents) and click the Remove button. Then a confirm dialog appears. If the user clicks on the Yes button, the chosen document(s) is(are) permanently deleted from the database. If the user clicks on the No button the document(s) is(are) not deleted.
Similar to the "Open" function, the user can choose documents in two modes: from all documents or from a certain group. When documents are removed from the list of all documents, the groups they belong are also updated.
The user can also delete a document in the manager by performing a double click on a certain document name either in the list of all documents or in the list of the current group.
- Close
A shortcut Ctrl+F4
Choosing this item the user closes the current view of the current document and removes it from the memory.
If this is the only view of the current document and the document has been edited, then the user is asked whether he/she wants to save the changes.
When the view is closed, if there is at least one view, loaded in the system, the previous active view becomes a current view. If there is no more documents, the system blank view appears.
- Import XML
An icon on the toolbar
Choosing this item the user loads a text file containing an XML document within the system. A standard dialog for choosing a file appears. The user has to specify the name of the file to load and its character encoding in Files of type: choice box. The possible encodings for files are: ASCII, UNICODE (UTF-16) or UNICODE (Microsoft version of UTF-16). Then the user is asked to choose a DTD from the DTD database of the system. If the document contains a DTD, this DTD is ignored during the parsing.
Note: If a document is opened in ASCII format (and it is sure it is an ASCII file) and in the system it contains different symbols than the expected ones, the possible problems are two:
- inappropriate system font (for details see Options/Fonts);
- inappropriate character encoding converter from ASCII to Unicode (for details see Options/Encoding Correction).
The system tries to parse the document. If the document is well-formed, then it is parsed into the internal representation. The system creates a view of the document, validates the document and if the document is not valid with respect to the chosen DTD, a list of errors is printed in the error window. A description of all validation errors can be found in Validation error messages section.
If the document is not well-formed, then the system reports an error. A detailed description of all parsing errors can be found in Parse error messages section. The errors of this kind have to be repaired with an editor outside the system. The error message reported by CLaRK system can help the user to find the error in the document.
- ReImport
A shortcut Ctrl+R
When the user tries to load a document, which is not well-formed, an error message is reported by the CLaRK system and the user has to edit the document outside the CLaRK system. Then he/she would like to try to load the same document. In order not to go throughout the whole process of choosing the file and so on, the user can use ReImport item to try to load the last chosen document.
The system tries to parse the document. If the document is well-formed, it is parsed into the internal representation. The system creates a view of the document, validates the document and if the document is not valid with respect to the chosen DTD, a list of errors is printed in the error window. A description of all validation errors can be found in Validation error messages section.
If the document is not well-formed, then the system reports an error. A detailed description of all parsing errors can be found in Parse error messages section. The errors of this kind have to be repaired with an editor outside the system. The error message reported by CLaRK system can help the user to find the error in the document.
- Export XML
An icon on the toolbar
Choosing this item the user can save the current document into a text file outside the system. A standard dialog for choosing a file appears. The user has to specify the name of the file which will contain the document and the character encoding of the exported text in Files of type: choice box. The possible encodings for files are: ASCII, UNICODE (UTF-16) or UNICODE (Microsoft version of UTF-16). If the files already exist the user is asked whether he/she wants to overwrite it.
Note: If a document is exported in ASCII format but the content of the output file is not readable for other programs or even for the CLaRK System, then the problem can be in an inappropriate character encoding converter from Unicode to ASCII (for details see Options/Encoding Correction).
The text in the file is formatted according to the layout specified for the DTD of the document.
- Multi-Import
Choosing this item the user can load several text files containing XML documents within the system. A standard dialog for choosing a file appears. The user has to choose one or more files from a directory. The files have to contain XML documents in the same character encoding (ASCII, UNICODE (UTF-16) or UNICODE (Microsoft version of UTF-16)). The character encoding can be stated in the Files of type: choice box.
Note: If a document is opened in ASCII format (and it is sure it is an ASCII file) and in the system it contains different symbols than the expected ones, the possible problems are two:
- inappropriate system font (for details see Options/Fonts);
- inappropriate character encoding converter from ASCII to Unicode (for details see Options/Encoding Correction).
After choosing the files to be imported, a Multi-Import Manager dialog appears. Here is a screen shot of one example dialog window configuration:
The manager window contains the following components:
- Default DTD Chooser - this component selects a DTD for the documents selected for importing in the system. This chooser is default, because the user can specify a DTD for each document separately and if for a document the DTD is not specified, the system takes the DTD from here. The initial selected value in this chooser is the last DTD used in the system.
- Group Chooser - This is an optional chooser. It lets the user to send the imported documents in an existing document group in the internal document database. By default, the selection in it is set to the last group which is used in the system. If nothing is selected in this chooser, the documents are saved but not included in any groups. This item is disabled if item Save Documents is not selected.
- Use directory "valid" Checkbox - This item, if selected, moves all files which are valid according to the corresponding DTD to a sub-directory "valid" of the current directory (the directory from where the imported documents are taken). If such a directory does not exist, the user is prompted to confirm its creation. If the user cancels, this option is disabled.
- Use directory "well-formed" Checkbox - This item, if selected, moves all files which are well-formed but not valid according to the corresponding DTD to a sub-directory "well-formed" of the current directory (the directory from where the imported documents are taken). If such a directory does not exist, the user is prompted to confirm its creation. If the user cancels, this option is disabled.
- Documents Preview Table - this component shows all documents which will be imported in the system. It is a table with three columns:
- File NameThis is a list of all files to be imported with their full directory path. This column is not editable.
- DTD NameThis column assigns a DTD to each document to be imported. By clicking on each cell, a list of all DTDs in the system appears. If for a document there is no DTD selected, the system takes the default selected DTD.
- Document NameThis column determines the internal name for each document, under which it will be saved in the system. This column is editable and the user can select an arbitrary name for each document. If a name for a document ends with a number in brackets, this means that the selected name is already used for a document.
- Start Button - Starts to import the selected documents one by one. While importing, the status bar indicates which document is being currently processed. When the operation is complete a result message window appears with detailed information about each imported document: whether it is valid, well-formed and so on.
- Cancel Button - Cancels the operation of multi-importing documents.
- Save Documents CheckBox - This checkbox determines whether or not the imported documents will be saved in the system. Not saving the documents is useful for checking the validity and the well-formedness of a set of documents. By default this item is selected.
If a document is not well-formed, then the system reports an error. A detailed description of all parsing errors can be found in Parse error messages section. The errors of this kind have to be repaired with an editor outside the system. The error message reported by CLaRK system can help the user to find the error in the document.
- Multi-Export
This item allows the user to export a set of internal documents: whole group or/and single documents. The user selects an output directory and encoding for all selected documents. Here is a screen-shot of the Multi-Export Manager window:
The manager window consists of:
- Output Directory sectionThis section contains two parts: Directory chooser and Encoding chooser. The Directory chooser shows the output directory, where the document will be stored in XML files. The file names are generated automatically on the basis of the internal document names. If an internal name contains symbols which are not allowed to be in a file name, they are substituted by an underscore symbol ('_'). The output directory can be changed by using button "Change Directory". Having pressed this button, the user is shown a standard file chooser to point a new output directory. The second component is used for setting an output character encoding. The options are: ASCII Text File, UNICODE Text File and UNICODE Text File (MS Word).
- Internal Documents chooserThis component is used for selecting documents to be exported. The component itself represents a standard for CLaRK system Document Selector, which will be described in details below. This selector is used not only in this part of the system, but in all places where a multiple selection of documents is needed. Having selected documents for exporting, the user must use one of the following buttons:
- Export buttonStarts exporting the selected documents one by one in the selected directory with the selected character encoding. The status bar of the system shows which file is being currently processed. After completion the operation the user is show an information message about the result of exporting the documents.
Note: If a document is exported in ASCII format but the content of the output file is not readable for other programs or even for the CLaRK System, then the problem can be in an inappropriate character encoding converter from Unicode to ASCII (for details see Options/Encoding Correction).
- Cancel buttonCancels the whole operation.
- Document Selector
The Document Selector is an universal component which is used in almost all system tools which can work with more than one document. The selector gives the user the ability to point to which documents from the internal document database a particular tool will be applied to. It consists of two sub-components: a selection list and a selecting dialog window. The selection list shows the currently selected documents. Initially this list is empty. Here follows a picture of the selection list. Usually it is embedded in another dialog window.
Three operations can be applied to this list: adding new list entries, removing selected list entries, clearing the whole list. These operations can be applied by using the buttons on the right:
- Add Documents - Using this button the user is shown a dialog window for selection of documents. The usage of it will be described just below. When new documents are selected, they are added to the list on the left. If a document is selected, but it already presents in the list, it is not added again.
- Remove Documents - Removes the selected items in the list on the left. If no selection is made, nothing is done.
- Clear All - Removes all the entries from the list on the left. If the list is empty, nothing is done.
The second sub-component of the Document Selector is a dialog window for adding entries to the Selected Documents list. This dialog window is visualized when the button Add Documents is pressed. Here is a preview of the dialog:
It supports three modes of document selection:
- Selecting a whole document groupIn order to select a whole group, the user must do two things. First, to select a group from the choice list "Document Group". It contains all groups in the system. The initial selected group is the last group used in the system. Changing the selection in this list updates the content of the list on the left with the new selected group content. The second thing is to select the radio button Add Whole Group, situated in the bottom right corner (visible on the picture above). This makes all entries of the list on the left to get selected and the selection can not be changed.
- Selecting single documents from a document groupHere again the user must select a group from the "Document Group" list and then to switch the radio button Single Documents on. Changing the selected group updates the content of the list on the left with the new group content. The next thing to do is to select manually document from the list to be added.
- Selecting single documents from a list of all documentsTo select items from the list of all documents the user must select the option All Documents situated on the top of the window. The default selection here is always In Groups. Having selected "All Documents", the user is show a list of all documents where s/he can select single documents with no respect to the groups they belong to.
After the selection of documents is completed, the user must confirm it by using button OK or reject with button Cancel. In case of confirmation, the new selected documents are added to the Selected Documents list.
- Exit
Choosing this item the user closes all opened files and exits the system. If there are unsaved documents, the user is asked whether to save the changes or not.
Menu Edit
- Undo
A shortcut Ctrl+Z
By choosing this item, the user restores the current document to its status before the application of the last operation on it. The system supports up to 3 steps of undo. Note that this operation does not hold for the plain text, but only for the structural representation, i.e. it restores the changes in the tree structure of the document. Sometimes the Undo operation may need a little time, especially on large documents.
- Search
A shortcut Ctrl+F
The edit field and the icon on the toolbar provide the same functionality.
With this tool the user can search for nodes in the current active document exploiting the implemented XPath engine. The XPath expression is evaluated as a context node on the currently selected node in the tree view. Having evaluated the XPath expression, the system shows the result from this evaluation. The first node, that matches the query, is marked in both areas - the tree and the text. The other nodes in the list are saved in order to be selected when the user chooses Next and Previous. The user can use help information XPath axes by clicking on "Axes Info" menu and then select the name of the axes.
- Next
A shortcut F3
An icon on the toolbar
When choosing this item in the current view, the user moves the focus to the next element from the list of the elements, found by the last XPath search operation. In cases when no search operation was performed or the previous search result was unsuccessful, an error notification message is shown.
- Previous
A shortcut Shift-F3
An icon on the toolbarWhen choosing this item in the current view, the user moves the focus to the previous element from the list of the elements, found by the last XPath search operation. In cases when no search operation was performed or the previous search result was unsuccessful, an error notification message is shown.
- Text Replace
A shortcut Ctrl+H
An icon on the text area frame toolbarThis tool is using regular expression to search for and replace text in the document. It contains for fields.
The first field is the search filed. Here the user defines regular expression to match parts of the data in the text nodes. All text nodes are tokenized using the tokenizer specified in the DTD. The user can use the normalize option by clicking the normalize button. On top of them the the tool is executing the regular expression defined by the user. In the regular expression the user can use token value and token categories in the same way as in grammars. All regular expression are saved in the system memory and can be selected from the table in the bottom of the dialog.
The second field is the replace field. Here the user fill normal text which can replace the matched data. The field can be left empty. In this case the matched data will be erased from the text.
The third field is the restriction field. Here the user specifies XPath expression that restricts the text nodes on which the regular expression will be applied.It is evaluated as a predicate. The tool uses only the text nodes that have a non zero result after evaluation.
Buttons:
- Next - finds the next data that matches the regular expression.
- Previous - finds the previous data in the document that matches the regular expression.
- Replace - replaces the selected data with the text in the replace field.
- Replace All - replace all matches in the document with the text in the replace field.
- Normalize - the tools uses the normalize option when tokenizes text nodes.
Menu DTD
- Compile DTD
A shortcut Ctrl+L
By choosing this item, the user compiles a DTD (Document Type Definition). First a standard file chooser appears. The user is expected to point out to the file where the DTD is stored. The system supports three kinds of character encodings: ASCII, Unicode UTF-16BE, Unicode UTF-16LE (Microsoft version of UTF-16).
In case an input file has been chosen, the parsing begins. If the DOCTYPE element is not declared, the user has to choose the root element.
If an error occurs during compiling (parsing) the DTD, a notifying error message appears.
If everything is correct, there appears a message for a successful compilation. The name of the DTD is the name of the root elemen (the DOCTYPE). The DTD is added to the list of all DTDs known to the system. If in the system there already exists a compiled DTD with the same name, then an additional index is appended to the end of the new DTD name.
- Remove DTD
By choosing this item, the user can remove a DTD from the list of all DTDs known to the system.
If in the system there are not saved documents referring to this DTD, there comes a message for a successful removing. Otherwise, an error message appears. It warns the user that the removal operation cannot be done.
Pressing the button Details the user can see the documents, which rely on the selected DTD. These documents do not allow the DTD removal. The two possible solutions to this problem are as follows: either the DTDs for all these documents are changed, or the documents themselves are removed.
In this way no documents will refer to the DTD in question and hence, the removal will be successful.
- View DTD
This is an information dialog, showing the content of a DTD (Document Type Declaration) already compiled in the system. (For more information about DTDs, see http://www.w3.org/TR/1998/REC-xml-19980210#dt-valid).
The information data is divided into 4 sections representing different parts of the DTDs(structure data, attributes data, entities data and processing-instructions data). These four parts are contained in a tabbed pane and, by clicking on each tab, the user can switch them.
The viewer is demonstrated by the following simple example of a DTD:
<!DOCTYPE books [
<!ELEMENT books (book)*>
<!ELEMENT authors (author)+>
<!ELEMENT book (#PCDATA, title, authors, publisher, (pages)?, isbn, (price)+)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT publisher (#PCDATA)>
<!ELEMENT pages (#PCDATA)>
<!ELEMENT isbn (#PCDATA)>
<!ELEMENT price (#PCDATA)>
<!ATTLIST price
currency CDATA #REQUIRED
prev CDATA #IMPLIED
id CDATA #IMPLIED
n CDATA #IMPLIED
lang CDATA #IMPLIED>
<?CLaRK member(Gram,AllAn,[;]) ?>
]>Structure data section:
The structure, defined by the DTD, is represented as a table (see above). Each table row contains structural data for one element. The name of the element is in the first cell. The second cell contains the definition of the element as a regular expression. The rows are sorted by lexicographical order of the element names.
Note: #PCDATA means a plain text string, excluding symbols like '<','>'.
In the picture above, the definition of author element says that it must contain only text data(not other elements). The book element structure must be: text data, title element, authors element, publisher element, pages element(not obligatory), isbn element, price element(one or more). The ordering is important.
When the declaration of an element is too long or rather complicated, sometimes it is helpful for the user to see the declaration separately from the other element declarations. It becomes possible by clicking with the right mouse button on the row of the desired element and then pressing the View button, when it appears.
Element attributes data section:
In order to see the attributes of a given element, the user has to choose the element from the drop-down menu at the top of the window. This menu contains all elements declared in the DTD. If after choosing an element nothing appears in the table, it means that there are no attributes declared for this element. Each time an element is chosen, the content of the table is updated. In the picture above, the element price has been chosen and the table contains its attributes according to the DTD.
Each row represents one element's attribute only. The name of the attribute is in the first column. The type of the attribute's value is in the second one. The third one contains meta-information about the attribute (required, implied, fixed, ...).
Entities data section:
This section gives information about the entities defined in the DTD. Entities can be used as escape alternatives for symbols, which are not allowed in the text. In the picture above, there is an entity lt which will substitute the symbol '<' in the text. Otherwise the XML parser will decide that a new tag is starting when it meets the symbol '<' in the text and if it is not the case then the further processing fails. Therefore, in this case the symbol '<' has to be substituted by < being interpreted as the symbol '<' but not as a starting point of a tag. The format of an entity is: &xxx;, where xxx is the name of the entity. All entity names appear in the first column of the table above. Opposite each entity name stands its corresponding text string (the text which will be substituted by the entity).
Processing-instructions section:
The processing-instructions contain more information about the text processor than about the text content itself. It is so, because the processors should know how to interpret the instructions.
- Save DTD Database
It saves the system information about DTDs, DTD layout, constraints, features (Element and Attribute), tokenizers and filters. Thus when the user starts the system again, she/he has at disposal all the things, which have already been added within the system. It is recommended that this function was executed every time, when the DTDs or the related things are changed.
- Edit DTD layout
By editing the DTD layout, the user can change the way, in which a document, loaded in the system, will appear on the screen. This facility includes the following: moving to a new line before/after opening/closing tag, hiding some tags and/or their con
After choosing a DTD, the following table appears:
The first column contains all tag names in the DTD. Each row represents the layout information for one tag. Here follows a description of the meaning of each column in the table:
- Tags - All the tag names in the selected DTD;
- Open tag start - whether the opening tag of the corresponding element to appear on a new line or not;
- Open tag end - whether the first child node to appear on a new line(a new line after the opening tag) or not;
- Close tag start - whether the closing tag of the corresponding element to appear on a new line or not;
- Close tag end - whether a new line to be inserted after the closing tag or not;
- Is tag visible - whether the tag to be visible on the screen or hidden;
- Are children visible - whether the content of the tag to be visible or hidden;
The check box at the bottom of the window: "Use line offsets" supports more comprehensive visualization. It suggests an additional white space to be inserted in front of the tags. If chosen, this white space is assigned to each tag, wh
When a document is loaded, it obeys the layout, defined for its DTD. Note that later on this layout can be changed only for the current view.
Menu Constraints
- Regular Expression Constraints
Each Regular Expression Constraint consists of 2 main parts and 3 additional(optional) parts:
- Constraint name (obligatory) - a unique identifier for the constraint in the system;
- Regular expression (obligatory) - a valid regular expression which represents the constraint over the nodes' content;
- Default XPath expression - it is an XPath expression defining the selection of the nodes to be processed by the constraint. This expression appears as a default text in the appropriate for this specification area;
- Tokenizer - it is used when the constraint tests text nodes' content. If the tokenizer remains unspecified, then the processor takes the tokenizer, which is specifiied in the DTD of the current document;
- Filter - it is used to filter the tokenizer categories when the text nodes' content is tested.
- Edit Regular Expression Constraints (REC)
This section is responsible for the regular expression constraint management. Here the REC can be created, modified, removed, saved to a file and loaded from a file. Here is a picture of the dialog window:
The left side of the window is a table with all REC in the system. The first column contains the names of the contstraints. The second one contains the regular expressions for each of the constraints. Having selected a row in this table, the user can apply a manipulation over a constraint by using the buttons on the right.
Description of the buttons on the right:
- New - creates a new Regular Expression Constraint. Having pressed the "New" button, a new constraint editor window appears on the screen (for more details, see below);
- Edit - the currently selected constraint is opened for editing in a new editor window;
- Remove - removes the currently selected constraint in the table. The removal is preceded by a confirmation message;
- OK - updates the current changes in the constraints and closes the manager window;
- Cancel - closes the manager window without saving the changes (if any) in the constraints;
- Save To File - serializes all the REC into an external file in an XML format. This function can be used for two main purposes: back-ups and interaction with external applications. The description of the output XML file(the DTD) can be found in the file: regConstraint.dtd;
- Load From File - loads the REC from an external file. The external file must be an XML document, valid with respect to the DTD in the file: regConstraint.dtd;
Regular Expression Constraint Editor
Here is the interface view of the editor window for the REC:
The last three fields are optional. The tokenizer and filter lists contain all tokenizers and filters defined in the system. The regular expression may consist of: tags, token kategories, token values and token value templates.
- Apply Regular Expression Constraints
The actual applying of the Regular Expression Constraints can be performed in two ways:
- by selecting a node from the tree panel and choosing a constraint;
- by selecting a set of nodes with the help of an XPath expression and then applying a certain constraint on each of them;
Here we describe the latter case. The user chooses 'Apply Regular Expression Constraints' from the menu Constraints/Regular Expression Constraints. Then the following dialog window appears:
The first input field Select nodes contains the XPath which is evaluated in order to select nodes for the constraints operation. If the default XPath expression is specified for the constraint, then it appears in this field as a default text.
The second field selects by name a constraint to be applied.
The last two fields are activated when the constraint tokenizer and filter are ignored and new ones have to be defined explicitly.
Having pressed the Apply button, the XPath is evaluated and a set of nodes is selected. Then for each of them the constraint is applied. If the node's content satisfies the constraint, then the node is marked as Valid. Otherwise it is marked as Non Valid. In this way two groups of nodes are formed and each of them can be observed separately. Here is a picture of the navigation panel window:
Here the user can change the group under observation by using the two radio buttons. Pushing Next and Previous buttons the user changes the current selection in the editor. On the top of the window there is some information about the constraint and the nodes which satisfy or do not satisfy it.
- Value Constraints
The constraint engine is a means for setting restrictions on the content of nodes in an XML Document, which can not be expressed by the DTD. Each value constraint in the system is attached to a DTD.
A value constraint in general consists of two parts. A target section and a source section.
Target Section
In this part one can find a description of the nodes to which the constraint will be applied. Initially, the nodes are selected by their tag name and then a further restriction is made by an XPath expression. In this way, using the expressive power of the XPath language, a context dependancy can be expressed. The evaluation of the target nodes is performed in the following way. First, from all the elements in the current document only the nodes with the specified tag name are retrieved. Then for each of them the XPath expression is evaluated with the current node as a context one. If the XPath expression is evaluated as a non-empty list, then the context node is included in the set of nodes to which the constraint is to be applied. Otherwise, it is excluded.Source Section
Here the possible values for the target nodes (selected by the previous section) are defined. The possible values are tag names and tokens depending on the type of the constraint. The source list can be selected by an XPath expression or by typing the choices explicitly as an XML markup. If the selection is made by a relative XPath expression, then the current target node is taken as a context node for the constraint. If a text node is selected as a source, then its text value is tokenized and the tokens are added to the source list, excluding the node itself. It is possible that the source for the constraint is an external document. The only requirements in such cases are the following: the external document has to be in the internal database of the system and the XPath expression cannot be relative.There are four types of value constraints, currently supported by the system. They are distinguished by their target and the way of their usage. Here is a description each value constraint separately:
1. Parent Constraint
This type of a value constraint sets limits on the possible parents of a node. There are two ways of applying this constraint type: by changing the parent of a node(local) or explicitly runing the constraint engine(global).The first possibility is changing the parent of a node(or a set of nodes at one level). The list of all the relevant parent nodes can be restricted further by applying other constraints. The final list contains the intersection between the source of the constraints and its former content. If the operation - changing of the parent of a set of nodes - is performed, then all compatible (parent)constraints are applied.
The second possibility is running the Constraint Engine. It works in the following way. First, the targets are selected(by their tag names and the XPath restriction). Then the source is compiled. If there is more than one choice, the user is asked to select one option from a list. If the choice happens to be exactly one element, it automaticly is inserted as a parent of the target.
The source list of each constraint must contain only tag names. All tokens in the list are ignored.
2. All Children Constraint
This type of a value constraint sets limits on the names of a node's children and the content of its text children. All children, that are tags, must have names coinciding with the name of some node from the source list. Then all the data in text children is tokenized and a list A of tokens is formed. After that all the data in text nodes in the source list is tokenized and a list B of tokens is formed. For every token in A there must exist a token in B such that the values (not categories) of A and B are equal. This type of a value constraint is applied automatically during the validation of a document.3. Some Children Constraint
This is a special type of a value constraint, because its main task is not to set limits on the node's content. Instead, it is used for a value restriction when the operation inserting a child in a node is performed. This constraint type is not applied each time a new node is inserted. These constraints are used separately. Here the target node is the node where the insertion takes place. The constraint is blocked when:
- there is a child of the target node that is a tag and there is a node in source list, such that both nodes have the same names.
- there is a text node in the target node that has a token, whose value equals the value of a token in the source list.
To sum up, when there is a non-empty intersection between the sourse list and the target node's content, the constraint is satisfied and there is nothing more to be done. In cases when the source list is empty and there is content for the target node, an error message is shown. If the target content is also empty, then the constraint is satisfied.
When the source list is not empty and there is no intersection with the target's content, the user is offered a list with the possible values from the source list for the target node. The user can choose one item to add. If the item is a token and the target node has already some text content, the new value is appended to it with a comma as a separator.
4. Some Attributes Constraint
This constraint is very similar to the previous one. The only difference is that the target here is an attribute of a node. Also the target selection includes a selection of an attribute defined in the DTD for the selected tag name.The following screen shot is the dialog window of the value constraints editor:
The dialog is separated into 5 sections:
- Constraint Info - here the user gives a short name of the constraint (free text) which is obligatory. Optionally some additional descriptions can be written in the second text box.
- Constraint Type - this is a list of four elements where the user specifies the type of the constraint. The options are : Parent, All Children, Some Children, Some Attributes (described before). The two checkboxes on the right can activate some runtime information as follows:
- Show status before - indicates the number of the target nodes the constraint is to be applied to, i.e. before the real application;
- Show status after - indicates the number of the target nodes the constraint has already been applied to, i.e. after the real application.
The check box "Prompt for save on each ... times:" and the text field next to it are used for making backups of the current state of the document while applying the constraints. In order to use this option, the check box must be marked and in the text field a number must be entered. It indicates the number of the successful applications, after which the system reminds the user to save the document.
- Target Node - here the description of the target nodes for the constraint is given. The first field defines the name of the target node(s). The field itself represents a sorted list of all tag names defined in the DTD. The second field is the place where the XPath restriction is specified. It is evaluated sequentially on every initially selected node as a context node. The function of these two fields can be represented by an XPath query: /descendant-or-self::ta/self::*[not(child::*)] (for the picture above). The bolded parts represent the text, which comes from the two text fields. The third field (disabled in the picture) is used when the target of the constraint is an attribute(Some Attributes). It gives the list of all attributes defined for the chosen element in the first text field according to the DTD.
- Constraint Source - this section defines the source list for the constraint. The text field content is either an XPath expression, or an XML markup. It depends on the radio button, which has been currently selected for the source type. If the source type is 'XML Mark-up', then the content of the text field is XML. Otherwise it must be an XPath expression. If the selected type is 'Local Document', then the XPath expression evaluates each target node as a context one. If the type is 'External Document', then the choice box gets enabled and the user is expected to choose a document. The XPath expression is evaluated on this document and the root node is the context. In the latter case it is expected for the XPath expression to be absolute.
- Tokenization & Help section - here a tokenizer can be activated for the constraint or it can be blocked in order not to treat the text nodes as a set of tokens but as a whole. Also a filter can be set in order to exclude some "garbage" categories as separators or others from the source list. Another restriction can be set here by defining the token value and the category templates. The templates are defined in the same way as these in the ClarkSystem grammars (using @ and # symbols for wildcards). Another facility, which can be relied upon here, is the Help Document. This option ensures the following possibility: while listing the different choices, the user can get brief information about the meaning of each choice. This information must be stored in an internal document. Its structure is described in a DTD in the file: helpFile.dtd. The information about a given choice appears in the status bar of the editor when the mouse pointer is over the choice.
In this section a short description of the Constraint Editor was presented. It is envoked whenever a change on a Value Constraint is needed or a new constraint is to be defined. The Value Constraint management is handled by the following Constraint Manager dialog window:
Within the CLaRK System this module can be envoked from the menu: Constraints/Value constraints. The user is asked to choose the DTD according to which the Value Constraints are to be applied. Then the dialog window from the picture above appears.
The Constraint Manager represents a table of all constraints defined for the current DTD(if any). Each constraint is represented as one row in the table. The ordering in the table is important only if the constraints depend on each other. The constraint in the first row is applied first, then the second constraint is applied and so on. But the ordering can be changed by the two buttons on the right side: "Move Up" and "Move Down". They swap the position of the selected row with one of its neighbours above or below.
Sometimes it is useful to deactivate some of the constraints temporarily. It can be done just by non-selecting the check box on the constraint row. For example, in the picture above the second constraint remains deactivated.
The other buttons:
- New Constraint - creates a new Value constraint by calling the Constraint Editor.
- Edit Constraint - edits the selected Value constraint.
- Remove Constraint - removes the selected Value constraint.
- Load From File - loads the Value constraints, which had been saved before. This is needed when making backups
- Save To File - saves the Value constraints in the current manager to file in order to make backups.
- Options - these options allow/disable the usage of certain types of constraints. This can be used as a filter.
- Apply Constraints - apply all Value constraints which are activated at the moment for the current DTD. Here the settings from the options hold.This button is disabled in case there is no document opened or the current one has a different DTD from the constraint's DTD.
- Done - closes the dialog window by saving the changes on the constraints (if any).
- Cancel - closes the dialog window without saving the changes on the constraints (if any).
Value Constraints Group
The constraints described so far work in the following way: the first constraint is applied to all targets, then the second one is applied, and so on. The constraint groups, however, work in a slightly different way. First, a context node is selected and then all constraints from the group are applied within this context. Each group contains three parts:
- name - unique identifier of the group;
- context - an XPath expression, selecting the context for the group;
- list - a list of all value constraints included in the group;
The Value Constraint Group Manager:
The Group management includes: the creation of a new group, the modification and removal of an existing group, the application of a group of constraints. Each operation (except New Group) is preceded by selecting a group from the list.
- Number Constraints
This constraint type restricts the occurences of some specific elements within the content of a document. The node specification is given by an XPath expression. This XPath evaluates the root of the document as a context node . The evaluation of the expression produces a list of nodes. The number of the entries in this list must range between MIN and MAX values in order to satisfy the constraint. The MIN or MAX value must not be a negative number. Instead of specifying MAX value, the user can write the character '*' which means positive infinity, i.e. without any upper limit.
The Number Constraint Manager dialog:
In the example above, the first constraint has no upper limit. The fourth column is responsible for the activation/deactivation of the constraints. It becomes necessary when the user would like to apply only a certain subset from all the constraints. Checking the (active)constraints can be done by pressing the Applybutton. This button is disabled when there in no document in the editor. After applying the constraints, the user receives information about the number of the satisfied constraints and the number and type of the unsatisfied ones.
Menu Tools
- Entity Converters
This tool handles documents, which contain symbols, not supported by the local hardware architecture. It substitutes the symbols with entities according to the standard ISO 8879 and vice versa. Currently, this tool supports 19 sub-sets of entity-char conversions. Each of them can be activated or deactivated. The more sub-sets are activated, the more time is needed for processing (conversion). One reason for excluding some of the sub-sets is the following : sometimes not all the symbols have to be converted, for example: commas, dots, colons, semicolons ....
Example: ("äóìà" in Bulgarian is the equivalent of "word")
"äóìà" <-- entity conversion --> "дума"
The tool operates on the document which is currently opened in the system. It can be started from the main menu:
- Tools/Entity Converters/Char --> Entity - Converts all the symbols included in the currently activated conversion sub-sets into entities. The current conversion sub-sets can be seen in Tools/Entity Converters/Entity Management.
- Tools/Entity Converters/Entity --> Char - It enables the opposite conversion i.e., from entities to symbols (characters).
- Tools/Entity Converters/Entity Management - This item visualizes the manager window that is responsible for the activation/deactivation of the different sub-sets of entity-char converters.
The dialog window:
The window shows the current active converters (filters). Each of them can be deactivated by removing it from this list (button 'Remove filter').
In order to see the symbols (entities respectively), the user can press the View filter button. Having pressed it, a table appears on the screen containing a detailed information about the filter. Each row represents one pair entity-symbol. The table has 3 columns: the first for the entities, the third for the symbols and the second is for the unicodes of the symbols represented as entities.
In order to activate one or more filters, which are not already in the list, the user can press Add filter button. A new dialog window appears which contains a list of all the available filters, which are not active. By selecting the checkbox opposite to each of then the user can activate filters. Here he/she can see the currently selected (non-active) filter content before adding it (button Preview). Optionally all filters can be added with button Add All.
When the entity filter management has been completed, the new settings can be updated by the Done button and then the window is closed.
There are two more buttons (Apply "Entity --> Char" and Apply "Char --> Entity") which apply the corresponding conversion upon the current document for the system. It performs the same action as the conversion tool from the main menu of CLaRK System. In addition the changes in the active filters(if any) are updated.
- XPath Transformations
This dialog is used for various "mass" commands for document restructuring. Generally, the scenario is the following: (1) a list of nodes (subtrees, text elements) is chosen by the Source field. In this way it is defined what will be copied or moved in the document; (2) a list of nodes is chosen by the Target field. In this way the place(s) where the source elements will be copied or moved is defined; (3) the elements from source list are attached to the elements of the target list. There are several option defining the way of performing the above action. The concern refer to such thing as whether the elements of the source are copied or cut from the document before being attached to the target, the mapping between the source and the target elements - there are possibilities elements of the source to be attached to each element of the target, or each element of the source to be attached to the corresponding element of the target.
A more detailed description of each field and action follows.
Source
This is a description of the data which is to be copied or moved. The description can be an XPath expression, an XML markup data or some text. If the description is an XPath expression then it is evaluated to a list of nodes depending on the document selected in the combo box under the source field. The XPath expression is evaluated before the actual change of the document had taken place. If the source is XML markup data, then it need not have a root element. The XML markup data is parsed to a list of XML nodes. If only text is given, this very text is considered as one element list.
Copy versus Cut
When the Source is defined by an XPath expression the elements in the resulting list can be either kept left in the document or deleted from it before the rest of the processing. If one chooses to copy elements, first a copy of the source list is created and then the operation proceeds further with that copy. If one uses cut, then the elements are first removed from the tree and then the operation continues. When one cuts the elements, the destination XPath cannot be relative.
Note: If you cut the elements, then they will not be present in the tree when destination XPath is evaluated.Include subtree check box
If this check box is on, then the source list contains for each chosen node the entire subtree under the chosen node. If it is off, then only the local information for each node is put in the source list. The local information includes the tag name and the set of attributes as well as their values. When only a node with the local information is chosen and it has to be cut then its children are inserted as immediate children of its parent. The insertion is made in the position of the deleted node.
Treat source as XML markup check box
By this check box the treatment of the Source field is controlled. If it is on, then the source is treated as XML markup data . If the XML markup data doesn't contains tags then it is treated as text.
If the check box is off, then the contents of the Source field is treated as an XPath expression.Target
An XPath defining the target list of nodes, e.g. the nodes where the source will be included.
Absolute or Relative?
When Absolute , the XPath is calculated from the beginning of the document selected in the document combo box under the destination field, e.g. the XPath "self::*" will return the document element.
When Relative, the XPath is calculated for every node from source with this node as a context. As a result you get a list of nodes for each node in the source.
Insert node(s)
Defines the position where the source is to be included, relative to a node in the target list.
As a parent
The nodes from the source become parents (ancestors) of destination.
As a child
The nodes from source become children of the destination in an appropriate position. If the number in the box is less than 0 or it is not a number, then they become last children of the destination nodes.
As a sibling
The nodes from the source become siblings of destination in the appropriate position. 1 means next sibling, 2 means the sibling after the next sibling and so on. -1 means previous sibling, -2 means the sibling before the previous sibling and so on.
Options
Copy all nodes in the source list to every node in the target list
This is allowed only when the target XPath is Absolute. It works according to the selected position.
- as parentThe elements in the source list are treated as a path in an XML document where the first element is a parent of the second, the second of the third and so on. If the source list contains whole subtrees, then each element in the source list is consider as a last child of the previous element. The constructed path is inserted in the document so that the first element is inserted in the place of the node from the target list and the target node is inserted as a last child of the last element of the source list. For each node in the target list a new copy of the source list is taken.
- as childThe elements in the source list are inserted as children of each node in the target list in the ordering in which they appear in the source list. The insertion is done at the indicated position. For each node in the target list a new copy of the source list is taken.
- as siblingThe elements in the source list are inserted as siblings of each node in the target list in the ordering in which they appear in the source list. The insertion is done at the indicated position. For each node in the target list a new copy of the source list is taken.
Although the program always makes a copy, if cut is selected the nodes in the source list will be deleted.
Copy each node from the source list into the corresponding node in the target list
This option allows a pair wise inclusion of the source list into the target list. When the target is Absolute, the first node from the source list is attached to the first node in the target list, the second node from the source list to the second in the target and so on until there are no nodes left in the source or in the target list. When the target is Relative, each node in the source list is attached to the first node in its relative target list. There is an option to check whether the two lists are of equal sizes (when Absolute) or the relative lists contain a single node (when Relative ).
Copy all nodes from the source list as parents of all nodes in the target list
The elements in the source list are treated as a path in an XML document where the first element is a parent of the second, the second of the third and so on. If the source list contains whole subtrees then each element in the source list is considered to be the last child of the previous element. This path is inserted in the document as a common path of parents for all nodes in the target list. This is done by searching for the nearest common parent for all nodes in the target list (note that this parent exists). Then all nodes which are children of that parent and are placed between the first and the last node in the target list are removed and on their place the first node in the source is put instead. Then the removed nodes are added as the last children of the last node in the source list.
When the target XPath is Relative the above procedure is repeated for each node in the target list.
Buttons:
- OK - applies the chosen replacement (or addition).First the Source description is evaluated and the result list is saved in a clipboard like memory. If Cut is chosen, then the corresponding nodes are deleted from the document. After that the Target description is evaluated and then the appropriate attachment of the source is made in the target.
- Cancel - exits the dialog without any processing.
- XSLT transformations
The current document can be transformed via an XSL Transformation. The user is asked to choose a valid XML document which contains the XSLT(It must be an internal document). The result from the transformation is a new document, which is loaded in the system.
- Edit Grammar
The tool represents a regular expression grammar editor. The user can edit an existing grammar or create a new one. The available grammars can be selected from the combo box at the top of the dialog. The user can create a new grammar, to remove, rename or update a grammar. In order to start editing, the user must first create at least one grammar with the help of the New button. Grammars without a name can not be edited. After each change, the grammar has to be updated with the Update button. Grammars that are not updated are removed from the memory when pressing the Exit button.
Each table line represents one grammar rule. The expressions in column Regular Expression (bodys of the rules) have to match the tokens and mark-up in the document. This column should be always non-empty. The expressions in Left Regular Expression (left context) and Right Regular Expression (right context) columns determine the context in which the matched tokens and mark-up should appear in order for this rule to work. If there is no left or right context specified, the grammar presumes that all contexts are valid. The XML markup in the Return Markup column is used for marking the matched data. A comment field is added for user's own commentary.
The Tokenizer combo box determines the current grammar tokenizer. If a tokenizer is selected it will be used when the system creates the grammar input. If no tokenizer is selected, then the tokenizer from the DTD (Element Features) will be used. The Filter combo box determines the filter that will be used when applying the current grammar. The Matches combo boxes represent the match option for Left Regular Expression (left context), Regular Expression (body) and Right Regular Expression (right context). The Any up and Any Down matches in the body combo box can be used for backtracking. If any Any up option is selected the grammar finds the shortest sequence of tokens and mark-up that that is recognized by the body of a grammar rule and is correct according to the left and right context of this rule. The difference from shortest match is that in shortest match the grammar engine will choose the shortest possible sequence and if the left or right context fails the whole sequence will fail. Example:
In this example we are applying this grammar on the following sequence of symbols: b,a,b,c and the grammar is on the symbol "a". If we use Shortest match then the grammar will use the second rule because it is the shortest possible match and will fail on the left context of this rule and the grammar engine will go to the next symbol "b". If Any Up match is used then the grammar will choose the first rule although it matches a longer sequence.
If Any Down option is selected the grammar finds the longest sequence of tokens and mark-up that is recognized by the body of a grammar rule and is correct according to the left and right context of this rule. Clark grammar engine implies four modes for checking the left and right context:
- Left Right - checks the left context first and then the right one.
- Right Left - checks the right context first and then the left one.
- Backtracking Left - checks the left context first and then the right one. If the right context fails the grammar engine will try to find a longer or shorter sequence of words (depending on the type of match selected for the left context) in order to use the right context of another rule instead. Example:
In this example we are applying this grammar on the following sequence of symbols: c,c,a,b,a and the grammar is on the symbol "a". If we select Left Right mode the grammar engine will use the first rule because it matches the longest left context but the grammar will fail on the right context. If Backtracking Left mode is selected the grammar engine will prefer the second rule because it is correct even though it is has a shorter left context.
- Backtracking Right - checks the right context first and then the left one. If the left context fails the grammar engine will try to find a longer or shorter sequence of words (depending on the type of match selected for the right context) in order to use the left context of another rule instead.
XPath expression (Apply to text field) selects the nodes to which the grammar to be applied.
Buttons:
- New - creates a new grammar.
- Update - updates the grammar within the system memory.
- Remove - removes grammar from the system memory.
- Rename - renames a grammar.
- Exit - closes the grammar editor, prompts for unsaved grammars.
- Save Grammar - saves grammar(s) to file.
- Load Grammar - loads grammar(s) from a file.
- Apply Grammar - the user can apply the grammar to the current document (if there is one) using the grammars XPath expression (The user should save the grammar before applying in order to use the new grammar settings).
- The Feature menu gives an access to the DTD Element Features and Attribute Features.
For each grammar the user can define element values. The element values are XPath expressions that are evaluated for the corresponding elements to determine their value when applying the grammar. If no element value is defined for some element then it is taken from the DTD. Here is an example of element values.
Each row of the table represents an element value for an element. Both columns should not be empty. If one element has two element values the first one is used. When the user presses the OK button the XPath expressions in the table are checked for correctness.
- Apply Grammar
This menu item applies a grammar to the current document. In the Choose grammar field the user selects a grammar to apply. In field Select nodes the user has to specify an XPath expression to select nodes on which to apply the grammar. If the grammar has a defined XPath expression it will appear automatically upon grammar selection in the Choose grammar combo box. If no XPath expression is entered, the tool will produce an error message. The user can select also a tokenizer and a filter in the Choose tokenizer and Choose filter combo boxes.
- Apply Multiple
It applies one or more grammars or grammar groups on the current document in a cascaded way. The grammars and groups are added to a list which will be executed in the order of the items.
Buttons :
- Insert Grammar - inserts a grammar;
- Insert Group - inserts a grammar group;
- Insert Save - when reaching this spot in the queue the system prompts the user to save the processed document.;
- Remove - removes the chosen item;
- Apply - starts applying the constructed grammar queue to the current document;
- Exit - closes the dialog window.
- Grammar Select
This menu item executes a grammar on the current document. As a result the tokens and mark-up recognized by the grammar are selected in the Text Area. Additionally it allows the user to mark the recognized information with an XML mark-up.
Buttons:
- Search - executes the grammar on the current document;
- Next - finds the next group of tokens and mark-up that matches the grammar;
- Previous - finds the previous group of tokens and mark-up that matches the grammar;
- Mark - marks the selected data with an XML mark-up taken from the grammar or written by the user.
- Edit - opens the grammar editor tool;
- Exit - closes this dialog.
- Grammar Groups
This menu item represents a grammar group editor. The user can set grouping of grammars in order to apply them together. The groups can be created, modified, removed or renamed. Grammar groups are created in order to enable the user to apply several grammars in cascade style. Grammar groups are applied via the Apply Multiple system tool.
Buttons :
- Insert Grammar - inserts a new grammar in the grammar group;
- Remove - removes the selected grammar from the grammar group list.
- New - creates a new Grammar group;
- Save - saves the grammar group;
- Rename - renames the Grammar Group;
- Remove Group - removes the currently opened grammar group.
- Exit - closes the dialog window.
- Sort Tools
Sort
The sort tool is used for reordering nodes.
For sorting user specifies the following things:
- The nodes to be sorted.
- The keys for each node.
The first is done by defining an XPath expression in the Select Elements field. If the field is empty the sort tool will return an error message. For context node the XPath engine assumes the node selected in the tree in the main panel. The sort tool compares only element nodes which have a common parent. The sort tool splits the result returned from the XPath evaluation into groups according to the parent node. Each group is sorted separately.
Keys are created for every node we want to sort. Each row in the table represents one key. The sort tool compares two nodes key by key. The key is the list of nodes returned from the XPath engine after evaluating the expression defined in the column Key of the table. The context node in this evaluation coincides with the node for which we want to create the key. The other columns of the table represent settings used in the comparing of the lists. The lists are compared node by node.
- If the nodes are both elements then the sort tool asks the DTD which one is defined to be smaller (Element Features).
- If the nodes are both text they are compared by their textual content.
- The attribute nodes are compared by the textual content of their values only if they have the same name and their parents are elements with same name.
- The textual content(text) of text and attribute nodes is compared in the following way:
- The text is compared symbol by symbol.
- If the user chooses tokenizer then the symbols are compared based on the tokens created by the primitive tokenizer of the selected tokenizer (A tokenizer that is ancestor of the selected tokenizer and is primitive. If the selected tokenizer is primitive then this tokenizer will be used for tokenization). The symbols are compared based on their token category (the order of the categories in primitive tokenizers) and by their position in the definition of the token category value. If normalization option is selected, the sort engine will use the primitive tokenizer normalization table to define the symbols token category and value.
- If the user selects "No Tokenizer" the sort tool will use the Unicode table to compare symbols. In this case normalization option will mean converting the Capital letters in to Small letters case for Cyrillic and Latin.
- If you select the reverse option for the key, the text will be reversed before the comparison ("erga" => "agre").
- If you select the trim option for the key, the text will be cleared from the whitespaces (TAB,SPACE,LF,CR) in both ends before comparing.
- If you select the number option for the key the text will be converted into numbers and compared by their number value.
- If the current nodes are not from one type then the following order is relevant: attribute < text < element.
- If a key for one element has more nodes then a key for another element then it is assumed smaller. This assumption is made when all nodes from the smaller key are equal to the corresponding nodes of the bigger key.
For each key the user can define different order ( Ascending | Descending ). The order of the keys in the table is very important because this is the order in which they will be used. If two keys have equal nodes but one of them has additional elements then the one with smaller number of nodes is considered smaller.
The difference between the DTD sort and the Advanced one is that the sort tool takes the tokenizer and the number option from the DTD (Element Features, Attribute Features). For attribute nodes the sort tool also takes from the DTD the order enumeration values.
Examples:
- Example 1: Sorting a book by pages and title. The elements to sort are the book children of the context node. They will be sorted by the content in their pages element and title element. Key 1 is the text in the pages element of the book. It will be trimmed and converted to number when sorting. In this key we don't need tokenizer because the whole node will be converted to number. If two elements are equal according to the first key (two books has the same number of pages) then they are compared based on the second Key. Key 2 is the text in the title element of the book. It will be trimmed and normalized when sorting. For normalization the sort tool will use the normalization defined in the "Mixed Word" tokenizer. The order of this key is descending. This means that this key will sort books by the title in reverse order.
- Example 2: Sorting TEI divisions by their heads. The sort tool takes all divisions in the document and sorts them according to the text in the text in their head element. If a division does not have a head element then it will assumed as smaller.
Example 1
Example 2
- Tokenizers
This menu item represents the Tokenizer Editor dialog . The user can select different tokenizers using the "Tokenizer" combo box. Note that there is always at least one tokenizer in the system because the "Default" tokenizer is not editable and can not be removed.
The user can create, remove or save a tokenizer. Each row in the table represents one tokenizers category. The user can add remove and reorder rows with the menu shown when the user right clicks over a row in the table. The first column is the category name. The content of the second column depends on the type of the tokenizer. The column contains category value (all the symbols in the category) if the tokenizer is "primitive" or regular expression for a "non-primitive" tokenizer. Here is an example:
This is a primitive tokenizer which name is Mixed. It has categories LAT, CYR, SYMBOL ... Category LAT represents the Latin symbols. Category NUMBER represents the numbers from 0 to 9...
Buttons:
- Save - saves the current tokenizer.
- Remove - removes tokenizer.
- Sort Order - defines the order of categories of a primitive tokenizer.
- Change Parent- changes the parent of a non-primitive tokenizer.
In order to create a new tokenizer the user must press the New button and set the options in the new tokenizer dialog.
If the tokenizer is primitive the user must select the Primitive check box. Otherwise the user must indicate the parent of the tokenizer in the Parent combo box. The use an existing tokenizer as basis for the new one with the Use current button. This tokenizer has to be in the dialog window.
When defining the category value for a primitive tokenizer the user should be aware of the following rules:
- The characters are quoted with single or double quotations or referenced by a number. Example "." or ";" or 'k' or 32 (Space)
- If the user wants to use more than one symbol for category he/she should separate the symbols by a comma. Example "a","b","c",...
- If the user wants to define a range of symbols, the starting and ending symbols must be connected with a dash. Example : "A"-"Z".
- One character can not have more than one category
- A category can be defined on more than one row.It is interpreted as conjunction between the two categories. Here is an example:
The tokenizer tool will interpreter this lines as LAT "'a'-'z','A'-'Z'".
Here is one example of a primitive tokenizer :
For each primitive the user can define the sort order of the categories by clicking the Sort Order button. Example:
This dialog will be shown if the user presses the Sort Order button when the Mixed tokenizer (the tokenizer in the first example)is on the screen. The use can reorder the categories with the menu shown when the user right clicks over a row in the table.
Also a normalize option can be defined for each category of the tokenizer by right clicking on the category line in the table. For instance the following dialog will appear if we select normalize for the "LAT" category:
For each symbol of the category the user must select a corresponding normalized symbol. The "New Category" combo box determines the new category that the symbols will receive after the normalization. This category can coincide with original category of symbols before the normalization or to be completely different.
When defining a non-primitive tokenizer, the user should follow the following rules:
- Each category Regular Expression must be a valid regular expression.
- Two categories can not have a common token.
Each non-primitive tokenizer must have a parent tokenizer. The parent of a tokenizer is set when the tokenizer is created and can be changed by the user when pressing the Change Parent button. Here is an example of a non-primitive tokenizer:
The parent of this tokenizer is the "Mixed" primitive tokenizer shown on the first example. This non-primitive tokenizer uses the categories "LAT" and "CYR" from the parent tokenizer.
The user can undo changes made on tokenizer with the Undo button.Tokenizers can be loaded or saved to an external file. The Exit button closes the tokenizer editor dialog. The system prompts the user to save all unsaved tokenizers on exit.
- Filters
This menu item starts the filter editor. In order to browse the filters in the system, the user can use the "Filter" combo box at the top of the dialog. The user can add token categories from different tokenizers or add XPath expressions to filter element nodes.
The "Token Types" list is the list of the filtered token categories. The user can take the categories from the tokenizers in the system (The "Choose From" list on the left side of the dialog) and add them to the list of filtered token categories with the arrow ("=>") button. In order to add an XPath for an element filtering the user must press the "add XPath" button. The new XPath expression will be added to the "Expression" list. To remove a token category an XPath from one of the list the user must press the corresponding "Remove" button. The user can remove and save filters or create new ones.
- Element Features
The Element Features is used to add information to the elements of a DTD.
The user can add the following additional information:
- Tokenizer for the elements of the DTD. Used for tokenization of the element text data by the Grammar and Sort engines.
- Default Tokenizer for a DTD. If no Tokenizer is defined for a element the grammar and sort engines will look for the the DTD default tokenizer. After compiling the DTD receives the "Default" primitive tokenizer for default tokenizer.
- The user can state that the content of an element is number.
- The user can define a XPath Value for each DTD element.
- The user can define order over the DTD elements. Option used for sorting purposes.
In order to select the default tokenizer for the DTD the user must select an item in the "Default Tokenizer" combo box. The user can select a tokenizer for each element of the DTD in the "Tokenizer" column of the table by clicking on a table cell. The check boxes in the "Number" column are used by the sort tool to determine whether the content of the current element can be treated as a number. For example pages and price can be treated as numbers during comparing of two books. The values in the "XPath Value" column are used by the Grammar Engine to define the value of the element nodes.
The order of the elements can be defined in the sort table shown when the user press the "Sort Order" button. When comparing two elements the position in sort table defines their order. The user can change position of elements by dragging their rows to correct positions or using the context menu opened when the user right clicks on sort table row. Here is example:
- Attribute Features
The Attribute Features is used to add information to the attributes of a DTD.
The "Element" column of the table represent all the elements in the DTD that have an attribute. One element can be on several rows because it can have more than one attribute. In the "Attribute" column are presented all the attributes of the DTD. In the "Tokenizer" column the user can select different tokenizers for each attribute. The sort tool uses the information from the "Number" column to select how to compare the value of the attributes (As plain text or as a number).
An additional feature is the order of enumerated attribute values. The attributes with enumerated values have "(e)" string at the end of the name. In order to sort the enumerated values click on an attribute with enumeration value and click the "Sort Values" button. Example:
The values are sorted in ascending order. The user can change the order by dragging the rows or by the right mouse button.
- Extract
The extract tool task is to extract nodes from a document or from multiple documents and to save them as a new document. The text field at the top of the dialog is used for defining an XPath expression which selects the elements in the document(s). The context node for this evaluation is the root node of the document(s). The user can extract from the currently active document in the system or from the internal documents. The result from the extraction is an XML document in which all extracted nodes are children of the root element (This element is named "Extract" by the system).
The "include subtree" option allows the extraction not only of the selected nodes but the entire subtree below as well.
Auxiliary tag - the extracted node can have a parent (prompts for attribute name) element which is used to separate results. For example, if we extract only text nodes then in the new document all the text nodes will be merged. If the auxiliary tag is selected, then the Number and Source prompt will be shown in the dialog.
If "Number" is selected then the extract tool adds an attribute with the extract result number to the auxiliary tag.
If "Source" is selected, then the extract tool adds an attribute with the source document name to the auxiliary tag.
- Statistics
The Statistic Dialog is used to show information about the number of the occurrences of some elements.
The user can select elements by an XPath expression ("Search" field). It is recommended always to check this expression in the systems search tool.
The elements are selected from the internal documents in the system. The XPath is applied to the root element for every selected document in the standard "internal document selector".
The selected elements are sorted by sort keys identical to the keys described in the sort tool manual. As we extract information from multiple documents and we can not determine one DTD only "Advanced" sort mode is available.
The user can choose a tokenizer from all tokenizers defined in the system ("Tokenizer" combo box) in order to process the text nodes. The "Tokenizer Category Filter" is used to select the categories derived as result after tokenization with the selected tokenizer. If no tokenizer is selected the text nodes will be processed as a whole node. Only the tokens with the chosen categories are shown in the result.
Buttons: "OK" button shows result. "Cancel" button closes window.
The Result is shown as a table. Here is an example:
The "Category" column contains categories from filter which exist in chosen text nodes or "<Element>" if the row represents element node and "#text" if the node is text.
The "Element" column contains tokenized text (The value of the filtered tokens), or node names.
The "#" column contains number of occurrences of the corresponding item.
The "%" column contains information for percentage of the corresponding item.
The "Key Value" column contains the value of sort keys created for the corresponding node or nothing if the line contains token.
Result can be saved as an XML document following the result table structure. The DTD of the document has the following structure:
<!DOCTYPE statistics [
<!ELEMENT statistics (documents, item*, all)>
<!ELEMENT documents (document+)>
<!ELEMENT document (#PCDATA)>
<!ELEMENT item (category, element, number, percent)>
<!ELEMENT category (#PCDATA)>
<!ELEMENT element (#PCDATA)>
<!ELEMENT number (#PCDATA)>
<!ELEMENT percent (#PCDATA)>
<!ELEMENT keyvalue (#PCDATA)>
<!ELEMENT all (number, percent)>
]>documents tag is a list of selected for statistic documents, where each document name appears in a document tag. item tag corresponds to a line from a result table as follows:
- category tag corresponds to the "Category" column
- element tag corresponds to the "Element" column
- number tag corresponds to the "#" column
- percent tag corresponds to the "%" column
- keyvalue tag corresponds to the "Key Value" column
all tag corresponds to the last row of result table. It contains the number of all occurrences of selected elements. And information for percentage
Pressing "Cancel" button closes window and returns to main window of the system.
- Concordance
Concordance - System tool for information extraction. The filed at the top of the dialog is used to define the context. If the XPath expression in it is invalid no elements will be extracted. For context the root of the document is used. The user can switch between two types of search with the tabbed pane. The result from the concordance is an XML document in which the found data is separated in lines. A line is a XML document with the following structure :
<L>
<LC> "the left context" </LC>
<I> "the data we are searching for" <I>
<RC> "the right context" </RC>
<COM> "Element for user commentary" </COM>
</L>1. The grammar search. - The user must select a grammar or create one of his own to search with it. To create a grammar select "<custom grammar>" item in the combo box and press edit button. The concordance tool will open a standard grammar editor with reduced options. After the user have finished editing the system asks whether he/she will use the created grammar. If the answer is no, then the system will presume that there is still no grammar. The user should press the "Save" button if he/she wants to save the newly created grammar in the system memory. If the user selects a grammar from the list then this grammar will be used for searching. If edit button is pressed while a grammar from the list is selected the system will open the selected grammar in a standard grammar editor. On exit the user will be asked whether he/she wants to use this grammar. If the answer is positive the modified grammar will appear on the place of the custom grammar in the combo box. The save option is available only for the custom and modified grammars . The user can select another grammar to restrict the context in which the search grammar will be applied. If the "Text Only" check box is selected the grammar will ignore the mark-up inside the context while checking. Example :
2. The XPath search - if the user uses an XPath to search inside of context the search field (the top text field in the tabbed pane) should contain valid XPath expression. All nodes selected with this expression will become items in the concordance result. If fields for the left or right context are not empty the concordance tool will use the expressions inside them to form the left and right context of the items. If not, the contexts will be generated automatically. Example:
If the "Add Source Attribute ?" is selected, then to every extracted line the concordance tool will add an attribute with the name of the source file. If the "Add Number Attribute ?" is selected, all extracted lines will receive a number attribute.
- Table View
The Table View tool is created to represent the information extracted from the concordance tool in more readable table form. Each line of the table represents one line of the concordance result. The data in the "Context" columns does not represent the whole context but only the amount of data that can fit in the column length. At the beginning it is only 30 symbols. To increase the context the user should press the settings button and from there to determine the context in symbols. The user can also set the width of the comment column. If the user wants to see the context without expanding the column data he can do it with right click on the "Left Context" and "Right Context" column. If the user wants to add commentary to a concordance line he can do so by filling value in the "comment" column or by right clicking a row in the "item" column. To navigate fast through the table the user can use the combo box in the top to access a row. The user can sort lines of the table.
The user must select on which column to apply each sort key (which element of the concordance line will be the context LC,I,RC or COM). If no column is selected then the key will be executed with the line element for context.
Useful option of the Table View is the "Edit Layout". The user can filter the tags that are shown in the table. For example, if the POS information is separated in a tag, the user can hide it in order to view only the text.
- Node Info
The following item give information about the number of occurrences of specified tags or tokens is a set of internal documents. When the user starts this tool, s/he is asked to provide several things:
- The type of information which is needed. The two possibilities are: counting tags and counting tokens.
- The documents for which information is needed.
- An XPath expression which selects nodes in each document for which the counting will be preformed.
Here is a screen-shot of the initial dialog window:
The major components of the dialog window are:
- XPath Field - selects the nodes in each document for which the counting will be performed. If tags will be counted, then for each node from the selection of this XPath expression, its descending nodes will be counted. If tokens will be counted, then for each text node from the selection its text content will be tokenized and the result tokens will be counted.
- Tokenizer Selector - determines which tokenizer will be used when tokenizing the text nodes for token counting. This component is disabled in case of tag counting.
- Info Type Selector - determines the type of elements, which will be counted. The options are: "Word Info" - for token counting and "Tag Info" - for tag counting.
- Document Selector - this component is responsible for selecting documents from the internal document database, on which the counting will be applied. This is an universal component for the CLaRK system. For more information see Document Selector in menu File.
- Show Info button - starts calculating the information for the selected documents.
- Cancel button - closes the window and cancels further processing.
If the Show Info button is pressed, the system starts to process the selected documents one by one. While processing the documents, the status bar of the system shows the current process state. Having processed all selected documents the system shows the result in a new window. Here are two example results, one for Word Info and one for Tag Info:
- Word Info
The first column Document contains the names of the documents chosen from the first dialog.
The second column Category contains the categories from the tokenizer which the user has chosen before.
The third column # contains the number of occurrences of each category in the text.The content of the table can be saved in a file - if Save if file checkbox is selected. The syntax of the result file content is XML. When the user presses the OK button, s/he will be asked to supply a file name and a directory with a standard file chooser.
If Add information checkbox is selected then the relevant information will be added to each of the documents. The word information added to a document has the following form:
<extent>
<interpGrp>
<interp type ="LATw" value="20"></interp>
<interp type ="TAB" value="1"></interp>
<interp type ="CYRw" value="9350"></interp>
<interp type ="NUMBER" value="237"></interp>
<interp type ="SYMBOL" value="129"></interp>
<interp type ="PUNCT" value="1720"></interp>
<interp type ="SPACE" value="9376"></interp>
</interpGrp>
</extent>If the DTD for a document is TEI <extend> is added in the appropriate position. Otherwise, <extend> is added after the first node.
- Tag Info
The first column Document contains the names of the documents chosen from the first dialog.
The second column Tag contains the tag names of all nodes which the user has chosen with the XPath expression from the first dialog.
The third column # contains the number of occurrences of each tag in the documents.The content of the table can be saved in a file - if Save if file checkbox is selected. The syntax of the result file content is XML. When the user presses the OK button, s/he will be asked to supply a file name and a directory with a standard file chooser.
If Add information checkbox is selected then the relevant information is added to each of the documents. The information added to each document has the following form:
<encodingDesc>
<tagsDecl>
<tagUsage gi="hi" occurs="40"></tagUsage>
<tagUsage gi="p" occurs="153"></tagUsage>
</tagsDecl>
</encodingDesc>If the DTD for a document is TEI <encodingDesc> is added in the appropriate position. Otherwise, it is added after the first node.
Menu Document
- Change
This option becomes relevant when more than one document are opened in the system. There appears a window with the list of all the documents, that are currently opened.
By selecting an item from this list, the user can change the active document in the system. If the opened document in the system is exactly one, this operation is not applicable.
- Change DTD
It changes the DTD of the current document. One can choose between all DTDs, that have already been compiled into the system. When a new DTD is selected, it is assigned to the current document. Now the document is validated with respect to this new DTD and its layout is updated according to the DTD's layout. If the document contains default attributes, whose default values are unchanged,i.e. still obey the old DTD, then such attributes are removed.
- Validate
An icon on the toolbar
The DTD consists of
- element definitions
- name declaration
- regular expression that defines the content of the element
- attribute definitions
For more information see http://www.w3.org/TR/REC-xml
These are various messages, which appear after applying the validation procedure to a document. All of them mean that the document is not valid and at the same time each of them gives a prompt about the error source.
- "Root must be "root_name" !"This message is shown when the document element is other than the DOCTYPE of the DTD (or the DOCTYPE, which was selected after the DTD compilaion)
Example :
In the DTD:
<!DOCTYPE books [ ….At the beginning of the document:
<library> …. - "ID "attr_val" for attribute "attr_name" not found !"There is an attribute of type IDREF (or IDREFS), but the id (ids), which it refers to, is (are) not found in the document.
- "Duplicate ID for attribute "attr_name" !"There are two or more elements which have attributes of type ID with the same value.
- "Entity "entity" not declared (in attribute "attr_name") !"
The attribute is of type ENTITY or ENTITIES, but it contains value (values) that is (are) not declared in the DTD. - "Element "element" not allowed as a child at that position for element "parent" "
This error message is given if some element cannot be placed in a certain position among the child nodes of another element.
Example :In the DTD:
<!ELEMENT books book+>
In the document:
<books>
<book>…</book>
<author>…</author>
</books> - "Element "element" not found!" or "Element "element" is not declared!"
The element is not declared in the DTD. - "Element "element" must be EMPTY!"The element is declared in the DTD as an element with empty content, but in the document it is used with non-empty content.
- "Content not finished checking type "element" !"
This message is given when the element requires more children to complete its content.
Example:In the DTD:
<!ELEMENT book title, author+, publisher>
In the document:
<book>
<title>Alice in Wonderland</title>
<author>Luis Carol</author>
</book> - "#REQUIRED attributes missing! (list_of_REQUIRED_attr)" or "Required attribute "attr_name" for element "element" is missing!"The message is given when an element does not contain a #REQUIRED attribute.
- "Element "element" has no attribute named "attr_name" !"The message is given when an element is assigned an attribute, which was not declared for the element's type in the DTD.
- "Attribute "attr_name" must contain only one token - "attr_value" !"The attribute is of type NMTOKEN, but contains more than one token.
- "Bad ID - "id" - for attribute "attr_name" !"The attribute is of type ID, but contains a value that cannot be an ID.
Example :…<book id=”123 456”>…
- "Bad ID reference - "id_ref" - for attribute "attr_name" !"The attribute is of type IDREF, but contains a value that cannot be an ID.
- "Value "attr_value" of attribute "attr_name" must be among (list_of_values)!"The attribute has a value, which is not possible for it.
Example :In the DTD:
<!ATTLIST author
title ( Mr. | Ms. | Miss. ) #IMPLIED >
In the document
…<author title = ”Dr.”>… - "Attribute "attr_name" has a FIXed default value - "def_value", not "wrong_value !"The message is given when an attribute tries to change its FIXED value in the DTD.
- element definitions
- New view
It opens a new view for the current document. This new view is presented in a new window with a the DTD layout. The new view is synchronized with the other views of the same document. For example, when a node is selected in one view, it is automatically selected in the others. All changes made in one of the views are immediately updated in the others. The only thing which remains independent for each view is the layout. When a view is opened, it takes its initial layout from the DTD. This layout can be modified later. For more details about editing view's layout see Edit current view layout.
- Edit current view layout
An icon on the toolbar
Edit current view layout item allows for editing of the layout for each element in a DTD. For each tag (opening or closing) additional new lines can be attached before and after the tag. In this way the text view gets improved. The tag and its children can be visible or invisible. It means that the user can hide the information he/she is not interested in. The layout is set only for the current view of the document. After closing the view, all the information about the layout is lost. If the user wants to save the layout, this must be done by the DTD Layout. For more information about the layout table, see Edit DTD layout in menu DTD.
- Add default attributes
This item adds default attributes(if defined) to every element in the current document. The default attributes of each element in the document, which were defined in the DTD, are stored in a list. If the element has an attribute, which is not a member of this list, then this attribute is added to the list together with its default value. After applying this operation, the system shows how many new attributes have been added to the current document.
- Remove default attributes
This item removes all default attributes, which possess unchanged default values in the current document. The procedure is as follows: First, all default attributes, which have been defined for the element, are taken from the DTD. Then, it is checked for the element in the document whether it has each of the attributes or not. If the answer is positive, then the attribute's value is compared with the default value for this attribute in the DTD. And if they are the same, the attribute is removed from the element in the document.
After applying this operation, the system shows how many attributes have been removed from the current document.
Menu Options
- Keyboard
Because of the variety of graphical characters (letters) which the Unicode tables allow, it is necessary for the user to have a means for keyboard input. Unfortunately, in most cases either the keys on the keyboard are not enough or the already defined keyboards are not suitable.
In these cases the CLaRK System suggests the following solution. The user can define his/her own keyboard maps, i.e. for each key on the keyboard a different character can be attached. There are 94 keys available for mapping. For identification of each key, its ASCII character is used (which coincides with the beginning of the Unicode (UTF- 16)). It is a default for the specific machine architecture. The keyboard maps themselves are represented as sets of pairs. Each pair is responsible for one key. It has two elements: the default character and the code of the new attached character from the Unicode table. And when a newly defined keyboard is activated and any key is pressed, its character is searched for in the set of char-code pairs. If there is such a pair found, then the second element is taken and according to it a new character is retrieved from the Unicode table and is visualized on the screen. If there is no such a pair, then the same character appears on the screen.
When the system is running there are always two active keyboards. One which is fixed and which can not be modified (the hardware system keyboard) and one auxiliary. The second one can be defined by the user. The two keyboards can be switched alternatively by the key combination <Ctrl>+<Left Shift>. There is also an indicator on the toolbar, which shows the currently used keyboard. If the indicator is red colored and the sign is Aux, this means that the auxiliary one is in use. Otherwise it is green colored with a sigh Lat. A switch can be performed also by clicking on the indicator.
Defining a new user keyboard
When this item is selected the Keyboard Manager window appears. The initial view of the manager is:
The manager dialog window contains three subparts: Keyboard Preview, Unicode Table Preview and Control Panel.
Keyboard Preview
This is the table on the left side of the window (with the white background). It shows the current state of the auxiliary keyboard. Each row in it represents a pair from the keyboard map. The first column contains the characters of the hardware default keyboard. It is not editable. The second column contains the codes of the new attached characters. In the picture above the selection is set to a row with a character F. The character code attached to it is 1060, which means that when the user presses <Shift>+<F> on the screen will not appear capital F, but the character corresponding to this code.
Just under the table there is a Char preview which shows the new character for the key of the selected row. And by moving the selection the user can observe the whole keyboard character by character.
In the second column the user enters the codes of the desired characters. After entering a code, <Enter> is expected.
The question is how the user will know the code of the expected character. The answer comes from the second component - Unicode Table Preview. This is the table with the blue background on the picture. It contains the characters of the Unicode table available for the current font. The font is the same as this of the text area of the system. If the character, expected by the user is not in the table, he/she must change the font of the text area from Options/Fonts.
The first row and column contain numbers which are used for calculating the code of each character. The calculation is very simple. When we find the character in the table, we take the number from the cell, which is in the same row in the first column and add it to the number in the cell at the same column (with the character) in the first row. The result is the new character code.
Example: How do we get the number 1060 for the character in the Char preview? First, we find the location of the character in the Unicode Table Preview. In the picture above it is in the second column and in the 11th row. The number in the same row in the first column is 1060. The number in the same column in the first row is 0. So the final sum is 1060.
Navigation on the Unicode table can be done by using the two buttons: Page Up and Page Down situated on the right side next to the table. The small rectangle in some of the table cells means that for this code, there is no support by the current font.
As it has been seen, constructing a new keyboard is not easy. The interface is not very convenient and in future it will be developed. By now it is clear that constructing a keyboard each time the system is started is impossible. Therefore the system supports a mechanism for saving and loading keyboards. And when the user constructs his/her keyboard map, it can be used repeatedly later. This can be done by using the Control Panel, which is situated just above the Unicode Preview Table. It contains 6 buttons:
- Apply - sets the new constructed keyboard as a current active auxiliary keyboard for the system and closes the manager window.
- Exit - closes the manager window without applying the changes to the auxiliary keyboard (if any).
- SetDefault - loads the default keyboard (embedded in the CLaRK System) in the manager. It is the standard phonetic cyrillic keymap.
- New - resets the keyboard in the manager. This means that to each key will be attached its own character.
- Load - loads a new, previously saved keyboard. When the button is pressed, the user is offered a list of all saved keyboards to choose one.
- Save - saves the keyboard in the manager for future use. When the button is pressed, the user is asked to enter a name for the new keyboard. By this name, the keyboard can be identified and reloaded later.
- Fonts
This dialog window suggests a tool for changing the system fonts of several key components of the system. This tool concerns only the graphical interface. The reason is that the CLaRK System uses Unicode char encoding which allows the usage of a great range of different characters from different alphabets. Unfortunately, not every font supports the whole character table. In general, fonts are defined for a specific use and support 2 or 3 different alphabets. This manager allows changing the fonts of the components independently. The components for which the font can be changed are:
- Text Window - this is the text area on the right side of the system main panel. This is the place where the text of the document appears.
- Tree Window - this is the component on the left side of the system main panel where the tree of the document structure appears.
- Attribute Table - a table, situated just below the Tree Window. It gives information about the attributes of the currently selected element.
- Error Messages - this is the component at the bottom of the main system panel, where the error messages appear.
- Tables - this sets the font of all tables in the system (Grammar editor, Tokenizer editor, ...).
- Fields - this sets the font of all text fields in the system.
The dialog window:
The dialog contains 5 sections as follows:
- Font Chooser - the panel on the left, showing all available fonts for the hardware system. The changing of the font for a given component can be done by choosing a new font entry from here.
- Component Chooser - it is situated in the upper right corner of the dialog window. From it the user chooses the component to change font to.
- Font Style Modificator - changes the style of the font (Regular, Bold, Italics and Underlined).
- Font Size Chooser - changes the size of the currently selected font. The font size can vary in the range from 5 to 50. If the user enters a number out of this range, the value is automatically corrected to 5 or 50. If the input is not a number, the old value is restored. When the user enters a new value for a font size, s/he must hit the Enter key in order to refresh the preview component.
- Font Previewer - makes a preview of the currently chosen font with a specified font style.
Note: if the text in the font preview does not change when a new style is chosen, it means that the font does not support this style.
- Visuals
This option can be used for changing the colors of the different components (tags, text, attributes) in the text area(s). The available colors are all the colors supported by the specific hardware and software environment where the system is used. The color selection is supplied by a standard color chooser (computer architecture dependant).
Here is the dialog which appears after choosing the "Visuals" option:The dialog window contains two sections:
- Colors InfoThis section is responsible for the color selection for the different components. The colors of the buttons on the left side indicate the corresponding components' colors. By pressing the buttons, a color chooser appears. If a new color is chosen, after closing the chooser, the background of the corresponding button is changed to the new selection. Otherwise it remains the same. The components which can change their colors are:
- Tags (Tag Color)
- Text (Text Color)
- Attribute Values (Attribute Color)
- Background (Background Color).
Here is a preview of the color settings above:
- Control Buttons:
- OK Button - Applies the new color settings.
- Reset Button - Resets the color settings as follows:
- tag color - pure blue;
- text color - pure black;
- attribute value color - pure green;
- background color - light gray.
- Cancel Button - Cancels the current color settings.
- Colors InfoThis section is responsible for the color selection for the different components. The colors of the buttons on the left side indicate the corresponding components' colors. By pressing the buttons, a color chooser appears. If a new color is chosen, after closing the chooser, the background of the corresponding button is changed to the new selection. Otherwise it remains the same. The components which can change their colors are:
- Encoding Correction
This option can be used when the user works with files which use 8-bits character encoding (like ASCII). It is used for correct mapping between ASCII and Unicode character encodings. Because of the limitations in size of the ASCII format and the need of using different symbols, there are many character-sets which use one and the same code ranges. The problem here is how to distinguish which character-set should be used for a certain ASCII file. Unfortunately, very often such information is not available and the system can make a wrong decision when reading a file. For example, the user expects to read a file containing a hebrew text but the system decides that it is a cyrillic text and interprets it in a wrong way in Unicode. So the user is needed to specify which character-set the system must use. That is the place where the Char Encoding Corrector can be used. Here is a screen-shot of the dialog window:
The choice list at the top of the window contains all the character-sets supported by the CLaRK System. For the moment the system supports 31 standard character-sets:
-
- Arabic (Windows-1256)
- Baltic (Windows-1257)
-
- Cyrillic (Windows-1251)
-
- Greek (Windows-1253)
- Hebrew (Windows-1255)
- Latin 1 (Windows-1250)
- Latin 2 (Windows-1252)
- Latin 5 (Windows-1254)
- Thai (Windows-874)
-
- Viet Nam (Windows-1258)
-
- Arabic (ISO 8859-6)
- Baltic (ISO 8859-4)
- Cyrillic (ISO 8859-5)
-
- Greek (ISO 8859-7)
- Hebrew (ISO 8859-8)
- Latin 1 (ISO 8859-1)
- Latin 2 (ISO 8859-2)
-
- Latin 3 (ISO 8859-3)
- Latin 9 (ISO 8859-15)
-
- Turkish (ISO 8859-9)
-
- Arabic (OEM-720)
-
- Baltic (OEM-775)
- Cyrillic DOS (OEM-855)
-
- Greek (OEM-737)
- Hebrew (OEM-862)
- Latin 2 (OEM-852)
- Multilingual Latin 1 (OEM-850)
-
- Multilingual Latin 1 + euro (OEM-858)
- Russian (Cyrillic 2) (OEM-866)
-
- Turkish (OEM-857)
- US Codepage (OEM-437)
The table in the center represents a preview of the currently selected character-set. The table contains symbols with codes in the range from 128 to 255. The changing of the selected character-set refreshes the content of the table. If the user is not sure which character-set must be used, s/he can choose the first option from the list: (System Default). This will make the system to use the default character-set of the specific computer architecture and operating system.
The new selected character-set can be applied by using button Apply or rejected with button Cancel. If a new character-set is applied it will be taken into consideration each time an ASCII file is opened, like importing/exporting documents, compiling DTDs, etc.
-
- Toolbar
Visualizes and hides the toolbar from the main system window.
- Add Default Attributes On Loading
For each element in an XML document, a set of default attributes can be defined in the DTD. These are attributes which do not present in the elements expicitly, but it is assumed that they are there with a default value set in the DTD. If this option is selected, each time a document is opened, for every element in it in which a default attribute is absent it is explicitly added with its default value.
- Show Current Node Path
Enables/disables the showing of the node path of the currently selected node in the tree. This node path is displayed in tha status bar in the bottom of the editor. The node path is a valid XPath expression in an abbreviated syntax. It can be selected in the status bar, copied and pasted wherever it is needed.
- Simple Tags
An icon on the toolbar
Shows and hides the tags in the text area. If tha tags are hidden in the area, on their place square brackets are placed: [ - for the opening tags and ] - for the closing tags. If the showing of attributes in the area is activated and the tags are hidden, attributes are not visible at all.
- Show Attributes In Area
An icon on the toolbar
Enables/disables the showing of the attributes in the text areas. If the attributes are shown in the area, they are not editable, that is, they can not be removed, added or modified. Attribute management is supported by using a right mouse click on the table below the tree panel of the editor.
Other
- Parse error messages
These are the various messages, given during importing an external document. They all mean that the document is not well-formed.
- "Doctype declaration not valid at line line_no, position position_no !"This error is given when there is a DTD in the file, containing the document and the DTD can not be parsed successfully.
- "Invalid character at line line_no, position position_no !"This is given when there are characters other than white space characters at the beginning of the document.
Example:Asdfg <books>…..</books>
- "No document or wrong char encoding!"This is given on a blank document file or when the character encoding is not recognized. Character encoding is set when the user is asked to point out to a file containing the document.
- "CDATA section not closed at line line_no, position position_no !"This is given when a CDATA element misses its closing declaration - ‘]]>’.
Example:<book><![CDATA[ Alice in wonderland</book>
- "Processing Instruction at line line_no, position position_no not closed!"This is given when a Processing Instruction is opened but not closed. The missing end is '?>'. Processing instructions are parsed, but not processed further.
- "Comment at line line_no, position position_no not closed!"This is given when a comment is opened but not closed. The missing end is '-->'. Comments are parsed, but not processed further.
- "Invalid element at line line_no, position position_no - <> !"This is given when the parser finds an opening tag without a name, that is, the sequence ‘<>’.
- "Invalid element at line line_no, position position_no - </> !"This is given when the parser finds a closing tag without a name, that is, the sequence ‘</>’.
- "Element not closed at line line_no, position position_no !"This is given when an opening or closing tag is not properly closed.
Example:<book author = "Luis Carol>… (missing closing ")
<book (end of document)
- "Invalid attribute at line line_no, position position_no !"This is given when an attribute is not given a value.
Example:<book author>
- "Attribute value not closed at line line_no, position position_no !"This is given when an attribute value is not closed.
Example:<book author = ”Luis Carol>
- "Attribute declaration must be in the opening tag at line line_no, position position_no !"This is given when the parser finds attribute declarations in the closing tag of an element.
Example:<book>Alice in Wonderland</book author = ”Luis Carol”>
- "Invalid nesting of opening and closing tags at line line_no, position position_no.
Expected <element> opened at line line_no, position position_no ."This is given when the parser finds an element that is closed before all of its children are closed.
Example:<books><book> Alice in Wonderland </books>
- "Attribute <attr_name> already declared at line line_no, position position_no !"Is is given when an attribute is declared more than once for the same element.
Example:<book author = ”Luis” author = ”Carol”>
- "Text not closed at line line_no, position position_no !"Is is given when the documents end with a text.
Example:<book>Alice in Wonderland (end of document)
- "Document not finished!"It is given when the document element is not closed.
Example:<books><book> Alice in Wonderland </book> (end of document)
- Validation error messages
These are various messages, which appear after applying the validation procedure to a document. All of them mean that the document is not valid and at the same time each of them gives a prompt about the error source.
- "Root must be "root_name" !"This message is shown when the document element is other than the DOCTYPE of the DTD (or the DOCTYPE, which was selected after the DTD compilaion)
Example :
In the DTD:
<!DOCTYPE books [ ….At the beginning of the document:
<library> …. - "ID "attr_val" for attribute "attr_name" not found !"There is an attribute of type IDREF (or IDREFS), but the id (ids), which it refers to, is (are) not found in the document.
- "Duplicate ID for attribute "attr_name" !"There are two or more elements which have attributes of type ID with the same value.
- "Entity "entity" not declared (in attribute "attr_name") !"
The attribute is of type ENTITY or ENTITIES, but it contains value (values) that is (are) not declared in the DTD. - "Element "element" not allowed as a child at that position for element "parent" "
This error message is given if some element cannot be placed in a certain position among the child nodes of another element.
Example :In the DTD:
<!ELEMENT books book+>
In the document:
<books>
<book>…</book>
<author>…</author>
</books> - "Element "element" not found!" or "Element "element" is not declared!"
The element is not declared in the DTD. - "Element "element" must be EMPTY!"The element is declared in the DTD as an element with empty content, but in the document it is used with non-empty content.
- "Content not finished checking type "element" !"
This message is given when the element requires more children to complete its content.
Example:In the DTD:
<!ELEMENT book title, author+, publisher>
In the document:
<book>
<title>Alice in Wonderland</title>
<author>Luis Carol</author>
</book> - "#REQUIRED attributes missing! (list_of_REQUIRED_attr)" or "Required attribute "attr_name" for element "element" is missing!"The message is given when an element does not contain a #REQUIRED attribute.
- "Element "element" has no attribute named "attr_name" !"The message is given when an element is assigned an attribute, which was not declared for the element's type in the DTD.
- "Attribute "attr_name" must contain only one token - "attr_value" !"The attribute is of type NMTOKEN, but contains more than one token.
- "Bad ID - "id" - for attribute "attr_name" !"The attribute is of type ID, but contains a value that cannot be an ID.
Example :…<book id=”123 456”>…
- "Bad ID reference - "id_ref" - for attribute "attr_name" !"The attribute is of type IDREF, but contains a value that cannot be an ID.
- "Value "attr_value" of attribute "attr_name" must be among (list_of_values)!"The attribute has a value, which is not possible for it.
Example :In the DTD:
<!ATTLIST author
title ( Mr. | Ms. | Miss. ) #IMPLIED >
In the document
…<author title = ”Dr.”>… - "Attribute "attr_name" has a FIXed default value - "def_value", not "wrong_value !"The message is given when an attribute tries to change its FIXED value in the DTD.
- "Root must be "root_name" !"This message is shown when the document element is other than the DOCTYPE of the DTD (or the DOCTYPE, which was selected after the DTD compilaion)
- Main Editor Components
CLaRK System Online User Manual
Main Editor ComponentsThis is a description of the most important components of the editor of the CLaRK System. The picture above is one example configuration of the system with one opened (work) document. The red labels point to the corresponding components, which are described below:
-
- Tree Panel
This panel shows the tree structure of the current document. When a new document is opened the tree structure is folded to its root node. The user then can expand it by clicking on the branches of the tree. The nodes in the tree are painted in two different styles depending on the type of the corresponding document nodes: for Element nodes - a folder looking icons (the yellow ones above) and black dot icons for the Text nodes. The names of the Element nodes are drawn in two colors, depending on their validity, that is, when a node contains an error which makes it non valid - it is colored in red. Otherwise it is blue. When pointing a red node, its corresponding error message appears in the Error Message Panel window. The list of all validation errors can be seen here.
Basic tree restructuring operations can be applied here by making right-clicks on the tree. Right-clicks over selected node(s) make a menu window to appear on the screen. A description of all menu items can be seen here.
When a node is selected in the tree panel, its corresponding tag (or text node) in the Text area is selected too. A multiple selection of nodes in the tree can be made by pressing and holding the <Ctrl> key and pointing with the mouse the nodes to select.
-
- Text Area
This area shows the current document in XML format. By default, the tags are drawn in blue and the text is black. These colors can be changed by using menu Options/Visuals. In this area some of the tags and/or their content can be hidden or shown, new lines and line offsets can be inserted. For details see Document Layout.
Here, the user also can use a right mouse click for editing the structure of the document. After right clicking on a tag, a menu window appears with the same functions as in this in the Tree Panel. If the selection in the area is within a single text node, a list of tags appears and the user is asked to choose one of them to be put around the selected text.
When a tag is selected in the area, its corresponding node in the tree is selected too.
The system has the ability to show more than one Text Area for a document. Each area has its own layout which can be modified independently from the others. In this way, each area forms one view for the document. The different views are synchronized in a way that when a node is selected in one of them, the corresponding node in the others is also selected.
-
- Attribute Table
This table contains the attributes of the currently selected element node in the editor. The first column contains the attribute names. In the second one, the corresponding values are written. The second column is editable, that is the user can modify the values of the attributes. By using right mouse click the user can add and remove attributes for the current element node.
-
- Error Message Panel
This is a list, containing all the errors for the current document (if any). A full list of all error messages can be seen here.
By performing a double click on a certain error message, the node containing the corresponding error is selected in the Tree Panel and in the Text Area.
-
- Status Bar
This component shows system messages of the system. When an operation is being performed, the color of the text is red. Otherwise it is black. While editing the document this bar shows the path from the root node to the current one using an abbreviated XPath expression.
-
- Keyboard Indicator
This button-indicator is used to show which keyboard-set is active in the system. There are two states of it: normal (in green) and auxiliary (in red). For more details about using keyboard-sets see Keyboard. By clicking on the indicator it changes its state alternatively (normal/auxiliary).
-
- Main Menu
A detailed description of the main menu can be found in this documentation.
-
- Toolbar
This toolbar is used for placing shortcuts to most frequently used functions from the Main Menu. Here is the list of all shortcuts and their target items in the menu:
Icon Target Menu Item File / New File / Open File / Save File / Import File / Export Document / Validate Document / New View Options / Simple Tags Options / Show Attributes In Area Edit / Search Edit / Next Edit / Previous -
- Scroll Buttons
These four buttons are used for scrolling the text area. The first two are used for scrolling line by line. The next two are for scrolling page by page.
The reason for creating these four buttons is that the Text Area does not contain the whole opened document, but only a fragment of it. The reason for this fragmentation is that showing a large document takes a huge amount of operating memory. Therefore the system tries to visualize as less as possible from it. So when the user navigates through the document, the Text Area dynamically changes its content. This text content generation can not be controlled by a default scroller. That is the reason for using these special scroll buttons.
-
- Text Processing Buttons
The first three buttons are used for shortcuts to Copy (Ctrl+C), Cut (Ctrl+X) and Paste (Ctrl+V) operations applied ONLY for text data. If the selection contains not only text, but also tags, the operation is not executed.
The last button: is a shortcut to menu Edit / Text Replace.
-
- Tree Popup Menu
The popup menu is shown when the user performs a right-click on a selected node or a set of selected nodes in the tree panel on the left side of the screen, or on an opening or closing tag in the text area. The menu commands allow the user to change the structure of the document, to apply grammars and constraints and other.
Menu commands
-
- Delete Subtree
Allows the user to delete the selected node(s) in the tree with the entire subtree(s) below it (them). The system will warn the user if, after the deletion, the structure of the document is non-valid.
-
- Delete Node
Allows the user to delete only the selected nodes. The children of the selected nodes will be inserted as children of the parent of the corresponding selected node. The system will warn the user if, after the deletion, the structure of the document is non-valid.
-
- Insert Sibling
Allows the user to insert a following sibling to the selected nodes. The system will give the user a choice between valid tags (tags which can be inserted at the specified position according to the DTD) and all tags (all tags defined in the DTD).
-
- Insert Child
Allows the user to insert a first child to the selected nodes. The system will give the user a choice between valid tags (tags which can be inserted at the specified position according to the DTD) and all tags (all tags defined in the DTD).
-
- Change Parent
Allows the user to insert a parent of the selected node(s). When there is a multiple selection of nodes and the user chooses this item, the selection is changed so that only the ancestors of the selected nodes (or the nodes themselves) that have a common parent are selected.
After correcting the selection (if needed), a Change Parent dialog appears. This dialog allows the user to choose a tag, which will be the new parent of the (newly) selected nodes. The selected nodes are then replaced with the chosen tag that has as children the selected nodes (and their subtrees). The system allows the user to constrain the possible parents in 3 ways.
-
-
- To choose from tags, such that the parent node of the selected nodes will be valid after the operation is complete.
- To choose from tags that can have the selected nodes for children.
- To choose from tags that are produced from PARENT value constraints of the selected nodes.
-
Any combination of the above constraints is possible.
-
- Use Grammar
This item allows the user to apply a grammar over the content of the selected node(s). More details about grammar usage can be found in the description of the main menu items of CLaRK System (Tools/Grammars).
-
- Sort
This item sorts the nodes in the document starting from the selected node (using the selected node as a starting node). If the selection contains more than one node, the selected nodes themselves are sorted. The sorting is performed in groups, that is the selected nodes are grouped depending on a mutual parent and then the nodes are sorted within each group. More information about sorting nodes can be found in the description of the main menu items of CLaRK System (Tools/Sort Tools).
-
- RegExpr. Constraints
Allows the user to apply regular expression constraints over the content of the selected node(s) (or the node itself, if it is a text node). When the user selects this item, he/she is given a list of all regular expression constraints defined in the system. The manager, responsible for creating and modifying such constraints can be started from menu: Constraints/Regular Expression Constraints/Edit Regular Expression Constraints. More details about Regular Expression Constraints usage can be found in the description of the main menu items of CLaRK System (Constraints/Regular Expression Constraints).
-
- Invoke XSLT
This item applies an XSLT transformations over the selected node. This means, that the system uses the selected node as the starting point for the XSLT, not the root of the document, which would be the case when running it from the main menu. If the XSLT returns more than one root, all the roots are inserted in the place of the selected node. This operation can not be applied to multiple selection of nodes.
-
- Info
This item gives information about the selected node and its content. The given information includes: the path from the root node to the current one (abbr. XPath), the tokenizer attached to the node, the content of the node (tags & tokens). This operation can not be applied to a multiple selection of nodes.
-
- Trim Text Nodes
This item removes the leading and trailing white space characters (spaces, tabs and new lines) of the text nodes, descendants of the selected nodes and of the nodes themselves (if they are text nodes).
-
- Rename
This item renames the selected nodes (only if they are element nodes). The system will allow the user to choose from tags, such that the corresponding parent node(s) will be valid after the operation is complete or to choose from tags such that the selected nodes will be valid or both.
-
- Copy
This item copies the selected nodes, including their subtrees, to a copy buffer. After applying the Copy operation the buffer contains a set of trees - one for each selected node.
-
- Cut
This copies the selected nodes (including their subtrees) to a copy buffer and then deletes the nodes from the tree. The system will warn the user if, after the deletion, the structure of the document becomes non-valid. After applying the Cut operation, the copy buffer contains a set of trees - one for each selected node.
-
- Paste As Child
This pastes the copy buffer as a first child of the selected node(s). If the copy buffer contains more than one tree, the trees are inserted as neighbour children. The system will warn the user if, after the insertion, the structure of the document will become non-valid. If it is applied to a multiple selection of nodes, the copy buffer will be cloned, as many times, as an insertion will be performed. If the copy buffer contains more than one
-
- Paste As Sibling
This pastes the copy buffer as a following sibling of the selected node(s). If the copy buffer contains more than one tree, the trees are inserted as neighbour siblings of the selected node(s). The system will warn the user if, after the insertion, the structure of the document will be become non-valid. If it is applied to a multiple selection of nodes, the copy buffer will be cloned, as many times, as an insertion will be performed.
-
- Expand Subtree
This item is used when the tree (or parts of it) on the left panel is folded and the user wants to see the whole substructure of it (them) unfolded. Instead of passing through all folded nodes and manually unfolding them, the user can expand everything, which is under a certain (selected) node by using this option. This option can be applied to more than one selected node.
-
- XML Path Language (XPath)
Abstract
XPath is a language for addressing parts of an XML document.
Status of this document
This document is based on the W3C Recommendation 16 November 1999 (http://www.w3.org/TR/1999/REC-xpath-19991116). It describes the XPath language according to the implementation used in the Clark System. The implementation covers almost the whole language. Because of the general purpose of XPath, in the implementation there are some insignificant exclusions, which are not needed for the system. On the other hand, there are new things which will be described in this document. The implementation also covers an abbreviated syntax.
1 Introduction
XPath is the result of an effort to provide a common syntax and semantics for functionality shared between XSL Transformations [XSLT] and XPointer [XPointer]. The primary purpose of XPath is to address parts of an XML [XML] document. In support of this primary purpose, it also provides basic facilities for manipulation of strings, numbers and booleans. XPath operates on the abstract, logical structure of an XML document, rather than its surface syntax. XPath gets its name from its use of a path notation as in URLs for navigating through the hierarchical structure of an XML document.
In addition to its use for addressing, XPath is also designed so that it has a natural subset that can be used for matching (testing whether or not a node matches a pattern); this use of XPath is described in XSLT.
XPath models an XML document as a tree of nodes. There are different types of nodes, including element nodes, attribute nodes and text nodes. XPath defines a way to compute a string-value for each type of node. Some types of nodes also have names.
The primary syntactic construct in XPath is the expression. An expression is evaluated to yield an object, which has one of the following four basic types:
- node-set (an unordered collection of nodes without duplicates)
- boolean (true or false)
- number (a floating-point number)
- string (a sequence of UCS characters)
Expression evaluation occurs with respect to a context. The context consists of:
- a node (the context node)
- a pair of non-zero positive integers (the context position and the context size)
- a function library
The context position is always less than or equal to the context size.
The function library consists of a mapping from function names to functions. Each function takes zero or more arguments and returns a single result. This document defines a core function library that thel XPath implementation supports. For a function in the core function library, arguments and result are of the four basic types.
The context node, context position, and context size used to evaluate a subexpression are sometimes different from those used to evaluate the containing expression. Several kinds of expressions change the context node; only predicates change the context position and context size. When the evaluation of a kind of expression is described, it will always be explicitly stated if the context node, context position, and context size change for the evaluation of subexpressions; if nothing is said about the context node, context position, and context size, they remain unchanged for the evaluation of subexpressions of that kind of expression.
The grammar specified in this section applies to the attribute value after XML 1.0 normalization. So, for example, if the grammar uses the character
<
, this must not appear in the XML source as<
but must be quoted according to XML 1.0 rules by, for example, entering it as<
. Within expressions, literal strings are delimited by single or double quotation marks, which are also used to delimit XML attributes. To avoid a quotation mark in an expression being interpreted by the XML processor as terminating the attribute value the quotation mark can be entered as a character reference ("
or'
). Alternatively, the expression can use single quotation marks if the XML attribute is delimited with double quotation marks or vice-versa.One important kind of expression is a location path. A location path selects a set of nodes relative to the context node. The result of evaluating an expression that is a location path is the node-set containing the nodes selected by the location path. Location paths can recursively contain expressions that are used to filter sets of nodes.
2 Location Paths
Although location paths are not the most general grammatical construct in the language, they are the most important construct and will therefore be described first.
Every location path can be expressed using a straightforward but rather verbose syntax. There are also a number of syntactic abbreviations that allow common cases to be expressed concisely. This section will explain the semantics of location paths using the unabbreviated syntax. The abbreviated syntax will then be explained by showing how it expands into the unabbreviated syntax .
Here are some examples of location paths using the unabbreviated syntax:
-
child::para
selects thepara
element children of the context node
-
child::text()
selects all text node children of the context node
-
child::node() or child::*
selects all the children of the context node, whatever their node type
-
attribute::name
selects thename
attribute of the context node
-
attribute::*
selects all the attributes of the context node
-
descendant::para
selects thepara
element descendants of the context node
-
ancestor::div
selects alldiv
ancestors of the context node
-
ancestor-or-self::div
selects thediv
ancestors of the context node and, if the context node is adiv
element, the context node as well
-
descendant-or-self::para
selects thepara
element descendants of the context node and, if the context node is apara
element, the context node as well
-
self::para
selects the context node if it is apara
element, and otherwise selects nothing
-
child::chapter/descendant::para
selects thepara
element descendants of thechapter
element children of the context node
-
child::*/child::para
selects allpara
grandchildren of the context node
-
/
selects the document root (which is always the parent of the document element)
-
/descendant::para
selects all thepara
elements in the same document as the context node
-
/descendant::olist/child::item
selects all theitem
elements that have anolist
parent and that are in the same document as the context node
-
child::para[position()=1]
selects the firstpara
child of the context node
-
child::para[position()=last()]
selects the lastpara
child of the context node
-
child::para[position()=last()-1]
selects the last but onepara
child of the context node
-
child::para[position()>1]
selects all thepara
children of the context node other than the firstpara
child of the context node
-
following-sibling::chapter[position()=1]
selects the nextchapter
sibling of the context node
-
preceding-sibling::chapter[position()=1]
selects the previouschapter
sibling of the context node
-
/descendant::figure[position()=42]
selects the forty-secondfigure
element in the document
-
/child::doc/child::chapter[position()=5]/child::section[position()=2]
selects the secondsection
of the fifthchapter
of thedoc
document element
-
child::para[attribute::type="warning"]
selects allpara
children of the context node that have atype
attribute with valuewarning
-
child::para[attribute::type='warning'][position()=5]
selects the fifthpara
child of the context node that has atype
attribute with valuewarning
-
child::para[position()=5][attribute::type="warning"]
selects the fifthpara
child of the context node if that child has atype
attribute with valuewarning
-
child::chapter[child::title='Introduction']
selects thechapter
children of the context node that have one or moretitle
children with string-valueequal toIntroduction
-
child::chapter[child::title]
selects thechapter
children of the context node that have one or moretitle
children
-
child::*[self::chapter or self::appendix]
selects thechapter
andappendix
children of the context node
child::*[self::chapter or self::appendix][position()=last()]
selects the lastchapter
orappendix
child of the context node
There are two kinds of location path: relative location paths and absolute location paths.
A relative location path consists of a sequence of one or more location steps separated by
/
. The steps in a relative location path are composed together from left to right. Each step in turn selects a set of nodes relative to a context node. An initial sequence of steps is composed together with a following step as follows. The initial sequence of steps selects a set of nodes relative to a context node. Each node in that set is used as a context node for the following step. The sets of nodes identified by that step are unioned together. The set of nodes identified by the composition of the steps is this union. For example,child::div/child::para
selects thepara
element children of thediv
element children of the context node, or, in other words, thepara
element grandchildren that havediv
parents.An absolute location path consists of
/
followed by a relative location path. The/
selects the root node of the document as a context node for the next relative location path. A/
itself does not select the root node. This can be done by /self::*2.1 Location Steps
A location step has three parts:
- an axis, which specifies the tree relationship between the nodes selected by the location step and the context node,
- a node test, which specifies the node type or the name of the nodes selected by the location step, and
- zero or more predicates, which use arbitrary expressions to further refine the set of nodes selected by the location step.
The syntax for a location step is the axis name and node test separated by a double colon, followed by zero or more expressions each in square brackets. For example, in
child::para[position()=1]
,child
is the name of the axis,para
is the node test and[position()=1]
is a predicate.The node-set selected by the location step is the node-set that results from generating an initial node-set from the axis and node-test, and then filtering that node-set by each of the predicates in turn.
The initial node-set consists of the nodes having the relationship to the context node specified by the axis, and having the node type or name specified by the node test. For example, a location step
descendant::para
selects thepara
element descendants of the context node:descendant
specifies that each node in the initial node-set must be a descendant of the context;para
specifies that each node in the initial node-set must be an element namedpara
. The available axes are described in Axes. The available node tests are described in Node Tests.The initial node-set is filtered by the first predicate to generate a new node-set; this new node-set is then filtered using the second predicate, and so on. The final node-set is the node-set selected by the location step. The axis affects how the expression in each predicate is evaluated and so the semantics of a predicate is defined with respect to an axis.
2.2 Axes
The following axes are available:
- the
child
axis contains the children of the context node - the
descendant
axis contains the descendants of the context node; a descendant is a child or a child of a child and so on; thus the descendant axis never contains attribute nodes - the
parent
axis contains the parent of the context node, if there is one - the
ancestor
axis contains the ancestors of the context node; the ancestors of the context node consist of the parent of context node and the parent's parent and so on; thus, the ancestor axis will always include the root node, unless the context node is the root node - the
following-sibling
axis contains all the following siblings of the context node; if the context node is an attribute node node, thefollowing-sibling
axis is empty - the
preceding-sibling
axis contains all the preceding siblings of the context node; if the context node is an attribute node, thepreceding-sibling
axis is empty - the
following
axis contains all nodes in the same document as the context node that are after the context node in document order, excluding any descendants and excluding attribute nodes. - the
preceding
axis contains all nodes in the same document as the context node that are before the context node in document order, excluding any ancestors and excluding attribute nodes. - the
attribute
axis contains the attributes of the context node; the axis will be empty unless the context node is an element - the
self
axis contains just the context node itself - the
descendant-or-self
axis contains the context node and the descendants of the context node - the
ancestor-or-self
axis contains the context node and the ancestors of the context node; thus, the ancestor axis will always include the root node
NOTE: The
ancestor
,descendant
,following
,preceding
andself
axes partition a document (ignoring attributes): they do not overlap and together they contain all the nodes in the document.2.3 Node Tests
The node tests are divided into two categories: node type tests and node name tests.
The node type tests are a finite number. They are:
- text()
From the initial node-set preserves only the text nodes. - text(<text>)
From the initial node-set preserves only the text nodes, which contain the <text> as a substring of their text content. For example child::text(“play”) will return all text nodes which are children of the context node and contain the token “play” in them. - node()
- This node test do not filter the initial node-set. This is used when all the nodes selected from the axis are needed for further evaluation. For short “*” can be used as instead, i.e. child::* is the same as child::node().
- element()
- From the initial node-set only the element nodes remain.
- attribute()
- From the initial node-set only the attribute nodes remain. It is possible the initianl nodes to contain not only attributes.
- attribute(<attributeName>)
Filters only for element nodes which have an attribute named <attributeName>. Example child::attribute(id) will take only the element nodes, children of the context node, which have an attribute id. - attribute(<attributeName> = “<arrtibuteValue>”)
- This node test is almost the same as the attribute(<attributeName>) node test, but in addition it has also a restriction on the value of the attribute. Example: child::attribute(id=”243”) will take only these element nodes which have an attribute id set to value 243.
- comment()
- From the initial node-set only comment nodes remain.
- processing-instruction()
- From the initial node-set only processing-instruction nodes remain.
The name node tests are used to filter the initial node-set for element nodes with a given name. All other non-element nodes and element nodes with other name are removed from the set. Example child::para takes only the para element nodes, children of the context node.
2.4 Predicates
An axis is either a forward axis or a reverse axis. An axis that only ever contains the context node or nodes that are after the context node in document order is a forward axis. An axis that only ever contains the context node or nodes that are before the context node in document order is a reverse axis. Thus, the ancestor, ancestor-or-self, preceding, and preceding-sibling axes are reverse axes; all other axes are forward axes. Since the self axis always contains at most one node, it makes no difference whether it is a forward or reverse axis. The proximity position of a member of a node-set with respect to an axis is defined to be the position of the node in the node-set ordered in document order if the axis is a forward axis and ordered in reverse document order if the axis is a reverse axis. The first position is 1.
A predicate filters a node-set with respect to an axis to produce a new node-set. For each node in the node-set to be filtered, the predicate is evaluated with that node as the context node, with the number of nodes in the node-set as the context size, and with the proximity position of the node in the node-set with respect to the axis as the context position; if the predicate evaluates to true for that node, the node is included in the new node-set; otherwise, it is not included.
A predicate is evaluated as an expression and the result is converted to a boolean. If the result is a number, the result will be converted to true if the number is equal to the context position and will be converted to false otherwise; if the result is not a number, then the result will be converted as if by a call to the boolean() function. Thus a location path
para[3]
is equivalent topara[position()=3]
. In other words if the context node has 4 child nodes para, then only for the third para child the predicate [3] (or [position()=3]) will be true. So the new node-set will contain only the third para child.2.5 Abbreviated Syntax
Here are some examples of location paths using abbreviated syntax:
-
para
selects thepara
element children of the context node
-
*
selects all children of the context node
-
text()
selects all text node children of the context node
-
@name
selects thename
attribute of the context node
-
@*
selects all the attributes of the context node
-
para[1]
selects the firstpara
child of the context node
-
para[last()]
selects the lastpara
child of the context node
-
*/para
selects allpara
grandchildren of the context node
-
/doc/chapter[5]/section[2]
selects the secondsection
of the fifthchapter
of thedoc
-
chapter//para
selects thepara
element descendants of thechapter
element children of the context node
-
//para
selects all thepara
descendants of the document root and thus selects allpara
elements in the same document as the context node
-
//olist/item
selects all theitem
elements in the same document as the context node that have anolist
parent
-
.
selects the context node
-
.//para
selects thepara
element descendants of the context node
-
..
selects the parent of the context node
-
../@lang
selects thelang
attribute of the parent of the context node
-
para[@type="warning"]
selects allpara
children of the context node that have atype
attribute with valuewarning
-
para[@type="warning"][5]
selects the fifthpara
child of the context node that has atype
attribute with valuewarning
-
para[5][@type="warning"]
selects the fifthpara
child of the context node if that child has atype
attribute with valuewarning
-
chapter[title="Introduction"]
selects thechapter
children of the context node that have one or moretitle
children with string-value equal toIntroduction
-
chapter[title]
selects thechapter
children of the context node that have one or moretitle
children
employee[@secretary and @assistant]
selects all theemployee
children of the context node that have both asecretary
attribute and anassistant
attribute
The most important abbreviation is that
child::
can be omitted from a location step. In effect,child
is the default axis. For example, a location pathdiv/para
is short forchild::div/child::para
.There is also an abbreviation for attributes:
attribute::
can be abbreviated to@
. For example, a location pathpara[@type="warning"]
is short forchild::para[attribute::type="warning"]
and so selectspara
children with atype
attribute with value equal towarning
.//
is short for/descendant-or-self::node()/
. For example,//para
is short for/descendant-or-self::node()/child::para
and so will select anypara
element in the document;div//para
is short fordiv/descendant-or-self::node()/child::para
and so will select allpara
descendants ofdiv
children.NOTE: The location path
//para[1]
does not mean the same as the location path/descendant::para[1]
. The latter selects the first descendantpara
element; the former selects all descendantpara
elements that are the firstpara
children of their parents.A location step of
.
is short forself::node()
. This is particularly useful in conjunction with//
. For example, the location path.//para
is short for self::node()/descendant-or-self::node()/child::para and so will select allpara
descendant elements of the context node.Similarly, a location step of
..
is short forparent::node()
. For example,../title
is short forparent::node()/child::title
and so will select thetitle
children of the parent of the context node.3 Expressions
3.1 Function Calls
A function call is evaluated by evaluating each of the arguments, converting each argument to the type required by the function, and finally calling the function, passing it the converted arguments. It is an error if the number of arguments is wrong or if an argument cannot be converted to the required type.
An argument is converted to type string as if by calling the string() function. An argument is converted to type number as if by calling the number() function. An argument is converted to type boolean as if by calling the boolean() function. An argument that is not of type node-set cannot be converted to a node-set.
Examples:
child::para[count(child::*) > 3] returns all para child elements of the context node which have more then 3 children. The evaluation will be in the following sequence: getting all child nodes of the context; applying the name node test and keeping only the para elements; for each para element as a context evaluating the count() function. First evaluating the location path child::* with a context each of the para elements. The location path every time returns a node-sets (different in general) which is passed as an agrument for the count() function. The function returns a number which later will be used for further evaluation of the predicate.
3.2 Node-sets
A location path can be used as an expression. The expression returns the set of nodes selected by the path.
The
|
operator computes the union of its operands, which must be node-sets.Example: /descendant-or-self::para | /descendant-or-self::head will return an union of all para and head element nodes in a document. The two location paths are evaluated independantly and the two results are unified.
3.3 Booleans
An object of type boolean can have one of two values, true and false.
An
or
expression is evaluated by evaluating each operand and converting its value to a boolean as if by a call to the boolean() function. The result is true if either value is true and false otherwise.An
and
expression is evaluated by evaluating each operand and converting its value to a boolean as if by a call to the boolean() function. The result is true if both values are true and false otherwise.An equality(A = B) or a relational(A < B) expression is evaluated by comparing the objects that result from evaluating the two operands. Comparison of the resulting objects is defined in the following three paragraphs. First, comparisons that involve node-sets are defined in terms of comparisons that do not involve node-sets; this is defined uniformly for
=
,!=
,<=
,<
,>=
and>
. Second, comparisons that do not involve node-sets are defined for=
and!=
. Third, comparisons that do not involve node-sets are defined for<=
,<
,>=
and>
.If both objects to be compared are node-sets, then the comparison will be true if and only if there is a node in the first node-set and a node in the second node-set such that the result of performing the comparison on the string-values of the two nodes is true. If one object to be compared is a node-set and the other is a number, then the comparison will be true if and only if there is a node in the node-set such that the result of performing the comparison on the number to be compared and on the result of converting the string-value of that node to a number using the number() function is true. If one object to be compared is a node-set and the other is a string, then the comparison will be true if and only if there is a node in the node-set such that the result of performing the comparison on the string-value of the node and the other string is true. If one object to be compared is a node-set and the other is a boolean, then the comparison will be true if and only if the result of performing the comparison on the boolean and on the result of converting the node-set to a boolean using the boolean() function is true.
When neither object to be compared is a node-set and the operator is
=
or!=
, then the objects are compared by converting them to a common type as follows and then comparing them. If at least one object to be compared is a boolean, then each object to be compared is converted to a boolean as if by applying theboolean() function. Otherwise, if at least one object to be compared is a number, then each object to be compared is converted to a number as if by applying theboolean() function. Otherwise, both objects to be compared are converted to strings as if by applying the string() function. The=
comparison will be true if and only if the objects are equal; the!=
comparison will be true if and only if the objects are not equal. Two booleans are equal if either both are true or both are false. Two strings are equal if and only if they consist of the same sequence of UCS characters.When neither object to be compared is a node-set and the operator is
<=
,<
,>=
or>
, then the objects are compared by converting both objects to numbers and comparing the numbers. The<
comparison will be true if and only if the first number is less than the second number. The<=
comparison will be true if and only if the first number is less than or equal to the second number. The>
comparison will be true if and only if the first number is greater than the second number. The>=
comparison will be true if and only if the first number is greater than or equal to the second number.3.4 Numbers
The numeric operators convert their operands to numbers as if by calling the number() function.
The
+
operator performs addition.The
-
operator performs subtraction.NOTE: Since XML allows
-
in names, the-
operator typically needs to be preceded by whitespace. For example,foo-bar
evaluates to a node-set containing the child elements namedfoo-bar
;foo - bar
evaluates to the difference of the result of converting the string-value of the firstfoo
child element to a number and the result of converting the string-value of the firstbar
child to a number.The
div
operator performs floating-point division.The
mod
operator returns the remainder from a truncating division. For example,-
5 mod 2
returns1
-
5 mod -2
returns1
-
-5 mod 2
returns-1
-5 mod -2
returns-1
4 Core Function Library
This section describes functions that the XPath implementation include in the function library that is used to evaluate expressions.
Each function in the function library is specified using a function prototype, which gives the return type, the name of the function, and the type of the arguments. If an argument type is followed by a question mark, then the argument is optional; otherwise, the argument is required.
4.1 Node Set Functions
The
last function returns a number equal to the context size from the expression evaluation context.
The position function returns a number equal to the context position from the expression evaluation context.
Function: number count(node-set)
The count function returns the number of nodes in the argument node-set.
Function: string name(node-set?)
The name function returns a string containing a name of the node in the argument node-set that is first in document order. If the argument node-set is empty or the first node has no name, an empty string is returned. If the argument it omitted, it defaults to a node-set with the context node as its only member.
4.2 String Functions
Function: string string(object?)
The string function converts an object to a string as follows:
- A node-set is converted to a string by returning the string-value of the node in the node-set that is first in document order. If the node-set is empty, an empty string is returned.
- A number is converted to a string as follows
- NaN is converted to the string
NaN
- positive zero is converted to the string
0
- negative zero is converted to the string
0
- positive infinity is converted to the string
Infinity
- negative infinity is converted to the string
-Infinity
- if the number is an integer, the number is represented in decimal form as a number with no decimal point and no leading zeros, preceded by a minus sign (
-
) if the number is negative - otherwise, the number is represented in decimal form as a number including a decimal point with at least one digit before the decimal point and at least one digit after the decimal point, preceded by a minus sign (
-
) if the number is negative;
- NaN is converted to the string
- The boolean false value is converted to the string
false
. The boolean true value is converted to the stringtrue
. - An object of a type other than the four basic types will cause an exception (error).
If the argument is omitted, it defaults to a node-set with the context node as its only member.
Function: string concat(string, string, string*)
The concat function returns the concatenation of its arguments.
Function: boolean starts-with(string, string)
The starts-with function returns true if the first argument string starts with the second argument string, and otherwise returns false.
Function: boolean contains(string, string)
The contains function returns true if the first argument string contains the second argument string, and otherwise returns false.
Function: string substring-before(string, string)
The substring-before function returns the substring of the first argument string that precedes the first occurrence of the second argument string in the first argument string, or the empty string if the first argument string does not contain the second argument string. For example,
substring-before("1999/04/01","/")
returns1999
.Function: string substring-after(string, string)
The substring-after function returns the substring of the first argument string that follows the first occurrence of the second argument string in the first argument string, or the empty string if the first argument string does not contain the second argument string. For example,
substring-after("1999/04/01","/")
returns04/01
, andsubstring-after("1999/04/01","19")
returns99/04/01
.Function: string substring(string, number, number?)
The substring function returns the substring of the first argument starting at the position specified in the second argument with length specified in the third argument. For example,
substring("12345",2,3)
returns"234"
. If the third argument is not specified, it returns the substring starting at the position specified in the second argument and continuing to the end of the string. For example,substring("12345",2)
returns"2345"
.More precisely, each character in the string is considered to have a numeric position: the position of the first character is 1, the position of the second character is 2 and so on.
The returned substring contains those characters for which the position of the character is greater than or equal to the rounded value of the second argument and, if the third argument is specified, less than the sum of the rounded value of the second argument and the rounded value of the third argument; rounding is done as if by a call to the round function. The following examples illustrate various unusual cases:
-
substring("12345", 1.5, 2.6)
returns"234"
-
substring("12345", 0, 3)
returns"12"
-
substring("12345", 0 div 0, 3)
returns""
-
substring("12345", 1, 0 div 0)
returns""
-
substring("12345", -42, 1 div 0)
returns"12345"
substring("12345", -1 div 0, 1 div 0)
returns""
Function: number string-length(string?)
The string-length returns the number of characters in the string. If the argument is omitted, it defaults to the context node converted to a string, in other words the string-value of the context node.
Function: string normalize-space(string?)
The normalize-space function returns the argument string with whitespace normalized by stripping leading and trailing whitespace and replacing sequences of whitespace characters by a single space. Whitespace characters are: SPACEs, TABs, new lines. If the argument is omitted, it defaults to the context node converted to a string, in other words the string-value of the context node.
Function: string translate(string, string, string)
The translate function returns the first argument string with occurrences of characters in the second argument string replaced by the character at the corresponding position in the third argument string. For example,
translate("bar","abc","ABC")
returns the stringBAr
. If there is a character in the second argument string with no character at a corresponding position in the third argument string (because the second argument string is longer than the third argument string), then occurrences of that character in the first argument string are removed. For example,translate("--aaa--","abc-","ABC")
returns"AAA"
. If a character occurs more than once in the second argument string, then the first occurrence determines the replacement character. If the third argument string is longer than the second argument string, then excess characters are ignored.4.3 Boolean Functions
Function: boolean boolean(object)
The boolean function converts its argument to a boolean as follows:
- a number is true if and only if it is neither positive or negative zero nor NaN
- a node-set is true if and only if it is non-empty
- a string is true if and only if its length is non-zero
- An object of a type other than the four basic types will cause an exception (error).
Function: boolean not(boolean)
The not function returns true if its argument is false, and false otherwise.
The true function returns true.
The false function returns false.
4.4 Number Functions
Function: number number(object?)
The number function converts its argument to a number as follows:
- a string that consists of optional whitespace followed by an optional minus sign followed by a number followed by whitespace is converted to the number that is nearest to the mathematical value represented by the string; any other string is converted to NaN
- boolean true is converted to 1; boolean false is converted to 0
- a node-set is first converted to a string as if by a call to the string function and then converted in the same way as a string argument
- an object of a type other than the four basic types will cause an exception (error).
If the argument is omitted, it defaults to a node-set with the context node as its only member.
Function: number sum(node-set)
The sum function returns the sum, for each node in the argument node-set, of the result of converting the string-values of the node to a number.
Function: number floor(number)
The floor function returns the largest (closest to positive infinity) number that is not greater than the argument and that is an integer.
Function: number ceiling(number)
The ceiling function returns the smallest (closest to negative infinity) number that is not less than the argument and that is an integer.
Function: number round(number)
The round function returns the number that is closest to the argument and that is an integer. If there are two such numbers, then the one that is closest to positive infinity is returned. If the argument is NaN, then NaN is returned. If the argument is positive infinity, then positive infinity is returned. If the argument is negative infinity, then negative infinity is returned. If the argument is positive zero, then positive zero is returned. If the argument is negative zero, then negative zero is returned. If the argument is less than zero, but greater than or equal to -0.5, then negative zero is returned.
5 Data Model
XPath operates on an XML document as a tree. This section describes how XPath models an XML document as a tree. This model is conceptual only and does not mandate any particular implementation.
The tree contains nodes. There are several supported types of node:
- element nodes
- text nodes
- attribute nodes
- processing instruction nodes
- comment nodes
For every type of node, there is a way of determining a string-value for a node of that type. For some types of node, the string-value is part of the node; for other types of node, the string-value is computed from the string-value of descendant nodes.
For element nodes, the string-value is a concatenation of the string-values of their child nodes in document order. If an element node has no child nodes then an empty string is returned.
For attribute, comment, text and processing-instruction nodes the string-value is the text content of each of the nodes.
There is an ordering, document order, defined on all the nodes in the document corresponding to the order in which the first character of the XML representation of each node occurs in the XML representation of the document after expansion of general entities. Thus, the root node will be the first node. Element nodes occur before their children. Thus, document order orders element nodes in order of the occurrence of their start-tag in the XML (after expansion of entities). Reverse document order is the reverse of document order.
Element nodes have an ordered list of child nodes. Nodes never share children: if one node is not the same node as another node, then none of the children of the one node will be the same node as any of the children of another node. Every node other than the root node has exactly one parent, which is an element node. The descendants of a node are the children of the node and the descendants of the children of the node.
5.1 Element Nodes
There is an element node for every element in the document. An element node has a name.
The children of an element node are the element nodes, comment nodes, processing instruction nodes and text nodes for its content.
5.2 Attribute Nodes
Each element node has an associated set of attribute nodes; the element is the parent of each of these attribute nodes; however, an attribute node is not a child of its parent element.
Elements never share attribute nodes: if one element node is not the same node as another element node, then none of the attribute nodes of the one element node will be the same node as the attribute nodes of another element node.
5.3 Processing Instruction Nodes
There is a processing instruction node for every processing instruction, except for any processing instruction that occurs within the document type declaration.
5.4 Comment Nodes
There is a comment node for every comment, except for any comment that occurs within the document type declaration.
5.5 Text Nodes
Character data is grouped into text nodes. As much character data as possible is grouped into each text node: a text node never has an immediately following or preceding sibling that is a text node. The string-value of a text node is the character data. A text node always has at least one character of data.
Each character within a CDATA section is treated as character data. Thus,
<![CDATA[<]]>
in the source document will treated the same as<
. Both will result in a single<
character in a text node in the tree. Thus, a CDATA section is treated as if the<![CDATA[
and]]>
were removed and every occurrence of<
and&
were replaced by<
and&
respectively.A text node does not have a name.
A References
A.1 Normative References
- XML
- World Wide Web Consortium. Extensible Markup Language (XML) 1.0. W3C Recommendation. See http://www.w3.org/TR/1998/REC-xml-19980210
- XML Names
- World Wide Web Consortium. Namespaces in XML. W3C Recommendation. See http://www.w3.org/TR/REC-xml-names
A.2 Other References
- Character Model
- World Wide Web Consortium. Character Model for the World Wide Web. W3C Working Draft. See http://www.w3.org/TR/WD-charmod
- DOM
- World Wide Web Consortium. Document Object Model (DOM) Level 1 Specification. W3C Recommendation. See http://www.w3.org/TR/REC-DOM-Level-1
- TEI
- C.M. Sperberg-McQueen, L. Burnard Guidelines for Electronic Text Encoding and Interchange. See http://etext.virginia.edu/TEI.html.
- Unicode
- Unicode Consortium. The Unicode Standard. See http://www.unicode.org/unicode/standard/standard.html.