Menu File
This menu item includes sub items for management of documents in the CLaRK system. The documents that the system recognizes are XML well-formed documents. These documents can be divided into two kinds. The documents of the first kind are outside of the system. The system can work with these documents if they are imported inside the system. The external documents must have one of the following character encodings: ASCII, UNICODE (UTF-8), UNICODE (UTF-16BE) or UNICODE (UTF-16LE). The documents of the second kind are imported into the system and stored in an internal representation. When a document is processed inside the system, it has to be connected with a document type definition (DTD). The DTD has to be compiled in an internal representation before being assigned to the document. The document loaded in the active view of the system is called a current document.
Sub-Menus:
- New
A default shortcut Ctrl+N
An icon on the toolbarChoosing this item, the user creates a new empty document. First, a menu for choosing a DTD appears. The user can create a document with respect to one of the DTDs compiled into the system. If the appropriate DTD is not compiled, then the user has to close the dialog and compile the required DTD. After choosing the DTD, a new document, containing only the root element with empty content is created. In most cases this document will not be valid.
- Open
A default shortcut Ctrl+O
An icon on the toolbarBy choosing this item, the user opens document(s) saved in the internal document database of the system. A Document Manager window is opened on the screen.
The user has to point out to a certain document and click the Open button. Then the corresponding document is loaded into the system. The internal documents are stored together with information for their DTD. If their DTD is compiled in the system, then they are opened with respect to it. If the DTD is deleted (for some reason) from the DTD database of the system, the user is asked to choose a DTD from the available ones.
The user can choose document(s) to open in two modes:
- selecting one or more documents from a list of all internal documents. It is an option provided by the Document Manager;
- the internal documents are arranged in hierarchically organized groups. The user can select document(s) from the current group.
If the Cancel button is chosen, the manager window closes and no document is loaded into the system.
The user can also open a document in the manager by performing a double click on a certain document name either in the list of all documents or in the list of the current group.
Note that Open menu does not open text files. If the you want to load an XML document from a text file, then choose Import XML item.
- Save
A default shortcut Ctrl+S
An icon on the toolbarBy choosing this item, the user saves the current document in the internal document database of the system. The user has to give a name to the document in a dialog box. Optionally the user can specify a document group in which the document should be included. Then the system tries to save the document. If there exists a document with the specified name, the user is asked to confirm the overwriting of the document or to choose a new name.
- Document Manager
[:bg]
This component is responsible for the management of the internal documents in the CLaRK System.
The Internal Documents Database is an essential part of the CLaRK system. It is the place where all documents saved in the system are stored. The main reason for maintaining such a database is that the system can work only with well-formed XML documents and having them stored in an internal format prevents the system from malfunctioning because of incorrect data. This database protects the documents from being corrupted. In addition, keeping the documents "invisible" for other applications allows saving additional system specific information with them. The interface of the documents with external applications can be done by import/export operations. A successfully imported document in the system has a guaranteed well-formed structure. An exported document from the system is always a well-formed XML document stored in a file, which can be used by other XML oriented applications.
The Internal Document Database represents a set of repositories, called Corpora which contain the saved documents in the system. The different corpora are identified by unique (user supplied) names. Each corpus contains a set of documents which are stored independently from the documents of other corpora. No documents can be saved outside a corpus. The documents within a corpus are identified by unique names, i.e. no two documents can have the same name. If it is necessary that two or more documents should have the same name, they must be stored in different corpora.
The possible operations on corpora are:
- Creating a new corpus
The user supplies a new corpus name which hasn't been used yet for another corpus. - Removing a corpus
The user selects a corpus to be removed. All documents contained in the selected corpus are removed as well. The removal is preceded by a confirmation warning message. There is one system protected corpus ('Root') which can not be removed. It is always presented in the system and it is used as a default corpus. - Renaming a corpus
The user can rename an existing corpus by supplying a new unique name. The 'Root' corpus is protected from renaming.
Corpora are document containers only, so it is not possible a corpus to contain another corpus. Each corpus has its own internal logical structure which facilitates the maintenance of many documents at the same time. The documents within a corpus are organized in groups. Each group can contain a set of documents, as well as a set of subgroups. Each group is identified by an unique name within the group in which it is included, but not necessary unique for the containing corpus. The supported operations on groups are: creating a new group or a subgroup of a group; renaming a group and removing a group. The user can include or exclude documents in a group.
The groups are logical unions of documents which do not impact physically the documents themselves. This means that if a document is removed (excluded) from a group, the document itself is not deleted from the database, but just excluded from this union. Thus the same document can be included in more than one group, as well as a document can be included in none.
Note: technically said, the groups contain not documents, but references to documents. In this way, if a document is presented in more than one group, the document itself is not multiplied for each containing group, but the single instance is referred by links from each group entry.
Observing the documents in the Internal Documents Database can be done in three views:
- by logical tree structure - the hierarchical structure is visualized in a way that on the topmost level the set of available corpora is shown and under each of them the groups tree structure is visualized. Selecting a point in the hierarchy causes a visualization of its content in a separate window.
- by all documents in a corpus - all documents which are presented in a certain corpus are represented in a table, showing the document names and some other information. The user can switch to another corpus in order to see its content in a table. In this view the groups information can not be observed.
- by all documents in the whole database - all documents from all corpora are represented in one table. The information shown in the table is the same as the one from the previous view, plus a source corpus name for each document.
The main dialog is Entry Manager with some additional functionality buttons. The manager keeps records of all documents saved in the system. The documents are contained in system of document groups which are arranged in trees, similar to the directory structure. On the top of each document groups hierarchy there stands a node, which represents thecorpus container. Thus the topmost level of the hierarchy of the Internal Documents database consists of corpora nodes only. From these nodes, the hierarchies of groups begin.
Each corpus contains one system reserved group, named SYSTEM. It contains documents for special purposes in the system. This group contains the following system protected sub-groups:
- Results - This group contains a sub-group for each tool in the system. Each such group stores the results from the application of the tools by default. The internal structure of each such tool group is user manageable.
- Queries - This group contains a sub-group for each tool in the system. These groups store the query documents for each tool (a query document includes all the settings for the tool in order to it to be applied without additional data). The internal structure of each tool group is again user manageable. The documents included here must have certain system DTDs and must be valid according to them.
None of the above groups can .be deleted or renamed by the user.
Here follows one example picture of the Document Manager window:
The By Groups view of the manager window consists of two main parts:
- The panel on the left. It contains a tree representation of the documents hierarchy: Root, SYSTEM, Queries, etc. When the user selects a node in this tree, the content of the corresponding group or corpus is loaded in the component on the right side.
- Current group monitor. This is the panel situated on the right side of the window. It is a list with the content of the currently selected group in the tree. The list components, which are in blue color, are sub-groups. The other ones are the entries included in this group. They are colored in black or red, depending on whether they are valid or not, according to their DTDs. The user can sort all the entries in a group by clicking on the Name column of the table header. It is possible to rearrange the entries in a group by simply using drag-and-drop technique, i.e. pressing a document and moving upwards or downwords until the desired position is reached.
There are seven buttons which can be used for manipulation of the content of the current group:
- New Group - creates a new sub-group of the current one. The user is asked to give a name for it;
- Remove - removes the selected documents and/or groups from the list. The removal is preceded by a confirmation message. If the selection includes sub-groups, they are also removed with their entire contents.
- Rename - changes the name of the selected document or group from the list.
- Copy - makes a copy of the selected document from the list with a different name supplied by the user.
- Add Document - gives a list of all documents which are not presented in the current group. The user is expected to choose one or more documents to be included in the current group.
- Delete! - this function can be used for removing documents from the internal documents database. It can be applied only to single documents, not to whole groups. Groups are excluded from any selections. The removal of documents is preceded by a confirmation message. The documents to be removed, are excluded from all the groups they may belong to.
- Apply - this function is used when applying saved tool queries. For details about the predefined tool queries see section XML Tool Queries. The user can choose one or more query documents to be applied. This is performed by selecting entries from the right panel. The "Apply" button is enabled only if the selected documents are queries. Pressing the button, the MultyQuery Tool manager opens and all the selected queries are inserted automatically.
Navigation in the group structure can be made also in the panel on the right. When the user wants to see the content of a certain sub-group of the current group, s/he just has to perform a double click on the desired sub-group name. This will change the current group to the new one. This represents the movement from a group to a sub-group. The movement in the other direction is also possible. For each document group (excluding the Root), a special sub-group is included, named: ". .". By performing a double click on it, similarly to most file systems, the current group is changed to its parent one.
The Document Manager also provides two other views of the internal documents:
- All Documents in a Corpus
This view shows all documents within a single corpus in a table. The user can control the current corpus by selection in Choose a Corpus field. - All Documents in the System
This view shows all documents from all corpora in one table.
From now on, these two view will be described together as they share many common features.
The content of the tables in these views contain the following information about the documents:
- Corpus Name (only for the second view here) - the corpus in which the document is contained;
- Name - the name of the document;
- Time - the date and the time of the last document modification;
- Status - if the document represents a tool query, the corresponding tool name;
- DTD - the DTD assigned to the document;
- Validity status - whether the document is valid according to its DTD. The invalid status can be detected by the red color of the table entry.
The documents in the table can be manipulated by using the right mouse button. Having selected a document or a set of documents and performed a right-mouse click, the following functions menu appears:
- Info - This item shows the following information about the selected documents:
- document name;
- document size (in characters);
- document's dtd name;
- validity status of the document;
- group(s) in which the document is included.
- Add In Group... - This item shows a dialog with the hierarchical structure for the groups in the system and the user can choose a group in which to place the selected files.
- Delete! - deletes the selected document(s) (see above);
- Rename - renames the selected document (see above);
- Copy - copies the selected document (see above);
- Change DTD - changes the DTD assignments of the selected documents.
Here is one example view of all internal documents from corpus Root in CLaRK:
[:en]
This component is responsible for the management of the internal documents in the CLaRK System.
The Internal Documents Database is an essential part of the CLaRK system. It is the place where all documents saved in the system are stored. The main reason for maintaining such a database is that the system can work only with well-formed XML documents and having them stored in an internal format prevents the system from malfunctioning because of incorrect data. This database protects the documents from being corrupted. In addition, keeping the documents "invisible" for other applications allows saving additional system specific information with them. The interface of the documents with external applications can be done by import/export operations. A successfully imported document in the system has a guaranteed well-formed structure. An exported document from the system is always a well-formed XML document stored in a file, which can be used by other XML oriented applications.
The Internal Document Database represents a set of repositories, called Corpora which contain the saved documents in the system. The different corpora are identified by unique (user supplied) names. Each corpus contains a set of documents which are stored independently from the documents of other corpora. No documents can be saved outside a corpus. The documents within a corpus are identified by unique names, i.e. no two documents can have the same name. If it is necessary that two or more documents should have the same name, they must be stored in different corpora.
The possible operations on corpora are:
- Creating a new corpus
The user supplies a new corpus name which hasn't been used yet for another corpus. - Removing a corpus
The user selects a corpus to be removed. All documents contained in the selected corpus are removed as well. The removal is preceded by a confirmation warning message. There is one system protected corpus ('Root') which can not be removed. It is always presented in the system and it is used as a default corpus. - Renaming a corpus
The user can rename an existing corpus by supplying a new unique name. The 'Root' corpus is protected from renaming.
Corpora are document containers only, so it is not possible a corpus to contain another corpus. Each corpus has its own internal logical structure which facilitates the maintenance of many documents at the same time. The documents within a corpus are organized in groups. Each group can contain a set of documents, as well as a set of subgroups. Each group is identified by an unique name within the group in which it is included, but not necessary unique for the containing corpus. The supported operations on groups are: creating a new group or a subgroup of a group; renaming a group and removing a group. The user can include or exclude documents in a group.
The groups are logical unions of documents which do not impact physically the documents themselves. This means that if a document is removed (excluded) from a group, the document itself is not deleted from the database, but just excluded from this union. Thus the same document can be included in more than one group, as well as a document can be included in none.
Note: technically said, the groups contain not documents, but references to documents. In this way, if a document is presented in more than one group, the document itself is not multiplied for each containing group, but the single instance is referred by links from each group entry.
Observing the documents in the Internal Documents Database can be done in three views:
- by logical tree structure - the hierarchical structure is visualized in a way that on the topmost level the set of available corpora is shown and under each of them the groups tree structure is visualized. Selecting a point in the hierarchy causes a visualization of its content in a separate window.
- by all documents in a corpus - all documents which are presented in a certain corpus are represented in a table, showing the document names and some other information. The user can switch to another corpus in order to see its content in a table. In this view the groups information can not be observed.
- by all documents in the whole database - all documents from all corpora are represented in one table. The information shown in the table is the same as the one from the previous view, plus a source corpus name for each document.
The main dialog is Entry Manager with some additional functionality buttons. The manager keeps records of all documents saved in the system. The documents are contained in system of document groups which are arranged in trees, similar to the directory structure. On the top of each document groups hierarchy there stands a node, which represents thecorpus container. Thus the topmost level of the hierarchy of the Internal Documents database consists of corpora nodes only. From these nodes, the hierarchies of groups begin.
Each corpus contains one system reserved group, named SYSTEM. It contains documents for special purposes in the system. This group contains the following system protected sub-groups:
- Results - This group contains a sub-group for each tool in the system. Each such group stores the results from the application of the tools by default. The internal structure of each such tool group is user manageable.
- Queries - This group contains a sub-group for each tool in the system. These groups store the query documents for each tool (a query document includes all the settings for the tool in order to it to be applied without additional data). The internal structure of each tool group is again user manageable. The documents included here must have certain system DTDs and must be valid according to them.
None of the above groups can .be deleted or renamed by the user.
Here follows one example picture of the Document Manager window:
The By Groups view of the manager window consists of two main parts:
- The panel on the left. It contains a tree representation of the documents hierarchy: Root, SYSTEM, Queries, etc. When the user selects a node in this tree, the content of the corresponding group or corpus is loaded in the component on the right side.
- Current group monitor. This is the panel situated on the right side of the window. It is a list with the content of the currently selected group in the tree. The list components, which are in blue color, are sub-groups. The other ones are the entries included in this group. They are colored in black or red, depending on whether they are valid or not, according to their DTDs. The user can sort all the entries in a group by clicking on the Name column of the table header. It is possible to rearrange the entries in a group by simply using drag-and-drop technique, i.e. pressing a document and moving upwards or downwords until the desired position is reached.
There are seven buttons which can be used for manipulation of the content of the current group:
- New Group - creates a new sub-group of the current one. The user is asked to give a name for it;
- Remove - removes the selected documents and/or groups from the list. The removal is preceded by a confirmation message. If the selection includes sub-groups, they are also removed with their entire contents.
- Rename - changes the name of the selected document or group from the list.
- Copy - makes a copy of the selected document from the list with a different name supplied by the user.
- Add Document - gives a list of all documents which are not presented in the current group. The user is expected to choose one or more documents to be included in the current group.
- Delete! - this function can be used for removing documents from the internal documents database. It can be applied only to single documents, not to whole groups. Groups are excluded from any selections. The removal of documents is preceded by a confirmation message. The documents to be removed, are excluded from all the groups they may belong to.
- Apply - this function is used when applying saved tool queries. For details about the predefined tool queries see section XML Tool Queries. The user can choose one or more query documents to be applied. This is performed by selecting entries from the right panel. The "Apply" button is enabled only if the selected documents are queries. Pressing the button, the MultyQuery Tool manager opens and all the selected queries are inserted automatically.
Navigation in the group structure can be made also in the panel on the right. When the user wants to see the content of a certain sub-group of the current group, s/he just has to perform a double click on the desired sub-group name. This will change the current group to the new one. This represents the movement from a group to a sub-group. The movement in the other direction is also possible. For each document group (excluding the Root), a special sub-group is included, named: ". .". By performing a double click on it, similarly to most file systems, the current group is changed to its parent one.
The Document Manager also provides two other views of the internal documents:
- All Documents in a Corpus
This view shows all documents within a single corpus in a table. The user can control the current corpus by selection in Choose a Corpus field. - All Documents in the System
This view shows all documents from all corpora in one table.
From now on, these two view will be described together as they share many common features.
The content of the tables in these views contain the following information about the documents:
- Corpus Name (only for the second view here) - the corpus in which the document is contained;
- Name - the name of the document;
- Time - the date and the time of the last document modification;
- Status - if the document represents a tool query, the corresponding tool name;
- DTD - the DTD assigned to the document;
- Validity status - whether the document is valid according to its DTD. The invalid status can be detected by the red color of the table entry.
The documents in the table can be manipulated by using the right mouse button. Having selected a document or a set of documents and performed a right-mouse click, the following functions menu appears:
- Info - This item shows the following information about the selected documents:
- document name;
- document size (in characters);
- document's dtd name;
- validity status of the document;
- group(s) in which the document is included.
- Add In Group... - This item shows a dialog with the hierarchical structure for the groups in the system and the user can choose a group in which to place the selected files.
- Delete! - deletes the selected document(s) (see above);
- Rename - renames the selected document (see above);
- Copy - copies the selected document (see above);
- Change DTD - changes the DTD assignments of the selected documents.
Here is one example view of all internal documents from corpus Root in CLaRK:
[:]
- Creating a new corpus
- Delete
By choosing this item, the user deletes document(s), saved in the internal document database of the system. A Document Manager window is opened on the screen.
The user has to point out to a document (or a set of documents) and click the Remove button. Then a confirmation dialog appears. If the user clicks on the Yes button, the chosen document(s) is(are) permanently deleted from the database. If the user clicks on the No button the document(s) is(are) not deleted.
Similarly to the "Open" function, the user can choose documents in two modes: from all documents or from a certain group. When documents are removed from the list of all documents, the groups they belong to are also updated.
The user can also delete a document in the manager by performing a double click on a certain document name either in the list of all documents or in the list of the current group.
- Import XML
An icon on the toolbar
By choosing this item, the user loads a text file containing an XML document within the system. A standard dialog for choosing a file appears. The user has to specify the name of the file to load and its character encoding in Files of type: choice box. The possible encodings for files are: ASCII, UNICODE (UTF-8), UNICODE (UTF-16BE) or UNICODE (UTF-16LE). Then the user is asked to choose a DTD from the DTD database of the system. If the document contains a DTD, the user is asked which one to be used during parsing/validation.
Remark: If a document is opened in ASCII format (and it is sure it is an ASCII file) and in the system it contains different symbols from the expected ones, the possible problems are two:
- Inappropriate system font (for details see Options/Fonts);
- Inappropriate character encoding converter from ASCII to Unicode (for details see Options/Encoding Correction).
The system tries to parse the document. If the document is well-formed, then it is parsed into the internal representation. The system creates a view of the document, validates the document and if the document is not valid with respect to the chosen DTD, a list of errors is printed in the error window. A description of all validation errors can be found in Validation error messages section.
If the document is not well-formed, then the system reports an error. A detailed description of all parsing errors can be found in Parse error messages section. The errors of this kind have to be repaired within an external to the system editor. The error message reported by CLaRK system can help the user to find the error in the document.
- ReImport
A default shortcut Ctrl+R
When the user tries to load a document, which is not well-formed, an error message is reported by the CLaRK system and the user has to edit the document outside the CLaRK system. Then he/she would like to load the same document. In order not to go throughout the whole process of choosing the file and so on, the user can use ReImport item to load the last chosen document.
The system tries to parse the document. If the document is well-formed, it is parsed into the internal representation. The system creates a view of the document, validates the document and if the document is not valid with respect to the chosen DTD, a list of errors is printed in the error window. A description of all validation errors can be found in Validation error messages section.
If the document is not well-formed, then the system reports an error. A detailed description of all parsing errors can be found in Parse error messages section. The errors of this kind have to be repaired with an editor outside the system. The error message reported by CLaRK system can help the user to find the error in the document.
- Export XML
An icon on the toolbar
By choosing this item, the user can save the current document into a text file outside the system. A standard dialog for choosing a file appears. The user has to specify the name of the file which will contain the document and the character encoding of the exported text in Files of type: choice box. The possible encodings for files are: ASCII, UNICODE (UTF-8), UNICODE (UTF-16BE) or UNICODE (UTF-16LE). If the files already exist the user is asked whether he/she wants to overwrite it.
Remark: If a document is exported in ASCII format but the content of the output file is not comprehensible for other programs or even for the CLaRK System, then the problem can be in an inappropriate character encoding converter from Unicode to ASCII (for details see Options/Encoding Correction).
The text in the file can be formatted according to the layout specified for the DTD of the document or to the current view's layout. These settings can be specified in section Export Options.
- Export Options
By choosing this item, the user can specify the information to be added in the external file.
Here is a screen shot of the dialog :
The dialog contains the following sections:
Layout source options
- None - by selecting this radio button, no information about the document layout is added in the file and all tags are written without line breaks in-between.
- DTD Layout - by selecting this radio button, the DTD layout of the internal document is preserved in the external file.
- Current Document Layout - by selecting this radio button, the text layout of the current document view is preserved in the external file.
Export options
- Export Line Offsets - by selecting this checkbox, the leading offsets of the tags in the document are printed in the external file, depending on the chosen layout.
- Spaces Count per Generation - determines the amount of space characters which will be written before the tags for each generation. Here each generation is a set of nodes with equal depth in the document tree structure. The total count of leading spaces before the tags is: x
- Print DTD in Document - determines whether the DTD of the document will be exported as well in the heading part of the exported file.
- Normalize whitespace in invisible text nodes - this option enables encoding of the whitespace characters with numerical entities when they appear in text nodes without visible characters in their content. This prevents their further removal by XML parsers assuming that it is a garbage or layout data. This can be crucial for such text nodes situated between tags.
- Recent ...
This option gives a quick access to documents which were used recently in the system. The system keeps track of up to seven documents back. By choosing an item here, a document (if it is still available) is immediately opened in the editor. The Recent ... section is split into two sections:
- Internal Documents
It stores the names of the last seven documents which have been opened from the Internal Documents database in the system editor. - External XML
It stores the names of the last seven documents which have been imported from external XML files in the system editor.
- Internal Documents
- Import RTF
By choosing this item, the user loads a RTF file within the system. A standard dialog for choosing a file appears. The user is asked to point to the RTF file to load. The expected file encoding is ASCII as it is specified in the RTF specification. The result from importing an RTF file is an XML document whose structure follows the TEI.2 DTD. During importing the user is asked to point to a DTD name where the TEI.2 DTD is stored. If there is no such DTD available, any other DTD can be selected and the result document will not be valid (which is not a problem).
RTF to XML Conversion
Not all the data from an RTF document is detected and transferred into XML. From the structures which are not recognized in the RTF file only the text content is taken. The types of data for which the system is aware are: heading information (title, author, keywords, creation/last modification date and time), structural information (headers, paragraphs, sections) and text layout information (bold, italics, underlined). Here follows the correspondence table from RTF to XML (TEI.2).
RTF Heading Data
XML(TEI.2) Corresponding Data
title TEI.2 > teiHeader > fileDesc > titleStmt > title author TEI.2 > teiHeader > fileDesc > titleStmt > author operator TEI.2 > teiHeader > fileDesc > titleStmt > respStmt > name company TEI.2 > teiHeader > fileDesc > publicationStmt > publisher category TEI.2 > teiHeader > encodingDesc > classDecl > taxonomy > category keywords TEI.2 > teiHeader > profileDesc > textClass > keywords > term creation date/time TEI.2 > teiHeader > profileDesc > creation > date / time words count TEI.2 > teiHeader > fileDesc > sourceDesc > bibl > extent > XXX words characters count TEI.2 > teiHeader > fileDesc > sourceDesc > bibl > extent > XXX characters characters count incl. whitespaces TEI.2 > teiHeader > fileDesc > sourceDesc > bibl > extent > XXX characters with ws. Structure & Layout Data XML Tags paragraph <p> ... </p> section <div> ... </div> header <head> ... </head> footer <trailer> ... </trailer> bold <hi rend="Bold"> ... </hi> italics <hi rend="Italics"> ... </hi> underlined <hi rend="Underlined"> ... </hi>
Remark: The notation in the table above: "xxx > yyy > zzz" stands for the following XML markup: "<xxx><yyy><zzz></zzz></yyy></xxx>"
The system supports Unicode encoded RTF documents as well as 23 different ASCII text encodings. If no ASCII encoding is detected, the system defaults to US Codepage (OEM-437).
- Import from URL
By choosing this item, the user can load an XML document from an URL address. The user should write: URL address, file encoding: ASCII, UNICODE (UTF-8), UNICODE (UTF-16BE) or UNICODE (UTF-16LE). Then the user is asked to choose a DTD from the DTD database of the system. If the document contains a DTD, this DTD is ignored during the parsing.
Remark: If a document is opened in ASCII format (and it is sure it is an ASCII file) and in the system it contains different symbols from the expected ones, the possible problems are two:
- Inappropriate system font (for details see Options/Fonts);
- Inappropriate character encoding converter from ASCII to Unicode (for details see Options/Encoding Correction).
The system tries to parse the document. If the document is well-formed, then it is parsed into the internal representation. The system creates a view of the document, validates the document and if the document is not valid with respect to the chosen DTD, a list of errors is printed in the error window. A description of all validation errors can be found in Validation error messages section.
If the document is not well-formed, then the system reports an error. A detailed description of all parsing errors can be found in Parse error messages section. The errors of this kind have to be repaired with an editor outside the system. The error message reported by CLaRK system can help the user to find the error in the document.
Here is a screen shot of the dialog :
- Import Text
This item allows the user to import pure text documents (without XML markup) in the system. The input documents can be encoded in ASCII or Unicode. The text from the input file is represented as one big text node. For compatibility, the CLaRK System puts an auxiliary tag (textdata) as root of the document and the text node is its only child. This auxiliary tag has an attribute source which keeps information about the location of the source file for this text. Having chosen this item, the user is asked to point to a DTD for the new document.
- Export Text
This item allows the user to export the text content of XML document as pure text. The text is serialized in a file without any preprocessing like the substitution of special control symbols ('<', '>', '&', etc.) with entities and others. Only the text nodes are printed in the order they appear in the document. All other data is ignored. The encoding of the text data can be either ASCII or Unicode.
- Multi-Import
This tool is used to perform a multiple Import XML operations, i.e. to import XML documents from external files into the internal document database of the system. The documents importing is done without opening them in the system editor. The imported documents are automatically saved in a corpus of the internal database and they are included (optionally) in a certain group.
Here follows a view of the Multi-Import Manager window:
The manager dialog window contains a list of all documents to be imported (initially it is empty) and some other options controlling the document processing. The components of the dialog are:
- Default DTD for documents - this component suggests a DTD for the documents selected for importing in the system. This chooser is default, because the user can specify a DTD for each document separately and if the DTD of the document is not specified, the system takes the DTD from here. The initial selected value in this chooser is the last DTD used in the system.
- Send documents to: - This option allows the user to specify a location in the internal documents database where the imported documents to be stored. This item is disabled if item Save Documents is not selected.
- Use directory "valid" - This item, if selected, moves all files which are valid according to the corresponding DTD to a sub-directory "valid" of the selected directory in field Valid/Well-formed Directory. If such a directory does not exist, the user is prompted to confirm its creation. If the user cancels, this option is disabled.
- Use directory "well-formed" - This item, if selected, moves all files which are well-formed but not valid according to the corresponding DTD to a sub-directory "well-formed" of the selected directory in field Valid/Well-formed Directory. If such a directory does not exist, the user is prompted to confirm its creation. If the user cancels, this option is disabled.
- Valid/Well-formed Directory - This field determines the base directory in which sub-directories "valid" and "well-formed" will be created/used. This information is strongly connected with the usage of the previous two options. If none of them is selected, this field is disabled. When it is enabled, the user can use button Change Directory to select a new base directory.
- Documents Preview Table - this component shows all documents which will be imported in the system. It is a table with four columns:
- File Name
This is a list of all files to be imported with their full directory paths. This column is not editable.
- DTD Name
This column assigns a DTD to each document to be imported. By clicking on each cell, a list of all DTDs in the system appears. If for a document there is no DTD selected, the system takes the default selected DTD (field Default DTD for documents).
- Document Name
This column determines the internal name, under which the document will be saved in the system. This column is editable and the user can select an arbitrary name for each document. If a name is painted in red, this means that the selected name is already used by another document. The behaviour of the processing component for such documents is determined by field Overwrite.
- Encoding
This column determines which supported character encoding to be used when reading each of the selected documents(files). The initial values depend on the selected encoding in the file chooser dialogs which are used during the selection of files. These values can be changed later by clicking on the corresponding cells in this column which causes the visualization of encoding chooser menu.
The management of the selected documents can be done by buttons: Add - allows the user to select documents for importing with the help of a standard file chooser; Remove - removes the selected documents in the table; Clear - clears the table content.
- File Name
- Start - Starts to import the selected documents one by one. While importing, the status bar indicates which document is being currently processed. When the operation is completed a result message window appears with detailed information about each imported document: whether it is valid, well-formed and so on.
- Cancel - Cancels the operation of multi-importing documents.
- Save Documents - This checkbox determines whether or not the imported documents will be saved in the system. Not saving the documents is useful for checking the validity and the well-formedness of a set of documents. By default this item is selected.
- Overwrite - This item determines the treatment of documents which will be imported in the system with names which are already used by other documents (painted in red in the table). If this option is selected, all problematic documents are saved, overwriting the previous ones. Otherwise, having pressed the Start button the user is shown the following warning/confirmation message:
The choices given here are:
- Overwrite - overwrites the old documents with the new ones (the same behaviour as if the Overwrite option is selected).
- Add suffix - generates automatically new names for the problematic documents by adding an index in brackets to the end of the document names.
- Cancel - cancels the import operation for the selected documents
- Auto Detect DTD - if this option is selected, the system will try to detect the DTD of the input documents, unless a certain DTD is chosen in the corresponding DTD Name cell in the table. If the detection fails, the system will refer to the Default DTD for documents.
If a document is not well-formed, then the system reports an error. A detailed description of all parsing errors can be found in Parse error messages section. The errors of this kind have to be repaired with an editor outside the system. The error message reported by CLaRK system can help the user to find the error in the document.
- Multi-Export
This item allows the user to export a set of internal documents: whole group or/and single documents. The user selects an output directory and encoding for all selected documents. Here is a screen-shot of the Multi-Export Manager window:
The manager window consists of:
- Output Directory section
This section contains two parts: Directory chooser and Encoding chooser. The Directory chooser shows the output directory, where the document will be stored in XML files. The file names are generated automatically on the basis of the internal document names. If an internal name contains symbols which are not allowed to be in a file name, they are substituted by an underscore symbol ('_'). The output directory can be changed by using button "Change Directory". Having pressed this button, the user is shown a standard file chooser to point to a new output directory. The second component is used for setting an output character encoding. The options are four:
- ASCII Text File
- UNICODE (UTF-8) Text File
- UNICODE (UTF-16BE) Text File
- UNICODE (UTF-16LE) Text File.
- Internal Documents chooser
This component is used for selecting documents to be exported. The component itself represents a standard for CLaRK system Document Selector, which will be described in details below. This selector is used not only in this part of the system, but in all places where a multiple selection of documents is needed. Having selected documents for exporting, the user must use one of the following buttons:
- Export button
Starts exporting the selected documents one by one in to the selected directory with the selected character encoding. The status bar of the system shows which file is being currently processed. After completion of the operation an information message about the result of the exported documents is shown to the user.
Remark: If a document is exported in ASCII format but the content of the output file is not comprehensible for other programs or even for the CLaRK System, then the problem can be in an inappropriate character encoding converter from Unicode to ASCII (for details see Options/Encoding Correction).
- Cancel button
Cancels the whole operation.
- No Overwrite Warnings
If this item is selected, no confirmation messages for overwriting the existing files will be shown.
- Output Directory section
- Document Selector
The Document Selector is an universal component which is used in almost all system tools, it can work with more than one document. The selector provides the possibility for the user to point to the internal documents a particular tool will be applied to. It consists of two sub-components: a selection list and a selecting dialog window. The selection list shows the currently selected documents. Initially this list is empty. Here follows a picture of the selection list. Usually it is embedded in another dialog window.
Three operations can be applied to this list: adding new list entries, removing selected list entries, clearing the whole list. These operations can be applied by using the buttons on the right:
- Add Documents - By using this button the Document Manager dialog is shown. When new documents are selected, they are added to the list on the left. If a document is selected, but it already is presented in the list, it is not added again.
- Remove Documents - Removes the selected items in the list on the left. If no selection is made, nothing is done.
- Clear All - Removes all the entries from the list on the left. If the list is empty, nothing is done.
- Exit
By choosing this item, the user closes all opened files and exits the system. If there are unsaved documents, the user is asked whether to save the changes or not.
Menu Edit
- Undo
A default shortcut Ctrl+Z
By choosing this item, the user restores the current document to its status before the application of the last operation on it. The system supports up to 3 steps of structural undo (Removing, Renaming, Insertion, etc. of nodes) and unlimited number of text typing actions. Sometimes the Undo operation may need a little time, especially on large documents.
- Search
A default shortcut Ctrl+F
The edit field and the icon on the toolbar provide the same functionality.
With this tool the user can search for nodes in the current active document exploiting the implemented XPath engine. The XPath expression is evaluated as a context node on the currently selected node in the tree view. Having evaluated the XPath expression, the system shows the result from this evaluation. The first node, that matches the query, is marked in both areas - the tree and the text. The other nodes in the list are saved in order to be selected when the user chooses Next and Previous.
Here is the layout of the Search Manager:
The dialog keeps as history the recently used XPath expressions and they can be used again by selecting an expression in the table. The user can rely on the help information XPath axes by clicking on Axes Info menu and then selecting names of the axes.
- Next
A default shortcut F3
An icon on the toolbar
When choosing this item in the current view, the system moves the focus to the next element from the list of the elements, found by the last XPath search operation. In cases when no search operation was performed or the previous search result was unsuccessful, an error notification message is shown.
- Previous
A default shortcut Shift-F3
An icon on the toolbarWhen choosing this item in the current view, the system moves the focus to the previous element from the list of the elements, found by the last XPath search operation. In cases when no search operation was performed or the previous search result was unsuccessful, an error notification message is shown.
- Synchronize Documents
This function gives the facility two or more documents with identical structures to be navigated in parallel in the editor. There is no restriction how many documents to be synchronized at the same time. The synchronized documents form a group in which, if the selection in one document is changed this causes a selection change in the rest group members.
This feature is useful when documents have to be compared manually for some reason. Another application is when there is a set of documents representing parallel aligned data (in example, parallel corpora where each language is in a separate document).
The synchronization, i.e. the distribution of the selection from one document to the others is done on the basis of the tree path location of the initially selected node. When a selection event is performed to a certain node, its path to the tree root is calculated, observing for each node in the path its preceding siblings' info. The calculation of this tree path results in an XPath expression which deterministically points to the selected node. In the next step, in all the rest synchronization group members this XPath expression is evaluated and if it returns a node, the corresponding document selection is moved to it. Otherwise, the document's previous selection is cleared. Thus if two documents are identical, for each node in one of them, there will be a corresponding node in the other one. If having selected a node in a document, the selection disappears in the other document(s), it means that this node is in a way unique for this document.
This function is a light-weight version of the synchronization with rules defined in Definitions/Sync Rules and applied in Document/Synchronize with .... Here the node correspondence is fixed (the tree paths from the roots) but it is more efficient and the synchronization links are two-directional (i.e. a set of corresponding nodes can be selected in any of the documents containing them). In the other type of linking, the user defines the correspondence (XPath) relations and there is a distinction between documents (current vs. referred document).
View
This menu is used to visualize/hide Main Editor components. Whenever an item from the menu is checked, the component it refers to is visualized, otherwise it is not. The menu components are the following:
- Toolbar
Visualizes and hides the toolbar from the main system window.
- Navi Toolbar
Visualizes and hides the document navigation toolbar from the main system window.
It appears under the main toolbar, on the top of the tree area:
It functions similarly to an Internet browser-like history module (Back/Forward). Each document activation is recorded in the history and if the Back button is used, the system activates the document which was previously active. Similar behaviour can be expected from the Forward button. This functionality can be applied when the work requires the use of more than one document loaded in the editor.
- Scrollbar
Visualizes and hides the vertical text scroll bar from the main system window.
- Show Current Node Path
Enables or disables the current node path visualization in the status bar of the main system window.
- Show Memory Monitor
Visualizes and hides the Memory Monitor from the main system window. When this option is checked, the monitor window appears on the main menu bar indicating the amount of memory, currently used by the system. By pressing the monitor, a Garbage Collection is run. It reduces the amount of used memory and increases the efficiency of next operations.
- Text Area
Shows and hides the Text Area of the main system editor. Hiding the area improves the performance of the system when huge amount of data is loaded in the editor.
- XPath Debugger
The
XPath Debugger
is a means which supports the user to follow the process of evaluation of an XPath expression written in the Search field of the editor. In order the XPath debugger to start functioning the user is expected to supply a VALID XPath expression. Otherwise, the system produces an error message and points to the incorrect atom or sub-expression. The task of the debugger is not to validate expressions, but to trace step-by-step the process of evaluation of expressions on certain XML structures. On each step, the debugger shows- the sub-expression or atom which is currently on focus,
- the current context node for which the evaluation step is performed and also
- the partial result from the evaluation step.
The XPath Debugger can be enabled by selecting a menu item View/XPath Debugger. The debugger panel appears on the place of the Error Message Panel in the bottom of the Main Editor window.
The appearance of the debugger panel is the following:
The panel consists of two parts: monitoring section and navigation section.
The monitoring section shows the query XPath expression and emphasizes (in red) which sub-expression is currently evaluated. The red marker is moving along the expression as the evaluation process goes forward or backwards. Some parts of the expression might be highlighted more than once if they are evaluated on different contexts.
The expressions which appear in the monitoring section might look slightly different from the initial expressions which are written in the search query field of CLaRK, but they are equivalent. Differences may come from the fact that the expressions in the debugger are in a Normal Form. I.e. all elements which come from the abbreviated syntax of XPath are expanded to their full forms ('/A/B/C' goes in '/child::A/child::B/child::C'); complex expressions are decomposed into smaller ones by using brace (usually in binary operations), variable definitions scopes are also marked by brace ('{a := expr1}{b := expr2}($a + $b)'); there are some axis composition optimizations ('//para' goes in '/descendant::para' instead of '/descendant-or-self::*/child::para') and others.
The second part of the debugger panel is the navigation section. It allows the user to go through all the steps of the evaluation, to jump over some of them or to go one or more steps back. To navigate in each direction the user has three buttons (three for forward and three for backwards). Each of the three buttons jumps over different portions of processing steps. The smallest steps (the most detailed navigation) can be done by using and , bigger steps - by and , and biggest (iterating on all contexts for a current expression) - and . On each step of the navigation, in the tree area the context node is marked in green and the resulting nodes (if there are such) are marked in red. Thus the user can follow the partial results generation. Additionally, on each step, an information message is shown on the status bar which describes the current results, which is useful when the result is not of type node-set, but string, number or boolean.
The Reset button is used to stop the process of tracing the evaluation, clearing the monitoring section and restoring the normal system behaviour.
To disable the XPath debugger and restore the Error Message Panel the menu item View/XPath Debugger must be unselected.
Here follows an example how the debugger can be used:
Lets have the following XPath expression:
/book[3]/authors/*[2]/text()
which is expected to select the second author of a third book. We open a relevant document in the system editor and enable the XPath debugger from the menu.Having performed an XPath search on the document with the example expression, the status bar shows the following message:
and the monitoring section of the debugger is initialized with the normal form of the expression. At this point the 'debugging' process is still started. To do this, we use any of the Next buttons, lets say button .
Now the debugger is working. We can use the Next and Back buttons to see in details the evaluation step-by-step. Here follows an illustration of three different processing steps:
Step A:
XPath monitor:
Status bar:
Step B:
XPath monitor:
Status bar:
Step C:
XPath monitor:
Status bar:
Note: Step A precedes (not immediately) Step B which precedes Step C.
- Graphical Tree View
The Graphical Tree View is a means for drawing of tree structures encoded in XML. The result structure is not necessary to follow the XML logical structure and it is user defined. The structure definition is based on XPath expressions. The view of the result drawing follows a defined in advance layout. The layout definition is described in section Definitions / Graphical Tree Layout. Here we will proceed the description with an example.
Let the source document be a simple RDF encoded ontology. Here is a short fragment of it:
Having selected a node in the document and menu item Graphical Tree View is selected, if the system uses the default layout, the following view appears:
Here each element node is represented by its name (the red ovals) and each text node - by its text content (the yellow rounded rectangle). An useful modification of the layout will be, instead of showing the tag names for the elements, to show the values of attributes rdf:ID and rdf:resource (where appropriate).
Adding two new rules to the layout (one to show the rdf:ID attribute value for element rdfs:Class and one to show the rdf:resource attribute value for element rdfs:subClassOf) will produce the following result:
In the current example (as a typical RDF document), the definition structure is 'flat', i.e. each concept is defined independently with the help of relation references to other definitions. In case we need to observe the hierarchy as a tree on the base of sub-class-of relations it is possible to do that by changing the structure i n the layout. The change will be that the child nodes of a rdfs:Class element will be all rdfs:Class elements which have a rdfs:subClassOf element referring to this element. The elements will be presented by their rdf:ID attribute values. The result is the following:
Tree View Window
The task of the window is to show the result tree image. The drawing canvas is in a way 'active', i.e. it is aware of mouse clicks on it and if such an event is performed on a certain shape, the corresponding source node is selected in the system editor. The synchronization with the editor is two directional, i.e. if the selection in the editor changes the system finds the closest node which can serve as a root according the current layout and reconstructs the image.
Tree view options:
- Zoom In / Zoom Out - changes the image dimensions in percentage range [25% - 300%]. The upper limit can be lowered if the original image size is tool big. The biggest image which can be visualized without reducing its size can have dimensions: 5000 points width x 3000 points height.
- Layout - allows changing the layout for the current view. It shows a choice list of all available layouts in the system.
- Refresh - reconstructs the preview image, which is needed if the source document has been changed.
- Keep on top - this option enables keeping the view window on focus when the selection in the editor changes in consequence of selection in the drawing canvas. The default behaviour is bringing the system editor window to front in order the show the new selection location.
- Save as ... - this option allows saving the current view image as an external image file. The supported file formats are: JPEG, PNG and SVG.
Menu DTD
- Compile DTD
A default shortcut Ctrl+L
By choosing this item, the user compiles a DTD (Document Type Definition). First a standard file chooser appears. The user is expected to point out to the file where the DTD is stored. The system supports four kinds of character encodings: ASCII, Unicode UTF-8, Unicode UTF-16BE and Unicode UTF-16LE.
When an input file has been chosen, the DTD compilation begins. If the DOCTYPE element is not declared, the user has to choose the root element from all elements defined in the DTD.
If an error occurs during the DTD compilation, a notifying error message appears.
If everything is correct, there appears a message for a successful compilation. The DTD is added to the list of all DTDs known to the system. If in the system there already exists a compiled DTD with the same name, then an additional index is appended to the end of the newly added DTD name.
- Renew DTD
The Renew operation replaces the content definitions of a selected DTD in the system with a DTD stored in an external file. During the renew operation all other settings related to the DTD in the system are saved unchanged. Such settings are the text and tree layout, element features etc. Thus Renew DTD operation is useful when the user constructs a DTD outside the system and needs to update it in the system. For this operation the user has to choose a DTD from the system by using a standard DTD chooser.
None of the opened documents in the system editor should use the selected DTD during Renew operation. If there are such document(s), the following warning confirmation message appears:
If No is clicked, then the Renew DTD operation is canceled.
If Yes is clicked, then all documents in the system which are connected to this DTD may become invalid. When some of these documents are opened in the system editor they will be validated again.
Then a standard file chooser appears. The user is expected to point out to the file where the DTD is stored. The system supports four kinds of character encodings: ASCII, Unicode UTF-8, Unicode UTF-16BE and Unicode UTF-16LE.
When an input file has been chosen, the parsing begins. If an error occurs during the DTD compilation, a notifying error message appears.
If everything is correct the new compiled version of the DTD substitutes the old one.
- Remove DTD
By choosing this item, the user can remove a DTD from the list of all DTDs known to the system.
If there are no saved documents referring to this DTD in the system, there comes a message for a successful removing. Otherwise, an error message appears. It warns the user that the removal operation cannot be done.
Having pressed the button Details the user can see the documents, which rely on the selected DTD. These documents do not allow the DTD removal. The two possible solutions to this problem are as follows: either the DTD for all these documents is changed, or the documents themselves are removed.
In this way no documents will refer to the DTD in question and hence, the removal will be successful.
- View DTD
This is an information dialog, showing the content of a DTD (Document Type Declaration) already compiled in the system. (For more information about DTDs, see http://www.w3.org/TR/1998/REC-xml-19980210#dt-valid).
The information data is divided into 4 sections representing different parts of the DTDs (structure data, attributes data, entities data and processing-instructions data). These four parts are contained in a tabbed pane and, by clicking on each tab, the user can switch between them.
The viewer is demonstrated by the following simple example of a DTD:
<!DOCTYPE books [
<!ELEMENT books (book)*>
<!ELEMENT authors (author)+>
<!ELEMENT book (#PCDATA, title, authors, publisher, (pages)?, isbn, (price)+)>
<!ELEMENT title (#PCDATA)>
<!ELEMENT author (#PCDATA)>
<!ELEMENT publisher (#PCDATA)>
<!ELEMENT pages (#PCDATA)>
<!ELEMENT isbn (#PCDATA)>
<!ELEMENT price (#PCDATA)>
<!ATTLIST price
currency CDATA #REQUIRED
prev CDATA #IMPLIED
id CDATA #IMPLIED
n CDATA #IMPLIED
lang CDATA #IMPLIED>
<?CLaRK member(Gram,AllAn,[;]) ?>
]>Structure data section:
The structure, defined by the DTD, is represented as a table (see above). Each table row contains structural data for one element. The name of the element is in the first cell. The second cell contains the definition of the element as a regular expression. The rows are sorted by lexicographical order of the element names.
Note: #PCDATA means a plain text string, excluding symbols like '<','>'.
In the picture above, the definition of author element says that it must contain only text data (not other elements). The book element structure must be: text data,title element, authors element, publisher element, pages element (optional), isbn element, price element(one or more) and date element (optional). The ordering matters.
When the declaration of an element is too long or rather complex, sometimes it is helpful for the user to see the declaration separately from the other element declarations. It becomes possible by clicking with the right mouse button on the row of the desired element and then pressing the View button, when it appears.
Element attributes data section:
In order to see the attributes of a given element, the user has to choose the element from the drop-down menu at the top of the window. This menu contains all elements declared in the DTD. If after choosing an element nothing appears in the table, it means that there are no attributes declared for this element. Each time an element is chosen, the content of the table is updated. In the picture above, the element price has been chosen and the table contains its attributes according to the DTD.
Each row represents one element's attribute only. The name of the attribute is in the first column. The type of the attribute's value is in the second one. The third one contains meta-information about the attribute (required, implied, fixed, ...).
Entities data section:
This section gives information about the entities defined in the DTD. Entities can be used as escape alternatives for symbols, which are not allowed in the text. In the picture above, there is an entity lt which will substitute the symbol '<' in the text. Otherwise the XML parser will decide that a new tag is starting when it meets the symbol '<' in the text and if it is not the case then the further processing fails. Therefore, in this case the symbol '<' has to be substituted by < being interpreted as the symbol '<' but not as a starting point of a tag. The format of an entity is: &xxx;, where xxx is the name of the entity. All entity names appear in the first column of the table above. Opposite each entity name stands its corresponding text string (the text which will be substituted by the entity).
Processing-instructions section:
The processing-instructions contain more information about the text processor than about the text content itself. It is so, because the processors should know how to interpret the instructions.
- Export DTD
There is a possibility to export an existing DTD document from the system to a file. For this operation the user selects the DTD he/she wants to export from the DTD chooser dialog. Then he/she points to a directory where the selected DTD to be stored with the desired file name and encoding.
- Create New DTD
This dialog allows the user to create a new DTD document.The dialog is similar to the View DTD dialog. There are additional Find, Save and Cancel buttons. The content of a new DTD document is empty - empty tables for elements and attributes. There are five default entities in the Entity table - amp, apos, gt, lt, quot, which cannot be edited or duplicated.
When editing a given cell in the table a popup menu can be used by clicking with the right button of the mouse. The popup menu has the following items: Cut, Copy, Paste, Delete, Select all, Edit and Insert symbol. Insert symbol allows the user to enter symbols from the Unicode table. There are several default strings from DTD standard which can be added while editing the cell content.
- ANY, EMPTY, #PCDATA , | ? * + ( ) for Element Description column.
- CDATA, ID, IDREF, IDREFS, NMTOKEN, NMTOKENS, ENTITY, ENTITIES, NOTATION for Attribute Type column.
- #REQUIRED, #IMPLIED, #FIXED for Attribute Default Value column.
When right mouse button is clicked over a selected row or rows a popup menu appears. There are Insert row, Delete row, Copy and Paste items. They are used for inserting a new row, deleting the selected row(s), copying a row or rows and pasting the row(s) from the buffer into the selected row position.
Thus there are two different menus: one for manipulating the lines in the tables and one for editing the content of a cell in the table.
For saving the new DTD the Save button is clicked and a name for it is entered. If there are no errors, the new DTD will be compiled and ready to be used by the CLaRK System. Possible errors can occur if there is an empty cell in one of the tables, invalid name of element, attribute or entity, wrong regular expression for element definition, wrong attribute type, attribute default value, entity value and others.
When Find button is clicked the find dialog appears. It gives possibilities to find information about the current DTD. First, the user enters the string to be looked for. By choosing elements, attributes or entities he/she determines the table in which the search will be performed. For each one there are items for the relevant columns in the table. When the search query is ready to be applied, the Search button is clicked for searching. With Next and Previous buttons the found result can be navigated.
- Edit DTD
This dialog allows the user to edit existing DTDs in the system. It is similar to the Create New DTD dialog. There are additional Save as and Update buttons. Editing operations are the same as Create New DTD dialog - Cut, Copy, Paste, Delete, Select all, Edit and Insert symbol for cell editing, Insert row, Delete row, Copy and Paste for row editing.
By Save as button the user can save the edited DTD under different name.
If there is an opened document in the system, which is related to the DTD to be edited, the following message dialog appears.
If Yes is clicked all documents in the system which are connected to this DTD may not be valid. When the documents are opened in the system, they can be validated again with respect to the new DTD.
By Update button the changes are saved to the edited DTD.
- Edit Text Layout
By editing the DTD layout, the user can change the way, in which a document, loaded in the system, will appear on the screen. This facility includes the following: moving to a new line before/after opening/closing tag, hiding some tags and/or their content.
After choosing a DTD, the following table appears:
The first column contains all tag names in the DTD. Each row represents the layout information for one tag. Here follows a description of the meaning of each column in the table:
- Tags - All the tag names in the selected DTD. A special name Unknown* is added at the end of table for the elements with a tag not defined in the DTD;
- Open tag start - the possibility for the opening tag of the corresponding element to appear on a new line or not;
- Open tag end - the possibility for the first child node to appear on a new line (a new line after the opening tag) or not;
- Close tag start - the possibility for the closing tag of the corresponding element to appear on a new line or not;
- Close tag end - the possibility for a new line to be inserted after the closing tag or not;
- Is tag visible - the possibility for the tag to be visible or hidden on the screen;
- Are children visible - whether the content of the tag to be visible or hidden;
The check box at the bottom of the window: "Use line offsets" supports more comprehensive visualization. It suggests an additional white space to be inserted in front of the tags. If chosen, this white space is assigned to each tag, which appears on a new line and the length of the white space depends on the depth of the node in the DOM tree.
The field Color Scheme specifies which Color Scheme will be used for the documents using this layout. If the selection is the first item (<disabled>) then no Color Scheme is used. For details see menu option Color Schemes.
The two buttons Export Layout and Load Layout are used for saving/loading the current layout settings to/from an external file in XML format.
When a document is loaded, it obeys the layout, defined for its DTD. Note that later on this layout can be changed only for the current view.
- Edit Tree Layout
The usage of the Edit Tree Layout option concerns the way the document tree will be displayed on the screen. Without a Tree Layout activated the tree appears as: element nodes are represented by their tag names, text nodes - by their text context. The nodes in the tree are colored in blue if the corresponding (DOM) nodes are valid according the DTD and colored in red otherwise. When a Tree Layout is activated the user can define the way each element node is represented in the tree. The definition is a text pattern with tree types of identifiers which are interpreted not just as text:
- A reference to an attribute value of the current element. The syntax is '{@attribute_name}', where attribute_name is the name of the attribute whose value is needed. If the attribute is not found, an empty string is returned.
- A reference to an XPath key which is evaluated with the current element as a context. The syntax is '{%key_name}', where key_name is a name of an XPath key defined in the system. It is an error if such a key does not exist.
- An XPath expression which is evaluated with the current element as a context. The syntax is '{xpath_expression}', where xpath_expression is a valid XPath expression. The result from the evaluation can be: node-set, string, number or boolean. In case a node-set with more than one element is returned, the different values are separated by a single space character. One restriction here is that the xpath_expression must not contain the character '}' in it. If there is a need to use it, the XPath expression in a XPath key has to be defined and then user can refer to it.
The user has to be careful when s/he uses the symbol '{' in a text pattern in the cases when it must not be considered as a beginning of an identifier. In these cases the user may use the sequence '^{' instead of '{'. Example: '^{@name}' will appear as: '{@name}', but not as the value of an attribute name.
Also for each element node a different coloring can be defined. There are 12 colors available.
Here is an example preview of the Tree Layout editor:
Each layout can be enabled/disabled. When a layout is disabled the tree is shown as if there is no layout defined but it still remains in the memory and can be enabled later. The Instructions button gives a short description of the text pattern syntax. The View Macros button shows a list of all XPath macros currently available in the system.
The checkbox Preserve ToolTip gives the possibility the original tag names to be shown in tooltip balloons on the tree panel for convenience. If the checkbox is not selected then the tooltip will repeat the content of the corresponding tree node.The field Graphical Layout determines which layout will be used for documents having a DTD, the owner of this layout. The Graphical Tree Layouts are defined in menu Definitions / Graphical Tree Layout and used in menu View / Graphical Tree View.
In the figures below you can see how the tree layout above has changed the tree appearance:
Tree Layout enabled
Tree Layout disabled
Menu Definitions
- Tokenizers
[:bg]
XML considers the content of each text element a whole string that is unacceptable for corpus processing. For this reason, it is required for the word-forms, punctuation and other tokens in the text to be distinguished. In order to solve this problem, the CLaRK System supports a user-defined hierarchy of tokenizers. At the very basic level users can define a tokenizer (Primitive) in terms of a set of token types. In this basic tokenizer each type is defined by a set of UNICODE symbols. Above basic level there are tokenizers (Complex) for which the token types are defined as regular expressions over the tokens of some other tokenizer, the so called Parent tokenizer.
Here is the Tokenizer Manager tool which shows all the tokenizers saved in the system, aa well as some of their characteristics:
- Name - each tokenizer has a unique name
- Type - shows whether the tokenizer is Primitive or Complex
- Parent - for Complex tokenizers - the name of the Parent tokenizer
- Compile - shows the Compilation status. Only the compiled tokenizers can be used in other tools.
The user can create a new tokenizer; edit, compile, or remove an existing one, load tokenizers from file out of the system or save tokenizer(s) as an external file.
- New
In order to create a new tokenizer (Complex or Primitive), the user must press the New button.
Each row in the table represents one tokenizer category. The first column presents the category name. The content of the second column depends on the type of the tokenizer. The column contains a category value (all the symbols in the category) if the tokenizer is "primitive", or regular expression for a "non-primitive" tokenizer.
- Primitive
If the tokenizer is primitive, the user must select the Primitive check box.
When defining the category value for a primitive tokenizer, the user should be aware of the following rules:
- The characters are quoted in single or double quotations, or referenced by a number. Example "." or ";" or 'k' or 32 (Space)
- If the user wants to write more than one symbol for a category, he/she should separate the symbols by a comma. Example: "a","b","c",...
- If the user wants to define a range of symbols, the starting and ending symbols must be connected with a dash. Example : "A"-"Z".
- Each character cannot have more than one category
- A category can be defined on more than one row. It is interpreted as a conjunction of expressions. For example:
The tokenizer tool will interpret this lines as LAT "'a'-'z','A'-'Z'".
- Complex
When defining a complex tokenizer, the user should follow the rules below:
- A parent tokenizer must be selected.
- Each category Expression must be a valid regular expression.
- Two categories cannot have a common token.
Here is an example of a complex tokenizer:
The parent of the tokenizer is the "Mixed" primitive tokenizer shown above. This complex tokenizer uses the categories "LAT" and "CYR" from the parent tokenizer in order to define the new categories "LATw" and "CYRw".
Each complex tokenizer must be compiled in order to be used.
- Primitive
- Edit
The user can edit a tokenizer pressing the "Edit" button. The tokenizer selected in the table will be opened for editing. The user can add, remove and reorder rows on the menu which shows up by right-clicking over a row in the table. The parent of a tokenizer is set when the tokenizer is created and can be changed by the user by pressing the Change Parent button.
For each primitive tokenizer the user can define the sort order of the categories by clicking the Sort Order button. This sort order is used by the other tools in the system when they compare tokens. For example:
The user can reorder the categories by selecting a row with left mouse click and pressing button - move the row or on the menu which shows up by right clicking over a row in the table.
Also for each category of the primitive tokenizer a normalization of the symbols can be defined. This normalization is applied when the tokens are compared. The usual normalization is the conversion of the capital letters into small ones, but in the system the user can define any correspondence of the symbols. This can be done by right clicking on the category line in the table. For instance, the following dialog will appear if we select normalize for the "LAT" category of Mixed Tokenizer:
For each symbol of the category the user can select a corresponding normalized symbol. The "New Category" combo box determines the new category that the symbols will receive after the normalization. This category can coincide with the original category of the symbols before the normalization or it can be completely different.
- Remove - removes the selected tokenizer from the table;
- Compile - compiles a non-primitive tokenizer;
The user can compile a complex tokenizer with the "compile" button. Pressing this button causes not only compiling but also saving the tokenizer. Some possible error messages after compilation: "Ambiguous Categories" - two tokenizer categories recognize the same token from the input word. This error can occur even if the categories belong to different tokenizers in the hierarchy; "Category not defined" - the category name used in the value of one of the tokenizer categories is not defined in the tokenizers above this tokenizer in the hierarchy. When compiling a tokenizer all the tokenizers above and under it in the hierarchy are compiled. This means that a change in one tokenizer can produce an error during the compilation of another tokenizer. The user must be very careful with the error messages. Also it is recommended to keep all tokenizers compiled.
- Load Tokenizer - loads a tokenizer from a file;
- Save Tokenizer - saves a tokenizer as a file;
- Exit - closes the dialog window.
[:en]
XML considers the content of each text element a whole string that is unacceptable for corpus processing. For this reason, it is required for the word-forms, punctuation and other tokens in the text to be distinguished. In order to solve this problem, the CLaRK System supports a user-defined hierarchy of tokenizers. At the very basic level users can define a tokenizer (Primitive) in terms of a set of token types. In this basic tokenizer each type is defined by a set of UNICODE symbols. Above basic level there are tokenizers (Complex) for which the token types are defined as regular expressions over the tokens of some other tokenizer, the so called Parent tokenizer.
Here is the Tokenizer Manager tool which shows all the tokenizers saved in the system, aa well as some of their characteristics:
- Name - each tokenizer has a unique name
- Type - shows whether the tokenizer is Primitive or Complex
- Parent - for Complex tokenizers - the name of the Parent tokenizer
- Compile - shows the Compilation status. Only the compiled tokenizers can be used in other tools.
The user can create a new tokenizer; edit, compile, or remove an existing one, load tokenizers from file out of the system or save tokenizer(s) as an external file.
- New
In order to create a new tokenizer (Complex or Primitive), the user must press the New button.
Each row in the table represents one tokenizer category. The first column presents the category name. The content of the second column depends on the type of the tokenizer. The column contains a category value (all the symbols in the category) if the tokenizer is "primitive", or regular expression for a "non-primitive" tokenizer.
- Primitive
If the tokenizer is primitive, the user must select the Primitive check box.
When defining the category value for a primitive tokenizer, the user should be aware of the following rules:
- The characters are quoted in single or double quotations, or referenced by a number. Example "." or ";" or 'k' or 32 (Space)
- If the user wants to write more than one symbol for a category, he/she should separate the symbols by a comma. Example: "a","b","c",...
- If the user wants to define a range of symbols, the starting and ending symbols must be connected with a dash. Example : "A"-"Z".
- Each character cannot have more than one category
- A category can be defined on more than one row. It is interpreted as a conjunction of expressions. For example:
The tokenizer tool will interpret this lines as LAT "'a'-'z','A'-'Z'".
- Complex
When defining a complex tokenizer, the user should follow the rules below:
- A parent tokenizer must be selected.
- Each category Expression must be a valid regular expression.
- Two categories cannot have a common token.
Here is an example of a complex tokenizer:
The parent of the tokenizer is the "Mixed" primitive tokenizer shown above. This complex tokenizer uses the categories "LAT" and "CYR" from the parent tokenizer in order to define the new categories "LATw" and "CYRw".
Each complex tokenizer must be compiled in order to be used.
- Primitive
- Edit
The user can edit a tokenizer pressing the "Edit" button. The tokenizer selected in the table will be opened for editing. The user can add, remove and reorder rows on the menu which shows up by right-clicking over a row in the table. The parent of a tokenizer is set when the tokenizer is created and can be changed by the user by pressing the Change Parent button.
For each primitive tokenizer the user can define the sort order of the categories by clicking the Sort Order button. This sort order is used by the other tools in the system when they compare tokens. For example:
The user can reorder the categories by selecting a row with left mouse click and pressing button - move the row or on the menu which shows up by right clicking over a row in the table.
Also for each category of the primitive tokenizer a normalization of the symbols can be defined. This normalization is applied when the tokens are compared. The usual normalization is the conversion of the capital letters into small ones, but in the system the user can define any correspondence of the symbols. This can be done by right clicking on the category line in the table. For instance, the following dialog will appear if we select normalize for the "LAT" category of Mixed Tokenizer:
For each symbol of the category the user can select a corresponding normalized symbol. The "New Category" combo box determines the new category that the symbols will receive after the normalization. This category can coincide with the original category of the symbols before the normalization or it can be completely different.
- Remove - removes the selected tokenizer from the table;
- Compile - compiles a non-primitive tokenizer;
The user can compile a complex tokenizer with the "compile" button. Pressing this button causes not only compiling but also saving the tokenizer. Some possible error messages after compilation: "Ambiguous Categories" - two tokenizer categories recognize the same token from the input word. This error can occur even if the categories belong to different tokenizers in the hierarchy; "Category not defined" - the category name used in the value of one of the tokenizer categories is not defined in the tokenizers above this tokenizer in the hierarchy. When compiling a tokenizer all the tokenizers above and under it in the hierarchy are compiled. This means that a change in one tokenizer can produce an error during the compilation of another tokenizer. The user must be very careful with the error messages. Also it is recommended to keep all tokenizers compiled.
- Load Tokenizer - loads a tokenizer from a file;
- Save Tokenizer - saves a tokenizer as a file;
- Exit - closes the dialog window.
[:]
- Filters
This menu item starts the filter editor. In order to browse the filters in the system, the user can rely on the "Filter" combo box at the top of the dialog. The user can add token categories from different tokenizers or add XPath expressions to filter element nodes. A filter defines the way of removing tokens and/or elements from the content of a given element when some tool processes its content. Usually filters are used in connection with grammars. When the grammar is applied, it is applied to the content of an element. The content is processed before the grammar is applied. The processing includes tokenization of the text in the content and conversion of elements to list of tokens. The result is a list of tokens which is the input for the grammar. Very often some of the tokens in this list make no sense for the grammar. Such are space tokens, some special symbols, some special elements (an element for parenthetical expression, for instance). In order to escape these non-meaningful tokens, they can be filtered out from the grammar input in advance. This is the purpose of the filters in the system.
"Token Types" is a list of token categories, that will be filtered. The user can take categories from tokenizers in the system ("Choose From" list on the left side of the dialog) and add them to the list of filtered token categories with the arrow ("=>") button.
In order to filter an element the user has to specify an XPath expression. This XPath expression is evaluated on each element in the content which is filtered and if it is evaluated as true or returns a non-empty list of nodes, then the element is filtered out of the content. The addition of the XPath expression is done by pressing "Add XPath" button. The new XPath expression will be added to the "Expression" list.
To remove a token category or an XPath from one of the lists, the user must select the line and press the "Remove" button under the table.
The user has to save the filter after editing by the Save button. The user can remove a filter by the Remove button.
The Export Filters button saves all filters as a file.
The Import Filters button loads filters from a file. There are loading options which determine the behaviour of the import operation.
- Element Features
The Element Features is used for assigning information to the elements of a DTD.
The user can add the following information:
- A tokenizer for DTD. This tokenizer will be used to tokenize (when needed) the text content of all the element for which there is no a specific tokenizer set. This tokenizer can be overwritten. Thus it is used by default. For instance, if no tokenizer is defined for the element, the grammar and sort engines will look for the DTD default. After compiling DTD, the "Default" primitive tokenizer is set as default tokenizer.
- For each element in the DTD, a tokenizer can be set. This tokenizer overwrites the default DTD tokenizer. It is used for tokenization of the element text data by the Grammar and Sort engines and some other tools.
- The user can state that the content of the element is treated as a number.
- The user can define an XPath Value (Key) for each DTD element.
- The user can define the order over the DTD elements. This option is used for sorting purposes.
- The user can define options for delete operations - Delete Subtree and Delete Node from the Tree Popup Menu and XPath Remove from the Tools menu.
In order to select the default tokenizer for the DTD the user must select an item in the "Default Tokenizer" combo box. The user can select a tokenizer for each element of the DTD in the "Tokenizer" column of the table by clicking on a table cell. The check boxes in the "Number" column are used by the sort tool to determine whether the content of the current element can be treated as a number. For example, pages and price can be treated as numbers in the comparison of two books. The values in the "Element Value" column are used by the Grammar Engine to define the value of the element nodes. (For defining value check the Edit Grammar). The check boxes in the "Before" and "After" columns are used to determine whether to insert space symbol before and after a deleted element by the delete operations - Delete Subtree and Delete Node from the Tree Popup Menu and XPath Remove from the Tools menu. After applying one of this operations if the Before chechbox is checked for an element the space symbol is inserted before this tag if it does not exist and if the After chechbox is checked for an element the space symbol is inserted after this tag if it does not exist.
The addition (creation) of a new Element feature can be performed by pressing the Add button. The removal of elements can be done by selecting the corresponding row and pressing the Remove button. The OK button closes the dialog and updates the changes (if any). The Cancel button closes the dialog without saving the changes.
The order of the elements in DTD can be defined in the sort table shown when the user presses the "Sort Order" button. When comparing two elements, the position in the sort table defines their order. The user can change the position of elements by dragging their rows to correct positions or using the context menu opened when the user right clicks on sort table row. Here is an example:
The Export Element Features button saves Element Features for the current selected DTD as a file.
The Import Element Features button loads Element Features from a file. All previous data are replaced by the new data. The user can restore the previous data by clickingCancel button.
If one of the Element Features uses some tokenizer which does not exist in the system the relevant warning message is given and the user can validate this definition by a validating dialog.
- Attribute Features
The Attribute Features is used to add information to the attributes of elements from a DTD or user elements.
The addition (creation) of a new Attribute feature can be performed by pressing the Add button. The removal of attributes can be done by selecting the corresponding row and pressing the Remove button. The OK button closes the dialog and updates the changes (if any). The Cancel button closes the dialog without saving the changes
The "Element" column of the table contains the elements which have attributes. One and the same element can appear several times because it can have more than one attribute. The "Attribute" column has the name of the attribute for which the attribute features are defined. "Is Enumeration " column shows whether the attribute is Enumeration or not. In the "Tokenizer" column the user can select different tokenizers for each attribute. The sort tool uses information from the "Number" column to select the way of comparing the value of the attributes (As plain text or as a number).
An additional feature is the order of enumerated attribute values. The attributes with enumerated values have "Yes" string at the "Is Enumeration" column. In order to sort the enumerated values the user must click on an attribute with enumeration value and to click the "Sort Values" button. Example:
The values are sorted in ascending order. The user can change the order by dragging the rows or by pressing the right mouse button.
The Export Attribute Features button saves Attribute Features for the current selected DTD as a file.
The Import Attribute Features button loads Attribute Features from a file. All previous data are replaced by the new data. The user can restore the previous data by clicking Cancel button.
If one of the Attribute Features uses some tokenizer which does not exist in the system the relevant warning message is given and the user can validate this definition by a validating dialog.
- XPath Macros
[:bg]
The XPath Macros gives the user possibility to name XPath Expressions. It contains a macro name and an XPath expression. It is designed for the general usage. The XPath Macros can be parts of XPath expressions and thus they can be used in each place where XPath is used. Within an XPath expression a macro is denoted by '%' (percent sign), followed by the macro name, i.e. %macro_name. A macro citation itself also represents an XPath expression (CLaRK System extension).
In the picture below you is shown what the XPath Macros editor looks like:The Export XPath Macros button saves selected XPath Macros as a file.
The Import XPath Macros button loads XPath Macros from a file. There are loading options which determine the behaviour of the import operation.
If one of the XPath Macros has no XPAth expression or it is invalid the relevant warning message is given and the user can validate this definition by a validating dialog.
[:en]
The XPath Macros gives the user possibility to name XPath Expressions. It contains a macro name and an XPath expression. It is designed for the general usage. The XPath Macros can be parts of XPath expressions and thus they can be used in each place where XPath is used. Within an XPath expression a macro is denoted by '%' (percent sign), followed by the macro name, i.e. %macro_name. A macro citation itself also represents an XPath expression (CLaRK System extension).
In the picture below you is shown what the XPath Macros editor looks like:The Export XPath Macros button saves selected XPath Macros as a file.
The Import XPath Macros button loads XPath Macros from a file. There are loading options which determine the behaviour of the import operation.
If one of the XPath Macros has no XPAth expression or it is invalid the relevant warning message is given and the user can validate this definition by a validating dialog.
[:]
- Keys
The Keys are a means for naming of XPath expressions and some specific information important for some of the tools in the system. The key names can be unique arbitrary strings. Here is what the XPath Key manager window looks like:
Each row in the table above represents one XPath Key. The content of the table cannot be modified directly form here. In order to modify a key entry, the user must select the corresponding row in the table and then press the Edit button. A new dialog window appears where the key specifications can be changed. The addition (creation) of a new XPath Key can be done by pressing the Add button. The removal of keys can be done by selecting the corresponding row and pressing the Remove button. The Exit button closes the dialog and updates the changes (if any).
There are several types of keys. Each type has different additional options and usage:
- Grammar Key - This key is designed for usage with the Grammar tool, especially in the Element Value option. It has two more additional fields: normalization on/off and a specification of a tokenizer. A description of the usage of this key can be found in the description of menu option Apply Grammar. Here is what the Grammar XPath Key editor window looks like:
- Sort Key - This key is designed for usage with the Sort tool. It contains settings specific for its usage:order descending/ascending; reverse sorting; removing leading/ending white spacing (trim); enabling number interpretation, enabling normalization; tokenizer specification. All these options and their usage are described in menu option Sort.
- Table Sort Key - This key is designed for usage in the Concordance tool. It is used for specifying the sort options for the different table columns' content. This key has one more option than the Sort Key options: Prefix. This option specifies for which table column this key is defined to be used. Here is what the editor window looks like:
The Export Keys button saves selected Keys as a file.
The Import Keys button loads Keys from a file. There are loading options which determine the behaviour of the import operation.
If one of the Keys uses some tokenizer which does not exist in the system the relevant warning message is given and the user can validate this definition by a validating dialog.
- Shortcuts
This tool allows user to use certain key combinations to execute different actions in the system.
Each keyboard shortcut definition consists of two parts:
- key combination - the combination which will activate the shortcut;
- action definition - the describtion of the action which will be executed when this shortcut is activated.
A key combination includes modifier keys (at least one) and an ordinary key. The modifier keys may vary in the different computer architectures. Usually, such keys are:
Alt
,Ctrl
,Shift
andMeta
. The rest of the keys (or at least most of them) can play the role of an 'ordinary' key. Such keys are: all keys which produce graphical symbols, the functional keys (F1, F2, ..), numerical keys and others.The action definition determines what type of action will be performed and the concrete action. The available action types are three: selecting a menu item, applying a tool query and an XPath search query. Each of them will be described in details later in this section.
Having selected this option, the following dialog window appears:
It represents a list of all defined shortcuts in the system, visualized in a table. Each row in the table corresponds to one shortcut. The first two cells show the key combination and the third one describes briefly the action of the shortcut. In column Key the keys which activate shortcuts are shown. In the second column, Modifier(s), the modifier keys for each shortcut are shown. They can be more than one. The column Action shows the shortcuts actions information.
The shortcut management which this dialog window offers includes adding, editing and removing shortcuts.
Here follows a description of the dialog buttons and their functions:
- Add - creates a new shortcut. Having pressed this button, a blank Shortcut Editor window will apper.
- Edit - modifies an existing shortcut. Having pressed this button, a Shortcut Editor window will apper with all data from the selected shortcut in the table will be loaded in it. This function can also be called by performing a mouse double-click on a shortcut's row in the table.
- Remove - removes the selected shortcuts from the table WITHOUT confirmation warning.
- Done - updates the changes (if any) on the shortcuts and closes the manager window.
- Reset Shortcuts - removes all existing to the moment shortcuts and restores the initial system ones.
- Cancel - discards the changes (if any) on the shortcuts and closes the manager window.
Shortcut Editor
The Shortcut Editor window appears when the user presses buttons Edit and Add.
The definition of a key combination can be done in sections Modifiers and Key Code. Different combinations of modifiers can be selected by clicking on the relevant checkboxes. In order to select a shortcut activation key or to change the current one, the user has to press button Change. The system will respond with a small dialog, titled Press a key and it will wait for a single key stroke (without modifiers) which will be recorded as a new activation key. If the new key recording has to be stopped, the user must press button Cancel. Having chosen a new activation key, the recording dialog will disappear and the program control will be returned to the editor. If after pressing a key, the dialog does not disappear, it means that the selected key is not suitable for an activation key (the same happens if a modifier key is pressed). If the recording is successful, the new selected key is shown in the Key Code section.
The action for the shortcut is defined in section Action. The section is separated into three sub-sections, each responsible for one type of action. The sub-sections are as follows:
- Action Items
This type of actions covers most of the menu items which can be chosen in the system menues. The target menues are: the main system editor menu (Menu Item), the menu which appears after a right mouse click on the tree areas (Tree Item) and the menu which appears on the attribute table (Attribute Item). The type of an action (menu) item can be selected from field Action Type.
- Queries
An action of this type representa an application of one or more predefined tool queries. There is no restriction on the type of queries and the order they are selected (when more than one). The queries are applied on the current document, taking into account the current tree area selection. If there is no current document or the tree selection is not an appropriate one, no action is performed.
Query list management:
- Add - selects one or more tool queries and appends them to the current queries list;
- Remove - removes the selected tool query from the queries list;
- Options - sets some shortcut execution options which are taken into account when the shortcut is activated.
The options are:
Use tree selection as:
- context for selection - determines how the selected data in the tree will be handled. Almost all tool queries contain XPath expressions which select the data on which the given tool query will be applied. If this option is selected, the queries XPath expressions are evaluated on each selected node as a context and the union of all results forms the tool input data.
- processing data - ignores the queries XPaths (if there are such) and uses the current tree selection as an input for the given tool(s).
- Suppress warning / info messages - this option enables the suppression of messages showing intermediate results during application.
- XPath Search
This type of actions allows performing an XPath search on the current document with a predefined query expression. It is handy when many XPath searches are needed to be done manually on different context nodes. Then instead of going each time to the search field on the toolbar for each new selected context, a key combination can be used. Additionally, if no result message dialogs are needed for search operation, they can be suppressed by selecting Suppress info messages and thus only 'discreet' messages will appear on the status bar in the bottom.
- Sync Rules
The Synchronization Rules is a means for establishing connections between opened documents in the system editor. The connections are expressed as distributions of a selection in one document to selections in other documents. The connections are based on XPath expressions processing and evaluation.
When a connection between the current document and another referent document is established each change of the selection in the current one is registered. If the new selection satisfies certain conditions, the connection is activated and a new XPath expression is generated on the basis of a pattern. The new XPath is evaluated in the referent document and the result is selected (in case it is not an empty node-set).
An example usage of this facility is when a certain document is explored and simultaneous look-ups in a dictionary document are needed. In this case a connection between the observed and the dictionary documents can be established. Whenever a certain word/expression to be looked up is selected, the system extracts the necessary information and performs a search in the dictionary document. If the searched entry is found it is selected in the background. Thus moving from word to word in the observed document leads to automanic showing the corresponding entries from the dictionary.
Another application of this tool is when different documents are connected in some way by references. Whenever a reference in one document is selected, the referent entity in the corresponding document is selected.
For establishing a connection of this type, exactly two documents are required: a current document in which the user works and a referent document in which the selection moves depending on the Sync rule parameters and the selection in the current document. There is no restriction on the number of connections which can take a certain document as a current one. In this respect, the user can connect one document with several referent documents and in this way navigating in one document causes distribution of selections in different documents in the same time.
In order a connection to be established, a
Sync Rule
must be defined. Here is an example view of the Sync Rule Editor:Each rule consists of three parts:
- Restriction (XPath) - this expression defines a restriction on the nodes in the current document for which a pattern expression will be generated. When a connection is established, whenever the selection in the current document is changed, for the new selected node the restriction expression is evaluated. If the result is approving (not empty node-set, not empty string, positive number or true boolean value), the node is suitable for proceeding with pattern generation.
- Pattern - the pattern here defines the way an XPath expression is constructed for evaluation in the referent document. The pattern expression is an XPath expression which may contain certain parameters whose values are generated on the basis of the selected node in the current document. The process of generating of an XPath includes: calculating the pattern parameters, converting the results to strings and inserting them in the pattern expression. The result is a complete XPath expression ready for evaluation in the referent document. The places in the pattern expression where parameter values are to be inserted are denoted by expressions in curly brackets ('{', '}'). The expressions between these brackets must be again XPath expressions which will be evaluated on the selection in the current document. The curly brackets in the pattern expression can be escaped by a leading '^' symbol.
- Ref Document - determines the referent document of the rule. When a connection is established with this rule, the selected document is used for evaluating the dynamically generated pattern (XPath) expressions.
Once a Sync Rule is defined, it can be applied in a connection by opening a document (which will serve as a current one) in the editor and then assigning rule(s) by choosing menu item Document/Synchronize with ....
- Document Index
The document indexing is a representation of the document's content in a way optimized for fast search. This representation in CLaRK can be done to the level of tokens. In order to use this optimized search, the user must preprocess (index) the input document(s). During the process of indexation the system reads the data from the document and produces the index data which is stored separately from the document. Whenever a fast search is needed, this data is automatically loaded. The ability to search in an indexed document is provided by an extension function of XPath:
search()
. With its help the index search can be used in many places and tools within the system.In order the user to make an indexation of a document, a Document Index definition must be created first. These definitions determine the data on which the indexation will be performed. Each such definition uses XPath expression to select the nodes to be indexed. With an additional XPath expression the user can further specify the representative value for each selected node. These values are converted to strings, tokenized (optional) and stored in the internal structure. For convenience, the user can index different parts of a document independently and thus forming different index repositories for one document. Then each of them can be used independently. Whenever an index search is needed, the user specifies the search query and a repository in which the searching will be performed. If no repository is stated, the search is performed in all available repositories for the document.
To define a Document Index, the user has to use the following manager:
It contains a list of all document index definitions saved in the system. It is visualized in a table where the first column contains the names of the definitions and the second one contains lists of repositories of each index.
The possible operations here are: creating new definitions (New Index), modifying existing definitions (Edit Index) and removing existing definitions (Remove Index). Each index definition must have a name.
Once a definition is created, an indexing with it of document(s) can be applied with button Apply on document. The user is asked to select documents for indexing, after which the selected ones are indexed according to the selected definition and the data is saved. Having done that, fast searching can be performed in the processed documents. If an indexed document has to be modified later, a new re-indexing might be necessary.
The following section describes the creation of a document index definition. In order to create a definition the user has to supply a name for it. Having done that, a Document Index Editor window appears:
It contains information about the repository definitions of the current index definition. Initially the table is empty.
Each repository definition contains several parts:
- Name - the name for this repository, which later will be cited when a search is performed in it.
- Targets - an XPath expression for selecting nodes to be indexed in this repository. The target nodes are the nodes which will be found later during searching.
- Keys - an XPath expression which determines the important value for each selected target node with which the node will be searched for later. In other words, a node will be found if it is searched for by its key value. Example: a dictionary in which each word entry has a certain XML structure. An appropriate indexing is: tagrets are the root elements for each word structure and keys pointing to the word(form) itself. Thus, the search query will be a word (or a part of word) and the result will be the structure(s) which contain(s) the relevant information.
An additional option here is setting a tokenizer. It is needed when a document must be indexed not by whole text nodes, but by text tokens within the text. Here the user selects a Tokenizer for processing the key values and the token categories to be indexed (button Customize). With it, the user can filter the categories which are not interesting for indexing. Additionally a token normalization can be used.
Searching in indexed documents
The searching in indexed documents is embedded in the extension of the XPath query language. Thus the indexing can be used wherever XPath search can be performed, i.e. all major system tools.
In order this functionality to be used, the target documents have to be indexed in advance. When an index search is performed on a document for a first time, the relevant index information is loaded automatically. This may cause a short delay before proceeding with further tasks. If an error occurs during index data loading, nothing is loaded and subsequent searches will be unsuccessful (no result will be returned). Possible reasons for unsuccessful index data loading can be that a document has been modified after it was indexed or the index data files have been corrupted. Whenever there is smth wrong with loading index data, the user can open the document in the editor and try to reload the data by using menu item Document/Load Index Data which indicates the failure reason (if any).
The extension XPath function which allows index search is:
search()
. The function usage is the following:node-set search ( string, string? )
The function's result is a node-set (possibly empty) which contains the nodes from the current document which answer to the given search query. The search query, itself, is set in the first argument and it represents a full or partial token value description, i.e. a certain word or a wildcards description. In case, no tokanization is used the queries will be matched against the whole nodes values stored in the index. A definition of the tokens value description language can be found in section Grammars.
The second function parameter is optional and it allows refining the search results by considering only a certain repository within the whole index. In this way, if an index contains different repositories with different content type (for example, one containing words form text nodes and another containing ID values comming from attributes), the search efficiency will be improved and the results will be more precise when index repositories are cited.
Example index search queries:
- search("noun") - returns all nodes, whose value contains the token noun stored anywhere within all available repositories for the index.
- search("noun", "dictionary") - returns all nodes, whose value contains the token noun stored in repository dictionary for the index.
- search("12.3.#", "IDs") - returns all nodes, whose value contains tokens starting with 12.3 ("12.3.23", "12.3.4", "12.3.6", etc.) stored in repository IDs for the index.
- search("#aba#") - returns all nodes, whose value contains tokens containing the substring aba stored anywhere within all available repositories for the index.
- Graphical Tree Layout
The Graphical Tree Layout is a means for drawing arbitrary tree structures represented in XML. The resulting graphical representation obeys different user adjustments, like colors and shapes rendering, text and structure definition and filtering and others.
The main graphical objects which can be used for nodes representation are: rectangles, rounded rectangles and ellipses. The user can specify their outline color and thickness, background color, text label inside (font, color and content). The nodes in the drawing are connected with lines, the appearance of which is again user defined: color and thickness. Additionally, there are cross-branches links available. They can connect any nodes in the drawing with arcs for which the curvature can be adjusted (to avoid overlapping with other lines and arcs) .
The layout itself represents a set of rules, each of which corresponding to one shape definition. Each rule has a conditional part which defines to what kind of nodes the rule is applicable (element, text or comment nodes or nodes, appearing in certain XPath defined context). If a condition for a rule is satisfied for a certain node, a new graphical object appears in the drawing canvas. The appearance of the object is defined in the rule.
Another important part of each rule is the Children Definition section which determines the nodes whose graphical representations which will appear as children of the graphical representation of the current node. The children definition is based on XPath expression and this allows visualization of nodes which are not direct children of the current node (or even nodes which do not belong at all to the current structure) as child nodes. The default value of the children definition for each rule is child::*, i.e. all direct child nodes.
If none of the rules in a layout are applicable for a ceratin node, there are three default rules, one of which always succeeds depending on the node type (element, text or comment).
An important section in each layout is the Structure roots definition section. It defines which nodes in the current document are suitable to be roots of a structure to be visualized. It contains an XPath expression which is used as a condition and if it is evaluated successfully for a certain node, the graphical representation building starts from it. Otherwise, the system searches for the closest ancestor which satisfies the condition.
In order to visualize a document (or a part of a document), it must be opened in the system editor. Having selected a node in the document, the menu item View/Graphical Tree View must be chosen which shows the graphical representation in a separate window. The layout which will be used is defined in the DTD Tree Layout of the cirrent document.
In order to define a graphical tree layout, the user must select menu item Definition / Graphical Tree Layout. The following dialog window appears:
The layouts manager contains several sections which are described below.
The Layouts section contains a list of all layouts which are currently available in the system. Initally the list contains only the entry Default Graphical Layout. The selection in this section determines the content of the other sections of the manager, i.e. it determines the current layout for editing ot just exploring.
The Rules section contains information about the rules defined in the current layout. It is represented in a table where each row represents one rule. Having selected a rule from the table a preview image is generated according to the rule's settings and it is shown in section Preview. The preview image contains fixed text and a shape with fixed dimensions. All other settings are taken from the rule.
In this section the user can add New Rule, Edit an existing rule or Remove Rule. The definition of a new rule or modification of an existing one is described in sectionLayout Rule Editor.
The first three rows of the table contain the default rules for the layout. They can be modified, but can not be removed.
The Links section contains information about the defined cross-branches links in the layout. Each such link is directed (although it is not visible on the canvas). The starting point of each link is determined by an XPath expression which is evaluated on the whole document. For each selected node which is visible on the canvas a second XPath expression is evaluated and the result of which determines the ending point of the corresponding link. If a starting or ending point is not visible, no link is drawn. Each link definition is processed independently. The creation of a link definition or modification of an existing one is described in section Layout Link Editor below.
The Structure roots definition section determines which nodes from the current document can represent a root for a structure to be visualized. The XPath expression is used as a condition for each selected node and in case of success the structure building starts from the corresponding node. If a condition fails for a node, it is checked again for its parent node and so on until a suitable root node is found. If no suitable root is found the system shows an error message.
Control buttons:
- New Layout - creates a new layout and prompts the user for a name for it;
- Remove Layout - removes the current layout which is preceded by a confirmation message.
- Options - offers some global options concerning the graphical visualization.
- Margins(top, bottom, left, right) - determines the spacing between the tree image and the corresponding borders in pixels;
- Nodes H Gap - determines the minimal space (in pixels) between the nodes in horizontal alignment;
- Nodes V Gap - determines the minimal space (in pixels) between the nodes in vertical alignment;
- Background - determines the background color of the drawing canvas.
- OK - updates the changes on the layouts (if there are such) and closes the dialog;
- Cancel - discards the changes on the layouts (if there are such) and closes the dialog;
Layout Rule Editor
The Layout Rule Editor is used for creating new graphical tree layout rules or modifying existing ones. The layout of the editor is the following:
The dialog contains several sections:
- Shape - describes the shape characteristics for this rule:
- Nodes shape - determines the shape which will be drawn for a node: Rectangle, Rounded Rect(angle) or Ellipse;
- Outline width - determines the outline thickness of the shape in pixels;
- Outline color - determines the outline color of the shape;
- Background - determines the background color of the shape;
- Parent arc color - determines the color of the arc which connects the current node with the parent;
- Parent arc width - determines the thickness of the arc which connects the current node with the parent in pixels.
- Preview - shows an image preview of the current rule's shape. It changes automatically when a characteristic is changed. Except for the text content (which is context dependent) all other characteristics are applied on the preview and the shape appears in its realistic dimensions.
- Label - this section determines the label text and its appearance within the shape. The options are:
- Label pattern - contains a definition of the label text content. The syntax of the pattern definition is the same as the one in the DTD Tree Layout;
- Label Color - determines the color of the label text;
- Font Name - determines the font of the label text;
- Font Size - determines the size of the label text;
- Font Style - determines the style (Plain, Bold, Italic) of the label text;
- Children Definition - determines the child nodes which will be drawn on the canvas for the current node. The children are result from the evaluation of XPath expression with context the current node. If the result is an empty nodelist or if this field is empty - no children are drawn for the current node;
- Context Restriction - this section contains the conditions which have to be satisfied by a node in order this rule to be applied. There are two types of conditions:
- by tag name - this condition is satisfied is the current node which is tested is of type element and its tagname coincides with the value specified in this field;
- by xpath - the expression specified in this field is used as a predicate for the current node. If the evaluation is successful the rule is used.
Layout Link Editor
The Layout Link Editor is used for drawing cross-branches arcs in order to express relations other than parent-of or child-of. Each link of this kind is directed and it has a starting point (target) and an ending point (reference). The targets and the references are determined by XPath expressions evaluation.
Each link definition is processed independently in the following way. An XPath expression is evaluated on the whole source document. The result node set is reduced only to those nodes which appear in the current view. This forms a set of candidates for link starting points. For each of them a second XPath expression is evaluated and if the result is a not empty list, a link is established between the current context and the first entry of the result which is also represented in the image.
Here is the layout of the Link Editor dialog:
The components of the editor are as follows:
- Preview - shows a dynamically updated preview image according to the current settings;
- Color - determines the color of the link;
- Width - determines the thickness of the link;
- Deviation - determines the curvature of the link in order to avoid overlapping with other components. The value here is a percentage coefficient of the length of the link (positive values - bulged curve; 0 - straight lines; negative values - concave lines);
- Targets - an XPath expression determining the starting points of the links;
- Reference - a relative XPath expression determining the ending points of the links.
- Export Definitions
The user can save the selected definitions (DTDs, Tokenizers, Filters, XPath Macros and/or Keys) from the system as a file.
If there is a tokenizer which is used in some definitions and it is not exported the relevant warning message is given.
- Import Definitions
The user can load selected definitions (DTDs, Tokenizers, Filters, XPath Macros and/or Keys) from a file. The imported file must be generated by the export operation from the system.
The Loading Options (see below) determine the behaviour of the import operation.
If one of the new definitions uses some tokenizer which does not exist in the system the relevant warning message is given and the user can validate this definition by a validating dialog. This dialog is relevant to the type of the definition - Element Features Validation, Attribute Features Validation, XPath Macros Validation, Keys Validation.
Loading Options
The loading options are related with the cases when there is a definition in the system with the same name as the new imported definition. There are four modes:
- Overwrite all - the data of the system definition is overwritten by the data from the new definition.
- Do not overwrite all (Skip) - the data of the new definition is skipped.
- Do not overwrite all (New Name) - the user can save the data of the new definition with another name.
- Ask for each - the user is asked whether to overwrite the data of the definition in the system with the data from the new definition or to save the new definition data with another name.
Menu Tools
- Entity Converters
This tool handles documents, which contain symbols, not supported by the local hardware architecture. It substitutes the symbols with entities according to the standard ISO 8879 and vice versa. Currently, this tool supports 19 sub-sets of entity-char conversions. Each of them can be activated or deactivated. One reason for excluding some of the sub-sets is the following : sometimes not all the symbols have to be converted, for example: commas, dots, colons, semicolons ....
Example: ("äóìà" in Bulgarian is the equivalent of "word")
"äóìà" <-- entity conversion --> "дума"
The tool operates on the document which is currently opened in the system or on a set of documents from the Internal documents database. It can be started from menu item: Tools/Entity Converters.
The dialog window looks in the following way:
The window represents a list of converters (filters) which will be used in the replacement procedure. The list content can be managed by using buttons:
- Add Flter - enables the addition of a new filter or a set of filters to the current list content.Having pressed this button the user is shown a list of all available filters which are not presented yet in the working list. The list is placed in a new dialog window with the following layout:
Here the user selects filters to be added. The control buttons are as follows:
- Add - includes the selected item(s) in the working list;
- Preview - shows in details the currently selected filter in the list;
- Cancel - closes the dialog without without any other action;
- Add All - includes all list items in the working list.
- Remove Filter - removes the selected item(s) from the working list.
- View Filter - shows detailed information about the selected item (filter) in the list.The information is visualized in a table form, where each row represents a single character-to-entity mapping. The table has three columns: Entity (the literal representations of the entities), Value (the character (uni)codes in hexadecimal format) and Preview (the characters themselves).
The direction of conversion is determined by the two radio buttons: Entity to Character and Character to Entity.
Additionally, the user can restrict the scope of conversion application, i.e. the conversion can be applied only to certain places in the documents, leaving the rest unchanged. If no restriction is used the conversion is applied to all attributes, text and comment nodes in the document(s). In order restriction to be set, theEnable Filtering checkbox must be selected. The user is expected to supply an XPath expression which will select the nodes on which the conversion will be applied. Each application on a node also includes conversion of the whole content of the node, i.e. all descending nodes suitable for this operation. Thus, if for example, only data included in paragraphs must be processed, the XPath expression must select only the paragraph nodes.
This tool supports two modes of application: on the current document and Multiple Apply mode. For details see Tool Application Modes.The user can also save or load the current tool settings, i.e. XML Tool Queries are supported here.
- Add Flter - enables the addition of a new filter or a set of filters to the current list content.Having pressed this button the user is shown a list of all available filters which are not presented yet in the working list. The list is placed in a new dialog window with the following layout:
- Grammars
A Grammar in the CLaRK System is defined as a set of rules. Each rule consists of three regular expressions and a category (represented as an XML fragment, called Return Markup). The three regular expressions are called: Regular Expression, Left Regular Expression, Right Regular Expression. The Regular Expression determines the content which the rule can be applied to. The Left and the Right Regular Expression determine the left and the right context of the content the rule recognises (if there is no constraints over one of the contexts, then the corresponding expressions are empty). When the rule is applied and recognises some part of an XML document, the part is substituted by the return markup of the rule. If it is necessary to keep the recognised part, it can be cited by using the variable
\w
. If the user needs to use the string \w in the return markup, he/she can avoid the\w
variable in the following way:^\w
The regular grammars in the CLaRK System work over token and/or element values generated from the content of an XML document and they incorporate their results back in the document as XML mark-up.
The tokens are determined by the corresponding tokenizer.
Before having been used in the grammar, each XML element is converted into a list of textual items. This list is called element value for the XML element. The element values are defined with the help of XPath keys, which determine the important information for each element.
In the grammars, the token and element values are described by token and element descriptions. These descriptions could contain wildcard symbols and variables. The variables are shared among the token descriptions within a regular expression and can be used for the treatment of phenomena like agreement.
Here is the list of the token and element descriptions:
-
-
"token"
-> describes the token itself. This description can be matched to the token itself and nothing else. -
$TokenCategory
-> describes all tokens of the categoryTokenCategory
. In the grammar input this description is matched against exactly one token of this category. -
Wildcard Symbols:
#
,@
,%
-> describe substrings of a given token.#
- describes a substring of arbitrary length from 0 to infinity,@
- describes a substring of arbitrary length from 0 or 1,%
- describes a substring of arbitrary length one. Here are some examples:
-
-
"lov#"
: matches exactly one token which starts with "lov": "lov", "love", "loves", and many others. -
"lo#ve"
: matches exactly one token which starts with "lo" and ends with "ve": "love", "locative", "locomotive" etc. -
"%og"
:matches exactly one token which ends in "og": "bog", "dog", "fog", "jog" etc. -
"do"
:matches exactly one token which starts with "do": "do", "doe", "dog", "don" and many others.
-
Variables:
&V
-> describes some substring in a token, when initialised for the first time, then matches to the same substring in the same token, or some following tokens. The scope of a variable is one grammar rule. The variable can be used in the return mark-up and in this case the value of the variable is copied into the return mark-up. Each variable consists of the symbol&
followed by a single Latin letter. Each variable has positive and/or negative constraints over the possible values. Both the positive and the negative constraints over variables are given by lists of token descriptions. The value assigned to a variable during the application of the grammar has to be described by one of the positive constraints and must not be described by any of the negative constraints. These token descriptions can contain wildcard symbols, but no other variables. Here are some examples:"A&N&G"
,"Nc&N&G"
. These token descriptions can be used in a rule to ensure the agreement in number and gender between an adjective and a noun. -
Complex token descriptions. The user can combine the above descriptions in one token description. Some examples:
"lov%#"
,"Vp&N&G#&P"
-
Element description is a regular expression in angle brackets:
<
Regular Expression>
. Here the Regular Expression is over token descriptions which is matched against the element value. Examples:-
<w>
: matches exactly onew
element. -
<$TokenCategory>
: matches exactly one element, whose element value is a token description with categoryTokenCategory
-
<"token">
: matches exactly one element, whose element value is the token itself and nothing else. A token can contain wildcard symbols and variables. -
<<N>>
: matches exactly one element, whose element value is the XML elementN
.
-
The application of a rule works in the following way. The element value is scanned from left to right. Its Regular Expression is evaluated from the current point. If it recognises a part of the element value (this part we will call a match of the rule), then the regular expressions for the left and for the right contexts are evaluated (if they are not empty). If they are satisfied by the context of the match, then the match is substituted in the return markup for each presence of the variable
\w
(The user must be careful if he/she has, for example, text like\word
in the return markup, the beginning\w
will be substituted by the match. In this case the variable must be escaped). After these substitutions, the new markup is substituted in the XML document instead of the match place.When a regular expression is evaluated from a given point within the element value, there is a possibility for several matches to the expression. For instance, the expression
(A,B)+
over the element valueL,A,B,A,B,A
can recognise two matches from the second possition:A,B
andA,B,A,B
. This allows for a non-deterministic choice in this place. One can choose either the shortest match, or the longest one, or some in between. Generally, there are no universal principles for making such a choice. This is why in the CLaRK system we allow for user definition of a strategy to choose a match among more choices. We envisage four strategies: shortest match - in this case the system always selects the shortest possible match; longest match - in this case the system always selects the longest possible match; any up - in this case the possible matches are enumerated from the shortest to the longest possible match up to the moment when the left and/or the right context of the match satisfy the Left and/or the Right Regular Expression. If there is no Left and Right Regular Expressions then any up strategy is the same as shortest match; any down - it is similar to any up except for the fact that the possible matches are enumerated from the longest to the shortest one. These strategies are specified within the grammar queries. This allows the same grammar to be applied with different strategies over different documents.The definition and application of a grammar are separated within the CLaRK system. The grammar itself is defined at one place, the parameters for its application are defined at another place in the form of grammar queries. This separation allows the use of the same grammar with different parameters like different tokenizers, different element values, different filters etc. Each of them has an XML representation. These XML representations allow grammars and their queries to be exchanged among different users. Also this allows the grammars to be constructed out of the system and then imported within it.
The grammar definition consists of a set of rules, variable definitions and context evaluation parameter (Check Context Order). The rules have already been discussed. The variable definitions are given by the positive and negative constraints over the variable. The context evaluation parameter determines the regular expression for which the context will be checked first - the left and then the right, or vice versa.
The grammar application determines: the elements which the grammar will be applied to; the element values for those elements, including the tokenizer and the filter; whether the textual elements will be normalized; the application strategy (longest, shortest match, any up and any down match).
-
- Grammar Manager
The grammar manager is the user interface in the CLaRK System for management of grammar definitions. It supports the user in the creation, modification, deletion of grammars. The main dialog is Entry Manager with additional buttons. It contains all of the available grammars arranged in a tree hierarchy, some of their features (Editable, Compiled) and buttons for management of the grammars.
Each grammar has to be compiled in order to be used in the system. The compilation converts the regular expressions into a finite-state automaton. Because the compilation is a heavy process, sometimes it is better to postpone the grammar compilation (For example, when a large grammar is imported into the system). The column
Compiled
in the table represents information whether the grammar is compiled or not. If the grammar is compiled the corresponding box is checked.The system also allows the user to export and import already compiled grammars. This option is very useful in cases of large grammars, when the compilation takes longer time or the user wants to exchange just the compiled grammar, but not its source. If such a grammar is used in the system, it cannot be edited. In such cases, the check box in the column
Editable
is not checked.Here is the main window of the Grammar Manager:
There are several grammars. The grammar
Slovnik One
is editable and also compiled. Thus if necessary, the user can open it in the Grammar Editor and can modify it. The grammarSlovnik One[1]
is editable, but not compiled. Thus it cannot be used immediately, first it has to be compiled. The grammarSlovnik One[3]
is available only in compiled form. It cannot be modified in this form, but it can be used to process the relevant documents.The manager window consists of two main parts:
- The panel on the left. It contains the tree representations of the group hierarchy. When the user selects a node in this tree, the content of the corresponding group is loaded in the component on the right side.
- Current group monitor. This is the panel situated on the right side of the window. It is a list with the content of the currently selected group in the tree. The list components, which are in blue color, are sub-groups. The other ones are the grammars included in this group. They are colored in black or red, depending on whether they are valid or not, according to their DTDs. The user can sort all the grammars in a group by clicking on the Name column of the table header. It is possible to rearrange the grammars in a group by simply using drag-and-drop technique, i.e. pressing a grammar and moving upwards or downwords until the desired position is reached.
There are six additional buttons which can be used for modification the content of the current group:
- New Group - creates a new sub-group of the current one. The user is asked to give a name for it;
- Remove - removes the selected grammars and/or groups from the list. The removal is preceded by a confirmation message. If the selection includes sub-groups, they are also removed with their entire contents.
- Rename - give a new name of the selected grammar from the list.
- Copy - save the data of the selected grammar from the list with different name given by the user.
- Add Grammars - gives a list of all grammars which are not present in the current group. The user is expected to choose one or more grammars to be included in the current group.
- Delete! This function can be used for removing grammars from the internal grammar database. It can be applied only to single grammars, not to whole groups. Groups are excluded from any selections. The removal of grammars is preceded by a confirmation message. The grammars to be removed, are excluded from all the groups they may belong to.
Navigation in the group structure can be made also in the panel on the right. When the user wants to see the content of a certain sub-group of the current group, s/he just has to perform a double click on the desired sub-group. This will change the current group to the new one. This represents the movement from a group to a sub-group. The movement in the other direction is also possible. For each grammar group (excluding the Grammars), a special sub-group is included, named: ". .". By performing a double click on it, similarly to most file systems, the current group is changed to its parent one.
The Grammar Manager also provides a list of all the grammars, no matter which group they are included in. The following information appears for each grammar: date - when it has been last modified; if it is a query, which tool it refers to; and which is its DTD. The user can sort all the grammars by clicking on the Name column of the table header. When selecting grammars, the right button of the mouse is used to visualize the Pop-up menu with the following operations on selected grammars:
- Info - This item shows the following information about the selected grammars:
- grammar name
- grammar size
- grammar's dtd name
- whether the grammar is valid
- group of the grammar
- Add In Group... - This item shows a dialog with the hierarchical structure for the groups in the system and the user can choose a group in which to place all the selected grammars.
- Delete! - It is described above.
- Rename - It is described above.
- Copy - It is described above.
Under the table with the available grammars there are 3 buttons (
New
,Edit
,Compile
) and 3 menus (Compiled Grammar
,File I/O
,XML Editor
). They can be used to manage the grammars.Buttons:
-
New - creates a new empty grammar with a name specified by the user and opens the Grammar Editor;
-
Edit - opens the selected grammar in the Grammar Editor;
-
Compile - compiles the selected grammars. This operation is relatively slow. For large grammars (with thousands of rules) it might take several minutes;
-
Apply - switches from the Grammar Manager dialog to the Apply Grammar dialog and allows the user to apply some of the grammars;
-
Exit - exits the Grammar Manager dialog.
Menus:
The three menus allow the import and export of grammars from and to the system. As it was said above, the grammars in CLaRK have an XML representation. Thus the user can load such grammar from an external file, or from a document within the system. Also, the user can save a grammar created within the system in an external file or as a document in the system. In this way the user can exchange grammars with other users, or make backup copies of the them, or can process them with other tools in system, such as sorting, searching etc. Additionally, there is a format for saving and loading compiled grammars.
-
Compiled Grammar
This menu gives the user a possibility to store and load grammars in compiled format into/from a file. It has the following items:
-
Load compiled grammar from file - the user is asked to choose a file which contains the compiled grammar. Then the system reads the file, interprets it as a CLaRK finite-state automaton and stores it in the grammar database of the system. Such a grammar cannot be modified, but it can be applied.
-
Save compiled grammar to file - the user can save a compiled grammar into a file. The file has a special format and it cannot be modified in any reasonable way, thus it can be used just for exchanging of grammars in compiled form.
-
-
File I/O
This menu gives the user a possibility to store and load grammars in XML format into/from a file. It has the following items:
-
Load grammars from file - the user is asked to choose a file which contains the grammars in XML format. Then the system reads the file, interprets it as CLaRK grammars and stores them in the grammar database of the system in a table format. Such a grammars can be modified within the system.
-
Save grammars to file - the user can save editable grammars into a file. The file is an XML document and it can be modified out of the system.
-
Save grammars with groups to file - the user can save editable grammars into a file and the group structure for the selected grammars. The file is an XML document and it can be modified out of the system.
-
-
XML Editor
This menu gives the user a possibility to store and load editable grammars in XML format into/from an internal for the system XML documents. It has the following items:
-
Load current document as grammar - the user has to open the document which contains the grammar as a current document in the system . When this item is chosen, the system reads the document, interprets it as a CLaRK grammar and stores it in the grammar database of the system in a table format. Such a grammar can be modified within the system. This option allows the user to create grammars from the documents produced by other tools in the system and load them in the Grammar Manager.
-
Edit grammar in editor - the selected grammar is converted from table format into an XML format and is loaded as a document in the system. Thus it becomes a current document of the system. The user can manipulate the document with the tools of the system. Useful processing can include sorting, searching, etc.
-
- Grammar Editor
This is the editor for grammars in the CLaRK system in table format. The editor contains the following elements:
Rules
,Option
,Variables
, and three buttonsSave
,Compile
,Exit
. Here is the main window of the Grammar Editor:Rules
The table
Rules
contains the rules of the grammar. Each row of the table represents one rule. The columns follow the structure of the grammar rules in CLaRK. The columnRegular Expression
has to contain the regular expression which will be matched with respect to the element content. The columnReturn Markup
is the second obligatory element in a rule. It contains the XML fragment which will be substituted with the matched part of the element content. There are two columns for the regular expressions which determine the left and the right context of the match -Left Regular Expression
andRight Regular Expression
. The last column is for comments on the rule.The content of the table cells is not checked for consistency before the compilation.
Context Check Order
This option gives the user a possibility to choose which context will be checked first - the right or the left. In this way a preference over rules can be defined. The two orders are: first the left context, then the right one (
Left->Right
); the right context, then the left one (Right->Left
).Variables
This table contains the definitions of the variables for the grammar. Each row of the table represents the definition of one variable. Each definition consists of the following elements: name of the variable (
Name
) which is a capital Latin letter; The positive constraints for the variable are given as a list of token descriptions in the cellPositive Values
. If the cell is empty, then the variable is not constrained and can have any non-empty value; The negative constraints for the variable are given as a list of token descriptions in the cellNegative Values
. All the values that can be described by some of the token descriptions in the cell are forbidden as values of the variable. If the cell is empty, then the variable is not negatively constrained and can have any value described be the positive constraints; Like every token description in the CLaRK system, a variable also can match several strings starting from a position. The user has a possibility to define which match to be chosen. There are two options for theMatch
cell:Longest
- in this case the variable is assigned the longest possible value, andShortest
- in this case it receives the shortest possible value.The management of the variable table is done by a pop-up menu which appears when the user clicks with the right button of the mouse on the cells in the table. The possible choices are:
Insert row
which allows the user to define a new variable;Delete row
which allows the user to delete the definition of a variable;Edit Cell
which allows the user to modify the list of token descriptions for the positive or the negative constraints for the variable;Up
andDown
allow the user to rearrange the list of the variable descriptions for her/his own convenience.When the user edits the constraints for a variable (
Edit Cell
) the following dialog appears for the positive constraints:or for the negative constraints:
In both cases the constraints are represented as a list of token descriptions (one per line). The user has the possibility to enter a new description by the button
Add Value
- in this case a text edit field appears and the new token description has to be entered in it. The user can delete some token descriptions by selecting them in the list and clicking on the buttonRemove Value(s)
. The buttonsOK
andCancel
are used for the acceptance or rejection of the changes.Buttons
The buttons at the bottom of the dialog give the user the possibility to save the grammar -
Save
; to compile the grammar -Compile
. In case of errors in the grammar a corresponding error message appears; to exit the dialog -Exit
. The system prompts for unsaved changes. - Apply Grammar
As it was described in the Grammars menu, the definition and application of a grammar are separated within the CLaRK system. In the Grammars menu the user can find a description of the grammars and how to construct their definitions in the CLaRK system. In this section the user can read how to apply a grammar over one or several XML documents.
Here is the main dialog of the Apply Grammar tool:
The application of a grammar requires the following types of information: the name of the grammar (text field
Grammar
), the target of application (text fieldApply to
), how the input to be prepared (the second row of options:Tokenizer
,Filter
,Normalize
, andElement Values
), how the rules of the grammar to be applied (the third row of options: the match options for the left context regular expression (Left
), for the main regular expression (Body
), for the right context regular expression (Right
)), and whether the context can be backtracked (Context Backtracking
). A combination of all these options is called a grammar query.Additionally, the dialog allows the user to consult the definitions which are connected with some DTD in the system. Generally, if in a particular grammar query some of the necessary information is not presented, then the system checks the corresponding information connected with the DTD of the document. This can be done by the menu
Features
.The user can save the current settings of a query as an XML document by choosing
Queries
check box. Then the user has the possibility to save the query with some comments in theInfo
text field. Also, the user can select a previous query from the list of queries. The grammar query XML documents are saved in the groupSYSTEM:Queries:Grammar
.Also, the user can specify whether the grammar to be applied over the currently opened document or on some documents stored internally in the system. This can be done by choosing the
Multiple Apply
. In this case the user can select several documents which to apply the grammar to.At the bottom of the dialog there are three buttons:
Apply
for application of the currently stated query (also one loaded from the XML representation);Close
for closing the dialog; andSelect
for navigation over the current document and manual annotation (see below).Element Values
When a grammar is applied over a content which contains XML elements, the system converts each such element into an element value. How exactly this conversion is performed is stated as Element Value definitions. These definitions can be connected with a particular DTD, but the grammar query allows the user to change the DTD settings and to define them locally into the query. Each element value is connected with a element tag and a sequence of XPath expressions (called keys) which define the sequence of tokens or elements for the element value. See below for examples. The dialog of the Element Values editor is as follows:
Each row of the table represents an element value for an element. The first column represents the names of the elements for which element value is defined. In this case they are
w
andpt
. TheKeys
column contains the keys for each definition. In this case the value of the elementw
is the text in itsph
element and the text in itsta
element. For thept
elements the definition says that the element value is their text content.The user can edit the value of the right column by selecting the
Edit Cell
item from the pop-up menu which appears when the user presses the right mouse button over the cell (The menu is visible on the screen shot). There are two modes for the element valueTool
andUser
. In the following screen shots thew
element is shown in both modes:Each row of the table represents one key. All keys in the table define the value of one element. One key consists of a key name (option), a key value (XPath expression), a normalize option and a tokenizer name. If the element value is in
Tool
mode, then the normalize option and the tokenizer are taken from the grammar. The key in the table is called Grammar Key. It can be saved and loaded into the system memory. This is done by selectingLoad Key
andSave Key
menu items in the context menu shown when the user right clicks over a cell in the table. The user can also add remove keys with items from this menu. The normalize option and the tokenizer name determine the input word, which is created for the elements when the grammar is applied. An interesting option here is "No Tokenizer" tokenizer. If it is selected, then the text nodes are treated as one token. When the OK button is clicked all XPath expressions in the table are compiled.The element values are calculated in the following way:
-
If the XPath expression selects textual content or the value of an attribute, the corresponding text is tokenized by the relevant tokenizer. Then the value is a sequence of tokens.
-
If the XPath expression selects one or more elements then each element is represented as
<tagname>
, wheretagname
is the tag for the element. -
If there is no element value definition for the element then it is represented as
<tagname>
, wheretagname
is the tag for the element. The difference from the previous case is that if the element value is defined byself::*
, then the element value will be<<tagname>>
.
Select
This button allows the user to apply a grammar in an interactive mode. In this mode the grammar is executed on the current document and for each match it stops and allows the user to see the selected content and to perform some actions like: to go to the next selection (
Next
button), to go to the previous selection (Previous
button), to add the return mark-up to the content (Mark
button), and to exit the mode (Exit
). -
- Grammar Groups
This tool is a means for applying a set of Grammar Queries in a row. The application itself is done in a cascaded grammar style, i.e. the output from each grammar is an input for the next one. The result from the last grammar is a result from the whole tool. The advantage of this tool is the better efficiency which is a result from the fact that the input for the grammars is prepared only once. Otherwise, the input should be prepared (preprocessed) each time a single grammar query is to be applied. This can be crucial for huge amounts of data.
Here is what the Grammar Groups dialog window looks like:
The tool dialog basically represents a list of Grammar queries. The user can Add Grammar Query to the end of the list and/or Remove Grammar Query from the list by using the buttons on the right side of the panel. The order of the different grammar queries can be changed by selecting a query and dragging it to the desired position.
This tool supports two modes of application: on the current document and Multiple Apply mode. For details see Tool Application Modes.The user can also save or load the current tool settings, i.e. XML Tool Queries are supported here.
- Regular Expression Constraints
The Regular Expression Constraints is a means for setting restrictions on the content of certain nodes in a document. The restrictions are set as regular expression patterns. The syntax of the patterns is the same as the one in the Grammar tool. The nodes whose content the constraint will be applied to are selected by an XPath expression. The selected nodes must only be of type Element (as no other types of nodes can have content). A node satisfies a certain constraint if its content matches the pattern given in the constraint. The application of a constraint gives the user the possibility to navigate either through the nodes which satisfy the constraint or through the nodes which do not satisfy it.
The constraint application is similar to a DTD validation of certain nodes of a document, but here we offer a more powerful instrument. On the one hand, the nodes which the constraints are applied to are selected not only by name (as it is in the DTD), but depending on the context in which they appear (XPath determined context). The context can be an absolute or relative document position, based on properties of the selected nodes or nodes relative to them, properties located in other documents, etc. The node selection qualification uses the full expressive power of the XPath engine, implemented in the system. On the other hand, the regular expressions of the Grammar allow writing patterns not only at the level of text nodes, but at the level of tokens within the text nodes as well. Even more, the user can specify a tokenizer which will be used for segmenting the text and a filter to discard the unmeaningful tokens during application. In the patterns the user can write wildcard symbols in token descriptions in the same way they are written in the Grammar tool.
With the help of these constraints the user can cover some of the features of XML Schema usage, by defining patterns for text nodes.
Regular Expression Constraints Structure
Each Regular Expression Constraint consists of 2 main parts and 3 additional (optional) parts:
- Constraint name (obligatory) - a unique identifier for the constraint in the system;
- Regular expression (obligatory) - a valid regular expression which represents the constraint over the nodes' content;
- Default XPath expression - it is an XPath expression defining the selection of the nodes to be processed by the constraint. This expression appears as a default text in the appropriate specification area;
- Tokenizer - it is used when the constraint tests text nodes' content. If the tokenizer remains unspecified, then the processor takes the tokenizer, which is specified in the DTD of the current document;
- Filter - it is used to filter the tokenizer categories when the text nodes' content is tested. In other words, all filtered tokens are discarded from the selection before passing it to the constraint engine.
- Edit Regular Expression Constraints (REC)
This section is responsible for the regular expression constraint management. Here the REC can be created, modified, removed, saved as a file and loaded from a file. Here is a picture of the dialog window:
The left side of the window is a table with all RECs in the system. The first column contains the names of the constraints. The second one contains the regular expressions for each of the constraints. Having selected a row in this table, the user can apply a manipulation over a constraint by using the buttons on the right.
Description of the buttons on the right:
- New - creates a new Regular Expression Constraint. Having pressed the "New" button, a new constraint editor window appears on the screen (for more details, see below);
- Edit - the currently selected constraint is opened for editing in a new editor window;
- Remove - removes the currently selected constraint in the table. The removal is preceded by a confirmation message;
- OK - updates the current changes in the constraints and closes the manager window;
- Cancel - closes the manager window without saving the changes (if any) in the constraints;
- Save To File - serializes all the RECs into an external file in an XML format. This function can be used for two main purposes: back-ups and interaction with external applications. The description of the output XML file (the DTD) can be found in the file: regConstraint.dtd;
- Load From File - loads the REC(s) from an external file. The external file must be an XML document, valid with respect to the DTD in the file: regConstraint.dtd;
Regular Expression Constraint Editor
Here is the interface view of the editor window for the REC:
The last three fields are optional. The tokenizer and filter lists contain all tokenizers and filters defined in the system. The regular expression may consist of: tags, token categories, token values and token value templates (wildcard descriptions).
- Apply Regular Expression Constraints
The actual applying of the Regular Expression Constraints can be performed in two ways:
- by selecting a node from the tree panel and choosing a constraint;
- by selecting a set of nodes with the help of an XPath expression and then applying a certain constraint on each of them;
Here we describe the latter case. The user chooses 'Apply Regular Expression Constraints' from the menu Tools/Constraints/Regular Expression Constraints/Apply Regular Expression Constraints. Then the following dialog window appears:
The first input field Select nodes contains the XPath which is evaluated in order to select nodes for the constraints operation. If the default XPath expression is specified for the constraint, then it appears in this field as a default text.
The second field selects by name a constraint to be applied.
The last two fields are activated when the current constraint tokenizer and filter are ignored and new ones have to be defined explicitly.
Having pressed the Apply button, the XPath is evaluated and a set of nodes is selected. Then for each of them the constraint is applied. If the node's content satisfies the constraint, then the node is marked as Valid. Otherwise it is marked as Non Valid. In this way two groups of nodes are formed and each of them can be observed separately. Here is a picture of the navigation panel window:
In this window the user can change the group under observation by using the two radio buttons. Pushing Next and Previous buttons the user changes the current selection in the editor. On the top of the window there is some information about the constraint and the nodes which satisfy or do not satisfy it. For the example above, the pattern '$NUMBER+,$SPACE*' concerns the content of text nodes. The items which satisfy the constraint with this pattern are all element nodes whose text content is a sequence of one or more tokens of category NUMBER, followed by zero or more tokens of category SPACE. Thus, strings which match the pattern are: "1234", "256 ", "666 ", etc.
- Value Constraints
The constraint engine is a means for setting restrictions on the content or other related information of nodes in XML documents, which cannot be expressed by the DTD. The nature of the restrictions is based on the existence of certain values (tokens and/or tags) at certain places. The constraints of this type specify the pieces of information which are restricted and define the set of admissible values for each of them (usually by pointing to a location they are stored in, or by encoding the values themselves explicitly).
Value Constraint Structure
In general a value constraint consists of two parts: a target section and a source section.
Target Section
In this part one can find a description of the nodes which the constraint will be applied to. The target nodes for a constraint are selected by an XPath expression evaluated on the document which the given constraint is to be applied to. The result from the evaluation is expected (required) to be a node set with nodes compatible with the specific constraint application. If the result set contains nodes of types other than the required ones, they are automatically excluded (example: the selection contains text and attribute nodes, but the constraint checks the child nodes of its targets). This way of target selection uses the full expressive power of the XPath language in order a context dependency to be expressed.
Source Section
Here the possible values for the target nodes (selected by the previous section) are defined. The possible values are tag names and tokens depending on the type of the constraint. The source list can be selected by an XPath expression or by typing the choices explicitly as an XML markup. If the selection is made by a relative XPath expression, then the current target node is taken as a context node for the constraint. If a text node is selected as a source, then its text value is tokenized and the tokens are added to the source list, excluding the node itself. It is possible that the source for the constraint is an external document. The only requirements in such cases are the following: the external document has to be in the internal database of the system and the XPath expression cannot be relative.
There are four types of value constraints, currently supported by the system. They are distinguished by their target and the way of their usage. Here is a description of each value constraint separately:
- Parent Constraint
This type of value constraint sets limits on the possible parents of a node. There are two ways of applying this constraint type: by changing the parent of a node (local) or explicitly running the constraint engine (global).
The first possibility is changing the parent of a node (or a set of nodes at one level). The list of all the relevant parent nodes can be restricted further by applying other constraints. The final list contains the intersection between the source of the constraints and its former content. If the operation - changing the parent of a set of nodes - is performed, then all compatible (parent)constraints are applied.
The second possibility is running the Constraint Engine. It works in the following way. First, the targets are selected (by their tag names and an XPath restriction). Then the source is compiled. If there is more than one choice, the user is asked to select one option from a list. If the choice happens to be exactly one element, it can be automatically inserted as a parent of the target. The action of a constraint depends on the Application Mode set for the constraint.
The source list of each constraint must contain only tag names. All tokens in the list are ignored.
- All Children Constraint
This type of value constraints sets limits on the names of a node's children and the content of its text children. All children, that are tags, must have names coinciding with the name of some node from the source list. Then all the data in text children is tokenized and a list A of tokens is formed. After that all the data in text nodes in the source list is tokenized and a list B of tokens is formed. For every token in A there must exist a token in B such that the values (not categories) of A and B are equal. This type of value constraints can be applied (checked) from menu item Apply All Children or from the toolbar button. The list of all invalid nodes according to the constraints is given in the Error message area together with the rest of the errors (if any). The user is given a possibility to navigate through all invalid for the constraints nodes.
- Some Children Constraint
This is a special type of a value constraint, because its main task is not only to set limits on the node's content. Instead, it can be used for a value restriction when the operation inserting a child in a node is performed. This constraint type is not applied each time a new node is inserted. These constraints are used separately. Here the target node is the node where the insertion takes place. The constraint is blocked when:
- there is a child of the target node that is a tag and there is a node in the source list, such that both nodes have identical names.
- there is a text node in the target node that has a token, whose value equals the value of a token in the source list.
To sum up, when there is a non-empty intersection between the source list and the target node's content, the constraint is satisfied and there is nothing more to be done. In cases when the source list is empty and the target content is also empty, then the constraint is satisfied.
When the source list is not empty and there is no intersection with the target's content, the user is offered a list with the possible values from the source list for the target node. The user can choose one item to insert. The action of a constraint depends on the Application Mode set for the constraint.
- Some Attributes Constraint
This constraint is very similar to the previous one. The only difference is that the target here is an attribute of an Element node. Also the target selection includes a selection of an attribute defined in the DTD for the selected tag name.The action of a constraint depends on the Application Mode set for the constraint.
Application Mode
The Value Constraints have two modes of application, concerning the treatment of the target nodes:
- Validation Mode - the constraint points to the target nodes which do not satisfy it, showing all the possibilities for the specific places. On demand the user can insert a value from the list of possibilities.
- Insertion Mode - The constraint points to the target nodes which do not satisfy it and expects the user to select one of the possible values to be inserted. If the list of possibilities for a certain place contains only one entry, it is automatically inserted. Then, if the constraint is of type Some Children, the user can specify the way of the new value insertion. If the new value is a token, the user can specify the position in the content where it must be inserted. The first position is denoted by 1 (not 0). If a position is not specified the new value is inserted as the last element. The elements in the content which are counted are either not filtered tokens or Element nodes. If the new value for insertion is an Element node the counting of the content entries is done in terms of DOM structures (Text nodes, Element nodes).
- Parent Constraint
- Edit Value Constraints
The screen shot on Fig. 1 is the dialog window of the value constraints editor.
Fig. 1
The editor is separated into 5 sections which are responsible for different parts of the constraint definition. On the top of the window there is a Summaryinformation panel which shows the current constraint settings (Type, Mode, Target, etc.). The sections are:
- General (Fig. 1) - here the user supplies an unique Constraint name for the constraint (free text) which is obligatory. The constraint is identified by this name later in applications.Optionally, some additional Constraint descriptions can be written in the second text box. In this section one of the most important aspects of the constraint is defined - the Type of constraint. This determines the whole behaviour of the constraint. The options are: Parent, All Children, Some Children and Some Attributes (described above).
- Options (Fig. 2) - this section offers several options related with the application of the constraints. The options here do not concern constraints of type All Children.The first part of the section defines the Application Mode in which the constraint will be applied. For Insertion Mode the user can set an insertion position and a token Separator (when needed) (Note: Position and Separator options concern only Some Children constraints). The position must be a positive integer, where 1 denotes the first position. Leaving this field empty means 'last position'. The separator can be an arbitrary string. For details see Application Mode.
The remaining options in this part are as follows:- Show status before - indicates the number of the target nodes the constraint will be applied to, i.e. nodes count before the real application;
- Show status after - indicates the number of the target nodes the constraint has already been applied to, i.e. after the real application;
- Disable Intersection Checking !!! - disables/enables the checking whether the constrained target nodes data has common parts with the source data of the constraint. Disabling this checking allows the user to insert more than one possible value in Insertion Mode during the application;
- Restrict to a single choice run - this option is relevant when a constraint is applied on the current document in an insertion mode. When selected, it restricts the execution to the cases when a single value is determined by the source evaluation, i.e. the constraint works only for the cases where no user decision is needed in application. If more than one entry is selected as a source, the corresponding target is skipped. The tool behaves as if it works in Multiple Apply mode, but on the current document.
- Prompt for save on each: ... applications of constraint - this item is used for making backups of the current state of the document while applying the constraints. In order to use this option, the check box must be marked and in the text field a number must be entered. It indicates the number of the successful applications, after which the system prompts the user to save the document.
Fig. 2
- Target (Fig. 3) - here the definition of the target nodes for the constraint is given. In field Target XPath the user is expected to supply an XPath expression which will determine the target nodes for the constraint. The XPath expression must return a node-set in which the nodes must be of a proper type (depending on the constraint type). In this XPath expression the user may (if needed) define context restrictions for the targets. If the current constraint is of type Some Attributes the user must supply a valid Target Attribute name. This field is disabled for all other types of constraints.
Fig. 3
- Source (Fig. 4) - this section defines the source list for the constraint. The text field content is either an XPath expression, or an XML markup. It depends on the radio button, which has been currently selected for the Source Type. If the source type is XML Mark-up, then the content of the text field is XML. Otherwise it must be an XPath expression. If the selected type is Local Document, then the XPath expression is evaluated for each target node as a context. If the type is External Document, then the choice box gets enabled and the user is expected to choose a document. The XPath expression is evaluated on this document and the root node is the context. In the latter case it is expected for the XPath expression to be absolute.
Fig. 4
- Advanced (Fig. 5) - here a tokenizer can be activated (Set a Tokenizer) for the constraint or it can be blocked in order not to treat the text nodes as a set of tokens but as a whole. Also a filter (Use Filter) can be set in order to exclude some "garbage" categories as separators or others from the source list. Another restriction can be set here by defining token value and category Templates. The templates are defined in the same way as these in the grammar tools (using @ and # symbols for wildcards). Another facility, which can be relied upon here, is the Help Document. This option ensures the following possibility: while listing the different choices, the user can get brief information about the meaning of each choice. This information must be stored in an internal document. Its structure is described in a DTD in the file: resources/dtds/helpFile.dtd. The information about a given choice appears in the status bar of the editor when the mouse pointer is over the choice.
Fig. 5
- Value Constraints Manager
In the preceding section a description of the Constraint Editor was presented. It is evoked whenever a change on a Value Constraint is needed or a new constraint is defined. The Value Constraint management is handled by the following manager dialog window:
Within the CLaRK System this module can be evoked from the menu: Tools/Constraints/Edit Value Constraints.
The Value Constraint Manager is an Entry Manager with additional buttons. It contains all of the available value constraints arranged in a tree hierarchy, some of their features (Description), buttons for management of the constraints.
There is an additional context XPath text field located from below of the table, which determine the context for each constraint group and is used for applying Constraint groups. First, a context node is selected and then all the constraints from the group are applied within this context. This XPath value can be changed by pressing Edit button and entering the new value.
The manager window consists of two main parts:
- The panel on the left. It contains the tree representations of the group hierarchy. When the user selects a node in this tree, the content of the corresponding group is loaded in the component on the right side.
- Current group monitor. This is the panel situated on the right side of the window. It is a list with the content of the currently selected group in the tree. The list components, which are in blue color, are sub-groups. The other ones are the constraints included in this group. They are colored in black or red, depending on whether they are valid or not, according to their DTDs. The user can sort all the constraints in a group by clicking on the Name column of the table header. It is possible to rearrange the constraints in a group by simply using drag-and-drop technique, i.e. pressing a constraint and moving upwards or downwords until the desired position is reached.
There are six additional buttons which can be used for modification the content of the current group:
- New Group - creates a new sub-group of the current one. The user is asked to give a name for it;
- Remove - removes the selected constraints and/or groups from the list. The removal is preceded by a confirmation message. If the selection includes sub-groups, they are also removed with their entire contents.
- Rename - give a new name of the selected constraint from the list.
- Copy - save the data of the selected constraint from the list with different name given by the user.
- Add Constraints - gives a list of all constraints which are not present in the current group. The user is expected to choose one or more constraints to be included in the current group.
- Delete! This function can be used for removing constraints from the internal constraint database. It can be applied only to single constraints, not to whole groups. Groups are excluded from any selections. The removal of constraints is preceded by a confirmation message. The constraints to be removed, are excluded from all the groups they may belong to.
Navigation in the group structure can be made also in the panel on the right. When the user wants to see the content of a certain sub-group of the current group, s/he just has to perform a double click on the desired sub-group. This will change the current group to the new one. This represents the movement from a group to a sub-group. The movement in the other direction is also possible. For each constraint group (excluding the Value Constraints), a special sub-group is included, named: ". .". By performing a double click on it, similarly to most file systems, the current group is changed to its parent one.
The Value Constraints Manager also provides a list of all the constraints, no matter which group they are included in. The following information appears for each constraint: date - when it has been last modified; if it is a query, which tool it refers to; and which is its DTD. The user can sort all the constraints by clicking on the Name column of the table header. When selecting constraints, the right button of the mouse is used to visualize the Pop-up menu with the following operations on selected constraints:
- Info - This item shows the following information about the selected constraints:
- constraint name
- constraint size
- constraint's dtd name
- whether the constraint is valid
- group of the constraint
- Add In Group... - This item shows a dialog with the hierarchical structure for the groups in the system and the user can choose a group in which to place all the selected constraints.
- Delete! - It is described above.
- Rename - It is described above.
- Copy - It is described above.
Under the table with the available constraints there are 5 buttons (
New
,Edit
,Apply Constraints
,Cancel
,Done
) and 1 menu (Load / Save Constraints
). They can be used to manage the constraints.Buttons:
- New - creates a new Value constraint by calling the Constraint Editor;
- Edit - edit the selected Value constraint by calling the Constraint Editor;
- Apply Constraints - first saves the changes on the constraints (if any). Then switches from the Value Constrains Manager dialog to the Apply Constraintsdialog and allows the user to apply some of the value constraints and constraint groups;
- Cancel - closes the dialog window without saving the changes on the constraints (if any);
- Done - closes the dialog window by saving the changes on the constraints (if any).
Menu Load / Save Constraints:
This menu gives the user a possibility to store and load value constraints in XML format into/from a file. It has the following items:
-
Load constraints from file - the user is asked to choose a file which contains the value constraints in XML format. Then the system reads the file, interprets it as CLaRK value constraints and stores them in the value constraint database of the system in a table format. Such a constraints can be modified within the system.
-
Save constraints to file - the user can save value constraints into a file. The file is an XML document and it can be modified out of the system.
-
Save constraints with groups to file - the user can save value constraints into a file and the group structure for the selected constraints. The file is an XML document and it can be modified out of the system.
- Apply Value Constraints
This is a tool specialized in applying Value Constraints on the current document or on a set of documents. The user is expected to create a list of single Value Constraints or whole Value Constraint Groups. The constraints and groups are applied in the order they appear in the list. They can be reordered by simply using drag-and-drop technique, i.e. pressing a list entry and moving upwards or downwords until the desired position is reached. Each entry in the list contains: either a constraint name followed by constraint description in brackets (for a constraint) or a group name followed by the '(group)' suffix (for a constraint group).
The user can modify the content of the Constraints List by pressing the following buttons:
- Add Constraint - appends one or more constraints to the end of the list. The user is shown a list of all Value Constraints in the system, including the ones which are already in the list. In this way one constraint can be included more than one time (if it is needed for certain processing);
- Add Group - appends one or more constraint groups to the end of the list. The user has to choose from a tree of all constraint groups in the system using Value Constraints Groups Hierarchy dialog;
- Remove - excludes the selected entries from the list (constraints or groups of constraints). The exclusion is NOT preceded by a warning message.
This tool supports two modes of application: to the current document and Multiple Apply mode. For details see Tool Application Modes. If the tool is run in Multiple Apply mode there is one significant difference in the application: if a constraint uses Insertion Mode, a real insertion is performed only if the possible source value is one. In case there are more choices - some human intervention is needed. But the Multiple Apply does not allow it.
The user can also save or load the current tool settings, i.e. XML Tool Queries are supported here.
- Number Constraints
This constraint type in general restricts numeric values related with nodes and their properties within XML documents. Such values can be: a number of occurrences of some specific elements within the content, values returned by XPath functions or operators. The target constrained values are produced from the evaluation of an XPath expression. This XPath is evaluated according to the result from the evaluation of Context XPath expression which determines the nodes which the constraint will be applied to. Independently, for each initially selected context node, one XPath result is produced. Depending on the result, a numeric value is formed as follows:
- node-set - the number of the nodes;
- string - if the string represents a valid number, the new value is this value. Otherwise the Not-A-Number identifier is produced;
- number - the number itself;
- boolean - if it is a true value - 1, otherwise 0.
Note: if the newly produced value is the Not-A-Number value then the corresponding context node does not satisfy the constraint.
A context node satisfies a constraint if the result numeric value of the XPath is within the range of Minimum Size and Maximum Size values. The latter two values can be either numbers or XPath expressions which are expected to return numeric values. If an XPath returns a non-numeric value, the system tries to transform it automatically to a number. In case the Minimum Size value for a constraint is not defined or the defined value produces Not-A-Number value, the system assumes that at this place there is no limitation and all target values which are under the corresponding maximum satisfy the constraint. By analogy, if a Maximum Size value is omitted or it is Not-A-Number, then no upper limitation is assumed. If an XPath expression is used for setting minimal or maximal limit, its context for evaluation is each initially selected context node. In this way the boundaries of one constraint can vary for different contexts.
The Number Constraint Manager dialog:
In the example above, the fourth constraint has no upper limit. The fourth column (Use It?) is responsible for the activation/deactivation of the constraints. It becomes a necessity when the user would like to apply only a certain subset from all the constraints. Applying the (active) constraints can be done by pressing the Apply button. This button is disabled when there in no document in the editor. After applying the constraints, the user receives information about each applied constraint and the number of the satisfied nodes (contexts) as well as the non-satisfied ones . In the picture below there is an example result dialog:
Here the user has the ability to navigate through all satisfying or not satisfying a certain constraint nodes. This can be done by selecting a row in the result info table and using button Details. The user is shown a small navigation dialog which allows successive traversal of the nodes in forward or reverse direction. Here follows a picture of the navigation dialog from the preceding example picture:
The dialog contains several sections:
- Nodes - in this section the navigation can be swapped between navigating satisfying (valid) and non-satisfying (non-valid) nodes. For each class of nodes the corresponding nodes count is given in brackets.
- Counts - in this panel the user receives a dynamic information about the currently selected node (valid or non-valid). First the permitted Range is shown, followed by the Real Count for the given location.
- Info - here is the information about the current constraint: the XPath expression used for selection of the Context nodes and the target XPath for calculating the constrained data.
- Navi panel - here the user performs the movement from a context node to the Next or the Previous one. If during navigation the user have to modify the target nodes s/he can use the Search & Edit button. When pressed, it closes the navigation window. The system will memorize all the nodes from the current class (valid or non-valid) and will give the possibility for resuming the navigation (only in this class) by using the Next and Previous buttons of the XPath Search on the toolbar of the main editor. In this way, if a modification is needed for all nodes from one class (usually when correcting specific errors), the user does not have to apply the same constraint many times to find all representatives of the class. During this navigation, in order the system to indicate that the target nodes are not a result from an XPath search query, a service message is shown in the Search field on the toolbar. For example:
The node members of the class memorized in this way are lost after applying an XPath Search query or after applying another Number Constraint in this way.
- Apply XSLT
This function applies an XSL Transformation either on the current document in the system editor or on a set of selected internal documents. In the case a transformation is applied on the current document, the result XML document is loaded automatically in the system. If the transformation which has been applied does not produce any result, a warning message 'No result produced!' appears. Otherwise the user is asked to supply a DTD for the result document. In case a transformation is applied on a set of internal documents the user has to specify a DTD and result document names in advance.
Another not traditional application of the XSL Transformations is the so called Local Transformations. Their application is performed in the following way: a set of nodes are extracted from a document (current or internal). For each of them, independently an XSLT is performed and the result (if any) is incorporated back in the original location where the extract was taken. Thus, no new result document is created but the original document is modified. The extracted nodes to be transformed are selected either by an XPath expression or by direct selection in the tree. The result from each transformation application is a Document Fragment (DOM) which substitutes the context node for which it is produced. The context node is removed from the tree and all sub elements of the fragment are inserted at the position of the context. The application of the transformation is followed by a result information message. It contains four pieces of information:
- The number of nodes selected by the XPath expression as contexts;
- The number of context nodes which have been replaced by a result fragment;
- The number of context nodes to which the transformation did not produce any result;
- The number of context nodes which have been lost during applying a preceding transformation. This can happen when a node and its descendant node are selected as contexts and the transformation has succeeded on the parent node.
If no transformations are available in the system, a warning message appears. User can apply XSL Transformation by means of the Multiple Apply module, or save the current settings for further use from Queries module.
For details about the management of the transformations see module XSLT Manager.
This tool supports two modes of application: on the current document and Multiple Apply mode. For details see Tool Application Modes.The user can also save or load the current tool settings, i.e. XML Tool Queries are supported here.
- XSLT Manager
This component is responsible for the management of all XSL Transformations in the CLaRK System. Although the documents containing XSLT are regular well-formed XML documents, they are treated as a separate class of read-only documents. They are stored in a special separate list and only tools using XSLT can access it. The acceptable operations on the list are adding, removing and overwriting a list entry (XSLTransformation document).
Here is the dialog window of the manager:
Buttons:
- Add New - Opens an Internal Documents Manager window, offering the user to select (a) document(s) containing XSL Transformations for the list of all transformations. Each document is tested for validity using the XSLT Valuator module. If a document is not a valid XSLT it is not included in the list and a message describing the error appears. Once included in the list, this XSL Transformation can be used in all system tools which deal with XSLT.
- Add Current - Adds the current document of the system editor to the list of all available XSL Transformations. The document is cloned and thus editing will not affect the transformation. If the transformation needs modification it has to be extracted from the manager and later added by using the buttonOpen In Editor.
- Remove - Removes the selected transformation(s) from the list of all XSL Transformations. The transformation data is lost. The removal is preceded by a confirm message. Multiple selection is allowed here.
- Open In Editor - Loads the selected transformation(s) from the list into the system editor as XML documents;
- Apply - Applies the selected XSL Transformation to the current document in the editor. Here only a single selection is allowed. If there is no current document this button is disabled.
- Close - Closes the XSLT Manager dialog window. All changes on the list of transformations (addition, removal) are updated.
- Validate XSLT
This option validates the current document in the editor if it can be used as an XSL Transformation. This option can be used when a new transformation document is created (or imported) in the editor. The XSLT Validator checks the content and if there are no errors, an information message is shown. Otherwise, an error message appears and the location of the error in the document is pointed to.
- Concordance
Concordance - a system tool for information extraction. It allows an extraction of certain units (words, phrases, etc.) within bigger units (sentences, paragraphs, etc.). The result extraction is shown in a table where on each row a result is shown. The searched items, the left and the right contexts are distinguished in separate columns. The tool is implemented on the basis of the XPath engine, regular grammar engine and a sorting module.
The field at the top of the dialog (Define Context) is used for defining the context nodes within which the extraction will be done. The user is expected to supply an XPath expression which after evaluation returns a node set. The context for evaluation is the root node of the document. The user can perform two types of concordance extraction: grammar based and XPath based. They will be described in details in the following paragraphs.
The result from the concordance is stored in an XML document and for convenience it is shown in a table. The structure of the XML document is a sequence of <L> elements standing for the found items (lines). Each concordance line has the following XML structure:
<L>
<LC> the left context </LC>
<I> the data we are searching for </I>
<RC> the right context </RC>
<!-- user commentary -->
</L>and the corresponding table representation:
When the user sees a result from the concordance in a table s/he must always have in mind that there is an XML structure behind the dialog window, especially when a table rows sorting has to be performed.
Grammar Concordance Search
This type of concordance uses regular expressions patterns for searching items in other items. The patterns are defined as grammars in the Grammar tool. The items which match the patterns (tokens, XML tags or mixed) are shown in the result table (document), accompanied by the context in which they appear. Initially the context is determined by the XPath expression mentioned above. Further restrictions on the context can be set by another grammar pattern, i.e. the target searched items will be extracted only from items matched by another grammar pattern as contexts discarding everything else.
If the Text only (The mark-up will be ignored)? option is selected the concordance engine will ignore the mark-up inside the initially selected contexts while checking. Here follows an example how the mark-up can be ignored during the data extraction. Let us have the following simple XML document as a source of extraction:If the target item of extraction is the word loves within the context of the TEXT element the Grammar pattern must only describe the word itself without specifying that it appears in any mark-up (in this case in the content of tag
verb
. Here is the result from the query 'loves' within the context of 'TEXT':If the target of interest is the sequence John loves Mary within the context of the TEXT element, the query pattern will be:
"John","loves","Mary"
. Although these three words appear in different XML structures, after filtering the mark-up the result will be:Here is how the Concordance dialog window appears in a configuration set for a Grammar concordance search:
The Concordance dialog offers three sub-dialogs for setting a grammar search query corresponding to three levels of complexity, each supplying different sets of options. Each of the three dialogs is accessible by choosing the corresponding item from the Usage Mode panel. The possible items are:
- Simplified
This mode of usage offers a very basic set of options, which is convenient for relatively simple search queries. The user is expected to supply a Query String which must be a regular expression (the same syntax as in the Grammar tool). For text preprocessing the user can specify a tokenizer, a filter and a normalization.
- Normal
In this mode the user can use a previously defined grammar from the Grammar tool. Similarly to the grammar application the user here can define Element Values for performing a more flexible search. For text preprocessing the user can specify a tokenizer, a filter and normalization.
- Queries
Here, the user does not specify anything directly related to the search process. The thing which is needed in advance is (at least) one grammar query (see the Apply Grammar tool description) which in turn requires a compiled grammar. In this dialog the user just points to the Search Query to be applied. Additionally a restriction on the context can be set by selection of a Restriction Query which determines the context for the Search Query. In this case the context is formed by the output of the restriction query and, if there are initially selected nodes by the XPath expression, for which the restriction grammar does not match anything, they are excluded from further processing.
XPath Concordance Search
Searching items in this mode of concordance extraction is based on XPath queries within initially selected context nodes. The content which will be shown in the Item column will be a result from the evaluation of the XPath expression from the field Search Elements. If for a context the returned result contains more than one node, each node will appear in a separate row in the table. For each single result node the XPath expressions from Left Context and Right Context are evaluated to form the content of the corresponding table cells for the contexts. If any of these two fields does not contain an XPath expression, in the corresponding table cell the whole content before/after the found item will be shown.
Here is the Concordance dialog window in a configuration set for XPath concordance search :
Concordance Options
The options which appear in the both modes of extraction are:
- Text only(The mark-up will be ignored)?
This option works only in mode Grammar Concordance Search. As it was described before, this option filters the XML tags from the input data for the concordance search engine and leaves plain text. - Add number attribute ?
This option enables adding attributes number to each single result tag <L> with values, enumerating each result item. - Add source attribute ?
This option enables adding attributes source to each single result tag <L> with values, showing the source documents which the extractions were taken from. - Add path attribute ?
This option enables adding an attribute path to each single result tag <L> with a value, which is an (abbreviated) XPath expression showing the absolute address of the corresponding located item in the original source document.
- Simplified
- Table View
The Table View tool is created to represent the information extracted from the concordance tool in a more readable table form. Each line of the table represents one line of the concordance result.
If the user wants to use this feature, s/he has to open an XML document which is produced as a result from the Concordance tool. Heaving the Table View menu item selected, the system tries to detect the required structure in the currently opened document in the system editor. In case of failure, the system produces an appropriate error message. Otherwise, the document content is shown in a table (the picture below). The required document structure of the input for the Table View is described in the Concordance tool.
The data in the "Context" columns does not represent the whole context but only the amount of data that can fit in the column length. At the beginning it is only 30 symbols. To increase the context the user should press the settings button and from there to determine the context in symbols. The user can also set the width of the comment column. If the user wants to see the context without expanding the column data s/he can do it with right click on the "Left Context" and "Right Context" column. If the user wants to add some commentary to a concordance line s/he can do so by filling a value in the "Comment" column or by right clicking a row in the "Item" column. To navigate faster through the table the user can rely on the combo box at the top for accessing a row.
The user can also sort the lines of the table. To do so, s/he must select which column to apply each sort keys to (which element of the concordance line will be the context LC, I, RC or comments). If no column is selected then the key will be executed with the line element for context.
A useful option of the Table View is the "Edit Layout". The user can filter the tags that are shown in the table. For example, if the POS information is separated in a tag, the user can hide it in order to view only the text.
- Extract
The extract tool task is to extract nodes from a document or from multiple documents and to save them as a new document. The document data extraction is based on XPath expressions. The text field at the top of the dialog is used for defining an XPath expression which selects the elements in the document(s). The context node for this evaluation is the root node of the document(s). The result from the extraction is an XML document in which all extracted nodes are children of the root element (This element is named "Extract" by the system).
The Include subtree option allows the extraction not only of the selected nodes but the entire subtrees below them as well.
The Create result tag option allows each extracted node to have a parent element which is used to separate the different results. For example, if we extract only text node, then in the new document all the text nodes will be concatenated. If the Create result tag is selected, then for each result node there will be added a parent element. The name of the parent element is taken from the corresponding text field.
If Create source attribute option is selected, then the extract tool adds an attribute with the source document name to either the auxiliary element or the root of each result root element. The name of the attribute is taken from the corresponding text field. In case the result is not an element-rooted structure and no auxiliary element is added, this option does not change anything.
If Create path attribute option is selected, then the extract tool adds to each result structure an attribute with a value which is an XPath expression expressing the location of the result in the original source document. In other words if this XPath expression is evaluated on the source document, the result will be exactly the extracted result node. The name of the attribute is taken from the corresponding text field. In case the result is not an element-rooted structure and no auxiliary element is added, this option does not change anything.
If Create number attribute option is selected then the extract tool adds an attribute with the extract result number to the auxiliary node or to the root element. The name of the attribute is taken from the corresponding text field. In case the result is not an element-rooted structure and no auxiliary element is added, this option does not change anything.
This tool supports two modes of application: on the current document and Multiple Apply mode. For details see Tool Application Modes.The user can also save or the load current tool settings, i.e. XML Tool Queries are supported here.
- Sort
The sort tool is used for reordering nodes in the tree representation of a document. The sort operation changes the order only for nodes which are siblings, i.e. nodes with the same parent. If sort is applied on a set of nodes with different parent nodes, the nodes will be sorted only in the scope of their parents. The nodes selection for sorting and the sort criteria (Sort Keys) are written in XPath expressions.
For sorting the user has to specify the following two things:
- The target nodes for sorting.
- The keys for each node.
The first is done by defining an XPath expression in the Select Elements field. If the field is empty, the sort tool will show an error message. For context node the XPath engine assumes the node selected in the tree panel of the system or the root elements of the internal documents. The sort tool compares only element nodes which have a common parent. The sort tool splits the result returned from the XPath evaluation into groups according to the parent node. Each group is sorted separately.
Keys are calculated for every node the user wants to sort. Each row in the table represents one key. The sort tool compares two nodes key by key. The key is the list of nodes returned from the XPath engine after evaluating the expression defined in the column Key of the table. The context node in this evaluation coincides with the node for which the user wants to create the key. The other columns of the table represent settings used in the list comparison. The lists are compared node by node as follows.
- If the nodes are both elements then the sort tool asks the DTD which one is defined to be smaller (Element Features/Sort Values).
- If the nodes are both text nodes they are compared by their textual content.
- The attribute nodes are compared by the textual content of their values only if they have the same name and their parents are elements with the same name.
- The textual content (text) of text and attribute nodes is compared in the following way:
- The text is compared symbol by symbol.
- If the user chooses a tokenizer then the symbols are compared with respect to the tokens created by the primitive tokenizer of the selected tokenizer (A tokenizer that is ancestor of the selected tokenizer and is primitive. If the selected tokenizer is primitive then this tokenizer will be used for tokenization). The symbols are compared with respect to their token category (the order of the categories in primitive tokenizers) and by their position in the definition of the token category value. If normalization option is selected, the sort engine will use the primitive tokenizer normalization table to define the symbols token category and value.
- If the user selects [No Tokenizer], the sort tool will use the Unicode table to compare symbols. In this case normalization option will mean converting the Capital letters into Small letters case for Cyrillic and Latin.
- If the user selects the Reverse option for the key, the text will be reversed before the comparison ("erga" => "agre").
- If the user selects the Trim option for the key, the text will be cleared from leading and trailing whitespace characters (TAB, SPACE, LF, CR, etc.) before comparing.
- If the user selects the Number option for the key, the text will be converted into numbers and compared by their numeric value.
- If the current nodes are not from one type, then the following order is relevant: attribute < text < element.
- If a key value for an element contains more nodes than a key value for another element, then the first one is assumed to be smaller. This assumption is made when all nodes of the smaller key value are equal to the corresponding nodes of the bigger key.
For each key the user can define different order ( Ascending | Descending ). The order of the keys in the table is very important because this is the order in which they will be used. If two keys have equal nodes but one of them has additional elements, then the one with the smaller number of nodes is considered smaller.
The difference between the DTD sort and the Advanced one is that the sort tool takes the tokenizer and the number option from the DTD (Element Features, Attribute Features). For attribute nodes the sort tool also takes from the DTD the order of enumeration values.
Examples:
- Example 1: Sorting a book by pages and title. The elements to sort are the book children of the context node. They will be sorted by the content in their pages element and title element. Key 1 is the text in the pages element of the book. It will be trimmed and converted to number when sorted. In this key we do not need a tokenizer because the whole node will be converted to a number. If two elements are equal according to the first key (two books has the same number of pages) then they are compared with respect to the second Key. Key 2 is the text in the title element of the book. It will be trimmed and normalized when sorted. For normalization the sort tool will use the normalization defined in the Mixed Word tokenizer. The order of this key is descending. It means that this key will sort books by the title in reverse order.
- Example 2: Sorting TEI divisions by their heads. The sort tool takes all divisions in the document and sorts them according to the text in their headelement. If a division does not have a head element then it will be assumed as smaller.
Example 1
Example 2
This tool supports two modes of application: on the current document and Multiple Apply mode. For details see Tool Application Modes.The user can also save or load current tool settings, i.e. XML Tool Queries are supported here.
- XPath Insert Attribute
This tool (Fig. 1) gives the possibility to set a certain attribute to nodes selected by an XPath expression. The user specifies an Attribute Name and Value. Additionally, s/he can tune the tool to set attributes only to nodes which do not have such yet. If the checkbox Skip Existing Attributes is unselected, the tool will set the given attribute with the given value to each element node returned by the XPath. Otherwise, it will skip all element nodes which already have this attribute and in this way it will preserve their original values. If the result from the evaluation of the XPath expression includes nodes other than Element nodes, they are ignored during the processing time.
This tool supports two modes of application: on the current document and Multiple Apply mode. For details see Tool Application Modes.The user can also save or load current tool settings, i.e. XML Tool Queries are supported here.
Fig. 1
- XPath Insert Child
This tool (Fig. 2) allows the user to insert certain child nodes in the content of other Element nodes. The target nodes which the insertion will be applied to are selected by an XPath expression. The result from the evaluation of the XPath expression must be a list of Element nodes. All the other types of nodes are discarded. This tool can insert two types of child nodes: Element and Text nodes. If the new children are of type Element, the tool expects from the user to supply a valid tag name. Otherwise, i.e. when the new children are Text nodes, the tool accepts any non-empty textual data. The user can also set on which position the new children will appear in their parents' content. Here the counting starts from 0, i.e. the first child is denoted by 0, the second - by 1, etc. If the position field remains empty, then the new nodes will be appended to the target nodes' content. Any non-numerical data in the position field will produce an error.
This tool supports two modes of application: on the current document and Multiple Apply mode. For details see Tool Application Modes.The user can also save or load current tool settings, i.e. XML Tool Queries are supported here.
Fig. 2
- XPath Insert Parent
This tool (Fig. 3) enables insertion of parent Element nodes of selected by an XPath expression nodes. The selected target nodes by the XPath expression can be either Element or Text nodes. Any other types of nodes are discarded from the selection. This tool expects from the user to specify a valid tag name for the new parent nodes.
The tool supports two modes of application: on the current document and Multiple Apply mode. For details see Tool Application Modes.The user can also save or load current tool settings, i.e. XML Tool Queries are supported here.
Fig. 3
- XPath Insert Sibling
This tool (Fig. 4) allows inserting sibling nodes of nodes selected by an XPath expression. The target selected nodes can be of any kind,but Attribute nodes. If the root node is selected, it is discarded during the processing time. The new nodes for insertion can be of type Element or Text. If the new nodes are of type Element, the tool expects from the user to supply a valid tag name. Otherwise, i.e. when the new siblings are Text nodes, the tool accepts any non-empty textual data. The user can also set the position where the new sibling nodes will appear. The options are: previous (preceding the target node sibling) and next(following the target node sibling).
The tool supports two modes of application: on the current document and Multiple Apply mode. For details see Tool Application Modes.The user can also save or load current tool settings, i.e. XML Tool Queries are supported here.
Fig. 4
- XPath Remove
This tool (Fig. 5) gives the possibility of removing parts from XML documents selected by an XPath expression. The target selection can list all types of nodes (including attributes). The only node which cannot be removed is the root of the document and if it is included in the selection it is discarded during the processing time. If a root node is detected in a selection, a warning message is shown. The removal can be done in two modes: either removing the selected nodes and their content (when Delete subtree is selected) or removing only the nodes without their content. In the latter case, the content of the deleted nodes is inserted in the content of their parent(s), in the places where the deletion was performed. The attribute nodes are not considered as content of the Element nodes they belong to, so they are removed in both cases.
The tool supports two modes of application: on the current document and Multiple Apply mode. For details see Tool Application Modes.The user can also save or load current tool settings, i.e. XML Tool Queries are supported here.
Fig. 5
This operation uses "Before" and "After" columns from the Element Features to determine whether to insert space symbol before and after the deleted element.
- XPath Rename
This tool (Fig. 6) allows the user to rename Element nodes in a document selected by an XPath expression. The user is expected to supply a valid New name. The selected nodes are renamed without changing their attributes and content. If the selection contains nodes of type different than Element, a warning message is shown and these nodes are discarded from further processing.
The tool supports two modes of application: on the current document and Multiple Apply mode. For details see Tool Application Modes.The user can also save or load the current tool settings, i.e. XML Tool Queries are supported here.
Fig. 6
- XPath Transformations
This is a tool for applying various transformations over a document or documents. It is specified by two main sets of nodes - source and target, and other features, which are described below. The target nodes are nodes over which information will be added. The source nodes are nodes which give the information which will be added. Here is the main window of the XPath Transformation:
There are three modes which specify which documents are related with the source and target fields and where the result will be saved. The modes are: Local Source, External Source and Distributed Source.
-
Local Source
In this mode the Source and the Target nodes are from the same document.
If Multiple Apply check box is not selected the target and the source nodes are related with the current open document in the system.
If the check is selected the target and the source nodes are related with the documents in the Input column of the Internal Documents table. The result for each document is saved as a document with name given in the relevant row in the Result column of the same table. -
External Source
There is a table with one column, which specifies the source documents. The source nodes are related with it.
If Multiple Apply check box is not selected the target nodes are related with the current open document in the system.
If the check is selected the target nodes are related with the documents from the Input column of the Internal Documentstable. The result for each document is saved as a document with name given in the relevant row in the Result column of the same table. -
Distribution Source
There is a table with two columns - Source and Target, which specifies the source and the target documents. The user can handle the Source column by the buttons on the right side of the table and the Target column by the buttons in the Internal Documents field. When the user adds a document in the Input column, this document is added in the Target column of the Source/Target table. The number of source and target documents has to be same.
The rest of the features of the XPath Transformations dialog are:
-
Target
An XPath expression defining the target list of nodes, e.g. the nodes where the source will be included.
-
as a parent
The nodes from the source become parents (ancestors) of the target nodes. The system requires Element nodes for source and Element and Text nodes for target.
-
as a child
The nodes from the source become children of the target nodes in the position specified in the at position field. The system requires zero or a positive integer for the position, non Attribute nodes for source and Element nodes for target. If the returned value as a source is a number, a string or a boolean value, it is treated as a text node.
-
as a sibling
The nodes from the source become siblings of the target nodes in a position before or after a target node depending on the at offset field. The system requires non Attribute nodes for source and Element nodes for target. If the returned value as a source node is a number, a string or a boolean value, it is treated as a text node.
-
as attribute
The nodes from the source become attributes of the target nodes with name specified in the with name field. The system requires non Element node for source and Element nodes for target.
-
Relative to Source
This check box is used only when the source is treated as an XPath expression (XML check box is not selected).
When this check box is not selected, the target XPath is evaluated from the root of the target document.
When this check box is selected, the target XPath is evaluated for every node from the source as a context. As a result there is a list of nodes for each node in the source. -
Source XPath/XML
This field specifies the source nodes. They could be nodes returned by XPath expression (evaluated on a specified document) or specified by an XML fragment. Whether the source is treated as XPath expression or XML fragment is specified by the XML checkbox.
-
All nodes from the source list will be processed for each target node.
-
Each node from the source list will be processed for each target node.
-
Equals
If this check is selected and if there is a difference between the number of source and target nodes the system reports an error.
-
Copy
If this button is selected, the source nodes are copied to the target nodes in a way specified by the tool fields.
-
Move
If this button is selected, after performing an operation for a node from the source list, the tool removes the node from the source location.
-
Include subtree
If this check box is selected, then the source list will contain for each selected node the entire subtree. If it is not selected, then only the local information for each node is put in the source list. The local information includes the tag name and the set of attributes as well as their values. When only a node with the local information is chosen and it has to be removed, then its children are inserted as immediate children of its parent. The insertion is made in the position of the deleted node.
-
XML
By this check box the treatment of the Source XPath/XML field is controlled.
If it is selected, then the source is treated as XML markup data. If the XML markup data does not contain tags, then it is treated as text.
If the check box is not selected, then the source is treated as an XPath expression.
-
- Statistics
The Statistics tool is used for counting the number of nodes or/and token occurrences in XML document(s). The items to be counted initially are selected by an XPath expression (field Select (XPath)). The selection returned by the XPath evaluation is a node set. At this point the Value Keys defined by the user are taken into consideration. Each key contains an XPath expression which is meant to point the essential properties of the selected nodes. The value keys are similar to the ones in the Sort tool. For each node of the initial selection the values from the Value Keys are calculated independently. If for two nodes the corresponding calculated values are the same, they are assumed to belong to the same class. In this way each of the selected nodes is classified in one class. If the statistics has to be applied not only on XML nodes, but on tokens the user must select a tokenizer from Choose Tokenizer: field. In this way the text nodes will be segmented in meaningful tokens. In addition the user can filter the tokens by category in order to receive information only for certain types of tokens (using the button Customize). Only tokens whose categories are in the list will be counted. All the rest will be discarded. If no tokenizer is selected, the text nodes will be processed as a whole node.
The result from the statistics application is a list of all classes formed by the selected nodes. The information which is kept for each class is:
- Searched Item - the item found by the selection (tag name or token);
- Item Category - the category of the search item: if the item is a token - its token category, otherwise - <Element>;
- Number of occurrences - the number of items from the selection which belong to the class;
- Percentage - the percentage of the items belonging to the class, compared to the rest from the selection;
- Keys Value - a string representation of the value(s) for the class.
This tool supports two modes of application: on the current document and Multiple Apply mode. For details see Tool Application Modes.The user can also save or load the current tool settings, i.e. XML Tool Queries are supported here.
Statistics on Current Document: The result of this type of statistics is shown as a table below.
The "Category" column contains categories from the filter which exist in the chosen text nodes or "<Element>" if the row represents element node, and "#text" if the node is text.
The "Element" column contains tokenized text (The value of the filtered tokens), or node names.
The "#" column contains number of occurrences of the corresponding item.
The "%" column contains information about the percentage of the corresponding item.
The "Key Value" column contains the value of sort keys created for the corresponding node or nothing if the line contains token.
Closing the table, the user can choose from the following options:
-
- to save the result of the statistics into XML format, using the DTD definition below.
- to open the result in the system.
Statistics on Multiple Apply:
- the result is preserved in XML format and the document DTD has the following structure:
<!ELEMENT statistics (documents, item+ , all )>
<!ELEMENT documents (document+)>
<!ELEMENT document (#PCDATA)>
<!ELEMENT item (category?, element, number?, percent?, keyvalue?)>
<!ELEMENT category (#PCDATA)>
<!ELEMENT element (#PCDATA)>
<!ELEMENT number (#PCDATA)>
<!ELEMENT percent (#PCDATA)>
<!ELEMENT keyvalue (#PCDATA)>
<!ELEMENT all (number, percent)>
]>documents tag is a list of selected for statistic documents, where each document name appears in a document tag; item tag corresponds to a line from a result table as follows:
- category tag corresponds to the "Category" column
- element tag corresponds to the "Element" column
- number tag corresponds to the "#" column
- percent tag corresponds to the "%" column
- keyvalue tag corresponds to the "Key Value" column
all tag corresponds to the last row of result table. It contains the number of all the occurrences of the selected elements and information for percentage.
Filtering XML Result Data
In many cases not all the information from the Statistics is needed to be saved. Sometimes the result data is too large and its further processing is difficult. In such cases the Output Info options can be used for specifying which information should be kept and which should be removed from the result document. The options are as follows:
Only the selected items will have a representation in the result XML document.
- Node Info
The following item gives information about the number of occurrences of specified tags or tokens in a set of internal documents. When the user starts this tool, s/he is asked to provide several things:
- The type of information which is needed. The two possibilities are: counting tags and counting tokens.
- The documents for which the information is needed.
- An XPath expression which selects the nodes in each document for being counted.
Here is a screen-shot of the initial dialog window:
The main components of the dialog window are:
- XPath Field - selects the nodes in each document for which the counting will be performed. If some tags are counted, then for each node from the selection of this XPath expression, its descending nodes will be counted. If some tokens are counted, then for each text node from the selection its text content will be tokenized and the result tokens will be counted.
- Tokenizer Selector - determines which tokenizer will be used when tokenizing the text nodes for token counting. This component is disabled in case of tag counting.
- Info Type Selector - determines the type of elements, which will be counted. The options are: "Word Info" - for token counting and "Tag Info" - for tag counting.
- Document Selector - this component is responsible for selecting documents from the internal document database, which the counting will be applied to. This is universal component for the CLaRK system. For more information see Document Selector in menu File.
- Show Info button - starts calculating the information for the selected documents.
- Cancel button - closes the window and cancels further processing.
If the Show Info button is pressed, the system starts to process the selected documents one by one. While processing the documents, the status bar of the system shows the current process state. Having processed all the selected documents, the system shows the result in a new window. Here are two example results, one for Word Info and one for Tag Info:
- Word Info
The first column Document contains the names of the documents chosen from the first dialog.
The second column Category contains the categories from the tokenizer which the user has already chosen.
The third column # contains the number of the occurrences of each category in the text.The content of the table can be saved in a file - if Save if file checkbox is selected. The syntax of the result file content is XML. When the user presses the OK button, s/he will be asked to supply a file name and a directory with a standard file chooser.
If Add information checkbox is selected then the relevant information will be added to each of the documents. The word information added to a document has the following form:
<extent>
<interpGrp>
<interp type ="LATw" value="41"></interp>
<interp type ="CYRw" value="5848"></interp>
<interp type ="NUMBER" value="181"></interp>
</interpGrp>
</extent>If the DTD for a document is TEI, then <extent> is added in the appropriate position. Otherwise, <extent> is added after the first node.
- Tag Info
The first column Document contains the names of the documents chosen from the first dialog.
The second column Tag contains the tag names of all the nodes which the user has chosen with the XPath expression from the first dialog.
The third column # contains the number of the occurrences of each tag in the documents.The content of the table can be saved in a file - if Save in file checkbox is selected. The syntax of the result file content is XML. When the user presses the OK button, s/he will be asked to supply a file name and a directory with a standard file chooser.
If Add information checkbox is selected, then the relevant information is added to each of the documents. The information added to each document has the following form:
<encodingDesc>
<tagsDecl>
<tagUsage gi="aa" occurs="5355"></tagUsage>
<tagUsage gi="hi" occurs="19"></tagUsage>
<tagUsage gi="lb" occurs="5"></tagUsage>
<tagUsage gi="p" occurs="90"></tagUsage>
<tagUsage gi="ph" occurs="5355"></tagUsage>
<tagUsage gi="pt" occurs="1144"></tagUsage>
<tagUsage gi="s" occurs="352"></tagUsage>
<tagUsage gi="ta" occurs="5355"></tagUsage>
<tagUsage gi="tok" occurs="301"></tagUsage>
<tagUsage gi="w" occurs="5355"></tagUsage>
</tagsDecl>
</encodingDesc>If the DTD for a document is TEI, then <encodingDesc> is added in the appropriate position. Otherwise, it is added after the first node.
- Text Replace
A default shortcut Ctrl+T
An icon on the text area frame toolbarThis tool is used for searching for patterns in the text and replacing (marking) them in an appropriate way.
The dialog has three fields :- Apply to (XPath) field is the restriction field. Here the user specifies an XPath expression that restricts the text nodes which the expression will be applied to. It is evaluated as a predicate. The tool processes only the text result nodes from the evaluation.
- Search for (Expression) field is the search field. Here the user defines an expression to match parts of the data in the text nodes.
- Replace with (Mark-up) - field is the replace field. Here the user fills normal text (XML mark-up) which can replace the matched data. The field remains empty. In this case the matched data will be removed from the text.
There are two modes of application:
- Advanced
If the search mode is advanced, all text nodes are tokenized by the tokenizer specified in the Settings panel. At the top of them the tool executes the regular expression defined by the user. In the regular expression the user can write token values and token categories in the same way as in the Grammar tool (see Grammar- Regular Expressions). The user can apply the normalize option and filters.
- Simple
If the search mode is simple - the expression is taken as a whole string (Note that symbols for new paragraph, tabulation, etc. are not supported) and then searched in the text. The search can be case sensitive (Match case).
- Multiple Apply - gives the user a possibility to replace the text in more than one documents. For details see Tool Application Modes.
- Queries - the user can save the current query in the system for further use ( XML Tool Queries).
Buttons:
-
- Keep Undo - this option enables keeping undo information about the changes when the tool is applied on the current document and when necessary the previous document content can be restored. Disabling this option can be useful when large amounts of data are processed and saving memory and processing time is important. This option has no effect when the tool works in Multiple Apply mode.
- Replace All - replaces all matches in the document with the text in the replace field.
- Close - closes the dialog
- Select - gives the user an opportunity to go through the text and mark or skip pattern matches:
-
- Next - finds the subsequent data that matches the expression.
- Previous - finds the previous data in the document that matches the expression.
- Mark - replaces the selected data with the text in the replace field - the user can change mark-up.
- Exit - returns to the Text Replace dialog.
- MultiQuery Tool
This tool is designed to call other tools. It does not work directly with XML data. The tool uses a list of XML Tool Queries which are executed one by one in the order of their appearance. The result from each single tool application is an input for the next single tool application. The result from the last single tool application is the result from the MultiQuery Tool. The tools query list is represented in the manager window by a table in which each row keeps information for one query. Each row contains the Name of the query in the second column, which is its document name. The third column shows the Type of the query, i.e. the tool it represents. In the fourth column the Info data from each query is shown. This is optional text which can be saved for each query when it is created or updated. The Info input field is part of the Queries panel in the different tools.
The first column (Label) of the table contains important information about the Conditional Control Operators or short Controls, specific for this tool. These Controls allow changing the order of application of the different queries or/and conditional application of certain operations. For more information see Controlsdescription below.
Along with Queries and Controls the user can use here conditional check points (Conditions) to verify that certain conditions are satisfied. These conditions determine whether the current processing procedure will proceed with the next step or (in case of failure) the decisions taken so far should be reconsidered in order to produce new intermediate results. The Conditions present a backtracking mechanism which can be applied on Grammars and Value Constraints.
Here is what the MultiQuery Tool dialog looks like:
The operations which the user can perform during the creation of a list of queries are:
- Add Query - adds one or more tool queries to the list. The user is shown a selection dialog where s/he can choose queries from the different system tool folders (groups).
- Remove - removes the currently selected query/control/condition from the list (table). The removal is NOT preceded by a warning message.
- Insert Control - inserts a new Control operator after the selected query or another Control in the table. For more details see the Controls definition below.
- Edit Control - allows editing the currently selected Control operator in the table. For more details see the Controls definition below.
- Insert Condition - inserts a new Condition after the selected row in the table. For more details see the Conditions definition below.
- Edit Condition - allows editing the currently selected Condition in the table. For more details see the Conditions definition below.
- Options - here several options concerning the whole application process are available:
- Prompt before application - if selected, the system will ask the user for confirmation before each single tool application. Available only for the current document application mode.
- Break on no result - if selected, the process of application will be stopped when a single tool application does not produce a result or does not change the document. Otherwise, the application will proceed to the end. This option is available only for the current document application mode.
- Use garbage collection before processing - if selected, a garbage collection will be performed before each single tool application. The usage of this option reduces the system resources which are needed for the processing, and improves the efficiency of the next operations.
- Reordering - this allows changing the order of the queries in the table. Reordering can be done by simply dragging the rows of the table up or down.
The tool also supports the two modes of application: on the current document and Multiple Apply mode. For details see Tool Application Modes.
The user can save or load the current tool settings, i.e. XML Tool Queries are supported here. This feature allows the creation of tool queries. There is no limitation for the level of inclusions in the queries. If a cyclic inclusion is detected (in the multi-query or in any of its sub-queries) the system produces an error.
Controls
The Control operator allows changing the order of application of the queries in the MultiQuery tool. The usual order of applications is starting from the first one and proceeding one by one up to the last one. Using Controls operators some queries can be applied only if certain conditions are true. Such conditions are: the true or false value of a result from an XPath evaluation; whether the preceding single tool application has or has not modified the working document; or unconditional (always succeeding). When a condition for a Control is true, the next query (or another Control), which will be applied, is defined in the control itself. Otherwise, the application proceeds with the next entry in the order (query or another control). The Control operators address their targets (queries or controls to be applied in case of success) by pointing their labels. Each entry(row) in the table of the MultiQuery Tool can have a label (unique identifier) which can be referred by to control operators. It is an error if a Control operator uses a target label which does not exist.
Each Control operator may consist of three parts:
- Type - determines the type of the Control, i.e. the conditions for checking. There are several types of control:
- IF (XPath) - the condition is an XPath expression. If the result from its evaluation on the current working document is: a non-empty list, a non-empty string, a non-negative number or a true boolean value the Control succeeds.
- IF NOT (XPath) - the condition is an XPath expression. In contrast to the previous type, here, if the result from the XPath evaluation on the current working document is: an empty list, an empty string, a negative number or a false boolean value the Control succeeds.
- IF CHANGED - the condition is the result from the previous single tool application. If the current working document has been modified by the previous (not necessarily preceding in the table) operation, the Control succeeds.
- IF NOT CHANGED - the condition is the result from the previous single tool application. If the current working document has NOT been modified by the previous (not necessarily preceding in the table) operation, the Control succeeds.
- GOTO - the condition is always satisfied. This is an unconditional movement to the target of the Control.
- XPath(depends on the type) - an XPath expression is evaluated, the result of which determines the success of the control. This part is used in controls of type IF (XPath) and IF NOT (XPath).
- Target - a label reference which points to the next location where the execution will continue in case of satisfied condition for the Control.
By using the Controls-labels technique, the user can model the famous IF-THEN-ELSE and WHILE-CONDITION-DO structures in order to make the processing more flexible. The composition of different Controls allows the user to create varied 'programs' or 'scripts' capable of doing certain jobs. It is up to the user to create efficient and reliable processing procedures.
Controls Editor
Here follows a description of the Control Operator Editor:
The editor design follows the Controls structure in three sections:
- Type - a list of all types of Controls: IF, IF NOT, IF CHANGED, IF NOT CHANGED and GOTO.
- XPath - an XPath expression field, active only for types IF and IF NOT.
- Target - a list of all labels currently defined in the MultiQuery Tool table. Here, for convenience, the user can enter target labels which have not been defined yet. In the end of the list there is one special label <break> which is used for suspending the processing procedure. In other words if the condition for a control is satisfied, the system will not proceed with the application.
Conditions
The Conditions are operators which perform certain checks on the current working document. The conditions can cause reconsidering the decisions taken so far and producing new result documents. Different decisions for a certain document can be taken by the Grammar tool and the (Value) Constraints tool. The condition operators can be used ONLY in Multiple Apply mode of the tool. The usage on the current document is not allowed because of efficiency reasons (multiple backtracking events can cause the system to work very slowly).
When a condition check fails, it causes the system to reconsider the latest decision, taken on a place of a choice point and recovering the working document to the state when the previous decision was taken. If a new decision can be taken, the system proceeds the application with it. Otherwise, the system continues searching backwards for another choice point where a new decision can be taken. If no solution for a condition is found, the system terminates the current application.
There are two types of Conditions which can be used:
- XPath based - an XPath expression is evaluated on the working document and if the result is approving (not empty node-set, not empty string, positive number or true boolean value) the condition is satisfied;
- Value Constraint based - the condition specifies a (Value) Constraints query which contains constraints to be applied in validation mode on the working document. If one of the constraints is not satisfied, the whole condition fails.
Condition Editor
The Condition Editor gives the user the ability to create a new Condition or to modify an existing one. The editor dialog appears when the user selects Insert Condition or Edit Condition buttons. The editor window's layout is the following:
The first section of the dialog determines the condition type: XPath (the user specifies an XPath expression condition) or Constraints (the user selects a Constraint query).
The second section provides several common Options for the two types of conditions:
- Enable 'CUT' operation for condition - this option (when selected) means that if the condition succeeds once, no subsequent decision reconsideration will be performed before this point. This reduces the search space and the amount of memory required for the operation. The operation resembles the common 'cut' operation in many backtracking based environments.
- Save-on-success - this option allows saving the state of the working document if this condition succeeds. The document is saved under name specified in Name base field plus an additional suffix formed by an integer index in brace (name_base(1), name_base(2), ...). Each time the processing successfully passes through this condition a new unique name is generated and the document is saved independently. If the Overwrite option is selected, each time a new name is generated the index increases by one and the older document with this name (if such exists) is overwritten. Additionally, the user can specify a location (Result group) where the result documents should be stored.
- MultiQueryEx Tool
This tool is designed to call other tools. It does not work directly with XML data. The tool uses a list of XML Tool Queries which are executed one by one in the order of their appearance. The tools query list is represented in the manager window by a table in which each row keeps information for one query. The main difference whit respect to MultiQuery Tool is in input/output management. The documents which will be processed and the result documents are contained in the queries. Each row contains the Name of the query in the second column, which is its document name. The third column shows the Type of the query, i.e. the tool it represents. In the fourth column the Info data from each query is shown. This is optional text which can be saved for each query when it is created or updated. The Info input field is part of the Queries panel in the different tools.
The first column (Label) of the table contains important information about the Conditional Control Operators or short Controls, specific for this tool. These Controls allow changing the order of application of the different queries or/and conditional application of certain operations. For more information see Controlsdescription below.
Here is what the MultiQueryEx Tool dialog looks like:
The operations which the user can perform during the creation of a list of queries are:
- Add Query - adds one or more tool queries to the list. The user is shown a selection dialog where s/he can choose queries from the different system tool folders (groups).
- Edit Query - opens tool dialog and loads the currently selected query.The user can modify the query and update it XML Tool Queries. Editing the query in this way, the user has an additional possibility to change the query's input documents. A new list Non Existing Documents is added to the standardInternal Document Manager dialog. This list contains documents which will be created when MultiQueryEx will be applied. All the queries which are added in MultiQueryEx tool contain names for result documents, which may not exist. When applying MultiQueryEx tool if in some query some documents doesn't exist they are removed from the input on the query.
- Remove Query - removes the currently selected query from the list (table). The removal is NOT preceded by a warning message.
- Insert Control - inserts a new Control operator after the selected query or another Control in the table. For more details see the Controls definition below.
- Edit Control - allows editing the currently selected Control operator in the table. For more details see the Controls definition below.
- View result in Editor - if this check box is checked, the result from the MultiQueryEx tool (applying queries and the control status) will be shown in the Editor as an XML document.
- Reordering - this allows changing the order of the queries in the table. Reordering can be done by simply dragging the rows of the table up or down.
The user can save or load the current tool settings, i.e. XML Tool Queries are supported here. This feature allows the creation of tool queries. There is no limitation for the level of inclusions in the queries.
Controls
The Control operator allows changing the order of application of the queries in the MultiQuery tool. The usual order of applications is starting from the first one and proceeding one by one up to the last one. Using Controls operators some queries can be applied only if certain conditions are true. Such conditions are: the true or false value of a result from an XPath evaluation; whether the preceding single tool application has or has not modified the working document; or unconditional (always succeeding). When a condition for a Control is true, the next query (or another Control), which will be applied, is defined in the control itself. Otherwise, the application proceeds with the next entry in the order (query or another control). The Control operators address their targets (queries or controls to be applied in case of success) by pointing their labels. Each entry(row) in the table of the MultiQuery Tool can have a label (unique identifier) which can be referred by to control operators. It is an error if a Control operator uses a target label which does not exist. The main difference with respect to the MultiQuery tool is that the Controls are evaluated over documents chosen by the user when each Control is created or edited.
Each Control operator may consist of four parts:
- Type - determines the type of the Control, i.e. the conditions for checking. There are several types of control:
- IF (XPath) - the condition is an XPath expression. If the result from its evaluation (at less one) document in Internal Documents list is: a non-empty list, a non-empty string, a non-negative number or a true boolean value the Control succeeds.
- IF NOT (XPath) - the condition is an XPath expression. In contrast to the previous type, here, if the result from the XPath evaluation on the current working document is: an empty list, an empty string, a negative number or a false boolean value the Control succeeds.
- IF CHANGED - the condition is the result from the previous single tool application. If the current working document has been modified by the previous (not necessarily preceding in the table) operation, the Control succeeds.
- IF NOT CHANGED - the condition is the result from the previous single tool application. If the current working document has NOT been modified by the previous (not necessarily preceding in the table) operation, the Control succeeds.
- GOTO - the condition is always satisfied. This is an unconditional movement to the target of the Control.
- DELETE - unconditional control which deletes all the documents in the Internal Documents list.
- EXIST - if at least one document in the Internal Documents list exists the Control succeeds.
- IF APPLIED - the condition is the result from the last query application. If all input documents in this query do not exist in the system, the Control does not succeed.
- XPath(depends on the type) - an XPath expression is evaluated, the result of which determines the success of the control. This part is used in controls of type IF (XPath) and IF NOT (XPath).
- Target - a label reference which points to the next location where the execution will continue in case of satisfied condition for the Control.
- Internal Documents list - list of documents from the system which are used by the Controls for evaluation.
By using the Controls-labels technique, the user can model the famous IF-THEN-ELSE and WHILE-CONDITION-DO structures in order to make the processing more flexible. The composition of different Controls allows the user to create varied 'programs' or 'scripts' capable of doing certain jobs. It is up to the user to create efficient and reliable processing procedures.
Controls Editor
Here follows a description of the Control Operator Editor:
The editor design follows the Controls structure in four sections:
- Type - a list of all types of Controls: IF, IF NOT, IF CHANGED, IF NOT CHANGED, GOTO, DELETE, EXIST and IF APPLIED .
- XPath - an XPath expression field, active only for types IF and IF NOT.
- Target - a list of all labels currently defined in the MultiQuery Tool table. Here, for convenience, the user can enter target labels which have not been defined yet. In the end of the list there is one special label <break> which is used for suspending the processing procedure. In other words if the condition for a control is satisfied, the system will not proceed with the application.
- Internal Documents list - list of documents from the system which are used by the Controls for evaluation.
Menu Document
- Change DTD
It changes the DTD of the current document. One can choose between all DTDs, that have already been compiled into the system. When a new DTD is selected, it is assigned to the current document. Now the document is validated with respect to this new DTD and its layout is updated according to the DTD's layout. If the document contains default attributes, whose default values are unchanged, i.e. still obey the old DTD, then these attributes are removed.
- Change DOCTYPE
It changes the DOCTYPE (root element) of the current document. One can choose between all elements from the current document's DTD. The document is validated after the change.
- Validate
An icon on the toolbar
An XML document is valid if it has an associated document type declaration and if the document compiles with the constraints expressed in it.
The DTD consists of
- element definitions
- name declaration
- regular expression that defines the content of the element (constraints)
- attribute definitions
For more information see http://www.w3.org/TR/REC-xml
These are various messages, which appear after applying the validation procedure to a document. All of them mean that the document is not valid and at the same time each of them gives a prompt about the error source.
- Misplaced Element Error Messages
- Element "element" not allowed as a child at that position for element "parent
This error message is given when some element cannot be placed in a certain position among the child nodes of another element.
Example :In the DTD:
<!ELEMENT books book+>
In the document:
<books>
<book>…</book>
<author>…</author>
</books>
- Element "element" not allowed as a child at that position for element "parent
- Undefined Element Error Messages
- #REQUIRED attributes missing! (list_of_REQUIRED_attr) or Required attribute "attr_name" for element "element" is missing!
The message is given when an element does not contain a #REQUIRED attribute.
- Element "element" has no attribute named "attr_name" !
The message is given when an element is assigned an attribute, which was not declared for the element's type in the DTD.
- Element "element" not found! or Element "element" is not declared!
The element is not declared in the DTD.
- #REQUIRED attributes missing! (list_of_REQUIRED_attr) or Required attribute "attr_name" for element "element" is missing!
- Other Element Error Messages
- Content not finished checking type "element" !
This message is given when the element requires more children to complete its content.
Example:In the DTD:
<!ELEMENT book title, author+, publisher>
In the document:
<book>
<title>Alice in Wonderland</title>
<author>Luis Carol</author>
</book> - Element "element" must be EMPTY!
The element is declared in the DTD as an element with empty content, but in the document it is used with non-empty content.
- Root must be "root_name" !
This message is shown when the document element is different from the DOCTYPE of the DTD (or the DOCTYPE, which was selected after the DTD compilation)
Example :
In the DTD:
<!DOCTYPE books [ ….In the beginning of the document:
<library> ….
- Content not finished checking type "element" !
- ID & IDRef Attribute Error Messages
- Bad ID reference - "id_ref" - for attribute "attr_name" !
The attribute is of type IDREF, but contains a value that cannot be an ID.
- Bad ID - "id" - for attribute "attr_name" !
The attribute is of type ID, but contains a value that cannot be an ID.
Example :…<book id=”123 456”>…
- Duplicate ID for attribute "attr_name" !
There are two or more elements which have attributes of type ID with the same value.
- ID "attr_val" for attribute "attr_name" not found !
There is an attribute of type IDREF (or IDREFS), but the id (ids), which it refers to, is (are) not found in the document.
- Bad ID reference - "id_ref" - for attribute "attr_name" !
- Others Attribute Error Messages
- Attribute "attr_name" has a FIXed default value - "def_value", not "wrong_value !
The message is given when an attribute tries to change its FIXED value in the DTD.
- Attribute "attr_name" must contain only one token - "attr_value" !
The attribute is of type NMTOKEN, but contains more than one token.
- Entity "entity" not declared (in attribute "attr_name") !
The attribute is of type ENTITY or ENTITIES, but it contains value (values) that is (are) not declared in the DTD.
- Value "attr_value" of attribute "attr_name" must be among (list_of_values)!
The attribute has a value, which is not possible for it.
Example :In the DTD:
<!ATTLIST author
title ( Mr. | Ms. | Miss. ) #IMPLIED >
In the document
…<author title = ”Dr.”>… -
All Children Constraint Messages
- Attribute "attr_name" has a FIXed default value - "def_value", not "wrong_value !
- element definitions
- Apply All Children
This option applies all active Value constraints of type All Children on the current document in the system editor.
Having selected this option, the user is shown information about the number of nodes in the current document which do not satisfy the constraints. The user is given the possibility to navigate through all invalid nodes one by one, by using the Next/Previous buttons on the toolbar in the same way as if an XPath search is performed.
Because of the arbitrary complexity of the constraints the information about the invalid nodes is not updated dynamically, which means that some of the nodes might have changed their validity status. Therefore, the Apply operation must be performed manually in order the node information to be updated.
- New View
It opens a new view for the current document. This new view is presented in a new window with the text layout. The new view is synchronized with the other views of the same document. For example, when a node is selected in one view, it is automatically selected in the others. All changes made in one of the views are immediately updated in the others. The only thing which remains independent for each view is the layout. When a view is opened, it takes its initial text layout from the DTD and color Scheme - if there is a selected one. This layout can be modified later.
- Edit Current Text Layout
An icon on the toolbar
Edit current text layout item allows for editing of the layout for each element in a DTD. For each tag (opening or closing) additional new lines can be attached before and after the tag. In this way the text view gets improved. The tag and its children can be visible or invisible. It means that the user can hide the information he/she is not interested in. The layout is set only for the current view of the document. After closing the view, all the information about the layout is lost. If the user wants to save the layout, this can be done in two ways :
- Export Layout button - saves current layout out of the system
- Save Current Layout for DTD in menu Document.
For more information about the layout table, see Edit Text Layout in menu DTD.
- Save Current Layout for DTD
Sets the text layout of the current document view as a Text Layout for the current document's DTD.
- Reload DTD Text Layout
Restores the text layout of the current document view according to the Text Layout for the corresponding DTD.
- Add Default Attributes
This item adds default attributes(if defined) to every element in the current document. The default attributes of each element in the document, which were defined in the DTD, are stored in a list. If the element has an attribute, which is not a member of this list, then this attribute is added to the list together with its default value. After applying this operation, the system shows how many new attributes have been added to the current document.
- Remove Default Attributes
This item removes all default attributes, which possess unchanged default values in the current document. The procedure is as follows: First, all default attributes, which have been defined for the element, are taken from the DTD. Then, it is checked for the element in the document whether it has each of the attributes or not. If the answer is positive, then the attribute's value is compared with the default value for this attribute in the DTD. And if they are the same, the attribute is removed from the element in the document.
After applying this operation, the system shows how many attributes have been removed from the current document.
- Load Index Data
This item is related with the document indexing feature for fast data searching. For details see menu item Definitions/Document Index.
Using this item requires a document opened in the system editor. If the current document has been indexed, the relevant index data is loaded. If errors occur, the user is notified by an error message. The main purpose of this option is to check in advance if the index data can be loaded successfully for ceratin document(s) before index search is used in any processing procedures. If a document is not indexed yet, or reindexing it is needed, the user has to use the Document Indexmanager - function Apply on document.
- Synchronize with ...
This option enables establishing connections between the current document and referent documents selected in corresponding Sync Rules. For details how to define Sync Rules see menu item Definitions/Sync Rules.
Having selected this item, the following dialog window appears:
It contains a list of all Sync Rules defined in the system. Here the user selects the rules which will be used for establishing connections between the current document and the documents chosen in the corresponing rules. The rules selection can be done by selecting rows in column Assign followed by pressing button Assign Rules. A requirement here is all referent documents in the selected Sync Rules to be opened in the editor. If a required referent document is not loaded, the system shows warning messages. If a rule is assigned but its referent document is not loaded yet, the rule will start working immediately after the document is loaded.
- Switch to Document
This option becomes relevant when more than one documents are opened in the system. There appears a window with the list of all the documents, that are currently opened.
By selecting an item from this list, the user can change the active document in the system. If the opened document in the system is exactly one, this operation is not applicable.
Menu Options
- Keyboard
Because of the variety of graphical characters (letters) which the Unicode tables allow, it is necessary for the user to have a means for keyboard input. Unfortunately, in most cases either the keys on the keyboard are not enough or the already defined keyboards are not suitable.
In these cases the CLaRK System suggests the following solution. The user can define his/her own keyboard maps, i.e. for each key on the keyboard a different character can be attached. There are 94 keys available for mapping. For identification of each key, its ASCII character is used (which coincides with the beginning of the Unicode). It is a default for the specific machine architecture. The keyboard maps are represented as sets of pairs. Each pair is responsible for one key. It has two elements: the default character and the code of the new attached character from the Unicode table. And when a newly defined keyboard is activated and some key is pressed, its character is searched for in the set of char-code pairs. If there is such a pair found, then the second element is taken and according to it a new character is retrieved from the Unicode table and is visualized on the screen. If there is no such a pair, then the same character appears on the screen.
There are two keyboards default for the system - English (the hardware system keyboard) and Bulgarian Phonetic (auxiliary). Both are fixed and cannot be modified.
When the system works there are always two active keyboards. The two keyboards can be switched on alternatively by the key combination <Ctrl>+<Left Shift>. Also there is an indicator on the toolbar, which shows the currently used keyboard. If the indicator is red colored and the sign is Aux, it means that the auxiliary one is in use. Otherwise it is green colored with a sign Lat. A switch can be performed also by clicking on the indicator.
Keyboard Editor
When this item is selected, the Keyboard Manager window appears. The initial view of the manager is presented below:
The manager dialog window contains three subparts: Keyboard Preview, Unicode Table Preview and Control Panel. A keyboard for editing can be selected from Current Keyboard at the right top of the dialog.
Keyboard Preview
This is the table on the left side of the window (with the white background). It shows the current state of the auxiliary keyboard. Each row in it represents a pair from the keyboard map. The first column contains the characters of the hardware default keyboard. It is not editable. The second column contains the codes of the newly attached characters. The third column is a char preview which shows the new char for the selected key corresponding to the current code. In the picture above the selection is set to a row with a character d. The character code attached to it is 1076, which means that when the user presses d, on the screen will not appear d, but the character corresponding to this code.
The user can define a key by entering the codes of the desired characters. After entering a code, <Enter> is expected.
Unicode Table Preview
Now the question is how the user will know the code of the expected character. The answer comes from the second component - Unicode Table Preview. This is the table with the blue background in the picture. It contains the characters of the Unicode table available for the current font. This font is identical with the font of the text area in the system. If the character, expected by the user, is not in the table, then the font of the text area must be changed from Options/Fonts.
The first row and column contain numbers which are used for calculating the code of each character. The calculation is very simple. When we find the character in the table, we take the number from the cell and add it to the number from the column. The result is the new character code. The easiest way to assign a key to a certain character is: first to select a row in the Keyboard Preview table and then after finding the right character to make a double click on this character in the Unicode Table Preview. The new character code is calculated and copied to the selected row. If there is no selected row, nothing is done.
Example: How do we get the number 1076 for the character in the Char preview? First, we find the location of the character in the Unicode Table Preview. In the picture above it is in the last row and in the eighth column. The number in the same row of the first column is 1070. The number in the same column of the first row is 6. So the final sum is 1076.
Navigation in the Unicode table can be done by using the two buttons: Page Up and Page Down situated on the right side of the table. Another possibility is to enter a number into Go to field under the table and go to the current row (Code button) or to the current page from the Unicode table (Code Page button ). A code page contains 256 chars.
Note that the small rectangle in some of the table cells means that for this code the current font does not provide support.
Control Panel
It is situated at the bottom of the dialog and it contains 5 components:
- Hot Key switch - detemines which key combination should be used to switch between Lat/Aux keyboard layouts. The suggested hot keys are:
Alt
,Ctrl
,Shift
and combinations of them:If somehow these combinations are not convenient, the user can define a new shortcut from menu item Definitions/Shortcuts with an action Keyboard Switch located as last item in: Action / Action Items / Menu Item.
- New -when the button is pressed, the user is asked to enter a name for the new keyboard.
There is a possibility to use an already defined keyboard as a basis for the new one - when Use keyboard is selected. Otherwise the default Latin keyboard is used. The new keyboard can be selected for use in the system as auxiliary keyboard. When the keyboard is saved, it is added as the last one at the list of the keyboards (It can be seen by right mouse click on the identifier of the keyboards). The currently used as Aux keyboard is selected in the list. In front of each keyboard there is a number. This number can be used for a quick selection of a keyboard - <Ctrl> + number selects the keyboard corresponding to the number.
- Remove - removes the current keyboard - a confirm dialog is shown.
- OK - saves the changes in the current keyboard.
- Cancel - closes the manager window without applying the changes to the current keyboard (if any).
- Hot Key switch - detemines which key combination should be used to switch between Lat/Aux keyboard layouts. The suggested hot keys are:
- Fonts
This dialog window suggests a tool for changing the system fonts in several key components of the system. This tool concerns only the graphical interface. The reason is that the CLaRK System uses Unicode char encoding which allows the usage of a great range of different characters from different alphabets. Unfortunately, not every font supports the whole character table. In general, fonts are defined for a specific use and support 2 or 3 different alphabets. This manager allows changing the fonts of the components independently. The components for which the font can be changed are:
- Text Window - this is the text area on the right side of the system main panel. This is the place where the text of the document appears.
- Tree Window - this is the component on the left side of the system main panel where the tree of the document structure appears.
- Attribute Table - a table, situated just below the Tree Window. It gives information about the attributes of the currently selected element.
- Error Messages - this is the component at the bottom of the main system panel, where the error messages appear.
- Tables - this sets the font of all tables in the system (Grammar editor, Tokenizer editor, ...).
- Fields - this sets the font of all text fields in the system.
The dialog window:
The dialog contains 5 sections as follows:
- Font Chooser - the panel on the left, showing all available fonts for the hardware system. The change of the font for a given component can be done by choosing a new font entry from here.
- Component Chooser - it is situated in the upper right corner of the dialog window. In it the user chooses the component that replaces the font.
- Font Style Modificator - changes the style of the font (Regular, Bold, Italics and Underlined).
- Font Size Chooser - changes the size of the currently selected font. The font size can vary in the range from 5 to 50. If the user enters a number out of this range, the value is automatically corrected to 5 or 50. If the input is not a number, the old value is restored. When the user enters a new value for a font size, Preview button must be hit in order to refresh the preview component.
- Font Previewer - makes a preview of the currently chosen font with a specified font style.
Note: if the text in the font preview does not change when a new style is chosen, it means that the font does not support this style.
- Visuals
This option can be used for changing the colors of the different components (tags, text, attributes, comments and background) in the text area(s) and the background of the tree area(s). The available colors are all the colors supported by the specific hardware and software environment in which the system is used. The color selection is supplied by a standard color chooser (computer architecture dependant).
Here is the dialog which appears after choosing the "Visuals" option:The dialog window contains two sections:
- Colors Info
This section is responsible for the color selection for the different components. The colors of the buttons on the right side indicate the corresponding components' colors. By pressing the buttons, a color chooser appears. If a new color is chosen, after closing the chooser, the background of the corresponding button is changed to the new selection. Otherwise it remains the same. The components which can change their colors are:
- Tags (Tag Color)
- Text (Text Color)
- Attribute Values (Attribute Color)
- Comments (Comment Color)
- Text Panels Background (Text Background).
- Tree Panels Background (Tree Background).
Here is a preview of the color settings above:
- Control Buttons:
- OK Button - Applies the new color settings.
- Reset Button - Resets the color settings as follows:
- tag color - pure blue;
- text color - pure black;
- attribute value color - pure green;
- comment color - dark gray;
- text background color - light gray.
- tree background color - white.
- Cancel Button - Cancels the current color settings.
- Color Schemes Button - Opens a Color Schemes editor dialog, described below.
- Colors Info
- Color Scheme Editor
This tool gives the possibility for defining in what color the specific elements in the text area (tags, comments, text) will appear. This is a more advanced function because it defines separately the colors of the elements and does not depend on their type but on the results from the evaluation of arbitrary XPath expressions. This allows the different elements to be in different color depending on the context in which they appear. When an element is visualized on the screen, a set of XPath expressions is evaluated according to it as a context, and if one of the results is a non-empty list, a positive non-zero number, a non-empty string or a true boolean value, then the corresponding element is painted in the specified color.
Here is what the Color Scheme Editor looks like:
The basic unit defining the color layout is called Color Scheme. Each Color Scheme is responsible for the visualisation of one or more documents. A Color Scheme is identified by a name and it contains a set of pairs. Each pair specifies an XPath expression and a color. If the evaluation of the XPath gives a positive result, then the corresponding context node is painted in the color which is the second component of the pair. If more than one pairs define a color for a certain node, then the first one is used.
The structure of the editor window is the following:
- Color Scheme Selector - this component is situated on the top of the window and it contains a list of all Color Schemes defined in the system.
- Scheme Preview - contains a list of all entries (pairs) of the selected scheme in the Color Scheme Selector. Each entry of this list is an XPath expression which is painted in a specific color. The order of the different entries determines the sequence in which the XPath expressions will be evaluated. The first XPath, which returns a positive result, is taken into consideration. The operations which can be applied over the different XPath-color pairs are determined by the three buttons on the right side on the panel:
- Add Line - adds a new list entry to the end of the list. The user is asked to enter an XPath expression and to select a color. Each XPath expression is evaluated relatively to each node in the corresponding document.
- Edit Color - gives the possibility for modification of an existing XPath-color pair.
- Remove Line - removes an entry from the list.
The last two operations work over a selected entry in the list. If there is no a list selection - nothing is performed.
- Control Panel - a set of buttons used for Color Scheme management:
- New button - creates a new Color Scheme. The user is asked to specify a scheme name.
- Remove button - removes the currently selected Color Scheme. This removal is preceded by a warning message.
- OK button - closes the editor window and updates all modified Color Schemes.
- Cancel button - closes the editor window and discards any modifications of the Color Schemes.
The Color Schemes can be used from the Edit DTD Layout or Edit Current Text Layout menu options - field Color Scheme.
- Look & Feel
This option allows the change of the style (Look & Feel) in the graphical user interface of the system. This does not change the structure of the dialogs but only the way they are painted on the screen. Here follows an example what one dialog window looks like in different styles:
Metal
CDE/Motif
Windows
The number of the supported styles may vary on the different computers depending on the computer architecture, operating system and the Java Virtual Machine. The example above is taken on a Intel x86 machine working under Windows OS with JDK 1.4.2. On other machines the picture might look slightly different: more or less available styles, different colors, different icons, etc. The major purpose of this option is to make the user environment more friendly and convenient for use.
- Encoding Correction
This option is relevant when the user works with files which rely on 8-bits character encoding (like ASCII). It is used for correct mapping between ASCII and Unicode character encoding. Because of the limitations in size of the ASCII format and the need of using different symbols, there are many character-sets which use one and the same code ranges. The problem here is how to distinguish which character-set should be used for a certain ASCII file. Unfortunately, very often such information is not available and the system can make a wrong decision when reading a file. For example, the user expects to read a file containing a Hebrew text but the system decides that it is a Cyrillic text and interprets it in a wrong way in Unicode. So the user is must specify which character-set to be interpreted from the system. That is the place where the Char Encoding Corrector can be used. Here is a screen-shot of the dialog window:
The choice list at the top of the window contains all the character-sets supported by the CLaRK System. For the moment the system supports 34 standard character-sets:
- Arabic (Windows-1256)
- Baltic (Windows-1257)
- Cyrillic (Windows-1251)
- Greek (Windows-1253)
- Hebrew (Windows-1255)
- Latin 1 (Windows-1250)
- Latin 2 (Windows-1252)
- Latin 5 (Windows-1254)
- Thai (Windows-874)
- Viet Nam (Windows-1258)
- Arabic (ISO 8859-6)
- Baltic (ISO 8859-4)
- Cyrillic (ISO 8859-5)
- Greek (ISO 8859-7)
- Hebrew (ISO 8859-8)
- Latin 1 (ISO 8859-1)
- Latin 2 (ISO 8859-2)
- Latin 3 (ISO 8859-3)
- Latin 9 (ISO 8859-15)
- Turkish (ISO 8859-9)
- Arabic (OEM-720)
- Baltic (OEM-775)
- Cyrillic DOS (OEM-855)
- Greek (OEM-737)
- Hebrew (OEM-862)
- Latin 2 (OEM-852)
- Multilingual Latin 1 (OEM-850)
- Multilingual Latin 1 + euro (OEM-858)
- Russian (Cyrillic 2) (OEM-866)
- Turkish (OEM-857)
- US Codepage (OEM-437)
- Cyrillic Russian (KOI8-R)
- Cyrillic Ukrainian (KOI8-U)
- Cyrillic Ancient (KOI8-C)
The table in the center represents a preview of the currently selected character-set. The table contains symbols with codes in the range from 128 to 255. The change of the selected character-set refreshes the content of the table. If the user is not sure which character-set must be used, s/he can choose the first option from the list: (System Default). This will make the system use the default character-set of the specific computer architecture and operating system.
The newly selected character-set can be applied by using button Apply or rejected with button Cancel. If a new character-set is applied, it will be taken into consideration each time an ASCII file is opened, i.e. importing/exporting documents, compiling DTDs, etc.
- Add Default Attributes On Loading
For each element in an XML document, a set of default attributes can be defined in the DTD. These are attributes which are not presented in the elements explicitly, but it is assumed that they are there with a default value set in the DTD. Each time a document is opened, for every element with absent default attribute, it is explicitly added with its default value.
- Simple Tags
An icon on the toolbar
Shows and hides the tags in the text area. If the tags are hidden in the area, they are replaced by square brackets: [ - for the opening tags and ] - for the closing tags. If the Show Attributes In Area option is activated and the tags are hidden, then attributes are not visible as well.
- Show Attributes In Area
An icon on the toolbar
Enables/disables the appearance of the attributes in the text areas. If the attributes are shown in the area, they cannot be removed or added , but they can be modified. Attribute management is supported by using a right mouse click on the table below the tree panel of the editor.
- Disable Validation
Enables/disables validation of the document according to the DTD and active All Children Constraints (if any). If Validation is enabled, all the errors for the current document (if any) are shown in the Error Massage Area.
By performing a double click on a certain error message, the node containing the corresponding error is selected in the Tree Panel and in the Text Area.
- Check DTDs at Start-up
When this option is selected, the system performs a check-up of the compiled DTDs database each time it is started. The system tries to load each compiled DTD and in case of failure the system removes the record for this DTD, i.e. it is not a known DTD for the system any more. For all documents which refer to a DTD, removed as a result of loading failure, the system asks the user to specify another DTD. In case of normal use of CLaRK this will never happen. DTDs database damages may appear when there is an external intervention of the system data files which could be performed by the user or by other application. Another cause for inability of the system to read the DTDs database could be that the system is used with data files which are not produced by it, but by another (incompatible) version to the system. To prevent this, each new version of the system must be installed in a separate directory.
Unselecting this option may reduce the starting time of the system. This might be useful when the system is running on a slower machine and when its DTDs database contains many and large DTDs. In all other cases this option is recommended to be selected.
- Compile System DTDs at Start-up
In the system there is a set of system DTDs mainly concerning the application of tools supporting XML Tools Queries. All these DTDs define the valid XML structures which can serve as tool queries. Another group of system DTDs defines the structure of the XML representation of different tool definitions (the rules of a grammar, a tokenizer definition, constraints definitions, etc.). All these DTDs are placed in the system resources directory. When a tool needs a certain system DTD, it is automatically compiled (if this have not been done before) in the system database. Thus some DTDs are not compiled unless they are needed in the processing.
Here, for convenience, the system can find all system DTDs which are not compiled yet in the system database and compile them one by one. This check and pre-compilation (if needed) will be performed at start-up if this option is selected.
Menu Trees
- Intersect Structures
This tool represents a means for comparing tree structures. It calculates the maximal tree structure which is presented in each of the input structures. The result structure (the intersection) is opened in the system editor. The differences and the places where they appear are also preserved in internal structures. On a later stage this information is used to support the user to find the correct target structure by expanding manually the intersection. When an intersection document is opened, the system switches to a special performance mode and some of the tree restructuring functions are disabled. Also some new tool specific functions appear. The structure difference data turns into constraints and the user expands the intersection by choosing pop-up menu items instead of manual typing. Every time the user takes a decision by choosing an item, this information is spread on the whole result document the intersection is extended.
In order to compare (intersect) tree structures, the user has to point out the source structures. They can be located in one document or in more than one. The topmost node of each structure to be processed within a document is pointed by an XPath expression. If each document contains a single structure the XPath expression selects the document's root (i.e. /self::*).
The initial source selection dialog has the following appearance:
In case the source document(s) is/are not internal one(s), the dialog is switched to External Source mode:
The XPath expression in field Subtrees (XPath) which is presented in the two source modes is evaluated on each of the selected documents. Each single result node determines one tree structure for comparing.
Having the structures compared, the result intersection document is show in the editor. The next step is this structure to be extended manually (on the basis of constraints) until it covers a single structure, or a certain subset of the initial structures (depending on the task). The nodes in the intersection tree where there are differences in the different structures (i.e. the choice points) are marked with a star '*' in the end of the name/value. Performing a right mouse-click on such a node, will activate the constraints for this node and a choice dialog window appears (instead of the standard tree menu).
Here there are five possible types of differences among the input structures for a selected node from the intersection:
- attribute - attributes which either do not appear or they appear with different values in all corresponding nodes;
- content - nodes (elements, text or comment nodes) which do not appear in all corresponding nodes' content;
- parent - nodes (elements) with different names represented as parent nodes of the corresponding nodes;
- preceding sibling - nodes (elements, text or comment nodes) which do not appear as preceding siblings (direct or indirect) for all corresponding nodes within the same parents;
- following sibling - nodes (elements, text or comment nodes) which do not appear as following siblings (direct or indirect) for all corresponding nodes within the same parents.
Here follows the layout of two example dialog windows for selecting an item:
Basically, the dialog represents a list of items, possible for the specific selection in the tree. The list content is predetermined by the mode in which the dialog is set. The mode corresponds to the type of differences which are currently under consideration (attributes, content, parents, etc.). If for a certain node there is more than one type of differences, the user can change the dialog mode with the arrowed buttons panel in the bottom left corner.
The modes are as follows:
- - the dialog shows a list of all attributes (and their values) which can be set to the current node. The attribute names are shown at the top of the window in a list. Depending on the selected item in this list the content of the main list is populated with all the possible values for it.
- - the dialog's list shows all the possible nodes which can be inserted as children of the current node. They can be elements (wrapped in '<' and '>'), text or comment nodes. Each list entry ends with a number in brackets. This number indicates in how many of the input structures this item exists.
- - the list shows all the possible nodes which can be inserted as a parent of the current node. They can only be elements. Each list entry ends with a number in brackets. This number indicates in how many of the input structures this item exists.
- - the list shows all the possible nodes which can be inserted as children of the parent of the current node in any position before the current one. They can be elements, text or comment nodes.
- - the list shows all the possible nodes which can be inserted as children of the parent of the current node in any position after the current one. They can be elements, text or comment nodes.
Whenever a mode is not appropriate for the selected node, the corresponding button is disabled. The rest of the components in this dialog window are:
- Match: n of m - an indicator of the current selection in the list. It shows how many input structures will still cover the intersection if the selected item is included. Here n is the number of the structures which contain the selection and m is the number of all structures which cover the current intersection.
- Approve - confirms that the currently selected item will be inserted in the intersection and thus reducing the number of structures which cover it. This leads to extension of the result structure.
- Exclude - rejects the currently selected item and thus excluding all structures which contain it. This leads to reducing the structures which are covered and extension of the intersection.
- Cancel - closes the dialog without any changes.
A run-time information about the current state of the intersection can be received by performing a right mouse- click on a node which is not a choice point (not marked with a star '*'). Then the following information dialog appears
The dialog shows information about how many input structures still cover the intersection (field 'Active trees'). Also there is a list(table) containing information about all choice points (i.e. nodes in which the intersection can be expanded by user decisions) in the current intersection.
The choice points in the table are sorted in document order (the order the nodes appear in the intersection document). For each point the information shown is: the position in the list (Doc Par); the node name for elements and the value for the rest (Node / Value); the number of choices for attributes ('Attr Diff'); the number of choices for children (Child Diff) and the number of choices for parents (Parent Diff).
The functionality here is:
- Resolve - takes the selected choice point from the table and selects the corresponding node in the tree. Then, the dialog window for selecting an item is shown. The information dialog window is closed.
- Close - closes the information dialog window without any other actions.
- Mouse double-click on a certain table row - performs a Resolve action for the selected row.
During the process of expanding the intersection if the number of covering structures becomes 1, the system shows an information message and switches the intersection document to a normal mode of processing.
Tree Comparing Algorithm
The construction of an intersection is incremental. It starts from the root of each structure and tries to find as many common nodes as possible, moving down to all input structures. When a common node for all structures is found, it is inserted in the result structure. The algorithm combines a top-down with a bottom-up matching strategy, i.e. the construction of the intersection starts from the root to the leaves and when it is no longer possible to find nodes in common which are directly assigned to the structure the system tries to find common nodes on lower levels discarding some nodes on the paths to the roots. When such common nodes are detected they are attached to the first common grand parents from the result structure (at worst, to the root).
There are different criteria for finding common nodes. The main factor in matching is the node names for elements; the string content for text and comment nodes and names and values for attributes. The contexts of the nodes in the structures (parents, children, siblings) are also very important. The nodes matching module is aware of the Element Features defined in the corresponding DTDs. This is crucial when within a structure there are element nodes with equal names, children of the same parents. Then a distinction is needed to prevent incorrect structures matching. With the help of well defined Element Values the processor is enabled to 'see ahead' and to make the correct matching.
- Suspend Intersection
This item can be used when the current document in the system editor is in a process of intersection construction. This function suspends this process and makes the intersection structure an ordinary XML document. All the data related with the intersection (constraints based on the structure differences) is removed. The document structure is preserved without any changes as it was before the selection of this item. All editor's functions which were disabled during the intersection are available now. This function is useful when the current intersection already contains the information the user is interested in and no further expansion is needed.
- Complete Structure
This item can be used when the current document in the system editor is in a process of intersection construction. This function discontinues the process by reducing the covering structures to one and makes the intersection structure an ordinary XML document. Having done this, the intersection is expanded to the only structure which remains and thus it completes the process. The structure to which the intersection will be expanded is selected randomly from the structures which still cover the intersection. This function is useful when the current intersection already contains the information the user is interested in and the rest of the data is added just for definiteness.
- Save Current Structure Set
This item can be used when the current document in the system editor is in a process of intersection construction. This item does not interrupt the process. It saves all structures which cover the current intersection in a separate document. The structure of the new document includes a special root node (named intersectionSet) to which all active structures are attached as children. This function can be used in cases when no further expansion of the intersection can be done (for any reason) and the current work has to be saved. Having saved the structures in a document, if later the process of intersection searching has to be recovered, this document can be used as a source where each child node of the root represents a root node for one structure. In this way the former intersection which was constructed when the structures were saved, will be recovered automatically.
The usage of this function does not stop the intersection finding process and the user can proceed with expanding the result structure after this item is used. Thus, this can be used as a backing-up mechanism.
Menu Help
- About
This item shows a brief information about the system: its version, its origin, the people involved in its development and several important URLs.
- Reference
This item gives a list of the most frequently used terms in the system, each accompanied by short description. The terms are ordered in sections in alphabetical order. With this option chosen the Reference list is shown in a separate window independently from the system.
- Memory Status
This item shows the status of the amount of the memory used by the system. The parameters which are shown are:
- Free Memory Size - the amount of memory which is allocated but still not used by the system. In case this value falls under a certain limit the system tries automatically to allocate more memory (either in the RAM or in the swap device);
- Used Memory Size - the amount of memory which the system currently uses;
- Total Memory Size - the total amount of memory the system has taken from the operating system. This includes the sum of the previous two quantities.
Here the user can perform the operation Run Garbage Collector and thus to optimize the Used memory (i.e. to reduce it). Usually this is useful after performing heavy operations or processing large amounts of data. Reducing the Used memory size as a rule increases the efficiency of next operations (to some extent) and prevents the system from running out of memory (if possible).
Other
- Parse Error Messages
These are various messages, given during importing an external document. They all mean that the document is not well-formed.
- Attribute <attr_name> already declared at line line_no, position position_no !
Is is given when an attribute is declared more than once for the same element.
Example:<book author = ”Luis” author = ”Carol”>
- Attribute declaration must be in the opening tag at line line_no, position position_no !
This is given when the parser finds attribute declarations in the closing tag of an element.
Example:<book>Alice in Wonderland</book author = ”Luis Carol”>
- Attribute value not closed at line line_no, position position_no !
This is given when an attribute value is not closed.
Example:<book author = ”Luis Carol>
- CDATA section not closed at line line_no, position position_no !
This is given when a CDATA element lacks its closing declaration - ‘]]>’.
Example:<book><![CDATA[ Alice in Wonderland</book>
- Comment at line line_no, position position_no not closed!
This is given when a comment is opened but not closed. The missing end is '-->'. Comments are parsed, but not processed further.
- Doctype declaration not valid at line line_no, position position_no !
This error is given when there is a DTD in the file, containing the document and the DTD, cannot be parsed successfully.
- Document not finished!
It is given when the document element is not closed.
Example:<books><book> Alice in Wonderland </book> (end of document)
- Element not closed at line line_no, position position_no !
This is given when the opening or the closing tag is not properly closed.
Example:<book author = "Luis Carol>… (missing closing ")
<book (end of document)
- Invalid attribute at line line_no, position position_no !
This is given when an attribute is not given a value.
Example:<book author>
- Invalid character at line line_no, position position_no !
This is given when there are characters other than white space characters at the beginning of the document.
Example:Asdfg <books>…..</books>
- Invalid element at line line_no, position position_no - <> !
This is given when the parser finds an opening tag without a name, i.e. the sequence ‘<>’.
- Invalid element at line line_no, position position_no - </> !
This is given when the parser finds a closing tag without a name, i.e. the sequence ‘</>’.
- Invalid nesting of opening and closing tags at line line_no, position position_no. Expected <element> opened at line line_no, position position_no .
This is given when the parser finds an element that is closed before all of its children are closed.
Example:<books><book> Alice in Wonderland </books>
- No document or wrong char encoding!
This is given on a blank document file or when the character encoding is not recognized. Character encoding is set when the user is asked to point out to a file containing the document.
- Processing Instruction at line line_no, position position_no not closed!
This is given when a Processing Instruction is opened but not closed. The missing end is '?>'. Processing instructions are parsed, but not processed further.
- Text not closed at line line_no, position position_no !
Is is given when the documents end in a text.
Example:<book>Alice in Wonderland (end of document)
- Attribute <attr_name> already declared at line line_no, position position_no !
- Validation Error Messages
These are various messages, which appear after applying the validation procedure to a document. All of them mean that the document is not valid and at the same time each of them gives a prompt about the error source.
- Misplaced Element Error Messages
- Element "element" not allowed as a child at that position for element "parent
This error message is given when some element cannot be placed in a certain position among the child nodes of another element.
Example :In the DTD:
<!ELEMENT books book+>
In the document:
<books>
<book>…</book>
<author>…</author>
</books>
- Element "element" not allowed as a child at that position for element "parent
- Undefined Element Error Messages
- #REQUIRED attributes missing! (list_of_REQUIRED_attr) or Required attribute "attr_name" for element "element" is missing!
The message is given when an element does not contain a #REQUIRED attribute.
- Element "element" has no attribute named "attr_name" !
The message is given when an element is assigned an attribute, which was not declared for the element's type in the DTD.
- Element "element" not found! or Element "element" is not declared!
The element is not declared in the DTD.
- #REQUIRED attributes missing! (list_of_REQUIRED_attr) or Required attribute "attr_name" for element "element" is missing!
- Other Element Error Messages
- Content not finished checking type "element" !
This message is given when the element requires more children to complete its content.
Example:In the DTD:
<!ELEMENT book title, author+, publisher>
In the document:
<book>
<title>Alice in Wonderland</title>
<author>Luis Carol</author>
</book> - Element "element" must be EMPTY!
The element is declared in the DTD as an element with empty content, but in the document it is used with non-empty content.
- Root must be "root_name" !
This message is shown when the document element is different from the DOCTYPE of the DTD (or the DOCTYPE, which was selected after the DTD compilation)
Example :
In the DTD:
<!DOCTYPE books [ ….In the beginning of the document:
<library> ….
- Content not finished checking type "element" !
- ID & IDRef Attribute Error Messages
- Bad ID reference - "id_ref" - for attribute "attr_name" !
The attribute is of type IDREF, but contains a value that cannot be an ID.
- Bad ID - "id" - for attribute "attr_name" !
The attribute is of type ID, but contains a value that cannot be an ID.
Example :…<book id=”123 456”>…
- Duplicate ID for attribute "attr_name" !
There are two or more elements which have attributes of type ID with the same value.
- ID "attr_val" for attribute "attr_name" not found !
There is an attribute of type IDREF (or IDREFS), but the id (ids), which it refers to, is (are) not found in the document.
- Bad ID reference - "id_ref" - for attribute "attr_name" !
- Others Attribute Error Messages
- Attribute "attr_name" has a FIXed default value - "def_value", not "wrong_value !
The message is given when an attribute tries to change its FIXED value in the DTD.
- Attribute "attr_name" must contain only one token - "attr_value" !
The attribute is of type NMTOKEN, but contains more than one token.
- Entity "entity" not declared (in attribute "attr_name") !
The attribute is of type ENTITY or ENTITIES, but it contains value (values) that is (are) not declared in the DTD.
- Value "attr_value" of attribute "attr_name" must be among (list_of_values)!
The attribute has a value, which is not possible for it.
Example :In the DTD:
<!ATTLIST author
title ( Mr. | Ms. | Miss. ) #IMPLIED >
In the document
…<author title = ”Dr.”>…
- Attribute "attr_name" has a FIXed default value - "def_value", not "wrong_value !
- All Children Constraint Messages
- Misplaced Element Error Messages
- Main Editor Components
[:bg]
This is a description of the most important components of the editor within the CLaRK System. The picture above is one example configuration of the system with one opened (work) document. The red labels point to the corresponding components, which are described below:
-
- Tree Panel
This panel shows the tree structure of the current document. When a new document is opened, the tree structure is folded to its root node. Then the user can expand it by clicking on the branches of the tree. The nodes in the tree are painted in three different styles depending on the type of the corresponding document nodes: forElement nodes - folder looking icons; white page icons for the Text nodes and blue rhomb for Comment Nodes. The names of the Element nodes are drawn in two colors, depending on their validity, i.e. when a node contains an error which makes it non-valid - it is colored in red. Otherwise it is blue. When pointing to a red node, its corresponding error message appears in the Error Message Panel window.
Basic tree restructuring operations can be applied here by making right-clicks on the tree. Right-clicks over the selected node(s) make a menu window to appear on the screen.
The user can move or copy selected nodes without Tree Popup Menu. To move selected node(s) - press right mouse button, drag over the tree structure and drop it in the chosen position - selected nodes are inserted as first child. To copy selected node(s) - press and hold the <Shift> key, press right mouse button, drag over the tree structure and drop it in the chosen position - selected nodes are copied as first child.
When a node is selected in the tree panel, its corresponding tag (or text node) in the Text area is selected too. A multiple selection of nodes in the tree can be made by pressing and holding the <Ctrl> key and pointing with the mouse to the nodes for selection.
-
- Text Area
This area shows the current document in XML format. By default, the tags are drawn in blue and the text is black. These colors can be changed by using menu Options/Visuals. In this area some of the tags and/or their content can be hidden or shown, new lines and line offsets can be inserted. For details see Document Layout.
If the selection in the area is within a single text node, a list of tags appears and the user is asked to choose one of them to be put around the selected text.
When a tag is selected in the area, its corresponding node in the tree is selected too.
The system has the ability to show more than one Text Area for a document. Each area has its own layout which can be modified independently from the others. In this way, each area forms one view for the document. The different views are synchronized in the following way: when a node is selected in one of them, the corresponding node in the others is also selected.
-
- Attribute Table
This table contains the attributes of the currently selected element node in the editor. The first column contains the attribute names. In the second one, the corresponding values are written. The second column is editable, i.e. the user can modify the values of the attributes. By using right mouse click, the user can add and remove attributes for the current element node.
This is a list, containing all the errors for the current document (if any). A full list of all error messages can be seen here.
By performing a double click on a certain error message, the node containing the corresponding error is selected in the Tree Panel and in the Text Area.
The popup menu is shown when the user performs a right-click on the Error Messages Panel at the bottom of the main window. The menu command Options opens the dialog Error Filter Options, which allows the user to filter the group of the error messages shown in the error messages panel and to change the color of each group messages. There are six error message groups for element, attribute and constraint errors:
-
-
- Misplaced Element Errors;
- Undefined Element Errors;
- Other Element Errors;
- ID & IDRef Attribute Errors;
- Other Attribute Errors;
- All Children Constraint Messages.
-
The groups can be filtered by unchecking the desired group. After that all messages related to this group will not be shown in the Error Messages Panel. The color of each message group can be changed by clicking the relevant Select button and choosing the desired color from the Choose a color dialog.
-
- Status Bar
This component shows system messages. When an operation is performed, the color of the text is red. Otherwise it is black. While editing the document, this bar shows the path from the root node to the current one using an abbreviated XPath expression.
-
- Keyboard Indicator
This button-indicator is used to show which keyboard-set is active in the system. There are two states of it: normal (in green) and auxiliary (in red). For more details about using keyboard-sets see Keyboard. By clicking on the indicator, it changes its state alternatively (normal/auxiliary).
-
- Main Menu
A detailed description of the main menu can be found in this documentation.
-
- Toolbar
This toolbar is used for placing shortcuts to most frequently used functions from the Main Menu. Here is the list of all shortcuts and their target items in the menu:
Icon Target Menu Item File / New File / Open File / Save File / Import File / Export Document / Validate Document / New View Document / Apply All Children Options / Show Attributes In Area Edit / Search Edit / Next Edit / Previous -
- Scroll Buttons
These four buttons are used for scrolling the text area. The first two are used for scrolling line by line. The next two are for scrolling page by page.
The reason for creating these four buttons is that the Text Area does not contain the whole opened document, but only a fragment of it. The reason for this fragmentation is that showing a large document takes a huge amount of operating memory. Therefore, the system tries to visualize as less as possible from it. So, when the user navigates through the document, the Text Area dynamically changes its content. This text content generation cannot be controlled by a default scroller. That is the reason for using these special scroll buttons.
-
- Text Processing Buttons
The first three buttons are used for shortcuts to Copy (Ctrl+C), Cut (Ctrl+X) and Paste (Ctrl+V) operations applied ONLY for text data. If the selection contains not only text, but also tags, the operation is not executed.
The last button: is a shortcut to menu Tools / Text Replace.
-
- Navi Toolbar
It functions similarly to an Internet browser-like history modul (Back/Forward). Each document activation is recorded in the history and if the Back button is used, the system activates the document which was previously active. Similar behaviour can be expected from the Forward button. This functionality can be applied when the work requires the use of more than one document loaded in the editor.[:en]
This is a description of the most important components of the editor within the CLaRK System. The picture above is one example configuration of the system with one opened (work) document. The red labels point to the corresponding components, which are described below:
-
- Tree Panel
This panel shows the tree structure of the current document. When a new document is opened, the tree structure is folded to its root node. Then the user can expand it by clicking on the branches of the tree. The nodes in the tree are painted in three different styles depending on the type of the corresponding document nodes: forElement nodes - folder looking icons; white page icons for the Text nodes and blue rhomb for Comment Nodes. The names of the Element nodes are drawn in two colors, depending on their validity, i.e. when a node contains an error which makes it non-valid - it is colored in red. Otherwise it is blue. When pointing to a red node, its corresponding error message appears in the Error Message Panel window.
Basic tree restructuring operations can be applied here by making right-clicks on the tree. Right-clicks over the selected node(s) make a menu window to appear on the screen.
The user can move or copy selected nodes without Tree Popup Menu. To move selected node(s) - press right mouse button, drag over the tree structure and drop it in the chosen position - selected nodes are inserted as first child. To copy selected node(s) - press and hold the <Shift> key, press right mouse button, drag over the tree structure and drop it in the chosen position - selected nodes are copied as first child.
When a node is selected in the tree panel, its corresponding tag (or text node) in the Text area is selected too. A multiple selection of nodes in the tree can be made by pressing and holding the <Ctrl> key and pointing with the mouse to the nodes for selection.
-
- Text Area
This area shows the current document in XML format. By default, the tags are drawn in blue and the text is black. These colors can be changed by using menu Options/Visuals. In this area some of the tags and/or their content can be hidden or shown, new lines and line offsets can be inserted. For details see Document Layout.
If the selection in the area is within a single text node, a list of tags appears and the user is asked to choose one of them to be put around the selected text.
When a tag is selected in the area, its corresponding node in the tree is selected too.
The system has the ability to show more than one Text Area for a document. Each area has its own layout which can be modified independently from the others. In this way, each area forms one view for the document. The different views are synchronized in the following way: when a node is selected in one of them, the corresponding node in the others is also selected.
-
- Attribute Table
This table contains the attributes of the currently selected element node in the editor. The first column contains the attribute names. In the second one, the corresponding values are written. The second column is editable, i.e. the user can modify the values of the attributes. By using right mouse click, the user can add and remove attributes for the current element node.
This is a list, containing all the errors for the current document (if any). A full list of all error messages can be seen here.
By performing a double click on a certain error message, the node containing the corresponding error is selected in the Tree Panel and in the Text Area.
The popup menu is shown when the user performs a right-click on the Error Messages Panel at the bottom of the main window. The menu command Options opens the dialog Error Filter Options, which allows the user to filter the group of the error messages shown in the error messages panel and to change the color of each group messages. There are six error message groups for element, attribute and constraint errors:
-
-
- Misplaced Element Errors;
- Undefined Element Errors;
- Other Element Errors;
- ID & IDRef Attribute Errors;
- Other Attribute Errors;
- All Children Constraint Messages.
-
The groups can be filtered by unchecking the desired group. After that all messages related to this group will not be shown in the Error Messages Panel. The color of each message group can be changed by clicking the relevant Select button and choosing the desired color from the Choose a color dialog.
-
- Status Bar
This component shows system messages. When an operation is performed, the color of the text is red. Otherwise it is black. While editing the document, this bar shows the path from the root node to the current one using an abbreviated XPath expression.
-
- Keyboard Indicator
This button-indicator is used to show which keyboard-set is active in the system. There are two states of it: normal (in green) and auxiliary (in red). For more details about using keyboard-sets see Keyboard. By clicking on the indicator, it changes its state alternatively (normal/auxiliary).
-
- Main Menu
A detailed description of the main menu can be found in this documentation.
-
- Toolbar
This toolbar is used for placing shortcuts to most frequently used functions from the Main Menu. Here is the list of all shortcuts and their target items in the menu:
Icon Target Menu Item File / New File / Open File / Save File / Import File / Export Document / Validate Document / New View Document / Apply All Children Options / Show Attributes In Area Edit / Search Edit / Next Edit / Previous -
- Scroll Buttons
These four buttons are used for scrolling the text area. The first two are used for scrolling line by line. The next two are for scrolling page by page.
The reason for creating these four buttons is that the Text Area does not contain the whole opened document, but only a fragment of it. The reason for this fragmentation is that showing a large document takes a huge amount of operating memory. Therefore, the system tries to visualize as less as possible from it. So, when the user navigates through the document, the Text Area dynamically changes its content. This text content generation cannot be controlled by a default scroller. That is the reason for using these special scroll buttons.
-
- Text Processing Buttons
The first three buttons are used for shortcuts to Copy (Ctrl+C), Cut (Ctrl+X) and Paste (Ctrl+V) operations applied ONLY for text data. If the selection contains not only text, but also tags, the operation is not executed.
The last button: is a shortcut to menu Tools / Text Replace.
-
- Navi Toolbar
It functions similarly to an Internet browser-like history modul (Back/Forward). Each document activation is recorded in the history and if the Back button is used, the system activates the document which was previously active. Similar behaviour can be expected from the Forward button. This functionality can be applied when the work requires the use of more than one document loaded in the editor.[:]
-
- Document Tree Popup Menu
[:bg]
The popup menu is shown when the user performs a right-click on a selected node or a set of selected nodes in the tree panel on the left side of the screen. The menu commands allow the user to change the structure of the document, to apply XML Tool queries and others.
Here follows a list of descriptions of the menu items:
Menu Commands Description
-
- Delete Subtree
Allows the user to delete the selected node(s) in the tree with the entire subtree(s) below it (them). This operation uses "Before" and "After" columns from the Element Features to determine whether to insert space symbol before and after the deleted element. The system will warn the user if, after the deletion, the structure of the document is non-valid.
- Delete Node
Allows the user to delete only the selected nodes. The children of the selected nodes will be inserted as children of the parent of the corresponding selected node. This operation uses "Before" and "After" columns from the Element Features to determine whether to insert space symbol before and after the deleted element. The system will warn the user if, after the deletion, the structure of the document is non-valid.
- Set Attribute
Allows the user to insert an attribute to the selected nodes. The system will give the user the choice between valid attributes (attributes which can be inserted at the specified position according to the DTD) or to define a new attribute. If more than one nodes are selected - given attributes are intersection of the attributes specified for each node.
- Element Child
Allows the user to insert a first child element node to the selected nodes. The system will give the user a choice between valid tags (tags which can be inserted at the specified position according to the DTD) and all tags (all tags defined in the DTD).
- Element Sibling
Allows the user to insert a following sibling to the selected nodes. The system will give the user a choice between valid tags (tags which can be inserted at the specified position according to the DTD) and all tags (all tags defined in the DTD).
- Text Child
Inserts a first child text node to the selected nodes. It can be edited in the Tree Area by double click on the text node or in the Text Area.
- Text Sibling
Inserts a following sibling child text node to the selected nodes.
- Comment Child
Allows the user to insert a first child comment node to the selected nodes.
- Comment Sibling
Allows the user to insert a following sibling comment node to the selected nodes.
- Parent
After correcting the selection (if needed), a Change Parent dialog appears.
This dialog allows the user to choose a tag, which will be the new parent of the (newly) selected nodes. The selected nodes are then replaced with the chosen tag that has as children the selected nodes (and their subtrees). The system allows the user to constrain the possible parents in 4 ways.
- To choose from all tags, that are possible for selection, i.e that can have selected nodes for children
- To choose from tags, such that the parent node of the selected nodes will be valid after the operation is completed.
- To choose from tags that can have the selected nodes for children.
- To choose from tags that are produced from PARENT value constraints of the selected nodes.
Any combination of the above constraints is possible.
- Rename
This item renames the selected nodes (only if they are element nodes). The system will allow the user to choose from tags, such that the corresponding parent node(s) will be valid after the operation is completed or to choose from tags, such that the selected nodes will be valid, or both.
- Apply query
Allows the user to apply a query ( Concordance, Constraints, Extract, Grammar, InsertAttribute, InsertChild, InsertParent, InsertSibling, MultiQuery,Remove, Rename, Sort, Statistics, TextReplace, XSLT) over selected nodes.
- Copy
This item copies the selected nodes, including their subtrees, to a copy buffer. After applying the Copy operation the buffer contains a set of trees - one for each selected node.
- Cut
This copies the selected nodes (including their subtrees) to a copy buffer and then deletes the nodes from the tree. The system will warn the user if, after the deletion, the structure of the document becomes non-valid. After applying the Cut operation, the copy buffer contains a set of trees - one for each selected node.
- Paste As Child
This pastes the copy buffer as a first child of the selected node(s). If the copy buffer contains more than one tree, the trees are inserted as neighbour children. The system will warn the user if, after the insertion, the structure of the document becomes non-valid. If it is applied to a multiple selection of nodes, the copy buffer will be cloned, as many times as an insertion will be performed.
- Paste As Sibling
This pastes the copy buffer as a following sibling of the selected node(s). If the copy buffer contains more than one tree, the trees are inserted as neighbour siblings of the selected node(s). The system will warn the user if, after the insertion, the structure of the document becomes non-valid. If it is applied to a multiple selection of nodes, the copy buffer will be cloned, as many times as an insertion will be performed.
- Delete Subtree
More ...
- Info
This item gives information about the selected node and its content. The given information includes: the path from the root node to the current one (abbr. XPath), the tokenizer attached to the node, the content of the node (tags & tokens). This operation cannot be applied to a multiple selection of nodes.
- Errors Info
This item gives all errors for the selected nodes from the tree in the document order. There are two modes - Selected Nodes and Subtree Nodes of Selected Nodes. In the Selected Nodes mode the dialog shows all the errors of the selected nodes. In the Subtree Nodes of Selected Nodes mode the dialog shows all the errors for all the subtrees of the selected nodes.
Each error is colored depending on the error group. For more details see Error Message Panel.
- Trim Text Nodes
This item removes the leading and trailing white space characters (spaces, tabs and new lines) of the text nodes, descendants of the selected nodes and of the nodes themselves (if they are text nodes).
- Expand Subtree
This item is used when the tree (or parts of it) on the left panel is folded and the user wants to see the whole substructure of it (them) unfolded. Instead of passing through all folded nodes and manually unfolding them, the user can expand everything, which is under a certain (selected) node by using this option. This option can be applied to more than one selected node.
- Collapse Subtree
This item is used to collapse one or more selected subtrees in the tree area. The function is an opposite to the Expand Subtree function.
- RegExpr. Constraints
Allows the user to apply regular expression constraints over the content of the selected node(s) (or the node itself, if it is a text node). When the user selects this item, he/she is given a list of all regular expression constraints defined in the system. The manager, responsible for creating and modifying such constraints can be started from the menu: Constraints/Regular Expression Constraints/Edit Regular Expression Constraints. More details about Regular Expression Constraints usage can be found in the description of the main menu items of the CLaRK System (Constraints/Regular Expression Constraints).
- Invoke XSLT
This item applies an XSLT transformations over the selected node. It means that the system uses the selected node as the starting point for the XSLT, not the root of the document, which would be the case when running it from the main menu. If the XSLT returns more than one root, all the roots are inserted in the place of the selected node. This operation cannot be applied to multiple selection of nodes.
[:en]
The popup menu is shown when the user performs a right-click on a selected node or a set of selected nodes in the tree panel on the left side of the screen. The menu commands allow the user to change the structure of the document, to apply XML Tool queries and others.
Here follows a list of descriptions of the menu items:
Menu Commands Description
-
- Delete Subtree
Allows the user to delete the selected node(s) in the tree with the entire subtree(s) below it (them). This operation uses "Before" and "After" columns from the Element Features to determine whether to insert space symbol before and after the deleted element. The system will warn the user if, after the deletion, the structure of the document is non-valid.
- Delete Node
Allows the user to delete only the selected nodes. The children of the selected nodes will be inserted as children of the parent of the corresponding selected node. This operation uses "Before" and "After" columns from the Element Features to determine whether to insert space symbol before and after the deleted element. The system will warn the user if, after the deletion, the structure of the document is non-valid.
- Set Attribute
Allows the user to insert an attribute to the selected nodes. The system will give the user the choice between valid attributes (attributes which can be inserted at the specified position according to the DTD) or to define a new attribute. If more than one nodes are selected - given attributes are intersection of the attributes specified for each node.
- Element Child
Allows the user to insert a first child element node to the selected nodes. The system will give the user a choice between valid tags (tags which can be inserted at the specified position according to the DTD) and all tags (all tags defined in the DTD).
- Element Sibling
Allows the user to insert a following sibling to the selected nodes. The system will give the user a choice between valid tags (tags which can be inserted at the specified position according to the DTD) and all tags (all tags defined in the DTD).
- Text Child
Inserts a first child text node to the selected nodes. It can be edited in the Tree Area by double click on the text node or in the Text Area.
- Text Sibling
Inserts a following sibling child text node to the selected nodes.
- Comment Child
Allows the user to insert a first child comment node to the selected nodes.
- Comment Sibling
Allows the user to insert a following sibling comment node to the selected nodes.
- Parent
After correcting the selection (if needed), a Change Parent dialog appears.
This dialog allows the user to choose a tag, which will be the new parent of the (newly) selected nodes. The selected nodes are then replaced with the chosen tag that has as children the selected nodes (and their subtrees). The system allows the user to constrain the possible parents in 4 ways.
- To choose from all tags, that are possible for selection, i.e that can have selected nodes for children
- To choose from tags, such that the parent node of the selected nodes will be valid after the operation is completed.
- To choose from tags that can have the selected nodes for children.
- To choose from tags that are produced from PARENT value constraints of the selected nodes.
Any combination of the above constraints is possible.
- Rename
This item renames the selected nodes (only if they are element nodes). The system will allow the user to choose from tags, such that the corresponding parent node(s) will be valid after the operation is completed or to choose from tags, such that the selected nodes will be valid, or both.
- Apply query
Allows the user to apply a query ( Concordance, Constraints, Extract, Grammar, InsertAttribute, InsertChild, InsertParent, InsertSibling, MultiQuery,Remove, Rename, Sort, Statistics, TextReplace, XSLT) over selected nodes.
- Copy
This item copies the selected nodes, including their subtrees, to a copy buffer. After applying the Copy operation the buffer contains a set of trees - one for each selected node.
- Cut
This copies the selected nodes (including their subtrees) to a copy buffer and then deletes the nodes from the tree. The system will warn the user if, after the deletion, the structure of the document becomes non-valid. After applying the Cut operation, the copy buffer contains a set of trees - one for each selected node.
- Paste As Child
This pastes the copy buffer as a first child of the selected node(s). If the copy buffer contains more than one tree, the trees are inserted as neighbour children. The system will warn the user if, after the insertion, the structure of the document becomes non-valid. If it is applied to a multiple selection of nodes, the copy buffer will be cloned, as many times as an insertion will be performed.
- Paste As Sibling
This pastes the copy buffer as a following sibling of the selected node(s). If the copy buffer contains more than one tree, the trees are inserted as neighbour siblings of the selected node(s). The system will warn the user if, after the insertion, the structure of the document becomes non-valid. If it is applied to a multiple selection of nodes, the copy buffer will be cloned, as many times as an insertion will be performed.
- Delete Subtree
More ...
- Info
This item gives information about the selected node and its content. The given information includes: the path from the root node to the current one (abbr. XPath), the tokenizer attached to the node, the content of the node (tags & tokens). This operation cannot be applied to a multiple selection of nodes.
- Errors Info
This item gives all errors for the selected nodes from the tree in the document order. There are two modes - Selected Nodes and Subtree Nodes of Selected Nodes. In the Selected Nodes mode the dialog shows all the errors of the selected nodes. In the Subtree Nodes of Selected Nodes mode the dialog shows all the errors for all the subtrees of the selected nodes.
Each error is colored depending on the error group. For more details see Error Message Panel.
- Trim Text Nodes
This item removes the leading and trailing white space characters (spaces, tabs and new lines) of the text nodes, descendants of the selected nodes and of the nodes themselves (if they are text nodes).
- Expand Subtree
This item is used when the tree (or parts of it) on the left panel is folded and the user wants to see the whole substructure of it (them) unfolded. Instead of passing through all folded nodes and manually unfolding them, the user can expand everything, which is under a certain (selected) node by using this option. This option can be applied to more than one selected node.
- Collapse Subtree
This item is used to collapse one or more selected subtrees in the tree area. The function is an opposite to the Expand Subtree function.
- RegExpr. Constraints
Allows the user to apply regular expression constraints over the content of the selected node(s) (or the node itself, if it is a text node). When the user selects this item, he/she is given a list of all regular expression constraints defined in the system. The manager, responsible for creating and modifying such constraints can be started from the menu: Constraints/Regular Expression Constraints/Edit Regular Expression Constraints. More details about Regular Expression Constraints usage can be found in the description of the main menu items of the CLaRK System (Constraints/Regular Expression Constraints).
- Invoke XSLT
This item applies an XSLT transformations over the selected node. It means that the system uses the selected node as the starting point for the XSLT, not the root of the document, which would be the case when running it from the main menu. If the XSLT returns more than one root, all the roots are inserted in the place of the selected node. This operation cannot be applied to multiple selection of nodes.
[:]
-
- XML Path Language (XPath)
Abstract
XPath is a language for addressing parts of an XML document.
Status of this document
This document is based on the W3C Recommendation 16 November 1999 (http://www.w3.org/TR/1999/REC-xpath-19991116). It describes the XPath language according to the implementation used in the Clark System. The implementation covers almost the whole language. Because of the general purpose of XPath, in the implementation there are some insignificant exclusions, which are not needed for the system. On the other hand, there are new things which will be described in this document. The implementation also covers an abbreviated syntax.
1 Introduction
XPath is the result of an effort to provide a common syntax and semantics for functionality shared between XSL Transformations [XSLT] and XPointer [XPointer]. The primary purpose of XPath is to address parts of an XML [XML] document. In support of this primary purpose, it also provides basic facilities for manipulation of strings, numbers and booleans. XPath operates on the abstract, logical structure of an XML document, rather than its surface syntax. XPath gets its name from its use of a path notation as in URLs for navigating through the hierarchical structure of an XML document.
In addition to its use for addressing, XPath is also designed so that it has a natural subset that can be used for matching (testing whether or not a node matches a pattern); this use of XPath is described in XSLT.
XPath models an XML document as a tree of nodes. There are different types of nodes, including element nodes, attribute nodes and text nodes. XPath defines a way to compute a string-value for each type of node. Some types of nodes also have names.
The primary syntactic construct in XPath is the expression. An expression is evaluated to yield an object, which has one of the following four basic types:
- node-set (an unordered collection of nodes without duplicates)
- boolean (true or false)
- number (a floating-point number)
- string (a sequence of UCS characters)
Expression evaluation occurs with respect to a context. The context consists of:
- a node (the context node)
- a pair of non-zero positive integers (the context position and the context size)
- a function library
The context position is always less than or equal to the context size.
The function library consists of a mapping from function names to functions. Each function takes zero or more arguments and returns a single result. This document defines a core function library that the XPath implementation supports. For a function in the core function library, arguments and result are of the four basic types.
The context node, context position, and context size used to evaluate a subexpression are sometimes different from those used to evaluate the containing expression. Several kinds of expressions change the context node; only predicates change the context position and context size. When the evaluation of a kind of expression is described, it will always be explicitly stated if the context node, context position, and context size change for the evaluation of subexpressions; if nothing is said about the context node, context position, and context size, they remain unchanged for the evaluation of subexpressions of that kind of expression.
The grammar specified in this section applies to the attribute value after XML 1.0 normalization. So, for example, if the grammar uses the character
<
, this must not appear in the XML source as<
but must be quoted according to XML 1.0 rules by, for example, entering it as<
. Within expressions, literal strings are delimited by single or double quotation marks, which are also used to delimit XML attributes. To avoid a quotation mark in an expression being interpreted by the XML processor as terminating the attribute value the quotation mark can be entered as a character reference ("
or'
). Alternatively, the expression can use single quotation marks if the XML attribute is delimited with double quotation marks or vice-versa.One important kind of expression is a location path. A location path selects a set of nodes relative to the context node. The result of evaluating an expression that is a location path is the node-set containing the nodes selected by the location path. Location paths can recursively contain expressions that are used to filter sets of nodes.
2 Location Paths
Although location paths are not the most general grammatical construct in the language, they are the most important construct and will therefore be described first.
Every location path can be expressed using a straightforward but rather verbose syntax. There are also a number of syntactic abbreviations that allow common cases to be expressed concisely. This section will explain the semantics of location paths using the unabbreviated syntax. The abbreviated syntax will then be explained by showing how it expands into the unabbreviated syntax .
Here are some examples of location paths using the unabbreviated syntax:
-
child::para
selects thepara
element children of the context node
-
child::text()
selects all text node children of the context node
-
child::node() or child::*
selects all the children of the context node, whatever their node type
-
attribute::name
selects thename
attribute of the context node
-
attribute::*
selects all the attributes of the context node
-
descendant::para
selects thepara
element descendants of the context node
-
ancestor::div
selects alldiv
ancestors of the context node
-
ancestor-or-self::div
selects thediv
ancestors of the context node and, if the context node is adiv
element, the context node as well
-
descendant-or-self::para
selects thepara
element descendants of the context node and, if the context node is apara
element, the context node as well
-
self::para
selects the context node if it is apara
element, and otherwise selects nothing
-
child::chapter/descendant::para
selects thepara
element descendants of thechapter
element children of the context node
-
child::*/child::para
selects allpara
grandchildren of the context node
-
/
selects the document root (which is always the parent of the document element)
-
/descendant::para
selects all thepara
elements in the same document as the context node
-
/descendant::olist/child::item
selects all theitem
elements that have anolist
parent and that are in the same document as the context node
-
child::para[position()=1]
selects the firstpara
child of the context node
-
child::para[position()=last()]
selects the lastpara
child of the context node
-
child::para[position()=last()-1]
selects the last but onepara
child of the context node
-
child::para[position()>1]
selects all thepara
children of the context node other than the firstpara
child of the context node
-
following-sibling::chapter[position()=1]
selects the nextchapter
sibling of the context node
-
preceding-sibling::chapter[position()=1]
selects the previouschapter
sibling of the context node
-
/descendant::figure[position()=42]
selects the forty-secondfigure
element in the document
-
/child::doc/child::chapter[position()=5]/child::section[position()=2]
selects the secondsection
of the fifthchapter
of thedoc
document element
-
child::para[attribute::type="warning"]
selects allpara
children of the context node that have atype
attribute with valuewarning
-
child::para[attribute::type='warning'][position()=5]
selects the fifthpara
child of the context node that has atype
attribute with valuewarning
-
child::para[position()=5][attribute::type="warning"]
selects the fifthpara
child of the context node if that child has atype
attribute with valuewarning
-
child::chapter[child::title='Introduction']
selects thechapter
children of the context node that have one or moretitle
children with string-value equal toIntroduction
-
child::chapter[child::title]
selects thechapter
children of the context node that have one or moretitle
children
-
child::*[self::chapter or self::appendix]
selects thechapter
andappendix
children of the context node
child::*[self::chapter or self::appendix][position()=last()]
selects the lastchapter
orappendix
child of the context node
There are two kinds of location path: relative location paths and absolute location paths.
A relative location path consists of a sequence of one or more location steps separated by
/
. The steps in a relative location path are composed together from left to right. Each step in turn selects a set of nodes relative to a context node. An initial sequence of steps is composed together with a following step as follows. The initial sequence of steps selects a set of nodes relative to a context node. Each node in that set is used as a context node for the following step. The sets of nodes identified by that step are united together. The set of nodes identified by the composition of the steps is this union. For example,child::div/child::para
selects thepara
element children of thediv
element children of the context node, or, in other words, thepara
element grandchildren that havediv
parents.An absolute location path consists of
/
followed by a relative location path. The/
selects the root node of the document as a context node for the next relative location path. A/
itself does not select the root node. This can be done by /self::*2.1 Location Steps
A location step has three parts:
- an axis, which specifies the tree relationship between the nodes selected by the location step and the context node,
- a node test, which specifies the node type or the name of the nodes selected by the location step, and
- zero or more predicates, which use arbitrary expressions to further refine the set of nodes selected by the location step.
The syntax for a location step is the axis name and node test separated by a double colon, followed by zero or more expressions each in square brackets. For example, in
child::para[position()=1]
,child
is the name of the axis,para
is the node test and[position()=1]
is a predicate.The node-set selected by the location step is the node-set that results from generating an initial node-set from the axis and node-test, and then filtering that node-set by each of the predicates in turn.
The initial node-set consists of the nodes having the relationship to the context node specified by the axis, and having the node type or name specified by the node test. For example, a location step
descendant::para
selects thepara
element descendants of the context node:descendant
specifies that each node in the initial node-set must be a descendant of the context;para
specifies that each node in the initial node-set must be an element namedpara
. The available axes are described in Axes. The available node tests are described in Node Tests.The initial node-set is filtered by the first predicate to generate a new node-set; this new node-set is then filtered using the second predicate, and so on. The final node-set is the node-set selected by the location step. The axis affects how the expression in each predicate is evaluated and so the semantics of a predicate is defined with respect to an axis.
2.2 Axes
The following axes are available:
- the
child
axis contains the children of the context node - the
descendant
axis contains the descendants of the context node; a descendant is a child or a child of a child and so on; thus the descendant axis never contains attribute nodes - the
parent
axis contains the parent of the context node, if there is one - the
ancestor
axis contains the ancestors of the context node; the ancestors of the context node consist of the parent of context node and the parent's parent and so on; thus, the ancestor axis will always include the root node, unless the context node is the root node - the
following-sibling
axis contains all the following siblings of the context node; if the context node is an attribute node node, thefollowing-sibling
axis is empty - the
preceding-sibling
axis contains all the preceding siblings of the context node; if the context node is an attribute node, thepreceding-sibling
axis is empty - the
following
axis contains all nodes in the same document as the context node that are after the context node in document order, excluding any descendants and excluding attribute nodes. - the
preceding
axis contains all nodes in the same document as the context node that are before the context node in document order, excluding any ancestors and excluding attribute nodes. - the
attribute
axis contains the attributes of the context node; the axis will be empty unless the context node is an element - the
self
axis contains just the context node itself - the
descendant-or-self
axis contains the context node and the descendants of the context node - the
ancestor-or-self
axis contains the context node and the ancestors of the context node; thus, the ancestor axis will always include the root node
NOTE: The
ancestor
,descendant
,following
,preceding
andself
axes partition a document (ignoring attributes): they do not overlap and together they contain all the nodes in the document.2.3 Node Tests
The node tests are divided into two categories: node type tests and node name tests.
The node type tests are a finite number. They are:
- text()
From the initial node-set preserves only the text nodes. - text(<text>)
From the initial node-set preserves only the text nodes, which contain the <text> as a substring of their text content. For example child::text(“play”) will return all text nodes which are children of the context node and contain the token “play” in them. - text( <mode>, <text>)
From the initial node-set preserves only the text nodes, which contain the <text> on a certain position depending on the <mode>. The modes are denoted by four integers as follows: - 1 - the tested node' value must have the <text> as a prefix (i.e. to start with it);
- 2 - the tested node' value must contain the <text> as a sub-string;
- 3 - the tested node' value must have the <text> as a suffix (i.e. to end with it);
- 4 - the tested node' value must match exactly with <text>.
The mode identifier may or may not be quoted. Example: child::text(1, "aba") will preserve only text nodes whose content starts with "aba".
- text( <mode>, (y|n), '('<reg. expr.>')')
This node test is an extension on the previous nodes test. The main difference is that the search pattern is not a fixed string, but a regular expression. The syntax of the regular expression is the same as the one in the Grammar tool. The regular expression must be wrapped in brace - ( and ). The tokenizer used in the evaluation process is the one specified for the current document/element in section Definitions/Element Features. The second parameter here enables/disabled the usage of token normalization, defined in the corresponding tokenizer. - node()
- This node test do not filter the initial node-set. This is used when all the nodes selected from the axis are needed for further evaluation. For short “*” can be used as instead, i.e. child::* is the same as child::node().
- element()
- From the initial node-set only the element nodes remain.
- attribute()
- From the initial node-set only the attribute nodes remain. It is possible the initial nodes to contain not only attributes.
- attribute(<attributeName>)
Filters only for element nodes which have an attribute named <attributeName>. Example child::attribute(id) will take only the element nodes, children of the context node, which have an attribute id. - attribute(<attributeName> = “<arrtibuteValue>”)
- This node test is almost the same as the attribute(<attributeName>) node test, but in addition it has also a restriction on the value of the attribute. Example: child::attribute(id=”243”) will take only these element nodes which have an attribute id set to value 243.
- comment()
- From the initial node-set only comment nodes remain.
- processing-instruction()
- From the initial node-set only processing-instruction nodes remain.
- content(<mode>, '('<reg. expr.>')') and content(<mode>, <query_name>)
- The behaviour of this note test is similar to the one for text(mode, (y|n), (reg.expr)). The advantage here is that the usage is not restricted only to text nodes, but to any type of nodes and their content. The matching modes are the same as described before. The second parameter specifies either a regular expression (in brace) or a Grammar Tool Query name (quoted). If the query is not found, the system raises an error message.
The name node tests are used to filter the initial node-set for element nodes with a given name. All other non-element nodes and element nodes with other name are removed from the set. Example child::para takes only the para element nodes, children of the context node.
2.4 Predicates
An axis is either a forward axis or a reverse axis. An axis that only ever contains the context node or nodes that are after the context node in document order is a forward axis. An axis that only ever contains the context node or nodes that are before the context node in document order is a reverse axis. Thus, the ancestor, ancestor-or-self,preceding, and preceding-sibling axes are reverse axes; all other axes are forward axes. Since the self axis always contains at most one node, it makes no difference whether it is a forward or reverse axis. The proximity position of a member of a node-set with respect to an axis is defined to be the position of the node in the node-set ordered in document order if the axis is a forward axis and ordered in reverse document order if the axis is a reverse axis. The first position is 1.
A predicate filters a node-set with respect to an axis to produce a new node-set. For each node in the node-set to be filtered, the predicate is evaluated with that node as the context node, with the number of nodes in the node-set as the context size, and with the proximity position of the node in the node-set with respect to the axis as the context position; if the predicate evaluates to true for that node, the node is included in the new node-set; otherwise, it is not included.
A predicate is evaluated as an expression and the result is converted to a boolean. If the result is a number, the result will be converted to true if the number is equal to the context position and will be converted to false otherwise; if the result is not a number, then the result will be converted as if by a call to the boolean() function. Thus a location path
para[3]
is equivalent topara[position()=3]
. In other words if the context node has 4 child nodes para, then only for the third para child the predicate [3] (or [position()=3]) will be true. So the new node-set will contain only the third para child.2.5 Abbreviated Syntax
Here are some examples of location paths using abbreviated syntax:
-
para
selects thepara
element children of the context node
-
*
selects all children of the context node
-
text()
selects all text node children of the context node
-
@name
selects thename
attribute of the context node
-
@*
selects all the attributes of the context node
-
para[1]
selects the firstpara
child of the context node
-
para[last()]
selects the lastpara
child of the context node
-
*/para
selects allpara
grandchildren of the context node
-
/doc/chapter[5]/section[2]
selects the secondsection
of the fifthchapter
of thedoc
-
chapter//para
selects thepara
element descendants of thechapter
element children of the context node
-
//para
selects all thepara
descendants of the document root and thus selects allpara
elements in the same document as the context node
-
//olist/item
selects all theitem
elements in the same document as the context node that have anolist
parent
-
.
selects the context node
-
.//para
selects thepara
element descendants of the context node
-
..
selects the parent of the context node
-
../@lang
selects thelang
attribute of the parent of the context node
-
para[@type="warning"]
selects allpara
children of the context node that have atype
attribute with valuewarning
-
para[@type="warning"][5]
selects the fifthpara
child of the context node that has atype
attribute with valuewarning
-
para[5][@type="warning"]
selects the fifthpara
child of the context node if that child has atype
attribute with valuewarning
-
chapter[title="Introduction"]
selects thechapter
children of the context node that have one or moretitle
children with string-value equal toIntroduction
-
chapter[title]
selects thechapter
children of the context node that have one or moretitle
children
employee[@secretary and @assistant]
selects all theemployee
children of the context node that have both asecretary
attribute and anassistant
attribute
The most important abbreviation is that
child::
can be omitted from a location step. In effect,child
is the default axis. For example, a location pathdiv/para
is short forchild::div/child::para
.There is also an abbreviation for attributes:
attribute::
can be abbreviated to@
. For example, a location pathpara[@type="warning"]
is short forchild::para[attribute::type="warning"]
and so selectspara
children with atype
attribute with value equal towarning
.//
is short for/descendant-or-self::*/
. For example,//para
is short for/descendant-or-self::node()/child::para
and so will select anypara
element in the document;div//para
is short fordiv/descendant-or-self::node()/child::para
and so will select allpara
descendants ofdiv
children.NOTE: The location path
//para[1]
does not mean the same as the location path/descendant::para[1]
. The latter selects the first descendantpara
element; the former selects all descendantpara
elements that are the firstpara
children of their parents.A location step of
.
is short forself::node()
. This is particularly useful in conjunction with//
. For example, the location path.//para
is short for self::node()/descendant-or-self::node()/child::para and so will select allpara
descendant elements of the context node.Similarly, a location step of
..
is short forparent::node()
. For example,../title
is short forparent::node()/child::title
and so will select thetitle
children of the parent of the context node.3 Expressions
3.1 Function Calls
A function call is evaluated by evaluating each of the arguments, converting each argument to the type required by the function, and finally calling the function, passing it the converted arguments. It is an error if the number of arguments is wrong or if an argument cannot be converted to the required type.
An argument is converted to type string as if by calling the string() function. An argument is converted to type number as if by calling the number() function. An argument is converted to type boolean as if by calling the boolean() function. An argument that is not of type node-set cannot be converted to a node-set.
Examples:
child::para[count(child::*) > 3] returns all para child elements of the context node which have more then 3 children. The evaluation will be in the following sequence: getting all child nodes of the context; applying the name node test and keeping only the para elements; for each para element as a context evaluating the count() function. First evaluating the location path child::* with a context each of the para elements. The location path every time returns a node-sets (different in general) which is passed as an argument for the count() function. The function returns a number which later will be used for further evaluation of the predicate.
3.2 Node-sets
A location path can be used as an expression. The expression returns the set of nodes selected by the path.
The
|
operator computes the union of its operands, which must be node-sets.Example: /descendant-or-self::para | /descendant-or-self::head will return an union of all para and head element nodes in a document. The two location paths are evaluated independently and the two results are unified.
3.3 Booleans
An object of type boolean can have one of two values, true and false.
An
or
expression is evaluated by evaluating each operand and converting its value to a boolean as if by a call to the boolean() function. The result is true if either value is true and false otherwise.An
and
expression is evaluated by evaluating each operand and converting its value to a boolean as if by a call to the boolean() function. The result is true if both values are true and false otherwise.An equality(A = B) or a relational(A < B) expression is evaluated by comparing the objects that result from evaluating the two operands. Comparison of the resulting objects is defined in the following three paragraphs. First, comparisons that involve node-sets are defined in terms of comparisons that do not involve node-sets; this is defined uniformly for
=
,!=
,<=
,<
,>=
and>
. Second, comparisons that do not involve node-sets are defined for=
and!=
. Third, comparisons that do not involve node-sets are defined for<=
,<
,>=
and>
.If both objects to be compared are node-sets, then the comparison will be true if and only if there is a node in the first node-set and a node in the second node-set such that the result of performing the comparison on the string-values of the two nodes is true. If one object to be compared is a node-set and the other is a number, then the comparison will be true if and only if there is a node in the node-set such that the result of performing the comparison on the number to be compared and on the result of converting the string-value of that node to a number using the number() function is true. If one object to be compared is a node-set and the other is a string, then the comparison will be true if and only if there is a node in the node-set such that the result of performing the comparison on the string-value of the node and the other string is true. If one object to be compared is a node-set and the other is a boolean, then the comparison will be true if and only if the result of performing the comparison on the boolean and on the result of converting the node-set to a boolean using the boolean() function is true.
When neither object to be compared is a node-set and the operator is
=
or!=
, then the objects are compared by converting them to a common type as follows and then comparing them. If at least one object to be compared is a boolean, then each object to be compared is converted to a boolean as if by applying the boolean() function. Otherwise, if at least one object to be compared is a number, then each object to be compared is converted to a number as if by applying the boolean() function. Otherwise, both objects to be compared are converted to strings as if by applying the string() function. The=
comparison will be true if and only if the objects are equal; the!=
comparison will be true if and only if the objects are not equal. Two booleans are equal if either both are true or both are false. Two strings are equal if and only if they consist of the same sequence of UCS characters.When neither object to be compared is a node-set and the operator is
<=
,<
,>=
or>
, then the objects are compared by converting both objects to numbers and comparing the numbers. The<
comparison will be true if and only if the first number is less than the second number. The<=
comparison will be true if and only if the first number is less than or equal to the second number. The>
comparison will be true if and only if the first number is greater than the second number. The>=
comparison will be true if and only if the first number is greater than or equal to the second number.3.4 Numbers
The numeric operators convert their operands to numbers as if by calling the number() function.
The
+
operator performs addition.The
-
operator performs subtraction.NOTE: Since XML allows
-
in names, the-
operator typically needs to be preceded by whitespace. For example,foo-bar
evaluates to a node-set containing the child elements namedfoo-bar
;foo - bar
evaluates to the difference of the result of converting the string-value of the firstfoo
child element to a number and the result of converting the string-value of the firstbar
child to a number.The
div
operator performs floating-point division.The
mod
operator returns the remainder from a truncating division. For example,-
5 mod 2
returns1
-
5 mod -2
returns1
-
-5 mod 2
returns-1
-5 mod -2
returns-1
4 Variables
One specific extension of this XPath implementation is the support of variables within expressions. The variables can be used for storing temporary results and using them later. Each variable has a name by which it is identified. The support can be split into two parts: variable definitions and variable usage, which are described below.
4.1 Variable Definitions
The variable definitions are part from of the XPath expression. They can not exist alone. They are always attached in front of another expression within which the corresponding variable values can be used. Once a variable is defined it can be used ONLY in the following expression or/and in the direct following variable definitions (if any). Finally, the structure of one XPath expression can be described as: variable_definition*, expression. Each variable definition consists of two parts: a variable name and an XPath expression which represents the variable value.
The syntax of one definition is: '{', variable_name, ':=', variable_value, '}'.
The variable name can be a non empty string, starting with a Latin letter and followed by a sequence of Latin letters and digits (a1, aValue, val, ..).
The variable value must be a valid XPath expression. It may use values of variables defined either in preceding variable definitions or in definitions on a higher level of expression, i.e. definitions in expression which contain the current expression as compound part. During evaluation, the context for a variable value expression is the same as the context for the expression which follows the definitions. The value of a variable can be of any standard data type for XPath (node-set, number, string and boolean).
4.2 Variable Usage
The variable usage is the same as it is described in the original XPath specification. A variable value is used within expressions by citing its name followed by the dollar sign ($). Example: $a1, $aValue, $val, etc. A variable value can be used only if it is already defined before. The usage of not defined variables will cause an error.
5. XPath Macros
The XPath Macros are used basically for simplification of XPath expressions. It is a means for naming XPath expressions and embedding them within other expressions just by referring to their names. The embedded named expressions (macros) during evaluation are expanded with the corresponding original XPath expressions. An XPath macro is embedded in another expression by citing its name, preceded by a percent sign ('%'). If a macro is used but not defined, the system raises an error. A macro can be inserted everywhere in an expression where the corresponding macro expression can be used. Roughly, the XPath macros substitute independent sub-expressions within other more complex expressions.
6 Core Function Library
This section describes functions that the XPath implementation include in the function library that is used to evaluate expressions.
Each function in the function library is specified using a function prototype, which gives the return type, the name of the function, and the type of the arguments. If an argument type is followed by a question mark, then the argument is optional; otherwise, the argument is required.
6.1 Node Set Functions
The last function returns a number equal to the context size from the expression evaluation context.
The position function returns a number equal to the context position from the expression evaluation context.
Function: number count(node-set)
The count function returns the number of nodes in the argument node-set.
Function: string name(node-set?)
The name function returns a string containing a name of the node in the argument node-set that is first in document order. If the argument node-set is empty or the first node has no name, an empty string is returned. If the argument it omitted, it defaults to a node-set with the context node as its only member.
The id function selects elements by their unique IDs. When the argument to id is of type node-set, then the result is the union of the result of applying id to the string-value of each of the nodes in the argument node-set. When the argument to id is of any other type, the argument is converted to a string as if by a call to the string function; the string is split into a whitespace-separated list of tokens; the result is a node-set containing the elements in the same document as the context node that have a unique ID equal to any of the tokens in the list.
id("foo")
selects the element with unique IDfoo
id("foo")/child::para[position()=5]
selects the fifthpara
child of the element with unique IDfoo
Extended Node Set Functions
Function: node-set set:difference(node-set, node-set)
The set:difference function returns the difference between two node sets - those nodes that are in the node set passed as the first argument that are not in the node set passed as the second argument.
Function: node-set set:intersection(node-set, node-set)
The set:intersection function returns a node set comprising the nodes that are within both the node sets passed as arguments to it.
Function: node-set set:distinct(node-set)
The set:distinct function returns a node-set containing a subset of nodes for which each two nodes, compared with the eq() function the result is false, i.e. it returns a subset of all nodes which have unique substructures.
Function: boolean set:has-same-node(node-set, node-set)
The set:has-same-node function returns true if the node set passed as the first argument shares any nodes with the node set passed as the second argument. If there are no nodes that are in both node sets, then it returns false.
Function: node-set set:leading(node-set, node-set)
The set:leading function returns the nodes in the node set passed as the first argument that precede, in document order, the first node in the node set passed as the second argument. If the first node in the second node set is not contained in the first node set, then an empty node set is returned. If the second node set is empty, then the first node set is returned.
Function: node-set set:trailing(node-set, node-set)
The set:trailing function returns the nodes in the node set passed as the first argument that follow, in document order, the first node in the node set passed as the second argument. If the first node in the second node set is not contained in the first node set, then an empty node set is returned. If the second node set is empty, then the first node set is returned.
Function: string set:generate-id(node-set)
The set:generate-id function returns an unique string identifier for the first member node of the argument. The identifier is based on the position of the node in the tree. The result is an empty string if the argument set is empty or the first node is an attribute node.
Function: boolean set:eq(object, object)
The set:eq function compares two objects which are expected to be ordered node-sets. If an argument is not of type node-set, this function compares the two arguments in the way they are compared by the "=" operator. The nodes in the two sets are compared positionally (first with first, second with second, and so on). Two nodes are equal if:
- they are of the same type (element, text, attribute);
- they are text nodes and their content character sequences are the same;
- they are attribute nodes which have same names and same values;
- they are element nodes which:
- have the same names;
- each attribute from the first node has a corresponding equal attribute in the second node and vice versa;
- have the same number of children and each two children which have equal positions in their parent nodes are recursively equal;
In all other cases the nodes are different.
Function: node-set set:order(node-set)
The set:order function receives an arbitrary set of nodes and produces a new set with the same nodes ordered in document order. If the input set contains attributes, their position in the document order is the same as the element nodes they belong to. As attributes within an element node do not have an ordering defined, the way they appear in the result set is undetermined. But for sure they will appear in the positions where their corresponding element nodes would appear.
Function: node-set set:reverse(node-set)
The set:order function receives an arbitrary set of nodes and produces a new set with the same nodes ordered in reverse document order. If the input set contains attributes, their position in the document order is the same as the element nodes they belong to. As attributes within an element node do not have an ordering defined, the way they appear in the result set is undetermined. But for sure they will appear in the positions where their corresponding element nodes would appear.
Function: node-set set:eval(string)
The set:eval function receives a string as an argument and interprets it as an XPath expression according to the current context node. The result from the evaluation is expected to be a node-set. In case it is not, an empty node-set is returned. It is possible the argument string for the function to be a result from evaluation of another XPath converted to string using the string() function.
The error function is related to the system dtd validator. It tests the context node if it is contained in the list of errors produced by the validator. This function works properly only for documents opened in the system editor.
Function: string doc-name(node-set)
The doc-name function returns the name of a document containing a certain node. The returned name is the name of the document within the CLaRK system. If the function is used without an argument, the result will be the name of the document containing the context node. Otherwise, it will be the name of the document of the first node in the argument. If no document name is found, the function returns an empty string.
Function: node-set search(string, string?)
The function search performs searching in the document on the basis of a content indexing for quick access. In order this function to work properly the target document must be indexed in advance. Otherwise the function returns an empty node-set. This function can be used in construction of location paths as a separate location step.
The first argument of the function represents the search query. It can be a whole word/token or a partial description containing wildcard symbols. The second argument is optional and it points to a repository (if it exists) in which the search to be performed. For more details about the function usage see: Definitions / Document Index.
6.2 String Functions
Function: string string(object?)
The string function converts an object to a string as follows:
- A node-set is converted to a string by returning the string-value of the node in the node-set that is first in document order. If the node-set is empty, an empty string is returned.
- A number is converted to a string as follows
- NaN is converted to the string
NaN
- positive zero is converted to the string
0
- negative zero is converted to the string
0
- positive infinity is converted to the string
Infinity
- negative infinity is converted to the string
-Infinity
- if the number is an integer, the number is represented in decimal form as a number with no decimal point and no leading zeros, preceded by a minus sign (
-
) if the number is negative - otherwise, the number is represented in decimal form as a number including a decimal point with at least one digit before the decimal point and at least one digit after the decimal point, preceded by a minus sign (
-
) if the number is negative;
- NaN is converted to the string
- The boolean false value is converted to the string
false
. The boolean true value is converted to the stringtrue
. - An object of a type other than the four basic types will cause an exception (error).
If the argument is omitted, it defaults to a node-set with the context node as its only member.
Function: string concat(string, string, string*)
The concat function returns the concatenation of its arguments.
Function: boolean starts-with(string, string)
The starts-with function returns true if the first argument string starts with the second argument string, and otherwise returns false.
Function: boolean contains(string, string)
The contains function returns true if the first argument string contains the second argument string, and otherwise returns false.
Function: string substring-before(string, string)
The substring-before function returns the substring of the first argument string that precedes the first occurrence of the second argument string in the first argument string, or the empty string if the first argument string does not contain the second argument string. For example,
substring-before("1999/04/01","/")
returns1999
.Function: string substring-after(string, string)
The substring-after function returns the substring of the first argument string that follows the first occurrence of the second argument string in the first argument string, or the empty string if the first argument string does not contain the second argument string. For example,
substring-after("1999/04/01","/")
returns04/01
, andsubstring-after("1999/04/01","19")
returns99/04/01
.Function: string substring(string, number, number?)
The substring function returns the substring of the first argument starting at the position specified in the second argument with length specified in the third argument. For example,
substring("12345",2,3)
returns"234"
. If the third argument is not specified, it returns the substring starting at the position specified in the second argument and continuing to the end of the string. For example,substring("12345",2)
returns"2345"
.More precisely, each character in the string is considered to have a numeric position: the position of the first character is 1, the position of the second character is 2 and so on.
The returned substring contains those characters for which the position of the character is greater than or equal to the rounded value of the second argument and, if the third argument is specified, less than the sum of the rounded value of the second argument and the rounded value of the third argument; rounding is done as if by a call to the round function. The following examples illustrate various unusual cases:
-
substring("12345", 1.5, 2.6)
returns"234"
-
substring("12345", 0, 3)
returns"12"
-
substring("12345", 0 div 0, 3)
returns""
-
substring("12345", 1, 0 div 0)
returns""
-
substring("12345", -42, 1 div 0)
returns"12345"
substring("12345", -1 div 0, 1 div 0)
returns""
Function: number string-length(string?)
The string-length returns the number of characters in the string. If the argument is omitted, it defaults to the context node converted to a string, in other words the string-value of the context node.
Function: string normalize-space(string?)
The normalize-space function returns the argument string with whitespace normalized by stripping leading and trailing whitespace and replacing sequences of whitespace characters by a single space. Whitespace characters are: SPACEs, TABs, new lines. If the argument is omitted, it defaults to the context node converted to a string, in other words the string-value of the context node.
Function: string translate(string, string, string)
The translate function returns the first argument string with occurrences of characters in the second argument string replaced by the character at the corresponding position in the third argument string. For example,
translate("bar","abc","ABC")
returns the stringBAr
. If there is a character in the second argument string with no character at a corresponding position in the third argument string (because the second argument string is longer than the third argument string), then occurrences of that character in the first argument string are removed. For example,translate("--aaa--","abc-","ABC")
returns"AAA"
. If a character occurs more than once in the second argument string, then the first occurrence determines the replacement character. If the third argument string is longer than the second argument string, then excess characters are ignored.Extended String Functions
Function: string str:align(string, string, string?)
The str:align function aligns a string within another string. The first argument gives the target string to be aligned. The second argument gives the padding string within which it is to be aligned. If the target string is shorter than the padding string then a range of characters in the padding string are replaced with those in the target string. Which characters are replaced depends on the value of the third argument, which gives the type of alignment. It can be one of 'left', 'right' or 'center'. If no third argument is given or if it is not one of these values, then it defaults to left alignment. With left alignment, the range of characters replaced by the target string begins with the first character in the padding string. With right alignment, the range of characters replaced by the target string ends with the last character in the padding string. With center alignment, the range of characters replaced by the target string is in the middle of the padding string, such that either the number of unreplaced characters on either side of the range is the same or there is one less on the left than there is on the right. If the target string is longer than the padding string, then it is truncated to be the same length as the padding string and returned.
Function: string str:concat(node-set, string?)
The str:concat function takes a node set and returns the concatenation of the string values of the nodes in that node set. If the node set is empty, it returns an empty string. If a second argument (string) is supplied, it serves as a separator between each node's string value from the first argument.
Function: string str:padding(number, string?)
The str:padding function creates a padding string of a certain length. The first argument gives the length of the padding string to be created. The second argument gives a string to be used to create the padding. This string is repeated as many times as is necessary to create a string of the length specified by the first argument; if the string is more than a character long, it may have to be truncated to produce the required length. If no second argument is specified, it defaults to a space (' '). If the second argument is an empty string, str:padding returns an empty string.
Function: number str:index-of(string, string, number?)
The str:index-of function returns the starting position of the first occurrence of the second string argument within the content of the first argument. If a third argument is presented then the searching in the first argument starts from the specified position. If the first argument does not contain the second one (after a certain position if specified) then the result is -1.
Function: number str:last-index-of(string, string, number?)
The str:last-index-of function returns the starting position of the last occurrence of the second string argument within the content of the first argument. If a third argument is presented then the function returns the last occurrence before the specified position. If the first argument does not contain the second one the result is -1.
Function: boolean str:region-matches(string, number, string, number, number, boolean)
Tests if two string regions are equal. A substring of the first string argument, starting at position - second argument, is compared to a substring of the third argument, starting at position - fourth argument with length - fifth argument. The result is true if these substrings represent character sequences that are the same, ignoring case if and only if the sixth argument is true. The result is false if one of the following is true:
- second argument is negative;
- fourth argument is negative;
- the sum of the second argument and the sixth argument is greater than the length of first argument;
- the sum of the fourth argument and the sixth argument is greater than the length of third argument;
Example: region-matches("abcdef", 2, "ECDF", 1, 2, true) --> true
Function: string str:to-lower-case(string)
Converts all of the characters in the string argument to lower case using the rules of the current environment.
Function: string str:to-upper-case(string)
Converts all of the characters in the string argument to upper case using the rules of the current environment.
Function: boolean ends-with(string, string)
Tests if the first argument ends with the specified suffix as second argument. The result is true if the character sequence represented by the second argument is a suffix of the character sequence represented by the first argument. Otherwise it is false.
Function: number str:count-tokens(string, string)
Returns the number of the tokens in the first string argument. The token boundaries are stated in the second argument. Example: str:count-tokens("one two three", " ") --> 3
Function: string str:token-at(string, string, number)
Returns a token from the first string argument at position specified in the third argument. The second argument represents the separator between the tokens.
Function: string str:normalize(string, string)
Returns the normalized representation of the string argument at first position. The normalization is performed on the basis of a tokenizer, supplied as a second argument. If the tokenizer is not primitive, the system refers to its primitive parent tokenizer. If the supplied tokenizer is not presented in the system, the original text value is returned. The tokenizer name should be quoted.
Function: number clark:count-tokens(string, string, string?)
Returns the number of the tokens in the first string argument. This function is CLaRK System specific: the second argument is a tokenizer name and the third one is a filter name (optional).
Function: string clark:token-at(string, number, string, string?)
Returns a token from the first string argument at position specified in the second argument.This function is CLaRK System specific: the third argument is a tokenizer name and the fourth one is a filter name (optional).
Function: string str:eval(string)
The str:eval function receives a string as an argument and interprets it as an XPath expression according to the current context node. The result from the evaluation is expected to be a string. In case it is not, an empty string is returned. It is possible the argument string for the function to be a result from evaluation of another XPath converted to string using the string() function.
6.3 Boolean Functions
Function: boolean boolean(object)
The boolean function converts its argument to a boolean as follows:
- a number is true if and only if it is neither positive or negative zero nor NaN
- a node-set is true if and only if it is non-empty
- a string is true if and only if its length is non-zero
- An object of a type other than the four basic types will cause an exception (error).
Function: boolean not(boolean)
The not function returns true if its argument is false, and false otherwise.
The true function returns true.
The false function returns false.
Extended Boolean Function
Function: boolean test(string)
The test function receives a string as an argument and interprets it as an XPath expression according to the current context node. The result from the evaluation is expected to be a boolean value. In case it is not, a
false
value is returned. It is possible the argument string for the function to be a result from evaluation of another XPath converted to string using the string() function.The error function tests the context node if it is an invalid node according to the system validator of the editor. This function is CLaRK specific and in order it to work properly, the document on which it is applied has to be opened (and activated) in the editor. Otherwise, any node is assumed to be valid.
6.4 Number Functions
Function: number number(object?)
The number function converts its argument to a number as follows:
- a string that consists of optional whitespace followed by an optional minus sign followed by a number followed by whitespace is converted to the number that is nearest to the mathematical value represented by the string; any other string is converted to NaN
- boolean true is converted to 1; boolean false is converted to 0
- a node-set is first converted to a string as if by a call to the string function and then converted in the same way as a string argument
- an object of a type other than the four basic types will cause an exception (error).
If the argument is omitted, it defaults to a node-set with the context node as its only member.
Function: number sum(node-set)
The sum function returns the sum, for each node in the argument node-set, of the result of converting the string-values of the node to a number.
Function: number floor(number)
The floor function returns the largest (closest to positive infinity) number that is not greater than the argument and that is an integer.
Function: number ceiling(number)
The ceiling function returns the smallest (closest to negative infinity) number that is not less than the argument and that is an integer.
Function: number round(number)
The round function returns the number that is closest to the argument and that is an integer. If there are two such numbers, then the one that is closest to positive infinity is returned. If the argument is NaN, then NaN is returned. If the argument is positive infinity, then positive infinity is returned. If the argument is negative infinity, then negative infinity is returned. If the argument is positive zero, then positive zero is returned. If the argument is negative zero, then negative zero is returned. If the argument is less than zero, but greater than or equal to -0.5, then negative zero is returned.
Extended Number Functions
Function: number math:min(node-set)
The math:min function returns the minimum value of the nodes passed as the argument. The minimum value is defined as follows. The node set passed as an argument is sorted in ascending order as it would be by xsl:sort with a data type of number. The minimum is the result of converting the string value of the first node in this sorted list to a number using the number function. If the node set is empty, or if the result of converting the string values of any of the nodes to a number is NaN, then NaN is returned.
Function: number math:max(node-set)
The math:max function returns the maximum value of the nodes passed as the argument. The maximum value is defined as follows. The node set passed as an argument is sorted in descending order as it would be by xsl:sort with a data type of number. The maximum is the result of converting the string value of the first node in this sorted list to a number using the number function. If the node set is empty, or if the result of converting the string values of any of the nodes to a number is NaN, then NaN is returned.
Function: node-set math:highest(node-set)
The math:highest function returns the nodes in the node set whose value is the maximum value for the node set. The maximum value for the node set is the same as the value as calculated by math:max. A node has this maximum value if the result of converting its string value to a number as if by the number function is equal to the maximum value, where the equality comparison is defined as a numerical comparison using the = operator. If any of the nodes in the node set has a non-numeric value, the math:max function will return NaN. The definition numeric comparisons entails that NaN != NaN. Therefore if any of the nodes in the node set has a non-numeric value, math:highest will return an empty node set.
Function: node-set math:lowest(node-set)
The math:lowest function returns the nodes in the node set whose value is the minimum value for the node set. The minimum value for the node set is the same as the value as calculated by math:min. A node has this minimum value if the result of converting its string value to a number as if by the number function is equal to the minimum value, where the equality comparison is defined as a numerical comparison using the = operator. If any of the nodes in the node set has a non-numeric value, the math:min function will return NaN. The definition numeric comparisons entails that NaN != NaN. Therefore if any of the nodes in the node set has a non-numeric value, math:lowest will return an empty node set.
Function: number math:abs(number)
The math:abs function returns a number containing the absolute value of the number passed as an argument.
Function: number math:sqrt(number)
The math:sqrt function returns the square root of a number. If the argument is a negative number, the return value is zero.
Function: number math:power(number, number)
The math:power function returns the value of a base expression taken to a specified power.
Function: number math:log(number)
The math:log function returns the natural logarithm of a number. The return value is the natural logarithm of number. The base is e.
Function: number math:random()
The math:random function returns a random number from 0 to 1.
Function: number math:sin(number)
The math:sin function returns the sine of the number in radians.
Function: number math:cos(number)
The math:cos function returns cosine of the passed argument in radians.
Function: number math:tan(number)
The math:tan function returns the tangent of the number passed as an argument in radians.
Function: number math:asin(number)
The math:asin function returns the arcsine value of a number in radians.
Function: number math:acos(number)
The math:acos function returns the arccosine value of a number in radians.
Function: number math:atan(number)
The math:atan function returns the arctangent value of a number in radians.
Function: number math:atan2(number, number)
The math:atan2 function returns the angle ( in radians ) from the X axis to a point (y,x). Value1 is a number argument corresponding to y of point (y,x). Value2 is a number argument corresponding to x of point (y,x). second.
Function: number math:exp(number)
The math:exp function returns e (the base of natural logarithms) raised to a power.
Function: number math:eval(string)
The math:eval function receives a string as an argument and interprets it as an XPath expression according to the current context node. The result from the evaluation is expected to be a number value. In case it is not, a
Not-A-Number
value is returned. It is possible the argument string for the function to be a result from evaluation of another XPath converted to string using the string() function.7 Data Model
XPath operates on an XML document as a tree. This section describes how XPath models an XML document as a tree. This model is conceptual only and does not mandate any particular implementation.
The tree contains nodes. There are several supported types of node:
- element nodes
- text nodes
- attribute nodes
- processing instruction nodes
- comment nodes
For every type of node, there is a way of determining a string-value for a node of that type. For some types of node, the string-value is part of the node; for other types of node, the string-value is computed from the string-value of descendant nodes.
For element nodes, the string-value is a concatenation of the string-values of their child nodes in document order. If an element node has no child nodes then an empty string is returned.
For attribute, comment, text and processing-instruction nodes the string-value is the text content of each of the nodes.
There is an ordering, document order, defined on all the nodes in the document corresponding to the order in which the first character of the XML representation of each node occurs in the XML representation of the document after expansion of general entities. Thus, the root node will be the first node. Element nodes occur before their children. Thus, document order orders element nodes in order of the occurrence of their start-tag in the XML (after expansion of entities). Reverse document order is the reverse of document order.
Element nodes have an ordered list of child nodes. Nodes never share children: if one node is not the same node as another node, then none of the children of the one node will be the same node as any of the children of another node. Every node other than the root node has exactly one parent, which is an element node. The descendants of a node are the children of the node and the descendants of the children of the node.
7.1 Element Nodes
There is an element node for every element in the document. An element node has a name.
The children of an element node are the element nodes, comment nodes, processing instruction nodes and text nodes for its content.
7.2 Attribute Nodes
Each element node has an associated set of attribute nodes; the element is the parent of each of these attribute nodes; however, an attribute node is not a child of its parent element.
Elements never share attribute nodes: if one element node is not the same node as another element node, then none of the attribute nodes of the one element node will be the same node as the attribute nodes of another element node.
7.3 Processing Instruction Nodes
There is a processing instruction node for every processing instruction, except for any processing instruction that occurs within the document type declaration.
7.4 Comment Nodes
There is a comment node for every comment, except for any comment that occurs within the document type declaration.
7.5 Text Nodes
Character data is grouped into text nodes. As much character data as possible is grouped into each text node: a text node never has an immediately following or preceding sibling that is a text node. The string-value of a text node is the character data. A text node always has at least one character of data.
Each character within a CDATA section is treated as character data. Thus,
<![CDATA[<]]>
in the source document will treated the same as<
. Both will result in a single<
character in a text node in the tree. Thus, a CDATA section is treated as if the<![CDATA[
and]]>
were removed and every occurrence of<
and&
were replaced by<
and&
respectively.A text node does not have a name.
A References
A.1 Normative References
- XML
- World Wide Web Consortium. Extensible Markup Language (XML) 1.0. W3C Recommendation. See http://www.w3.org/TR/1998/REC-xml-19980210
- XML Names
- World Wide Web Consortium. Namespaces in XML. W3C Recommendation. See http://www.w3.org/TR/REC-xml-names
A.2 Other References
- Character Model
- World Wide Web Consortium. Character Model for the World Wide Web. W3C Working Draft. See http://www.w3.org/TR/WD-charmod
- DOM
- World Wide Web Consortium. Document Object Model (DOM) Level 1 Specification. W3C Recommendation. See http://www.w3.org/TR/REC-DOM-Level-1
- TEI
- C.M. Sperberg-McQueen, L. Burnard Guidelines for Electronic Text Encoding and Interchange. See http://etext.virginia.edu/TEI.html.
- Unicode
- Unicode Consortium. The Unicode Standard. See http://www.unicode.org/unicode/standard/standard.html.
- Tool Application Modes - Processing Current Document vs. Multiple Apply
The processing of XML data in the CLaRK System can be done in two ways. Each of them has its advantages and disadvantages.
As the system is designed to work with corpora, this in most of the cases involves working with large amount of data. Sometimes this can be crucial for the processing time and the system resources which are needed for a certain task. Therefore CLaRK supports two techniques for processing XML documents:
- processing Current Document. 'Current Document' here refers to an XML document which is opened in the main editor and which is currently active (when more than one document are opened). During the process of work the system interacts with the user by graphical dialogs for processing options, confirmation, warning or error messages. In case of error or invalid settings or others the user can cancel the current operation.
- processing Multiple Apply. The processing is applied to one or more documents which are already saved in the system. During processing, the document(s) is/are not opened in the editor but only the final results are reported. There is no interaction with the user during processing. The results from an operation (modified or new documents) are saved in the Internal Documents Database after a successful procedure.
The advantages of the first type of processing are that the user can see the data and can adjust the tool settings according to the specific task. The user can check which specific data the given tool will be applied to without an actual application. One disadvantage here is that the visualization of large documents requires system resources which can make the processing extremely slow. Another disadvantage is that the specific tool can be applied only to one (current) document.
To solve these problems, the CLaRK system offers the second approach (Multiple Apply). Here the user can select one or more documents which the certain tool will be applied to. The processing proceeds according to the order of the document selection. During the processing time the selected input documents are not opened in the editor which takes considerably less system resources and makes the procedure faster. Here, after starting the application, no user input is expected. During runtime, on the screen status messages are printed showing the current state of the process: currently processed document, result message after application to a single document, result document, etc.
All tools which support these two modes of application have a similar graphical interface dialogs. The mode is controlled by a checkbox "Multiple Apply" (fig. 1) situated on the main tool dialogs. If the checkbox is unselected, the tool will be applied to the current document. Otherwise, an auxiliary panel is shown under the checkbox (fig. 2).
Fig. 1 Tool application to the current document
Fig. 2 Tool application to multiple Internal Documents (Multiple Apply)
Multiple Apply Auxiliary Panel
Basically, the panel represents a table with the selected documents the specific tool to be applied to (column INPUT). Also the table contains the result document names in which the result from the application should be stored (column RESULTS). If for each input document one result document is produced, the input name and the result name appear on the same row. If for all selected input documents only one result document is produced (tool and/or options dependent), its name should appear on the first row of column RESULTS. Unless the Overwrite option is set (see Options below) the second column of the table can be edited.
On the right side of the panel, the buttons for document selection and options are situated:
-
Add Documents - Opens a selection dialog with the internal documents arranged in groups or in a list. The result from the selection is appended to the table.
-
Remove Documents - Removes the selected rows (documents) from the table. The removal is NOT preceded by a confirm message.
-
Clear All - Removes all entries from the table without a confirmation.
-
Options - Opens an options dialog with settings concerning the actual application and the result forming. The dialog window looks in the following way:
Multiple Apply Options
The options dialog is divided into two sections:
-
Output section - determines the way the result is formed. The possibilities are: for each input document to create one result document (option Separate) and for all input documents to create one result document (option United). The second option is disabled for some of the tools, for which it is not appropriate. When the second option is active and selected, the user is expected to supply a result document name in the field Name.
-
Mode section - determines the way the result document names are generated. The Overwrite option makes the new created documents to overwrite the old input files. This is useful in cases the result documents represent the modifications of the input documents. This option is disabled for some of the tools, for which it is not appropriate. The other option Create New says that the result will be stored in one or more new documents. The initial names of the result documents are formed by concatenation of the initial input names and the suffix form field Default Extension. The user can modify the suffix. Each tool has its own default initial suffix (its format is: -tool_name- ). By pressing the Reset button the value in the Default Extension field is set to its initial value. At the bottom of the dialog window there are two checkboxed options:
-
Always save - after a tool application, if an input document is not modified and it serves as a result, it is not saved. If this option is selected, the result is always saved whether it is the same as the original input document or not.
-
Always overwrite - when as a result document name, an existing document name is set, the system raises a warning message and cancels further application. If this option is selected, no warning messages are shown and the existing internal documents (if any) are replaced with the new results.
Result Folder
When a tool is applied in a Multiple Apply mode the user have to specify a location where the result(s) should be stored (except for the cases the result overwrites the input data). A result location group can be specified at the bottom of the panel. By default, each tool has its own specific result group ( [Corpus_name] SYSTEM : Results : <tool-name>). The user can point to any group in the Internal Documents database with one restriction: results cannot be stored in the system groups under group [Corpus_name] SYSTEM : Queries and their descending groups. This restriction comes from the fact that these groups must contain XML documents of a special type (tool queries) and they must be valid according certain DTDs. These requirements for the result cannot be controlled in advance. Having pressed the Change button, the user gets the following result group (folder) chooser dialog:
A new result group can be set either by pointing to a group in the tree and pressing the Choose button, or by performing a double left-mouse-click on the target group. If the selected group is not appropriate for a result group, a warning message appears and the control is returned to the group chooser dialog. Two additional operations are available here: adding a new group (button New Group) and removing an existing group (button Remove Group).
Result DTD
For some of the tools there is one more option available: specifying a DTD for the result document(s). This option is available only for tools which produce new result documents. The user can select any DTD compiled in the system or to preserve the DTD from the input document(s) (option <Original DTD>).
While the real tool application in Multiple Apply mode is performed, the user is shown an information dialog which indicates the overall status of the process. The system shows which document is currently processed, result messages after each single application and where the result is stored. In case of errors, corresponding messages are shown. An example status window is in the following picture:
During runtime the user can cancel the tool application by using the Stop button. The application is not interrupted immediately but only after the current operation has been completed (opening a document, applying a single operation or saving a result).
XML Tools Queries
The user can save different configurations of the tools in order to execute them many times. Except for the specific tool settings, the user can specify which input documents the tool will be applied to and how the result will be formed and saved. All these settings are represented as XML documents in the Internal Documents database. Further more they can be processed system with all facilities. This specific kind of XML documents in the system are called XML Tool Queries or just queries. Each tool in the CLaRK System has its own specific type of queries with their specific DTDs. The queries are located in a special place in the Internal Documents database (system groups [Corpus_name] SYSTEM : Queries : <tool_name> and all descending sub-groups). Each XML query is valid according to its DTD.
fig. 3 Queries Panel for XPath Remove Tool
A management panel similar to the one in fig. 3 appears (with very small variations depending on the specific tool). After choosing Select button, the corresponding Query Manager is shown and the user can load a query in the current tool. If the Reset button is clicked, then the tool settings are reset to their initial values. In this case the Updatebutton changes to Save. The settings on the current tool dialog window can be saved by using the Save/Update button. If the user creates a new query then after pressing the Save button s/he must supply a query name. If changes (modifications) on an existing queries are to be saved, the Update button requires the user confirmation for overwriting. All queries which are saved/updated are stored in the Internal Documents database.
- XML Tools Queries
The user can save different configurations of the tools in order to execute them many times. Except for the specific tool settings, the user can specify which input documents the tool will be applied to and how the result will be formed and saved. All these settings are represented as XML documents in the Internal Documents database. Further more they can be processed system with all facilities. This specific kind of XML documents in the system are called XML Tool Queries or just queries. Each tool in the CLaRK System has its own specific type of queries with their specific DTDs. The queries are located in a special place in the Internal Documents database (system groups [Corpus_name] SYSTEM : Queries : <tool_name> and all descending sub-groups). Each XML query is valid according to its DTD.
fig. 3 Queries Panel for XPath Remove Tool
A management panel similar to the one in fig. 3 appears (with very small variations depending on the specific tool). After choosing Select button, the corresponding Query Manager is shown and the user can load a query in the current tool. If the Reset button is clicked, then the tool settings are reset to their initial values. In this case the Updatebutton changes to Save. The settings on the current tool dialog window can be saved by using the Save/Update button. If the user creates a new query then after pressing the Save button s/he must supply a query name. If changes (modifications) on an existing queries are to be saved, the Update button requires the user confirmation for overwriting. All queries which are saved/updated are stored in the Internal Documents database.