Home
Description
Publications

Available Resources
Text Acknowledgements
Related links


Events


CLaRK System

CLaRK System Online Manual


Bulgarian dialects'
electronic archive




eXTReMe Tracker

 

 

 

 

 

 

 


Menu Tools

Entity Converters

This tool handles documents, which contain symbols, not supported by the local hardware architecture. It substitutes the symbols with entities according to the standard ISO 8879 and vice versa. Currently, this tool supports 19 sub-sets of entity-char conversions. Each of them can be activated or deactivated. One reason for excluding some of the sub-sets is the following : sometimes not all the symbols have to be converted, for example: commas, dots, colons, semicolons ....

Example: ("дума" in Bulgarian is the equivalent of "word")

"дума" <-- entity conversion --> "&dcy;&ucy;&mcy;&acy;"

The tool operates on the document which is currently opened in the system or on a set of documents from the Internal documents database. It can be started from menu item: Tools/Entity Converters.

The dialog window looks in the following way:

The window represents a list of converters (filters) which will be used in the replacement procedure. The list content can be managed by using buttons:

  • Add Flter - enables the addition of a new filter or a set of filters to the current list content.

    Having pressed this button the user is shown a list of all available filters which are not presented yet in the working list. The list is placed in a new dialog window with the following layout:

    Here the user selects filters to be added. The control buttons are as follows:

    • Add - includes the selected item(s) in the working list;
    • Preview - shows in details the currently selected filter in the list;
    • Cancel - closes the dialog without without any other action;
    • Add All - includes all list items in the working list.

  • Remove Filter - removes the selected item(s) from the working list.
  • View Filter - shows detailed information about the selected item (filter) in the list.

    The information is visualized in a table form, where each row represents a single character-to-entity mapping. The table has three columns: Entity (the literal representations of the entities), Value (the character (uni)codes in hexadecimal format) and Preview (the characters themselves).

The direction of conversion is determined by the two radio buttons: Entity to Character and Character to Entity.

Additionally, the user can restrict the scope of conversion application, i.e. the conversion can be applied only to certain places in the documents, leaving the rest unchanged. If no restriction is used the conversion is applied to all attributes, text and comment nodes in the document(s). In order restriction to be set, the Enable Filtering checkbox must be selected. The user is expected to supply an XPath expression which will select the nodes on which the conversion will be applied. Each application on a node also includes conversion of the whole content of the node, i.e. all descending nodes suitable for this operation. Thus, if for example, only data included in paragraphs must be processed, the XPath expression must select only the paragraph nodes.

This tool supports two modes of application: on the current document and Multiple Apply mode. For details see Tool Application Modes.The user can also save or load the current tool settings, i.e. XML Tool Queries are supported here.


Grammars

A Grammar in the CLaRK System is defined as a set of rules. Each rule consists of three regular expressions and a category (represented as an XML fragment, called Return Markup). The three regular expressions are called: Regular Expression, Left Regular Expression, Right Regular Expression. The Regular Expression determines the content which the rule can be applied to. The Left and the Right Regular Expression determine the left and the right context of the content the rule recognises (if there is no constraints over one of the contexts, then the corresponding expressions are empty). When the rule is applied and recognises some part of an XML document, the part is substituted by the return markup of the rule. If it is necessary to keep the recognised part, it can be cited by using the variable \w. If the user needs to use the string \w in the return markup, he/she can avoid the \w variable in the following way: ^\w

The regular grammars in the CLaRK System work over token and/or element values generated from the content of an XML document and they incorporate their results back in the document as XML mark-up.

The tokens are determined by the corresponding tokenizer.

Before having been used in the grammar, each XML element is converted into a list of textual items. This list is called element value for the XML element. The element values are defined with the help of XPath keys, which determine the important information for each element.

In the grammars, the token and element values are described by token and element descriptions. These descriptions could contain wildcard symbols and variables. The variables are shared among the token descriptions within a regular expression and can be used for the treatment of phenomena like agreement.

Here is the list of the token and element descriptions:

  1. "token" -> describes the token itself. This description can be matched to the token itself and nothing else.

  2. $TokenCategory -> describes all tokens of the category TokenCategory. In the grammar input this description is matched against exactly one token of this category.

  3. Wildcard Symbols: #, @, % -> describe substrings of a given token. # - describes a substring of arbitrary length from 0 to infinity, @ - describes a substring of arbitrary length from 0 or 1, % - describes a substring of arbitrary length one. Here are some examples:

    • "lov#": matches exactly one token which starts with "lov": "lov", "love", "loves", and many others.

    • "lo#ve": matches exactly one token which starts with "lo" and ends with "ve": "love", "locative", "locomotive" etc.

    • "%og":matches exactly one token which ends in "og": "bog", "dog", "fog", "jog" etc.

    • "do":matches exactly one token which starts with "do": "do", "doe", "dog", "don" and many others.

  4. Variables: &V -> describes some substring in a token, when initialised for the first time, then matches to the same substring in the same token, or some following tokens. The scope of a variable is one grammar rule. The variable can be used in the return mark-up and in this case the value of the variable is copied into the return mark-up. Each variable consists of the symbol & followed by a single Latin letter. Each variable has positive and/or negative constraints over the possible values. Both the positive and the negative constraints over variables are given by lists of token descriptions. The value assigned to a variable during the application of the grammar has to be described by one of the positive constraints and must not be described by any of the negative constraints. These token descriptions can contain wildcard symbols, but no other variables. Here are some examples: "A&N&G", "Nc&N&G". These token descriptions can be used in a rule to ensure the agreement in number and gender between an adjective and a noun.

  5. Complex token descriptions. The user can combine the above descriptions in one token description. Some examples: "lov%#", "Vp&N&G#&P"

  6. Element description is a regular expression in angle brackets: < Regular Expression >. Here the Regular Expression is over token descriptions which is matched against the element value. Examples:

    • <w>: matches exactly one w element.

    • <$TokenCategory>: matches exactly one element, whose element value is a token description with category TokenCategory

    • <"token"> : matches exactly one element, whose element value is the token itself and nothing else. A token can contain wildcard symbols and variables.

    • <<N>> : matches exactly one element, whose element value is the XML element N.

The application of a rule works in the following way. The element value is scanned from left to right. Its Regular Expression is evaluated from the current point. If it recognises a part of the element value (this part we will call a match of the rule), then the regular expressions for the left and for the right contexts are evaluated (if they are not empty). If they are satisfied by the context of the match, then the match is substituted in the return markup for each presence of the variable \w (The user must be careful if he/she has, for example, text like \word in the return markup, the beginning \w will be substituted by the match. In this case the variable must be escaped). After these substitutions, the new markup is substituted in the XML document instead of the match place.

When a regular expression is evaluated from a given point within the element value, there is a possibility for several matches to the expression. For instance, the expression (A,B)+ over the element value L,A,B,A,B,A can recognise two matches from the second possition: A,B and A,B,A,B. This allows for a non-deterministic choice in this place. One can choose either the shortest match, or the longest one, or some in between. Generally, there are no universal principles for making such a choice. This is why in the CLaRK system we allow for user definition of a strategy to choose a match among more choices. We envisage four strategies: shortest match - in this case the system always selects the shortest possible match; longest match - in this case the system always selects the longest possible match; any up - in this case the possible matches are enumerated from the shortest to the longest possible match up to the moment when the left and/or the right context of the match satisfy the Left and/or the Right Regular Expression. If there is no Left and Right Regular Expressions then any up strategy is the same as shortest match; any down - it is similar to any up except for the fact that the possible matches are enumerated from the longest to the shortest one. These strategies are specified within the grammar queries. This allows the same grammar to be applied with different strategies over different documents.

The definition and application of a grammar are separated within the CLaRK system. The grammar itself is defined at one place, the parameters for its application are defined at another place in the form of grammar queries. This separation allows the use of the same grammar with different parameters like different tokenizers, different element values, different filters etc. Each of them has an XML representation. These XML representations allow grammars and their queries to be exchanged among different users. Also this allows the grammars to be constructed out of the system and then imported within it.

The grammar definition consists of a set of rules, variable definitions and context evaluation parameter (Check Context Order). The rules have already been discussed. The variable definitions are given by the positive and negative constraints over the variable. The context evaluation parameter determines the regular expression for which the context will be checked first - the left and then the right, or vice versa.

The grammar application determines: the elements which the grammar will be applied to; the element values for those elements, including the tokenizer and the filter; whether the textual elements will be normalized; the application strategy (longest, shortest match, any up and any down match).

Grammar Manager

The grammar manager is the user interface in the CLaRK System for management of grammar definitions. It supports the user in the creation, modification, deletion of grammars. The main dialog is Entry Manager with additional buttons. It contains all of the available grammars arranged in a tree hierarchy, some of their features (Editable, Compiled) and buttons for management of the grammars.

Each grammar has to be compiled in order to be used in the system. The compilation converts the regular expressions into a finite-state automaton. Because the compilation is a heavy process, sometimes it is better to postpone the grammar compilation (For example, when a large grammar is imported into the system). The column Compiled in the table represents information whether the grammar is compiled or not. If the grammar is compiled the corresponding box is checked.

The system also allows the user to export and import already compiled grammars. This option is very useful in cases of large grammars, when the compilation takes longer time or the user wants to exchange just the compiled grammar, but not its source. If such a grammar is used in the system, it cannot be edited. In such cases, the check box in the column Editable is not checked.

Here is the main window of the Grammar Manager:

There are several grammars. The grammar Slovnik One is editable and also compiled. Thus if necessary, the user can open it in the Grammar Editor and can modify it. The grammar Slovnik One[1] is editable, but not compiled. Thus it cannot be used immediately, first it has to be compiled. The grammar Slovnik One[3] is available only in compiled form. It cannot be modified in this form, but it can be used to process the relevant documents.

The manager window consists of two main parts:

  1. The panel on the left. It contains the tree representations of the group hierarchy. When the user selects a node in this tree, the content of the corresponding group is loaded in the component on the right side.
  2. Current group monitor. This is the panel situated on the right side of the window. It is a list with the content of the currently selected group in the tree. The list components, which are in blue color, are sub-groups. The other ones are the grammars included in this group. They are colored in black or red, depending on whether they are valid or not, according to their DTDs. The user can sort all the grammars in a group by clicking on the Name column of the table header. It is possible to rearrange the grammars in a group by simply using drag-and-drop technique, i.e. pressing a grammar and moving upwards or downwords until the desired position is reached.

    There are six additional buttons which can be used for modification the content of the current group:

    • New Group - creates a new sub-group of the current one. The user is asked to give a name for it;
    • Remove - removes the selected grammars and/or groups from the list. The removal is preceded by a confirmation message. If the selection includes sub-groups, they are also removed with their entire contents.
    • Rename - give a new name of the selected grammar from the list.
    • Copy - save the data of the selected grammar from the list with different name given by the user.
    • Add Grammars - gives a list of all grammars which are not present in the current group. The user is expected to choose one or more grammars to be included in the current group.
    • Delete! This function can be used for removing grammars from the internal grammar database. It can be applied only to single grammars, not to whole groups. Groups are excluded from any selections. The removal of grammars is preceded by a confirmation message. The grammars to be removed, are excluded from all the groups they may belong to.

Navigation in the group structure can be made also in the panel on the right. When the user wants to see the content of a certain sub-group of the current group, s/he just has to perform a double click on the desired sub-group. This will change the current group to the new one. This represents the movement from a group to a sub-group. The movement in the other direction is also possible. For each grammar group (excluding the Grammars), a special sub-group is included, named: ". .". By performing a double click on it, similarly to most file systems, the current group is changed to its parent one.

The Grammar Manager also provides a list of all the grammars, no matter which group they are included in. The following information appears for each grammar: date - when it has been last modified; if it is a query, which tool it refers to; and which is its DTD. The user can sort all the grammars by clicking on the Name column of the table header. When selecting grammars, the right button of the mouse is used to visualize the Pop-up menu with the following operations on selected grammars:

  • Info - This item shows the following information about the selected grammars:
    • grammar name
    • grammar size
    • grammar's dtd name
    • whether the grammar is valid
    • group of the grammar


  • Add In Group... - This item shows a dialog with the hierarchical structure for the groups in the system and the user can choose a group in which to place all the selected grammars.

  • Delete! - It is described above.
  • Rename - It is described above.
  • Copy - It is described above.

Under the table with the available grammars there are 3 buttons (New, Edit, Compile) and 3 menus (Compiled Grammar, File I/O, XML Editor ). They can be used to manage the grammars.

Buttons:

  • New - creates a new empty grammar with a name specified by the user and opens the Grammar Editor;

  • Edit - opens the selected grammar in the Grammar Editor;

  • Compile - compiles the selected grammars. This operation is relatively slow. For large grammars (with thousands of rules) it might take several minutes;

  • Apply - switches from the Grammar Manager dialog to the Apply Grammar dialog and allows the user to apply some of the grammars;

  • Exit - exits the Grammar Manager dialog.

Menus:

The three menus allow the import and export of grammars from and to the system. As it was said above, the grammars in CLaRK have an XML representation. Thus the user can load such grammar from an external file, or from a document within the system. Also, the user can save a grammar created within the system in an external file or as a document in the system. In this way the user can exchange grammars with other users, or make backup copies of the them, or can process them with other tools in system, such as sorting, searching etc. Additionally, there is a format for saving and loading compiled grammars.

  • Compiled Grammar

    This menu gives the user a possibility to store and load grammars in compiled format into/from a file. It has the following items:

    • Load compiled grammar from file - the user is asked to choose a file which contains the compiled grammar. Then the system reads the file, interprets it as a CLaRK finite-state automaton and stores it in the grammar database of the system. Such a grammar cannot be modified, but it can be applied.

    • Save compiled grammar to file - the user can save a compiled grammar into a file. The file has a special format and it cannot be modified in any reasonable way, thus it can be used just for exchanging of grammars in compiled form.

  • File I/O

    This menu gives the user a possibility to store and load grammars in XML format into/from a file. It has the following items:

    • Load grammars from file - the user is asked to choose a file which contains the grammars in XML format. Then the system reads the file, interprets it as CLaRK grammars and stores them in the grammar database of the system in a table format. Such a grammars can be modified within the system.

    • Save grammars to file - the user can save editable grammars into a file. The file is an XML document and it can be modified out of the system.

    • Save grammars with groups to file - the user can save editable grammars into a file and the group structure for the selected grammars. The file is an XML document and it can be modified out of the system.

  • XML Editor

    This menu gives the user a possibility to store and load editable grammars in XML format into/from an internal for the system XML documents. It has the following items:

    • Load current document as grammar - the user has to open the document which contains the grammar as a current document in the system . When this item is chosen, the system reads the document, interprets it as a CLaRK grammar and stores it in the grammar database of the system in a table format. Such a grammar can be modified within the system. This option allows the user to create grammars from the documents produced by other tools in the system and load them in the Grammar Manager.

    • Edit grammar in editor - the selected grammar is converted from table format into an XML format and is loaded as a document in the system. Thus it becomes a current document of the system. The user can manipulate the document with the tools of the system. Useful processing can include sorting, searching, etc.

Grammar Editor

This is the editor for grammars in the CLaRK system in table format. The editor contains the following elements: Rules, Option, Variables, and three buttons Save, Compile, Exit . Here is the main window of the Grammar Editor:

Rules

The table Rules contains the rules of the grammar. Each row of the table represents one rule. The columns follow the structure of the grammar rules in CLaRK. The column Regular Expression has to contain the regular expression which will be matched with respect to the element content. The column Return Markup is the second obligatory element in a rule. It contains the XML fragment which will be substituted with the matched part of the element content. There are two columns for the regular expressions which determine the left and the right context of the match - Left Regular Expression and Right Regular Expression. The last column is for comments on the rule.

The content of the table cells is not checked for consistency before the compilation.

Context Check Order

This option gives the user a possibility to choose which context will be checked first - the right or the left. In this way a preference over rules can be defined. The two orders are: first the left context, then the right one (Left->Right); the right context, then the left one (Right->Left).

Variables

This table contains the definitions of the variables for the grammar. Each row of the table represents the definition of one variable. Each definition consists of the following elements: name of the variable (Name) which is a capital Latin letter; The positive constraints for the variable are given as a list of token descriptions in the cell Positive Values. If the cell is empty, then the variable is not constrained and can have any non-empty value; The negative constraints for the variable are given as a list of token descriptions in the cell Negative Values. All the values that can be described by some of the token descriptions in the cell are forbidden as values of the variable. If the cell is empty, then the variable is not negatively constrained and can have any value described be the positive constraints; Like every token description in the CLaRK system, a variable also can match several strings starting from a position. The user has a possibility to define which match to be chosen. There are two options for the Match cell: Longest - in this case the variable is assigned the longest possible value, and Shortest - in this case it receives the shortest possible value.

The management of the variable table is done by a pop-up menu which appears when the user clicks with the right button of the mouse on the cells in the table. The possible choices are: Insert row which allows the user to define a new variable; Delete row which allows the user to delete the definition of a variable; Edit Cell which allows the user to modify the list of token descriptions for the positive or the negative constraints for the variable; Up and Down allow the user to rearrange the list of the variable descriptions for her/his own convenience.

When the user edits the constraints for a variable (Edit Cell) the following dialog appears for the positive constraints:

or for the negative constraints:

In both cases the constraints are represented as a list of token descriptions (one per line). The user has the possibility to enter a new description by the button Add Value - in this case a text edit field appears and the new token description has to be entered in it. The user can delete some token descriptions by selecting them in the list and clicking on the button Remove Value(s). The buttons OK and Cancel are used for the acceptance or rejection of the changes.

Buttons

The buttons at the bottom of the dialog give the user the possibility to save the grammar - Save; to compile the grammar - Compile. In case of errors in the grammar a corresponding error message appears; to exit the dialog - Exit. The system prompts for unsaved changes.

Apply Grammar

As it was described in the Grammars menu, the definition and application of a grammar are separated within the CLaRK system. In the Grammars menu the user can find a description of the grammars and how to construct their definitions in the CLaRK system. In this section the user can read how to apply a grammar over one or several XML documents.

Here is the main dialog of the Apply Grammar tool:

The application of a grammar requires the following types of information: the name of the grammar (text field Grammar), the target of application (text field Apply to), how the input to be prepared (the second row of options: Tokenizer, Filter, Normalize, and Element Values), how the rules of the grammar to be applied (the third row of options: the match options for the left context regular expression (Left), for the main regular expression (Body), for the right context regular expression (Right)), and whether the context can be backtracked (Context Backtracking). A combination of all these options is called a grammar query.

Additionally, the dialog allows the user to consult the definitions which are connected with some DTD in the system. Generally, if in a particular grammar query some of the necessary information is not presented, then the system checks the corresponding information connected with the DTD of the document. This can be done by the menu Features.

The user can save the current settings of a query as an XML document by choosing Queries check box. Then the user has the possibility to save the query with some comments in the Info text field. Also, the user can select a previous query from the list of queries. The grammar query XML documents are saved in the group SYSTEM:Queries:Grammar.

Also, the user can specify whether the grammar to be applied over the currently opened document or on some documents stored internally in the system. This can be done by choosing the Multiple Apply. In this case the user can select several documents which to apply the grammar to.

At the bottom of the dialog there are three buttons: Apply for application of the currently stated query (also one loaded from the XML representation); Close for closing the dialog; and Select for navigation over the current document and manual annotation (see below).

Element Values

When a grammar is applied over a content which contains XML elements, the system converts each such element into an element value. How exactly this conversion is performed is stated as Element Value definitions. These definitions can be connected with a particular DTD, but the grammar query allows the user to change the DTD settings and to define them locally into the query. Each element value is connected with a element tag and a sequence of XPath expressions (called keys) which define the sequence of tokens or elements for the element value. See below for examples. The dialog of the Element Values editor is as follows:

Each row of the table represents an element value for an element. The first column represents the names of the elements for which element value is defined. In this case they are w and pt. The Keys column contains the keys for each definition. In this case the value of the element w is the text in its ph element and the text in its ta element. For the pt elements the definition says that the element value is their text content.

The user can edit the value of the right column by selecting the Edit Cell item from the pop-up menu which appears when the user presses the right mouse button over the cell (The menu is visible on the screen shot). There are two modes for the element value Tool and User. In the following screen shots the w element is shown in both modes:

Each row of the table represents one key. All keys in the table define the value of one element. One key consists of a key name (option), a key value (XPath expression), a normalize option and a tokenizer name. If the element value is in Tool mode, then the normalize option and the tokenizer are taken from the grammar. The key in the table is called Grammar Key. It can be saved and loaded into the system memory. This is done by selecting Load Key and Save Key menu items in the context menu shown when the user right clicks over a cell in the table. The user can also add remove keys with items from this menu. The normalize option and the tokenizer name determine the input word, which is created for the elements when the grammar is applied. An interesting option here is "No Tokenizer" tokenizer. If it is selected, then the text nodes are treated as one token. When the OK button is clicked all XPath expressions in the table are compiled.

The element values are calculated in the following way:

  1. If the XPath expression selects textual content or the value of an attribute, the corresponding text is tokenized by the relevant tokenizer. Then the value is a sequence of tokens.

  2. If the XPath expression selects one or more elements then each element is represented as <tagname>, where tagname is the tag for the element.

  3. If there is no element value definition for the element then it is represented as <tagname>, where tagname is the tag for the element. The difference from the previous case is that if the element value is defined by self::*, then the element value will be <<tagname>>.

Select

This button allows the user to apply a grammar in an interactive mode. In this mode the grammar is executed on the current document and for each match it stops and allows the user to see the selected content and to perform some actions like: to go to the next selection (Next button), to go to the previous selection (Previous button), to add the return mark-up to the content (Mark button), and to exit the mode (Exit).

Grammar Groups

This tool is a means for applying a set of Grammar Queries in a row. The application itself is done in a cascaded grammar style, i.e. the output from each grammar is an input for the next one. The result from the last grammar is a result from the whole tool. The advantage of this tool is the better efficiency which is a result from the fact that the input for the grammars is prepared only once. Otherwise, the input should be prepared (preprocessed) each time a single grammar query is to be applied. This can be crucial for huge amounts of data.

Here is what the Grammar Groups dialog window looks like:

The tool dialog basically represents a list of Grammar queries. The user can Add Grammar Query to the end of the list and/or Remove Grammar Query from the list by using the buttons on the right side of the panel. The order of the different grammar queries can be changed by selecting a query and dragging it to the desired position.

This tool supports two modes of application: on the current document and Multiple Apply mode. For details see Tool Application Modes.The user can also save or load the current tool settings, i.e. XML Tool Queries are supported here.


Constraints

Regular Expression Constraints

The Regular Expression Constraints is a means for setting restrictions on the content of certain nodes in a document. The restrictions are set as regular expression patterns. The syntax of the patterns is the same as the one in the Grammar tool. The nodes whose content the constraint will be applied to are selected by an XPath expression. The selected nodes must only be of type Element (as no other types of nodes can have content). A node satisfies a certain constraint if its content matches the pattern given in the constraint. The application of a constraint gives the user the possibility to navigate either through the nodes which satisfy the constraint or through the nodes which do not satisfy it.

The constraint application is similar to a DTD validation of certain nodes of a document, but here we offer a more powerful instrument. On the one hand, the nodes which the constraints are applied to are selected not only by name (as it is in the DTD), but depending on the context in which they appear (XPath determined context). The context can be an absolute or relative document position, based on properties of the selected nodes or nodes relative to them, properties located in other documents, etc. The node selection qualification uses the full expressive power of the XPath engine, implemented in the system. On the other hand, the regular expressions of the Grammar allow writing patterns not only at the level of text nodes, but at the level of tokens within the text nodes as well. Even more, the user can specify a tokenizer which will be used for segmenting the text and a filter to discard the unmeaningful tokens during application. In the patterns the user can write wildcard symbols in token descriptions in the same way they are written in the Grammar tool.

With the help of these constraints the user can cover some of the features of XML Schema usage, by defining patterns for text nodes.

Regular Expression Constraints Structure

Each Regular Expression Constraint consists of 2 main parts and 3 additional (optional) parts:

  • Constraint name (obligatory) - a unique identifier for the constraint in the system;
  • Regular expression (obligatory) - a valid regular expression which represents the constraint over the nodes' content;
  • Default XPath expression - it is an XPath expression defining the selection of the nodes to be processed by the constraint. This expression appears as a default text in the appropriate specification area;
  • Tokenizer - it is used when the constraint tests text nodes' content. If the tokenizer remains unspecified, then the processor takes the tokenizer, which is specified in the DTD of the current document;
  • Filter - it is used to filter the tokenizer categories when the text nodes' content is tested. In other words, all filtered tokens are discarded from the selection before passing it to the constraint engine.

Edit Regular Expression Constraints (REC)

This section is responsible for the regular expression constraint management. Here the REC can be created, modified, removed, saved as a file and loaded from a file. Here is a picture of the dialog window:

The left side of the window is a table with all RECs in the system. The first column contains the names of the constraints. The second one contains the regular expressions for each of the constraints. Having selected a row in this table, the user can apply a manipulation over a constraint by using the buttons on the right.

Description of the buttons on the right:

  • New - creates a new Regular Expression Constraint. Having pressed the "New" button, a new constraint editor window appears on the screen (for more details, see below);
  • Edit - the currently selected constraint is opened for editing in a new editor window;
  • Remove - removes the currently selected constraint in the table. The removal is preceded by a confirmation message;
  • OK - updates the current changes in the constraints and closes the manager window;
  • Cancel - closes the manager window without saving the changes (if any) in the constraints;
  • Save To File - serializes all the RECs into an external file in an XML format. This function can be used for two main purposes: back-ups and interaction with external applications. The description of the output XML file (the DTD) can be found in the file: regConstraint.dtd;
  • Load From File - loads the REC(s) from an external file. The external file must be an XML document, valid with respect to the DTD in the file: regConstraint.dtd;

Regular Expression Constraint Editor

Here is the interface view of the editor window for the REC:

The last three fields are optional. The tokenizer and filter lists contain all tokenizers and filters defined in the system. The regular expression may consist of: tags, token categories, token values and token value templates (wildcard descriptions).

Apply Regular Expression Constraints

The actual applying of the Regular Expression Constraints can be performed in two ways:

  • by selecting a node from the tree panel and choosing a constraint;
  • by selecting a set of nodes with the help of an XPath expression and then applying a certain constraint on each of them;

Here we describe the latter case. The user chooses 'Apply Regular Expression Constraints' from the menu Tools/Constraints/Regular Expression Constraints/Apply Regular Expression Constraints. Then the following dialog window appears:

The first input field Select nodes contains the XPath which is evaluated in order to select nodes for the constraints operation. If the default XPath expression is specified for the constraint, then it appears in this field as a default text.

The second field selects by name a constraint to be applied.

The last two fields are activated when the current constraint tokenizer and filter are ignored and new ones have to be defined explicitly.

Having pressed the Apply button, the XPath is evaluated and a set of nodes is selected. Then for each of them the constraint is applied. If the node's content satisfies the constraint, then the node is marked as Valid. Otherwise it is marked as Non Valid. In this way two groups of nodes are formed and each of them can be observed separately. Here is a picture of the navigation panel window:

In this window the user can change the group under observation by using the two radio buttons. Pushing Next and Previous buttons the user changes the current selection in the editor. On the top of the window there is some information about the constraint and the nodes which satisfy or do not satisfy it. For the example above, the pattern '$NUMBER+,$SPACE*' concerns the content of text nodes. The items which satisfy the constraint with this pattern are all element nodes whose text content is a sequence of one or more tokens of category NUMBER, followed by zero or more tokens of category SPACE. Thus, strings which match the pattern are: "1234", "256  ", "666        ", etc.

Value Constraints

The constraint engine is a means for setting restrictions on the content or other related information of nodes in XML documents, which cannot be expressed by the DTD. The nature of the restrictions is based on the existence of certain values (tokens and/or tags) at certain places. The constraints of this type specify the pieces of information which are restricted and define the set of admissible values for each of them (usually by pointing to a location they are stored in, or by encoding the values themselves explicitly).

Value Constraint Structure

In general a value constraint consists of two parts: a target section and a source section.

Target Section

In this part one can find a description of the nodes which the constraint will be applied to. The target nodes for a constraint are selected by an XPath expression evaluated on the document which the given constraint is to be applied to. The result from the evaluation is expected (required) to be a node set with nodes compatible with the specific constraint application. If the result set contains nodes of types other than the required ones, they are automatically excluded (example: the selection contains text and attribute nodes, but the constraint checks the child nodes of its targets). This way of target selection uses the full expressive power of the XPath language in order a context dependency to be expressed.

Source Section

Here the possible values for the target nodes (selected by the previous section) are defined. The possible values are tag names and tokens depending on the type of the constraint. The source list can be selected by an XPath expression or by typing the choices explicitly as an XML markup. If the selection is made by a relative XPath expression, then the current target node is taken as a context node for the constraint. If a text node is selected as a source, then its text value is tokenized and the tokens are added to the source list, excluding the node itself. It is possible that the source for the constraint is an external document. The only requirements in such cases are the following: the external document has to be in the internal database of the system and the XPath expression cannot be relative.

There are four types of value constraints, currently supported by the system. They are distinguished by their target and the way of their usage. Here is a description of each value constraint separately:

  1. Parent Constraint

    This type of value constraint sets limits on the possible parents of a node. There are two ways of applying this constraint type: by changing the parent of a node (local) or explicitly running the constraint engine (global).

    The first possibility is changing the parent of a node (or a set of nodes at one level). The list of all the relevant parent nodes can be restricted further by applying other constraints. The final list contains the intersection between the source of the constraints and its former content. If the operation - changing the parent of a set of nodes - is performed, then all compatible (parent)constraints are applied.

    The second possibility is running the Constraint Engine. It works in the following way. First, the targets are selected (by their tag names and an XPath restriction). Then the source is compiled. If there is more than one choice, the user is asked to select one option from a list. If the choice happens to be exactly one element, it can be automatically inserted as a parent of the target. The action of a constraint depends on the Application Mode set for the constraint.

    The source list of each constraint must contain only tag names. All tokens in the list are ignored.

  2. All Children Constraint

    This type of value constraints sets limits on the names of a node's children and the content of its text children. All children, that are tags, must have names coinciding with the name of some node from the source list. Then all the data in text children is tokenized and a list A of tokens is formed. After that all the data in text nodes in the source list is tokenized and a list B of tokens is formed. For every token in A there must exist a token in B such that the values (not categories) of A and B are equal. This type of value constraints can be applied (checked) from menu item Apply All Children or from the toolbar button. The list of all invalid nodes according to the constraints is given in the Error message area together with the rest of the errors (if any). The user is given a possibility to navigate through all invalid for the constraints nodes.

  3. Some Children Constraint

    This is a special type of a value constraint, because its main task is not only to set limits on the node's content. Instead, it can be used for a value restriction when the operation inserting a child in a node is performed. This constraint type is not applied each time a new node is inserted. These constraints are used separately. Here the target node is the node where the insertion takes place. The constraint is blocked when:

    • there is a child of the target node that is a tag and there is a node in the source list, such that both nodes have identical names.
    • there is a text node in the target node that has a token, whose value equals the value of a token in the source list.

    To sum up, when there is a non-empty intersection between the source list and the target node's content, the constraint is satisfied and there is nothing more to be done. In cases when the source list is empty and the target content is also empty, then the constraint is satisfied.

    When the source list is not empty and there is no intersection with the target's content, the user is offered a list with the possible values from the source list for the target node. The user can choose one item to insert. The action of a constraint depends on the Application Mode set for the constraint.

  4. Some Attributes Constraint

    This constraint is very similar to the previous one. The only difference is that the target here is an attribute of an Element node. Also the target selection includes a selection of an attribute defined in the DTD for the selected tag name.The action of a constraint depends on the Application Mode set for the constraint.

Application Mode

The Value Constraints have two modes of application, concerning the treatment of the target nodes:

  • Validation Mode - the constraint points to the target nodes which do not satisfy it, showing all the possibilities for the specific places. On demand the user can insert a value from the list of possibilities.
  • Insertion Mode - The constraint points to the target nodes which do not satisfy it and expects the user to select one of the possible values to be inserted. If the list of possibilities for a certain place contains only one entry, it is automatically inserted. Then, if the constraint is of type Some Children, the user can specify the way of the new value insertion. If the new value is a token, the user can specify the position in the content where it must be inserted. The first position is denoted by 1 (not 0). If a position is not specified the new value is inserted as the last element. The elements in the content which are counted are either not filtered tokens or Element nodes. If the new value for insertion is an Element node the counting of the content entries is done in terms of DOM structures (Text nodes, Element nodes).

Edit Value Constraints

The screen shot on Fig. 1 is the dialog window of the value constraints editor.

VConst.gif (19423 bytes)

Fig. 1

The editor is separated into 5 sections which are responsible for different parts of the constraint definition. On the top of the window there is a Summary information panel which shows the current constraint settings (Type, Mode, Target, etc.). The sections are:

  • General (Fig. 1) - here the user supplies an unique Constraint name for the constraint (free text) which is obligatory. The constraint is identified by this name later in applications.Optionally, some additional Constraint descriptions can be written in the second text box. In this section one of the most important aspects of the constraint is defined -  the Type of constraint. This determines the whole behaviour of the constraint. The options are: Parent, All Children, Some Children and Some Attributes (described above).
  • Options (Fig. 2) - this section offers several options related with the application of the constraints. The options here do not concern constraints of type All Children.The first part of the section defines the Application Mode in which the constraint will be applied. For Insertion Mode the user can set an insertion position and a token Separator (when needed) (Note: Position and Separator options concern only Some Children constraints). The position must be a positive integer, where 1 denotes the first position. Leaving this field empty means 'last position'. The separator can be an arbitrary string.  For details see Application Mode.
    The remaining options in this part are as follows:
    • Show status before - indicates the number of the target nodes the constraint will be applied to, i.e. nodes count before the real application;
    • Show status after - indicates the number of the target nodes the constraint has already been applied to, i.e. after the real application;
    • Disable Intersection Checking !!! - disables/enables the checking whether the constrained target nodes data has common parts with the source data of the constraint. Disabling this checking allows the user to insert more than one possible value in Insertion Mode during the application;
    • Restrict to a single choice run - this option is relevant when a constraint is applied on the current document in an insertion mode. When selected, it restricts the execution to the cases when a single value is determined by the source evaluation, i.e. the constraint works only for the cases where no user decision is needed in application. If more than one entry is selected as a source, the corresponding target is skipped. The tool behaves as if it works in Multiple Apply mode, but on the current document.
    • Prompt for save on each: ... applications of constraint - this item is used for making backups of the current state of the document while applying the constraints. In order to use this option, the check box must be marked and in the text field a number must be entered. It indicates the number of the successful applications, after which the system prompts the user to save the document.

Fig. 2

  • Target (Fig. 3) - here the definition of the target nodes for the constraint is given. In field Target XPath the user is expected to supply an XPath expression which will determine the target nodes for the constraint. The XPath expression must return a node-set in which the nodes must be of a proper type (depending on the constraint type). In this XPath expression the user may (if needed) define context restrictions for the targets. If the current constraint is of type Some Attributes the user must supply a valid Target Attribute name. This field is disabled for all other types of constraints.

Fig. 3

  • Source (Fig. 4) - this section defines the source list for the constraint. The text field content is either an XPath expression, or an XML markup. It depends on the radio button, which has been currently selected for the Source Type. If the source type is XML Mark-up, then the content of the text field is XML. Otherwise it must be an XPath expression. If the selected type is Local Document, then the XPath expression is evaluated for each target node as a context. If the type is External Document, then the choice box gets enabled and the user is expected to choose a document. The XPath expression is evaluated on this document and the root node is the context. In the latter case it is expected for the XPath expression to be absolute.

Fig. 4

  • Advanced (Fig. 5) - here a tokenizer can be activated (Set a Tokenizer) for the constraint or it can be blocked in order not to treat the text nodes as a set of tokens but as a whole. Also a filter (Use Filter) can be set in order to exclude some "garbage" categories as separators or others from the source list. Another restriction can be set here by defining token value and category Templates. The templates are defined in the same way as these in the grammar tools (using @ and # symbols for wildcards). Another facility, which can be relied upon here, is the Help Document. This option ensures the following possibility: while listing the different choices, the user can get brief information about the meaning of each choice. This information must be stored in an internal document. Its structure is described in a DTD in the file: resources/dtds/helpFile.dtd. The information about a given choice appears in the status bar of the editor when the mouse pointer is over the choice.

Fig. 5

Value Constraints Manager

In the preceding section a description of the Constraint Editor was presented. It is evoked whenever a change on a Value Constraint is needed or a new constraint is defined. The Value Constraint management is handled by the following manager dialog window:

VConst_m.gif (22410 bytes)

Within the CLaRK System this module can be evoked from the menu: Tools/Constraints/Edit Value Constraints.

The Value Constraint Manager is an Entry Manager with additional buttons. It contains all of the available value constraints arranged in a tree hierarchy, some of their features (Description), buttons for management of the constraints.

There is an additional context XPath text field located from below of the table, which determine the context for each constraint group and is used for applying Constraint groups. First, a context node is selected and then all the constraints from the group are applied within this context. This XPath value can be changed by pressing Edit button and entering the new value.

The manager window consists of two main parts:

  1. The panel on the left. It contains the tree representations of the group hierarchy. When the user selects a node in this tree, the content of the corresponding group is loaded in the component on the right side.
  2. Current group monitor. This is the panel situated on the right side of the window. It is a list with the content of the currently selected group in the tree. The list components, which are in blue color, are sub-groups. The other ones are the constraints included in this group. They are colored in black or red, depending on whether they are valid or not, according to their DTDs. The user can sort all the constraints in a group by clicking on the Name column of the table header. It is possible to rearrange the constraints in a group by simply using drag-and-drop technique, i.e. pressing a constraint and moving upwards or downwords until the desired position is reached.

    There are six additional buttons which can be used for modification the content of the current group:

    • New Group - creates a new sub-group of the current one. The user is asked to give a name for it;
    • Remove - removes the selected constraints and/or groups from the list. The removal is preceded by a confirmation message. If the selection includes sub-groups, they are also removed with their entire contents.
    • Rename - give a new name of the selected constraint from the list.
    • Copy - save the data of the selected constraint from the list with different name given by the user.
    • Add Constraints - gives a list of all constraints which are not present in the current group. The user is expected to choose one or more constraints to be included in the current group.
    • Delete! This function can be used for removing constraints from the internal constraint database. It can be applied only to single constraints, not to whole groups. Groups are excluded from any selections. The removal of constraints is preceded by a confirmation message. The constraints to be removed, are excluded from all the groups they may belong to.

Navigation in the group structure can be made also in the panel on the right. When the user wants to see the content of a certain sub-group of the current group, s/he just has to perform a double click on the desired sub-group. This will change the current group to the new one. This represents the movement from a group to a sub-group. The movement in the other direction is also possible. For each constraint group (excluding the Value Constraints), a special sub-group is included, named: ". .". By performing a double click on it, similarly to most file systems, the current group is changed to its parent one.

The Value Constraints Manager also provides a list of all the constraints, no matter which group they are included in. The following information appears for each constraint: date - when it has been last modified; if it is a query, which tool it refers to; and which is its DTD. The user can sort all the constraints by clicking on the Name column of the table header. When selecting constraints, the right button of the mouse is used to visualize the Pop-up menu with the following operations on selected constraints:

  • Info - This item shows the following information about the selected constraints:
    • constraint name
    • constraint size
    • constraint's dtd name
    • whether the constraint is valid
    • group of the constraint


  • Add In Group... - This item shows a dialog with the hierarchical structure for the groups in the system and the user can choose a group in which to place all the selected constraints.

  • Delete! - It is described above.
  • Rename - It is described above.
  • Copy - It is described above.

Under the table with the available constraints there are 5 buttons (New, Edit, Apply Constraints, Cancel, Done) and 1 menu (Load / Save Constraints). They can be used to manage the constraints.

Buttons:

  • New - creates a new Value constraint by calling the Constraint Editor;
  • Edit - edit the selected Value constraint by calling the Constraint Editor;
  • Apply Constraints - first saves the changes on the constraints (if any). Then switches from the Value Constrains Manager dialog to the Apply Constraints dialog and allows the user to apply some of the value constraints and constraint groups;
  • Cancel - closes the dialog window without saving the changes on the constraints (if any);
  • Done - closes the dialog window by saving the changes on the constraints (if any).

Menu Load / Save Constraints:

This menu gives the user a possibility to store and load value constraints in XML format into/from a file. It has the following items:

  • Load constraints from file - the user is asked to choose a file which contains the value constraints in XML format. Then the system reads the file, interprets it as CLaRK value constraints and stores them in the value constraint database of the system in a table format. Such a constraints can be modified within the system.

  • Save constraints to file - the user can save value constraints into a file. The file is an XML document and it can be modified out of the system.

  • Save constraints with groups to file - the user can save value constraints into a file and the group structure for the selected constraints. The file is an XML document and it can be modified out of the system.

Apply Value Constraints

This is a tool specialized in applying Value Constraints on the current document or on a set of documents. The user is expected to create a list of single Value Constraints or whole Value Constraint Groups. The constraints and groups are applied in the order they appear in the list. They can be reordered by simply using drag-and-drop technique, i.e. pressing a list entry and moving upwards or downwords until the desired position is reached. Each entry in the list contains: either a constraint name followed by constraint description in brackets (for a constraint) or a group name followed by the '(group)' suffix (for a constraint group).

The user can modify the content of the Constraints List by pressing the following buttons:

  • Add Constraint - appends one or more constraints to the end of the list. The user is shown a list of all Value Constraints in the system, including the ones which are already in the list. In this way one constraint can be included more than one time (if it is needed for certain processing);
  • Add Group - appends one or more constraint groups to the end of the list. The user has to choose from a tree of all constraint groups in the system using Value Constraints Groups Hierarchy dialog;
  • Remove - excludes the selected entries from the list (constraints or groups of constraints). The exclusion is NOT preceded by a warning message.

This tool supports two modes of application: to the current document and Multiple Apply mode. For details see Tool Application Modes. If the tool is run in Multiple Apply mode there is one significant difference in the application: if a constraint uses Insertion Mode, a real insertion is performed only if the possible source value is one. In case there are more choices - some human intervention is needed. But the Multiple Apply does not allow it.

The user can also save or load the current tool settings, i.e. XML Tool Queries are supported here.

Number Constraints

This constraint type in general restricts numeric values related with nodes and their properties within XML documents. Such values can be: a number of occurrences of some specific elements within the content, values returned by XPath functions or operators. The target constrained values are produced from the evaluation of an XPath expression. This XPath is evaluated according to the result from the evaluation of Context XPath expression which determines the nodes which the constraint will be applied to. Independently, for each initially selected context node, one XPath result is produced. Depending on the result, a numeric value is formed as follows:

  • node-set - the number of the nodes;
  • string - if the string represents a valid number, the new value is this value. Otherwise the Not-A-Number identifier is produced;
  • number - the number itself;
  • boolean - if it is a true value - 1, otherwise 0.

Note: if the newly produced value is the Not-A-Number value then the corresponding context node does not satisfy the constraint.

A context node satisfies a constraint if the result numeric value of the XPath is within the range of Minimum Size and Maximum Size values. The latter two values can be either numbers or XPath expressions which are expected to return numeric values. If an XPath returns a non-numeric value, the system tries to transform it automatically to a number. In case the Minimum Size value for a constraint is not defined or the defined value produces Not-A-Number value, the system assumes that at this place there is no limitation and all target values which are under the corresponding maximum satisfy the constraint. By analogy, if a Maximum Size value is omitted or it is Not-A-Number, then no upper limitation is assumed. If an XPath expression is used for setting minimal or maximal limit, its context for evaluation is each initially selected context node. In this way the boundaries of one constraint can vary for different contexts.

The Number Constraint Manager dialog:

In the example above, the fourth constraint has no upper limit. The fourth column (Use It?)  is responsible for the activation/deactivation of the constraints. It becomes a necessity when the user would like to apply only a certain subset from all the constraints. Applying the (active) constraints can be done by pressing the Apply button. This button is disabled when there in no document in the editor. After applying the constraints, the user receives information about each applied constraint and the number of the satisfied nodes (contexts) as well as the non-satisfied ones . In the picture below there is an example result dialog:

Here the user has the ability to navigate through all satisfying or not satisfying a certain constraint nodes. This can be done by selecting a row in the result info table and using button Details. The user is shown a small navigation dialog which allows successive traversal of the nodes in forward or reverse direction. Here follows a picture of the navigation dialog from the preceding example picture:

The dialog contains several sections:

  • Nodes - in this section the navigation can be swapped between navigating satisfying (valid) and non-satisfying (non-valid) nodes. For each class of nodes the corresponding nodes count is given in brackets.
  • Counts - in this panel the user receives a dynamic information about the currently selected node (valid or non-valid). First the permitted Range is shown, followed by the Real Count for the given location.
  • Info - here is the information about the current constraint: the XPath expression used for selection of the Context nodes and the target XPath for calculating the constrained data.
  • Navi panel - here the user performs the movement from a context node to the Next or the Previous one. If during navigation the user have to modify the target nodes s/he can use the Search & Edit button. When pressed, it closes the navigation window. The system will memorize all the nodes from the current class (valid or non-valid) and will give the possibility for resuming the navigation (only in this class) by using the Next and Previous buttons of the XPath Search on the toolbar of the main editor. In this way, if a modification is needed for all nodes from one class (usually when correcting specific errors), the user does not have to apply the same constraint many times to find all representatives of the class. During this navigation, in order the system to indicate that the target nodes are not a result from an XPath search query, a service message is shown in the Search field on the toolbar. For example:

    The node members of the class memorized in this way are lost after applying an XPath Search query or after applying another Number Constraint in this way.


XSLT

Apply XSLT

This function applies an XSL Transformation either on the current document in the system editor or on a set of selected internal documents. In the case a transformation is applied on the current document, the result XML document is loaded automatically in the system. If the transformation which has been applied does not produce any result, a warning message 'No result produced!' appears. Otherwise the user is asked to supply a DTD for the result document. In case a transformation is applied on a set of internal documents the user has to specify a DTD and result document names in advance.

Another not traditional application of the XSL Transformations is the so called Local Transformations. Their application is performed in the following way: a set of nodes are extracted from a document (current or internal). For each of them, independently an XSLT is performed and the result (if any) is incorporated back in the original location where the extract was taken. Thus, no new result document is created but the original document is modified. The extracted nodes to be transformed are selected either by an XPath expression or by direct selection in the tree. The result from each transformation application is a Document Fragment (DOM) which substitutes the context node for which it is produced. The context node is removed from the tree and all sub elements of the fragment are inserted at the position of the context. The application of the transformation is followed by a result information message. It contains four pieces of information:

  1. The number of nodes selected by the XPath expression as contexts;
  2. The number of context nodes which have been replaced by a result fragment;
  3. The number of context nodes to which the transformation did not produce any result;
  4. The number of context nodes which have been lost during applying a preceding transformation. This can happen when a node and its descendant node are selected as contexts and the transformation has succeeded on the parent node.

If no transformations are available in the system, a warning message appears. User can apply XSL Transformation by means of the Multiple Apply module, or save the current settings for further use from Queries module.

For details about the management of the transformations see module XSLT Manager.

This tool supports two modes of application: on the current document and Multiple Apply mode. For details see Tool Application Modes.The user can also save or load the current tool settings, i.e. XML Tool Queries are supported here.

XSLT Manager

This component is responsible for the management of all XSL Transformations in the CLaRK System. Although the documents containing XSLT are regular well-formed XML documents, they are treated as a separate class of read-only documents. They are stored in a special separate list and only tools using XSLT can access it. The acceptable operations on the list are adding, removing and overwriting a list entry (XSLTransformation document).

Here is the dialog window of the manager:

Buttons:

  • Add New - Opens an Internal Documents Manager window, offering the user to select (a) document(s) containing XSL Transformations for the list of all transformations. Each document is tested for validity using the XSLT Valuator module. If a document is not a valid XSLT it is not included in the list and a message describing the error appears. Once included in the list, this XSL Transformation can be used in all system tools which deal with XSLT.
  • Add Current - Adds the current document of the system editor to the list of all available XSL Transformations. The document is cloned and thus editing will not affect the transformation. If the transformation needs modification it has to be extracted from the manager and later added by using the button Open In Editor.
  • Remove - Removes the selected transformation(s) from the list of all XSL Transformations. The transformation data is lost. The removal is preceded by a confirm message. Multiple selection is allowed here.
  • Open In Editor - Loads the selected transformation(s) from the list into the system editor as XML documents;
  • Apply - Applies the selected XSL Transformation to the current document in the editor. Here only a single selection is allowed. If there is no current document this button is disabled.
  • Close - Closes the XSLT Manager dialog window. All changes on the list of transformations (addition, removal) are updated.

Validate XSLT

This option validates the current document in the editor if it can be used as an XSL Transformation. This option can be used when a new transformation document is created (or imported) in the editor. The XSLT Validator checks the content and if there are no errors, an information message is shown. Otherwise, an error message appears and the location of the error in the document is pointed to.


Concordance

Concordance - a system tool for information extraction. It allows an extraction of certain units (words, phrases, etc.) within bigger units (sentences, paragraphs, etc.). The result extraction is shown in a table where on each row a result is shown. The searched items, the left and the right contexts are distinguished in separate columns. The tool is implemented on the basis of the XPath engine, regular grammar engine and a sorting module.

The field at the top of the dialog (Define Context) is used for defining the context nodes within which the extraction will be done. The user is expected to supply an XPath expression which after evaluation returns a node set. The context for evaluation is the root node of the document. The user can perform two types of concordance extraction: grammar based and XPath based. They will be described in details in the following paragraphs.

The result from the concordance is stored in an XML document and for convenience it is shown in a table. The structure of the XML document is a sequence of <L> elements standing for the found items (lines). Each concordance line has the following XML structure:

<L>
   <LC> the left context </LC>
   <I> the data we are searching for </I>
   <RC> the right context </RC>
   <!-- user commentary -->
</L>

and the corresponding table representation:

When the user sees a result from the concordance in a table s/he must always have in mind that there is an XML structure behind the dialog window, especially when a table rows sorting has to be performed.

Grammar Concordance Search

This type of concordance uses regular expressions patterns for searching items in other items. The patterns are defined as grammars in the Grammar tool. The items which match the patterns (tokens, XML tags or mixed) are shown in the result table (document), accompanied by the context in which they appear. Initially the context is determined by the XPath expression mentioned above. Further restrictions on the context can be set by another grammar pattern, i.e. the target searched items will be extracted only from items matched by another grammar pattern as contexts discarding everything else.
If the Text only (The mark-up will be ignored)? option is selected the concordance engine will ignore the mark-up inside the initially selected contexts while checking. Here follows an example how the mark-up can be ignored during the data extraction. Let us have the following simple XML document as a source of extraction:

If the target item of extraction is the word loves within the context of the TEXT element the Grammar pattern must only describe the word itself without specifying that it appears in any mark-up (in this case in the content of tag verb. Here is the result from the query 'loves' within the context of 'TEXT':

If the target of interest is the sequence John loves Mary within the context of the TEXT element, the query pattern will be: "John","loves","Mary". Although these three words appear in different XML structures, after filtering the mark-up the result will be:

Here is how the Concordance dialog window appears in a configuration set for a Grammar concordance search:

The Concordance dialog offers three sub-dialogs for setting a grammar search query corresponding to three levels of complexity, each supplying different sets of options. Each of the three dialogs is accessible by choosing the corresponding item from the Usage Mode panel. The possible items are:

  • Simplified

    This mode of usage offers a very basic set of options, which is convenient for relatively simple search queries. The user is expected to supply a Query String which must be a regular expression (the same syntax as in the Grammar tool). For text preprocessing the user can specify a tokenizer, a filter and a normalization.

  • Normal

    In this mode the user can use a previously defined grammar from the Grammar tool. Similarly to the grammar application the user here can define Element Values for performing a more flexible search. For text preprocessing the user can specify a tokenizer, a filter and normalization.

  • Queries

    Here, the user does not specify anything directly related to the search process. The thing which is needed in advance is (at least) one grammar query (see the Apply Grammar tool description) which in turn requires a compiled grammar. In this dialog the user just points to the Search Query to be applied. Additionally a restriction on the context can be set by selection of a Restriction Query which determines the context for the Search Query. In this case the context is formed by the output of the restriction query and, if there are initially selected nodes by the XPath expression, for which the restriction grammar does not match anything, they are excluded from further processing.

XPath Concordance Search

Searching items in this mode of concordance extraction is based on XPath queries within initially selected context nodes. The content which will be shown in the Item column will be a result from the evaluation of the XPath expression from the field Search Elements. If for a context the returned result contains more than one node, each node will appear in a separate row in the table. For each single result node the XPath expressions from Left Context and Right Context are evaluated to form the content of the corresponding table cells for the contexts. If any of these two fields does not contain an XPath expression, in the corresponding table cell the whole content before/after the found item will be shown.

Here is the Concordance dialog window in a configuration set for XPath concordance search :

Concordance Options

The options which appear in the both modes of extraction are:

  • Text only(The mark-up will be ignored)?
    This option works only in mode Grammar Concordance Search. As it was described before, this option filters the XML tags from the input data for the concordance search engine and leaves plain text.
  • Add number attribute ?
    This option enables adding attributes number to each single result tag <L> with values, enumerating each result item.
  • Add source attribute ?
    This option enables adding attributes source to each single result tag <L> with values, showing the source documents which the extractions were taken from.
  • Add path attribute ?
    This option enables adding an attribute path to each single result tag <L> with a value, which is an (abbreviated) XPath expression showing the absolute address of the corresponding located item in the original source document.


Table View

The Table View tool is created to represent the information extracted from the concordance tool in a more readable table form. Each line of the table represents one line of the concordance result.

If the user wants to use this feature, s/he has to open an XML document which is produced as a result from the Concordance tool. Heaving the Table View menu item selected, the system tries to detect the required structure in the currently opened document in the system editor. In case of failure, the system produces an appropriate error message. Otherwise, the document content is shown in a table (the picture below). The required document structure of the input for the Table View is described in the Concordance tool.

The data in the "Context" columns does not represent the whole context but only the amount of data that can fit in the column length. At the beginning it is only 30 symbols. To increase the context the user should press the settings button and from there to determine the context in symbols. The user can also set the width of the comment column. If the user wants to see the context without expanding the column data s/he can do it with right click on the "Left Context" and "Right Context" column. If the user wants to add some commentary to a concordance line s/he can do so by filling a value in the "Comment" column or by right clicking a row in the "Item" column. To navigate faster through the table the user can rely on the combo box at the top for accessing a row.

The user can also sort the lines of the table. To do so, s/he must select which column to apply each sort keys to (which element of the concordance line will be the context LC, I, RC or comments). If no column is selected then the key will be executed with the line element for context.

A useful option of the Table View is the "Edit Layout". The user can filter the tags that are shown in the table. For example, if the POS information is separated in a tag, the user can hide it in order to view only the text.


Extract

The extract tool task is to extract nodes from a document or from multiple documents and to save them as a new document. The document data extraction is based on XPath expressions. The text field at the top of the dialog is used for defining an XPath expression which selects the elements in the document(s). The context node for this evaluation is the root node of the document(s). The result from the extraction is an XML document in which all extracted nodes are children of the root element (This element is named "Extract" by the system).

The Include subtree option allows the extraction not only of the selected nodes but the entire subtrees below them as well.

The Create result tag option allows each extracted node to have a parent element which is used to separate the different results. For example, if we extract only text node, then in the new document all the text nodes will be concatenated. If the Create result tag is selected, then for each result node there will be added a parent element. The name of the parent element is taken from the corresponding text field.

If Create source attribute option is selected, then the extract tool adds an attribute with the source document name to either the auxiliary element or the root of each result root element. The name of the attribute is taken from the corresponding text field. In case the result is not an element-rooted structure and no auxiliary element is added, this option does not change anything.

If Create path attribute option is selected, then the extract tool adds to each result structure an attribute with a value which is an XPath expression expressing the location of the result in the original source document. In other words if this XPath expression is evaluated on the source document, the result will be exactly the extracted result node. The name of the attribute is taken from the corresponding text field. In case the result is not an element-rooted structure and no auxiliary element is added, this option does not change anything.

If Create number attribute option is selected then the extract tool adds an attribute with the extract result number to the auxiliary node or to the root element. The name of the attribute is taken from the corresponding text field. In case the result is not an element-rooted structure and no auxiliary element is added, this option does not change anything.

This tool supports two modes of application: on the current document and Multiple Apply mode. For details see Tool Application Modes.The user can also save or the load current tool settings, i.e. XML Tool Queries are supported here.


Sort

The sort tool is used for reordering nodes in the tree representation of a document. The sort operation changes the order only for nodes which are siblings, i.e. nodes with the same parent. If sort is applied on a set of nodes with different parent nodes, the nodes will be sorted only in the scope of their parents. The nodes selection for sorting and the sort criteria (Sort Keys) are written in XPath expressions.

For sorting the user has to specify the following two things:

  1. The target nodes for sorting.
  2. The keys for each node.

The first is done by defining an XPath expression in the Select Elements field. If the field is empty, the sort tool will show an error message. For context node the XPath engine assumes the node selected in the tree panel of the system or the root elements of the internal documents. The sort tool compares only element nodes which have a common parent. The sort tool splits the result returned from the XPath evaluation into groups according to the parent node. Each group is sorted separately.

Keys are calculated for every node the user wants to sort. Each row in the table represents one key. The sort tool compares two nodes key by key. The key is the list of nodes returned from the XPath engine after evaluating the expression defined in the column Key of the table. The context node in this evaluation coincides with the node for which the user wants to create the key. The other columns of the table represent settings used in the list comparison. The lists are compared node by node as follows.

  • If the nodes are both elements then the sort tool asks the DTD which one is defined to be smaller (Element Features/Sort Values).
  • If the nodes are both text nodes they are compared by their textual content.
  • The attribute nodes are compared by the textual content of their values only if they have the same name and their parents are elements with the same name.
  • The textual content (text) of text and attribute nodes is compared in the following way:
    1. The text is compared symbol by symbol.
    2. If the user chooses a tokenizer then the symbols are compared with respect to the tokens created by the primitive tokenizer of the selected tokenizer (A tokenizer that is ancestor of the selected tokenizer and is primitive. If the selected tokenizer is primitive then this tokenizer will be used for tokenization). The symbols are compared with respect to their token category (the order of the categories in primitive tokenizers) and by their position in the definition of the token category value. If normalization option is selected, the sort engine will use the primitive tokenizer normalization table to define the symbols token category and value.
    3. If the user selects [No Tokenizer], the sort tool will use the Unicode table to compare symbols. In this case normalization option will mean converting the Capital letters into Small letters case for Cyrillic and Latin.
    4. If the user selects the Reverse option for the key, the text will be reversed before the comparison ("erga" => "agre").
    5. If the user selects the Trim option for the key, the text will be cleared from leading and trailing whitespace characters (TAB, SPACE, LF, CR, etc.) before comparing.
    6. If the user selects the Number option for the key, the text will be converted into numbers and compared by their numeric value.
  • If the current nodes are not from one type, then the following order is relevant: attribute < text < element.
  • If a key value for an element contains more nodes than a key value for another element, then the first one is assumed to be smaller. This assumption is made when all nodes of the smaller key value are equal to the corresponding nodes of the bigger key.

For each key the user can define different order ( Ascending | Descending ). The order of the keys in the table is very important because this is the order in which they will be used. If two keys have equal nodes but one of them has additional elements, then the one with the smaller number of nodes is considered smaller.

The difference between the DTD sort and the Advanced one is that the sort tool takes the tokenizer and the number option from the DTD (Element Features, Attribute Features). For attribute nodes the sort tool also takes from the DTD the order of enumeration values.

Examples:

  • Example 1: Sorting a book by pages and title. The elements to sort are the book children of the context node. They will be sorted by the content in their pages element and title element. Key 1 is the text in the pages element of the book. It will be trimmed and converted to number when sorted. In this key we do not need a tokenizer because the whole node will be converted to a number. If two elements are equal according to the first key (two books has the same number of pages) then they are compared with respect to the second Key. Key 2 is the text in the title element of the book. It will be trimmed and normalized when sorted. For normalization the sort tool will use the normalization defined in the Mixed Word tokenizer. The order of this key is descending. It means that this key will sort books by the title in reverse order.
  • Example 2: Sorting TEI divisions by their heads. The sort tool takes all divisions in the document and sorts them according to the text in their head element. If a division does not have a head element then it will be assumed as smaller.

Example 1

Example 2

This tool supports two modes of application: on the current document and Multiple Apply mode. For details see Tool Application Modes.The user can also save or load current tool settings, i.e. XML Tool Queries are supported here.


XPath Insert ...

XPath Insert Attribute

This tool (Fig. 1) gives the possibility to set a certain attribute to nodes selected by an XPath expression. The user specifies an Attribute Name and Value. Additionally, s/he can tune the tool to set attributes only to nodes which do not have such yet. If the checkbox Skip Existing Attributes is unselected, the tool will set the given attribute with the given value to each element node returned by the XPath. Otherwise, it will skip all element nodes which already have this attribute and in this way it will preserve their original values. If the result from the evaluation of the XPath expression includes nodes other than Element nodes, they are ignored during the processing time.

This tool supports two modes of application: on the current document and Multiple Apply mode. For details see Tool Application Modes.The user can also save or load current tool settings, i.e. XML Tool Queries are supported here.

Fig. 1

XPath Insert Child

This tool (Fig. 2) allows the user to insert certain child nodes in the content of other Element nodes. The target nodes which the insertion will be applied to are selected by an XPath expression. The result from the evaluation of the XPath expression must be a list of Element nodes. All the other types of nodes are discarded. This tool can insert two types of child nodes: Element and Text nodes. If the new children are of type Element, the tool expects from the user to supply a valid tag name. Otherwise, i.e. when the new children are Text nodes, the tool accepts any non-empty textual data. The user can also set on which position the new children will appear in their parents' content. Here the counting starts from 0, i.e. the first child is denoted by 0, the second - by 1, etc. If the position field remains empty, then the new nodes will be appended to the target nodes' content. Any non-numerical data in the position field will produce an error.

This tool supports two modes of application: on the current document and Multiple Apply mode. For details see Tool Application Modes.The user can also save or load current tool settings, i.e. XML Tool Queries are supported here.

Fig. 2

XPath Insert Parent

This tool (Fig. 3) enables insertion of parent Element nodes of selected by an XPath expression nodes. The selected target nodes by the XPath expression can be either Element or Text nodes. Any other types of nodes are discarded from the selection. This tool expects from the user to specify a valid tag name for the new parent nodes.

The tool supports two modes of application: on the current document and Multiple Apply mode. For details see Tool Application Modes.The user can also save or load current tool settings, i.e. XML Tool Queries are supported here.

Fig. 3

XPath Insert Sibling

This tool (Fig. 4) allows inserting sibling nodes of nodes selected by an XPath expression. The target selected nodes can be of any kind,but Attribute nodes. If the root node is selected, it is discarded during the processing time. The new nodes for insertion can be of type Element or Text. If the new nodes are of type Element, the tool expects from the user to supply a valid tag name. Otherwise, i.e. when the new siblings are Text nodes, the tool accepts any non-empty textual data. The user can also set the position where the new sibling nodes will appear. The options are: previous (preceding the target node sibling) and next (following the target node sibling).

The tool supports two modes of application: on the current document and Multiple Apply mode. For details see Tool Application Modes.The user can also save or load current tool settings, i.e. XML Tool Queries are supported here.

Fig. 4


XPath Remove

This tool (Fig. 5) gives the possibility of removing parts from XML documents selected by an XPath expression. The target selection can list all types of nodes (including attributes). The only node which cannot be removed is the root of the document and if it is included in the selection it is discarded during the processing time. If a root node is detected in a selection, a warning message is shown. The removal can be done in two modes: either removing the selected nodes and their content (when Delete subtree is selected) or removing only the nodes without their content. In the latter case, the content of the deleted nodes is inserted in the content of their parent(s), in the places where the deletion was performed. The attribute nodes are not considered as content of the Element nodes they belong to, so they are removed in both cases.

The tool supports two modes of application: on the current document and Multiple Apply mode. For details see Tool Application Modes.The user can also save or load current tool settings, i.e. XML Tool Queries are supported here.

Fig. 5

This operation uses "Before" and "After" columns from the Element Features to determine whether to insert space symbol before and after the deleted element.

XPath Rename


This tool (Fig. 6) allows the user to rename Element nodes in a document selected by an XPath expression. The user is expected to supply a valid New name. The selected nodes are renamed without changing their attributes and content. If the selection contains nodes of type different than Element, a warning message is shown and these nodes are discarded from further processing.

The tool supports two modes of application: on the current document and Multiple Apply mode. For details see Tool Application Modes.The user can also save or load the current tool settings, i.e. XML Tool Queries are supported here.

Fig. 6


XPath Transformations

This is a tool for applying various transformations over a document or documents. It is specified by two main sets of nodes - source and target, and other features, which are described below. The target nodes are nodes over which information will be added. The source nodes are nodes which give the information which will be added. Here is the main window of the XPath Transformation:

There are three modes which specify which documents are related with the source and target fields and where the result will be saved. The modes are: Local Source, External Source and Distributed Source.

  • Local Source

    In this mode the Source and the Target nodes are from the same document.
    If Multiple Apply check box is not selected the target and the source nodes are related with the current open document in the system.
    If the check is selected the target and the source nodes are related with the documents in the Input column of the Internal Documents table. The result for each document is saved as a document with name given in the relevant row in the Result column of the same table.

  • External Source

    There is a table with one column, which specifies the source documents. The source nodes are related with it.
    If Multiple Apply check box is not selected the target nodes are related with the current open document in the system.
    If the check is selected the target nodes are related with the documents from the Input column of the Internal Documentstable. The result for each document is saved as a document with name given in the relevant row in the Result column of the same table.

  • Distribution Source

    There is a table with two columns - Source and Target, which specifies the source and the target documents. The user can handle the Source column by the buttons on the right side of the table and the Target column by the buttons in the Internal Documents field. When the user adds a document in the Input column, this document is added in the Target column of the Source/Target table. The number of source and target documents has to be same.

The rest of the features of the XPath Transformations dialog are:

  • Target

    An XPath expression defining the target list of nodes, e.g. the nodes where the source will be included.

  • as a parent

    The nodes from the source become parents (ancestors) of the target nodes. The system requires Element nodes for source and Element and Text nodes for target.

  • as a child

    The nodes from the source become children of the target nodes in the position specified in the at position field. The system requires zero or a positive integer for the position, non Attribute nodes for source and Element nodes for target. If the returned value as a source is a number, a string or a boolean value, it is treated as a text node.

  • as a sibling

    The nodes from the source become siblings of the target nodes in a position before or after a target node depending on the at offset field. The system requires non Attribute nodes for source and Element nodes for target. If the returned value as a source node is a number, a string or a boolean value, it is treated as a text node.

  • as attribute

    The nodes from the source become attributes of the target nodes with name specified in the with name field. The system requires non Element node for source and Element nodes for target.

  • Relative to Source

    This check box is used only when the source is treated as an XPath expression (XML check box is not selected).
    When this check box is not selected, the target XPath is evaluated from the root of the target document.
    When this check box is selected, the target XPath is evaluated for every node from the source as a context. As a result there is a list of nodes for each node in the source.

  • Source XPath/XML

    This field specifies the source nodes. They could be nodes returned by XPath expression (evaluated on a specified document) or specified by an XML fragment. Whether the source is treated as XPath expression or XML fragment is specified by the XML checkbox.

  • All nodes from the source list will be processed for each target node.

  • Each node from the source list will be processed for each target node.

  • Equals

    If this check is selected and if there is a difference between the number of source and target nodes the system reports an error.

  • Copy

    If this button is selected, the source nodes are copied to the target nodes in a way specified by the tool fields.

  • Move

    If this button is selected, after performing an operation for a node from the source list, the tool removes the node from the source location.

  • Include subtree

    If this check box is selected, then the source list will contain for each selected node the entire subtree. If it is not selected, then only the local information for each node is put in the source list. The local information includes the tag name and the set of attributes as well as their values. When only a node with the local information is chosen and it has to be removed, then its children are inserted as immediate children of its parent. The insertion is made in the position of the deleted node.

  • XML

    By this check box the treatment of the Source XPath/XML field is controlled.
    If it is selected, then the source is treated as XML markup data. If the XML markup data does not contain tags, then it is treated as text.
    If the check box is not selected, then the source is treated as an XPath expression.


Statistics

The Statistics tool is used for counting the number of nodes or/and token occurrences in XML document(s). The items to be counted initially are selected by an XPath expression (field Select (XPath)). The selection returned by the XPath evaluation is a node set. At this point the Value Keys defined by the user are taken into consideration. Each key contains an XPath expression which is meant to point the essential properties of the selected nodes. The value keys are similar to the ones in the Sort tool. For each node of the initial selection the values from the Value Keys are calculated independently. If for two nodes the corresponding calculated values are the same, they are assumed to belong to the same class. In this way each of the selected nodes is classified in one class. If the statistics has to be applied not only on XML nodes, but on tokens the user must select a tokenizer from Choose Tokenizer: field. In this way the text nodes will be segmented in meaningful tokens. In addition the user can filter the tokens by category in order to receive information only for certain types of tokens (using the button Customize). Only tokens whose categories are in the list will be counted. All the rest will be discarded. If no tokenizer is selected, the text nodes will be processed as a whole node.

The result from the statistics application is a list of all classes formed by the selected nodes. The information which is kept for each class is:

  • Searched Item - the item found by the selection (tag name or token);
  • Item Category - the category of the search item: if the item is a token - its token category, otherwise - <Element>;
  • Number of occurrences - the number of items from the selection which belong to the class;
  • Percentage - the percentage of the items belonging to the class, compared to the rest from the selection;
  • Keys Value - a string representation of the value(s) for the class.

This tool supports two modes of application: on the current document and Multiple Apply mode. For details see Tool Application Modes.The user can also save or load the current tool settings, i.e. XML Tool Queries are supported here.

Statistics on Current Document: The result of this type of statistics is shown as a table below.

The "Category" column contains categories from the filter which exist in the chosen text nodes or "<Element>" if the row represents element node, and "#text" if the node is text.

The "Element" column contains tokenized text (The value of the filtered tokens), or node names.

The "#" column contains number of occurrences of the corresponding item.

The "%" column contains information about the percentage of the corresponding item.

The "Key Value" column contains the value of sort keys created for the corresponding node or nothing if the line contains token.

Closing the table, the user can choose from the following options:

  • to save the result of the statistics into XML format, using the DTD definition below.

  • to open the result in the system.
Statistics on Multiple Apply:
  • the result is preserved in XML format and the document DTD has the following structure:

<!ELEMENT statistics (documents, item+ , all )>
<!ELEMENT documents (document+)>
<!ELEMENT document (#PCDATA)>
<!ELEMENT item (category?, element, number?, percent?, keyvalue?)>
<!ELEMENT category (#PCDATA)>
<!ELEMENT element (#PCDATA)>
<!ELEMENT number (#PCDATA)>
<!ELEMENT percent (#PCDATA)>
<!ELEMENT keyvalue (#PCDATA)>
<!ELEMENT all (number, percent)>
]>

documents tag is a list of selected for statistic documents, where each document name appears in a document tag; item tag corresponds to a line from a result table as follows:

  • category tag corresponds to the "Category" column
  • element tag corresponds to the "Element" column
  • number tag corresponds to the "#" column
  • percent tag corresponds to the "%" column
  • keyvalue tag corresponds to the "Key Value" column

all tag corresponds to the last row of result table. It contains the number of all the occurrences of the selected elements and information for percentage.

Filtering XML Result Data

In many cases not all the information from the Statistics is needed to be saved. Sometimes the result data is too large and its further processing is difficult. In such cases the Output Info options can be used for specifying which information should be kept and which should be removed from the result document. The options are as follows:

Only the selected items will have a representation in the result XML document.


Node Info

The following item gives information about the number of occurrences of specified tags or tokens in a set of internal documents. When the user starts this tool, s/he is asked to provide several things:

  • The type of information which is needed. The two possibilities are: counting tags and counting tokens.
  • The documents for which the information is needed.
  • An XPath expression which selects the nodes in each document for being counted.

Here is a screen-shot of the initial dialog window:

The main components of the dialog window are:

  • XPath Field - selects the nodes in each document for which the counting will be performed. If some tags are counted, then for each node from the selection of this XPath expression, its descending nodes will be counted. If some tokens are counted, then for each text node from the selection its text content will be tokenized and the result tokens will be counted.
  • Tokenizer Selector - determines which tokenizer will be used when tokenizing the text nodes for token counting. This component is disabled in case of tag counting.
  • Info Type Selector - determines the type of elements, which will be counted. The options are: "Word Info" - for token counting and "Tag Info" - for tag counting.
  • Document Selector - this component is responsible for selecting documents from the internal document database, which the counting will be applied to. This is universal component for the CLaRK system. For more information see Document Selector in menu File.
  • Show Info button - starts calculating the information for the selected documents.
  • Cancel button - closes the window and cancels further processing.

If the Show Info button is pressed, the system starts to process the selected documents one by one. While processing the documents, the status bar of the system shows the current process state. Having processed all the selected documents, the system shows the result in a new window. Here are two example results, one for Word Info and one for Tag Info:

  • Word Info

    The first column Document contains the names of the documents chosen from the first dialog.
    The second column Category contains the categories from the tokenizer which the user has already chosen.
    The third column # contains the number of the occurrences of each category in the text.

    The content of the table can be saved in a file - if Save if file checkbox is selected. The syntax of the result file content is XML. When the user presses the OK button, s/he will be asked to supply a file name and a directory with a standard file chooser.

    If Add information checkbox is selected then the relevant information will be added to each of the documents. The word information added to a document has the following form:

    <extent>
    <interpGrp>
    <interp
    type ="LATw" value="41"></interp>
    <interp
    type ="CYRw" value="5848"></interp>
    <interp
    type ="NUMBER" value="181"></interp>
    </interpGrp>
    </extent>

    If the DTD for a document is TEI, then <extent> is added in the appropriate position. Otherwise, <extent> is added after the first node.

  • Tag Info

    The first column Document contains the names of the documents chosen from the first dialog.
    The second column Tag contains the tag names of all the nodes which the user has chosen with the XPath expression from the first dialog.
    The third column # contains the number of the occurrences of each tag in the documents.

    The content of the table can be saved in a file - if Save in file checkbox is selected. The syntax of the result file content is XML. When the user presses the OK button, s/he will be asked to supply a file name and a directory with a standard file chooser.

    If Add information checkbox is selected, then the relevant information is added to each of the documents. The information added to each document has the following form:

    <encodingDesc>
    <tagsDecl>
    <tagUsage
    gi="aa" occurs="5355"></tagUsage>
    <tagUsage
    gi="hi" occurs="19"></tagUsage>
    <tagUsage
    gi="lb" occurs="5"></tagUsage>
    <tagUsage
    gi="p" occurs="90"></tagUsage>
    <tagUsage
    gi="ph" occurs="5355"></tagUsage>
    <tagUsage
    gi="pt" occurs="1144"></tagUsage>
    <tagUsage
    gi="s" occurs="352"></tagUsage>
    <tagUsage
    gi="ta" occurs="5355"></tagUsage>
    <tagUsage
    gi="tok" occurs="301"></tagUsage>
    <tagUsage
    gi="w" occurs="5355"></tagUsage>
    </tagsDecl>
    </encodingDesc>

    If the DTD for a document is TEI, then <encodingDesc> is added in the appropriate position. Otherwise, it is added after the first node.


Text Replace

A default shortcut Ctrl+T
An icon on the text area frame toolbar

This tool is used for searching for patterns in the text and replacing (marking) them in an appropriate way.
The dialog has three fields :

  • Apply to (XPath) field is the restriction field. Here the user specifies an XPath expression that restricts the text nodes which the expression will be applied to. It is evaluated as a predicate. The tool processes only the text result nodes from the evaluation.
  • Search for (Expression) field is the search field. Here the user defines an expression to match parts of the data in the text nodes.
  • Replace with (Mark-up) - field is the replace field. Here the user fills normal text (XML mark-up) which can replace the matched data. The field remains empty. In this case the matched data will be removed from the text.

There are two modes of application:

  • Advanced

    If the search mode is advanced, all text nodes are tokenized by the tokenizer specified in the Settings panel. At the top of them the tool executes the regular expression defined by the user. In the regular expression the user can write token values and token categories in the same way as in the Grammar tool (see Grammar- Regular Expressions). The user can apply the normalize option and filters.

  • Simple

    If the search mode is simple - the expression is taken as a whole string (Note that symbols for new paragraph, tabulation, etc. are not supported) and then searched in the text. The search can be case sensitive (Match case).

  • Multiple Apply - gives the user a possibility to replace the text in more than one documents. For details see Tool Application Modes.
  • Queries - the user can save the current query in the system for further use ( XML Tool Queries).

Buttons:

  • Keep Undo - this option enables keeping undo information about the changes when the tool is applied on the current document and when necessary the previous document content can be restored. Disabling this option can be useful when large amounts of data are processed and saving memory and processing time is important. This option has no effect when the tool works in Multiple Apply mode.
  • Replace All - replaces all matches in the document with the text in the replace field.
  • Close - closes the dialog
  • Select - gives the user an opportunity to go through the text and mark or skip pattern matches:
    1. Next - finds the subsequent data that matches the expression.
    2. Previous - finds the previous data in the document that matches the expression.
    3. Mark - replaces the selected data with the text in the replace field - the user can change mark-up.
    4. Exit - returns to the Text Replace dialog.


MultiQuery Tool

This tool is designed to call other tools. It does not work directly with XML data. The tool uses a list of XML Tool Queries which are executed one by one in the order of their appearance. The result from each single tool application is an input for the next single tool application. The result from the last single tool application is the result from the MultiQuery Tool. The tools query list is represented in the manager window by a table in which each row keeps information for one query. Each row contains the Name of the query in the second column, which is its document name. The third column shows the Type of the query, i.e. the tool it represents. In the fourth column the Info data from each query is shown. This is optional text which can be saved for each query when it is created or updated. The Info input field is part of the Queries panel in the different tools.

The first column (Label) of the table contains important information about the Conditional Control Operators or short Controls, specific for this tool. These Controls allow changing the order of application of the different queries or/and conditional application of certain operations. For more information see Controls description below.

Along with Queries and Controls the user can use here conditional check points (Conditions) to verify that certain conditions are satisfied. These conditions determine whether the current processing procedure will proceed with the next step or (in case of failure) the decisions taken so far should be reconsidered in order to produce new intermediate results. The Conditions present a backtracking mechanism which can be applied on Grammars and Value Constraints.

Here is what the MultiQuery Tool dialog looks like:

The operations which the user can perform during the creation of a list of queries are:

  • Add Query - adds one or more tool queries to the list. The user is shown a selection dialog where s/he can choose queries from the different system tool folders (groups).
  • Remove - removes the currently selected query/control/condition from the list (table). The removal is NOT preceded by a warning message.
  • Insert Control - inserts a new Control operator after the selected query or another Control in the table. For more details see the Controls definition below.
  • Edit Control - allows editing the currently selected Control operator in the table. For more details see the Controls definition below.
  • Insert Condition - inserts a new Condition after the selected row in the table. For more details see the Conditions definition below.
  • Edit Condition - allows editing the currently selected Condition in the table. For more details see the Conditions definition below.
  • Options - here several options concerning the whole application process are available:

    • Prompt before application - if selected, the system will ask the user for confirmation before each single tool application. Available only for the current document application mode.
    • Break on no result - if selected, the process of application will be stopped when a single tool application does not produce a result or does not change the document. Otherwise, the application will proceed to the end. This option is available only for the current document application mode.
    • Use garbage collection before processing - if selected, a garbage collection will be performed before each single tool application. The usage of this option reduces the system resources which are needed for the processing, and improves the efficiency of the next operations.
  • Reordering - this allows changing the order of the queries in the table. Reordering can be done by simply dragging the rows of the table up or down.

The tool also supports the two modes of application: on the current document and Multiple Apply mode. For details see Tool Application Modes.

The user can save or load the current tool settings, i.e. XML Tool Queries are supported here. This feature allows the creation of tool queries. There is no limitation for the level of inclusions in the queries. If a cyclic inclusion is detected (in the multi-query or in any of its sub-queries) the system produces an error.

Controls

The Control operator allows changing the order of application of the queries in the MultiQuery tool. The usual order of applications is starting from the first one and proceeding one by one up to the last one. Using Controls operators some queries can be applied only if certain conditions are true. Such conditions are: the true or false value of a result from an XPath evaluation; whether the preceding single tool application has or has not modified the working document; or unconditional (always succeeding). When a condition for a Control is true, the next query (or another Control), which will be applied, is defined in the control itself. Otherwise, the application proceeds with the next entry in the order (query or another control). The Control operators address their targets (queries or controls to be applied in case of success) by pointing their labels. Each entry(row) in the table of the MultiQuery Tool can have a label (unique identifier) which can be referred by to control operators. It is an error if a Control operator uses a target label which does not exist.

Each Control operator may consist of three parts:

  • Type - determines the type of the Control, i.e. the conditions for checking. There are several types of control:
    • IF (XPath) - the condition is an XPath expression. If the result from its evaluation on the current working document is: a non-empty list, a non-empty string, a non-negative number or a true boolean value the Control succeeds.
    • IF NOT (XPath) - the condition is an XPath expression. In contrast to the previous type, here, if the result from the XPath evaluation on the current working document is: an empty list, an empty string, a negative number or a false boolean value the Control succeeds.
    • IF CHANGED - the condition is the result from the previous single tool application. If the current working document has been modified by the previous (not necessarily preceding in the table) operation, the Control succeeds.
    • IF NOT CHANGED - the condition is the result from the previous single tool application. If the current working document has NOT been modified by the previous (not necessarily preceding in the table) operation, the Control succeeds.
    • GOTO - the condition is always satisfied. This is an unconditional movement to the target of the Control.
  • XPath(depends on the type) - an XPath expression is evaluated, the result of which determines the success of the control. This part is used in controls of type IF (XPath) and IF NOT (XPath).
  • Target - a label reference which points to the next location where the execution will continue in case of satisfied condition for the Control.

By using the Controls-labels technique, the user can model the famous IF-THEN-ELSE and WHILE-CONDITION-DO structures in order to make the processing more flexible. The composition of different Controls allows the user to create varied 'programs' or 'scripts' capable of doing certain jobs. It is up to the user to create efficient and reliable processing procedures.

Controls Editor

Here follows a description of the Control Operator Editor:

The editor design follows the Controls structure in three sections:

  • Type - a list of all types of Controls: IF, IF NOT, IF CHANGED, IF NOT CHANGED and GOTO.
  • XPath - an XPath expression field, active only for types IF and IF NOT.
  • Target - a list of all labels currently defined in the MultiQuery Tool table. Here, for convenience, the user can enter target labels which have not been defined yet. In the end of the list there is one special label <break> which is used for suspending the processing procedure. In other words if the condition for a control is satisfied, the system will not proceed with the application.

Conditions

The Conditions are operators which perform certain checks on the current working document. The conditions can cause reconsidering the decisions taken so far and producing new result documents. Different decisions for a certain document can be taken by the Grammar tool and the (Value) Constraints tool. The condition operators can be used ONLY in Multiple Apply mode of the tool. The usage on the current document is not allowed because of efficiency reasons (multiple backtracking events can cause the system to work very slowly).

When a condition check fails, it causes the system to reconsider the latest decision, taken on a place of a choice point and recovering the working document to the state when the previous decision was taken. If a new decision can be taken, the system proceeds the application with it. Otherwise, the system continues searching backwards for another choice point where a new decision can be taken. If no solution for a condition is found, the system terminates the current application.

There are two types of Conditions which can be used:

  • XPath based - an XPath expression is evaluated on the working document and if the result is approving (not empty node-set, not empty string, positive number or true boolean value) the condition is satisfied;
  • Value Constraint based - the condition specifies a (Value) Constraints query which contains constraints to be applied in validation mode on the working document. If one of the constraints is not satisfied, the whole condition fails.

Condition Editor

The Condition Editor gives the user the ability to create a new Condition or to modify an existing one. The editor dialog appears when the user selects Insert Condition or Edit Condition buttons. The editor window's layout is the following:

The first section of the dialog determines the condition type: XPath (the user specifies an XPath expression condition) or Constraints (the user selects a Constraint query).

The second section provides several common Options for the two types of conditions:

  • Enable 'CUT' operation for condition - this option (when selected) means that if the condition succeeds once, no subsequent decision reconsideration will be performed before this point. This reduces the search space and the amount of memory required for the operation. The operation resembles the common 'cut' operation in many backtracking based environments.
  • Save-on-success - this option allows saving the state of the working document if this condition succeeds. The document is saved under name specified in Name base field plus an additional suffix formed by an integer index in brace (name_base(1), name_base(2), ...). Each time the processing successfully passes through this condition a new unique name is generated and the document is saved independently. If the Overwrite option is selected, each time a new name is generated the index increases by one and the older document with this name (if such exists) is overwritten. Additionally, the user can specify a location (Result group) where the result documents should be stored.

MultiQueryEx Tool

This tool is designed to call other tools. It does not work directly with XML data. The tool uses a list of XML Tool Queries which are executed one by one in the order of their appearance. The tools query list is represented in the manager window by a table in which each row keeps information for one query. The main difference whit respect to MultiQuery Tool is in input/output management. The documents which will be processed and the result documents are contained in the queries. Each row contains the Name of the query in the second column, which is its document name. The third column shows the Type of the query, i.e. the tool it represents. In the fourth column the Info data from each query is shown. This is optional text which can be saved for each query when it is created or updated. The Info input field is part of the Queries panel in the different tools.

The first column (Label) of the table contains important information about the Conditional Control Operators or short Controls, specific for this tool. These Controls allow changing the order of application of the different queries or/and conditional application of certain operations. For more information see Controls description below.

Here is what the MultiQueryEx Tool dialog looks like:

multiEx.gif (7440 bytes)

The operations which the user can perform during the creation of a list of queries are:

  • Add Query - adds one or more tool queries to the list. The user is shown a selection dialog where s/he can choose queries from the different system tool folders (groups).
  • Edit Query - opens tool dialog and loads the currently selected query.The user can modify the query and update it XML Tool Queries. Editing the query in this way, the user has an additional possibility to change the query's input documents. A new list Non Existing Documents is added to the standard Internal Document Manager dialog. This list contains documents which will be created when MultiQueryEx will be applied. All the queries which are added in MultiQueryEx tool contain names for result documents, which may not exist. When applying MultiQueryEx tool if in some query some documents doesn't exist they are removed from the input on the query.
  • Remove Query - removes the currently selected query from the list (table). The removal is NOT preceded by a warning message.
  • Insert Control - inserts a new Control operator after the selected query or another Control in the table. For more details see the Controls definition below.
  • Edit Control - allows editing the currently selected Control operator in the table. For more details see the Controls definition below.
  • View result in Editor - if this check box is checked, the result from the MultiQueryEx tool (applying queries and the control status) will be shown in the Editor as an XML document.
  • Reordering - this allows changing the order of the queries in the table. Reordering can be done by simply dragging the rows of the table up or down.

The user can save or load the current tool settings, i.e. XML Tool Queries are supported here. This feature allows the creation of tool queries. There is no limitation for the level of inclusions in the queries.

Controls

The Control operator allows changing the order of application of the queries in the MultiQuery tool. The usual order of applications is starting from the first one and proceeding one by one up to the last one. Using Controls operators some queries can be applied only if certain conditions are true. Such conditions are: the true or false value of a result from an XPath evaluation; whether the preceding single tool application has or has not modified the working document; or unconditional (always succeeding). When a condition for a Control is true, the next query (or another Control), which will be applied, is defined in the control itself. Otherwise, the application proceeds with the next entry in the order (query or another control). The Control operators address their targets (queries or controls to be applied in case of success) by pointing their labels. Each entry(row) in the table of the MultiQuery Tool can have a label (unique identifier) which can be referred by to control operators. It is an error if a Control operator uses a target label which does not exist. The main difference with respect to the MultiQuery tool is that the Controls are evaluated over documents chosen by the user when each Control is created or edited.

Each Control operator may consist of four parts:

  • Type - determines the type of the Control, i.e. the conditions for checking. There are several types of control:
    • IF (XPath) - the condition is an XPath expression. If the result from its evaluation (at less one) document in Internal Documents list is: a non-empty list, a non-empty string, a non-negative number or a true boolean value the Control succeeds.
    • IF NOT (XPath) - the condition is an XPath expression. In contrast to the previous type, here, if the result from the XPath evaluation on the current working document is: an empty list, an empty string, a negative number or a false boolean value the Control succeeds.
    • IF CHANGED - the condition is the result from the previous single tool application. If the current working document has been modified by the previous (not necessarily preceding in the table) operation, the Control succeeds.
    • IF NOT CHANGED - the condition is the result from the previous single tool application. If the current working document has NOT been modified by the previous (not necessarily preceding in the table) operation, the Control succeeds.
    • GOTO - the condition is always satisfied. This is an unconditional movement to the target of the Control.
    • DELETE - unconditional control which deletes all the documents in the Internal Documents list.
    • EXIST - if at least one document in the Internal Documents list exists the Control succeeds.
    • IF APPLIED - the condition is the result from the last query application. If all input documents in this query do not exist in the system, the Control does not succeed.
  • XPath(depends on the type) - an XPath expression is evaluated, the result of which determines the success of the control. This part is used in controls of type IF (XPath) and IF NOT (XPath).
  • Target - a label reference which points to the next location where the execution will continue in case of satisfied condition for the Control.
  • Internal Documents list - list of documents from the system which are used by the Controls for evaluation.

By using the Controls-labels technique, the user can model the famous IF-THEN-ELSE and WHILE-CONDITION-DO structures in order to make the processing more flexible. The composition of different Controls allows the user to create varied 'programs' or 'scripts' capable of doing certain jobs. It is up to the user to create efficient and reliable processing procedures.

Controls Editor

Here follows a description of the Control Operator Editor:

The editor design follows the Controls structure in four sections:

  • Type - a list of all types of Controls: IF, IF NOT, IF CHANGED, IF NOT CHANGED, GOTO, DELETE, EXIST and IF APPLIED .
  • XPath - an XPath expression field, active only for types IF and IF NOT.
  • Target - a list of all labels currently defined in the MultiQuery Tool table. Here, for convenience, the user can enter target labels which have not been defined yet. In the end of the list there is one special label <break> which is used for suspending the processing procedure. In other words if the condition for a control is satisfied, the system will not proceed with the application.
  • Internal Documents list - list of documents from the system which are used by the Controls for evaluation.