Home
Description
Publications

Available Resources
Text Acknowledgements
Related links


Events


CLaRK System

CLaRK System Online Manual


Bulgarian dialects'
electronic archive




eXTReMe Tracker

 

 

 

 

 

 

 

Menu Definitions

Tokenizers

XML considers the content of each text element a whole string that is unacceptable for corpus processing. For this reason, it is required for the word-forms, punctuation and other tokens in the text to be distinguished. In order to solve this problem, the CLaRK System supports a user-defined hierarchy of tokenizers. At the very basic level users can define a tokenizer (Primitive) in terms of a set of token types. In this basic tokenizer each type is defined by a set of UNICODE symbols. Above basic level there are tokenizers (Complex) for which the token types are defined as regular expressions over the tokens of some other tokenizer, the so called Parent tokenizer.

Here is the Tokenizer Manager tool which shows all the tokenizers saved in the system, aa well as some of their characteristics:

  • Name - each tokenizer has a unique name
  • Type - shows whether the tokenizer is Primitive or Complex
  • Parent - for Complex tokenizers - the name of the Parent tokenizer
  • Compile - shows the Compilation status. Only the compiled tokenizers can be used in other tools.

The user can create a new tokenizer; edit, compile, or remove an existing one, load tokenizers from file out of the system or save tokenizer(s) as an external file.

  • New

    In order to create a new tokenizer (Complex or Primitive), the user must press the New button.

    Each row in the table represents one tokenizer category. The first column presents the category name. The content of the second column depends on the type of the tokenizer. The column contains a category value (all the symbols in the category) if the tokenizer is "primitive", or regular expression for a "non-primitive" tokenizer.

    • Primitive

      If the tokenizer is primitive, the user must select the Primitive check box.

      When defining the category value for a primitive tokenizer, the user should be aware of the following rules:

      1. The characters are quoted in single or double quotations, or referenced by a number. Example "." or ";" or 'k' or 32 (Space)
      2. If the user wants to write more than one symbol for a category, he/she should separate the symbols by a comma. Example: "a","b","c",...
      3. If the user wants to define a range of symbols, the starting and ending symbols must be connected with a dash. Example : "A"-"Z".
      4. Each character cannot have more than one category
      5. A category can be defined on more than one row. It is interpreted as a conjunction of expressions. For example:



        The tokenizer tool will interpret this lines as LAT "'a'-'z','A'-'Z'".
    • Complex

      When defining a complex tokenizer, the user should follow the rules below:

      1. A parent tokenizer must be selected.
      2. Each category Expression must be a valid regular expression.
      3. Two categories cannot have a common token.

      Here is an example of a complex tokenizer:

      The parent of the tokenizer is the "Mixed" primitive tokenizer shown above. This complex tokenizer uses the categories "LAT" and "CYR" from the parent tokenizer in order to define the new categories "LATw" and "CYRw".

      Each complex tokenizer must be compiled in order to be used.

  • Edit

    The user can edit a tokenizer pressing the "Edit" button. The tokenizer selected in the table will be opened for editing. The user can add, remove and reorder rows on the menu which shows up by right-clicking over a row in the table. The parent of a tokenizer is set when the tokenizer is created and can be changed by the user by pressing the Change Parent button.

    For each primitive tokenizer the user can define the sort order of the categories by clicking the Sort Order button. This sort order is used by the other tools in the system when they compare tokens. For example:

    The user can reorder the categories by selecting a row with left mouse click and pressing button - move the row or on the menu which shows up by right clicking over a row in the table.

    Also for each category of the primitive tokenizer a normalization of the symbols can be defined. This normalization is applied when the tokens are compared. The usual normalization is the conversion of the capital letters into small ones, but in the system the user can define any correspondence of the symbols. This can be done by right clicking on the category line in the table. For instance, the following dialog will appear if we select normalize for the "LAT" category of Mixed Tokenizer:

    For each symbol of the category the user can select a corresponding normalized symbol. The "New Category" combo box determines the new category that the symbols will receive after the normalization. This category can coincide with the original category of the symbols before the normalization or it can be completely different.

  • Remove - removes the selected tokenizer from the table;
  • Compile - compiles a non-primitive tokenizer;

    The user can compile a complex tokenizer with the "compile" button. Pressing this button causes not only compiling but also saving the tokenizer. Some possible error messages after compilation: "Ambiguous Categories" - two tokenizer categories recognize the same token from the input word. This error can occur even if the categories belong to different tokenizers in the hierarchy; "Category not defined" - the category name used in the value of one of the tokenizer categories is not defined in the tokenizers above this tokenizer in the hierarchy. When compiling a tokenizer all the tokenizers above and under it in the hierarchy are compiled. This means that a change in one tokenizer can produce an error during the compilation of another tokenizer. The user must be very careful with the error messages. Also it is recommended to keep all tokenizers compiled.

  • Load Tokenizer - loads a tokenizer from a file;
  • Save Tokenizer - saves a tokenizer as a file;
  • Exit - closes the dialog window.

Filters

This menu item starts the filter editor. In order to browse the filters in the system, the user can rely on the "Filter" combo box at the top of the dialog. The user can add token categories from different tokenizers or add XPath expressions to filter element nodes. A filter defines the way of removing tokens and/or elements from the content of a given element when some tool processes its content. Usually filters are used in connection with grammars. When the grammar is applied, it is applied to the content of an element. The content is processed before the grammar is applied. The processing includes tokenization of the text in the content and conversion of elements to list of tokens. The result is a list of tokens which is the input for the grammar. Very often some of the tokens in this list make no sense for the grammar. Such are space tokens, some special symbols, some special elements (an element for parenthetical expression, for instance). In order to escape these non-meaningful tokens, they can be filtered out from the grammar input in advance. This is the purpose of the filters in the system.

"Token Types" is a list of token categories, that will be filtered. The user can take categories from tokenizers in the system ("Choose From" list on the left side of the dialog) and add them to the list of filtered token categories with the arrow ("=>") button.

In order to filter an element the user has to specify an XPath expression. This XPath expression is evaluated on each element in the content which is filtered and if it is evaluated as true or returns a non-empty list of nodes, then the element is filtered out of the content. The addition of the XPath expression is done by pressing "Add XPath" button. The new XPath expression will be added to the "Expression" list.

To remove a token category or an XPath from one of the lists, the user must select the line and press the "Remove" button under the table.

The user has to save the filter after editing by the Save button. The user can remove a filter by the Remove button.

The Export Filters button saves all filters as a file.

The Import Filters button loads filters from a file. There are loading options which determine the behaviour of the import operation.


Element Features

The Element Features is used for assigning information to the elements of a DTD.

The user can add the following information:

  1. A tokenizer for DTD. This tokenizer will be used to tokenize (when needed) the text content of all the element for which there is no a specific tokenizer set. This tokenizer can be overwritten. Thus it is used by default. For instance, if no tokenizer is defined for the element, the grammar and sort engines will look for the DTD default. After compiling DTD, the "Default" primitive tokenizer is set as default tokenizer.
  2. For each element in the DTD, a tokenizer can be set. This tokenizer overwrites the default DTD tokenizer. It is used for tokenization of the element text data by the Grammar and Sort engines and some other tools.
  3. The user can state that the content of the element is treated as a number.
  4. The user can define an XPath Value (Key) for each DTD element.
  5. The user can define the order over the DTD elements. This option is used for sorting purposes.
  6. The user can define options for delete operations - Delete Subtree and Delete Node from the Tree Popup Menu and XPath Remove from the Tools menu.

In order to select the default tokenizer for the DTD the user must select an item in the "Default Tokenizer" combo box. The user can select a tokenizer for each element of the DTD in the "Tokenizer" column of the table by clicking on a table cell. The check boxes in the "Number" column are used by the sort tool to determine whether the content of the current element can be treated as a number. For example, pages and price can be treated as numbers in the comparison of two books. The values in the "Element Value" column are used by the Grammar Engine to define the value of the element nodes. (For defining value check the Edit Grammar). The check boxes in the "Before" and "After" columns are used to determine whether to insert space symbol before and after a deleted element by the delete operations - Delete Subtree and Delete Node from the Tree Popup Menu and XPath Remove from the Tools menu. After applying one of this operations if the Before chechbox is checked for an element the space symbol is inserted before this tag if it does not exist and if the After chechbox is checked for an element the space symbol is inserted after this tag if it does not exist.

The addition (creation) of a new Element feature can be performed by pressing the Add button. The removal of elements can be done by selecting the corresponding row and pressing the Remove button. The OK button closes the dialog and updates the changes (if any). The Cancel button closes the dialog without saving the changes.

The order of the elements in DTD can be defined in the sort table shown when the user presses the "Sort Order" button. When comparing two elements, the position in the sort table defines their order. The user can change the position of elements by dragging their rows to correct positions or using the context menu opened when the user right clicks on sort table row. Here is an example:

The Export Element Features button saves Element Features for the current selected DTD as a file.

The Import Element Features button loads Element Features from a file. All previous data are replaced by the new data. The user can restore the previous data by clicking Cancel button.

If one of the Element Features uses some tokenizer which does not exist in the system the relevant warning message is given and the user can validate this definition by a validating dialog.


Attribute Features

The Attribute Features is used to add information to the attributes of elements from a DTD or user elements.

The addition (creation) of a new Attribute feature can be performed by pressing the Add button. The removal of attributes can be done by selecting the corresponding row and pressing the Remove button. The OK button closes the dialog and updates the changes (if any). The Cancel button closes the dialog without saving the changes

The "Element" column of the table contains the elements which have attributes. One and the same element can appear several times because it can have more than one attribute. The "Attribute" column has the name of the attribute for which the attribute features are defined. "Is Enumeration " column shows whether the attribute is Enumeration or not. In the "Tokenizer" column the user can select different tokenizers for each attribute. The sort tool uses information from the "Number" column to select the way of comparing the value of the attributes (As plain text or as a number).

An additional feature is the order of enumerated attribute values. The attributes with enumerated values have "Yes" string at the "Is Enumeration" column. In order to sort the enumerated values the user must click on an attribute with enumeration value and to click the "Sort Values" button. Example:

The values are sorted in ascending order. The user can change the order by dragging the rows or by pressing the right mouse button.

The Export Attribute Features button saves Attribute Features for the current selected DTD as a file.

The Import Attribute Features button loads Attribute Features from a file. All previous data are replaced by the new data. The user can restore the previous data by clicking Cancel button.

If one of the Attribute Features uses some tokenizer which does not exist in the system the relevant warning message is given and the user can validate this definition by a validating dialog.


XPath Macros

The XPath Macros gives the user possibility to name XPath Expressions. It contains a macro name and an XPath expression. It is designed for the general usage. The XPath Macros can be parts of XPath expressions and thus they can be used in each place where XPath is used. Within an XPath expression a macro is denoted by '%' (percent sign), followed by the macro name, i.e. %macro_name. A macro citation itself also represents an XPath expression (CLaRK System extension).
In the picture below you is shown what the XPath Macros editor looks like:


The Export XPath Macros button saves selected XPath Macros as a file.

The Import XPath Macros button loads XPath Macros from a file. There are loading options which determine the behaviour of the import operation.

If one of the XPath Macros has no XPAth expression or it is invalid the relevant warning message is given and the user can validate this definition by a validating dialog.


Keys

The Keys are a means for naming of XPath expressions and some specific information important for some of the tools in the system. The key names can be unique arbitrary strings. Here is what the XPath Key manager window looks like:

Each row in the table above represents one XPath Key. The content of the table cannot be modified directly form here. In order to modify a key entry, the user must select the corresponding row in the table and then press the Edit button. A new dialog window appears where the key specifications can be changed. The addition (creation) of a new XPath Key can be done by pressing the Add button. The removal of keys can be done by selecting the corresponding row and pressing the Remove button. The Exit button closes the dialog and updates the changes (if any).

There are several types of keys. Each type has different additional options and usage:

  • Grammar Key - This key is designed for usage with the Grammar tool, especially in the Element Value option. It has two more additional fields: normalization on/off and a specification of a tokenizer. A description of the usage of this key can be found in the description of menu option Apply Grammar. Here is what the Grammar XPath Key editor window looks like:


  • Sort Key - This key is designed for usage with the Sort tool. It contains settings specific for its usage:order descending/ascending; reverse sorting; removing leading/ending white spacing (trim); enabling number interpretation, enabling normalization; tokenizer specification. All these options and their usage are described in menu option Sort.


  • Table Sort Key - This key is designed for usage in the Concordance tool. It is used for specifying the sort options for the different table columns' content. This key has one more option than the Sort Key options: Prefix. This option specifies for which table column this key is defined to be used. Here is what the editor window looks like:


The Export Keys button saves selected Keys as a file.

The Import Keys button loads Keys from a file. There are loading options which determine the behaviour of the import operation.

If one of the Keys uses some tokenizer which does not exist in the system the relevant warning message is given and the user can validate this definition by a validating dialog.


Shortcuts

This tool allows user to use certain key combinations to execute different actions in the system.

Each keyboard shortcut definition consists of two parts:

  • key combination - the combination which will activate the shortcut;
  • action definition - the describtion of the action which will be executed when this shortcut is activated.

A key combination includes modifier keys (at least one) and an ordinary key. The modifier keys may vary in the different computer architectures. Usually, such keys are: Alt, Ctrl, Shift and Meta. The rest of the keys (or at least most of them) can play the role of an 'ordinary' key. Such keys are: all keys which produce graphical symbols, the functional keys (F1, F2, ..), numerical keys and others.

The action definition determines what type of action will be performed and the concrete action. The available action types are three: selecting a menu item, applying a tool query and an XPath search query. Each of them will be described in details later in this section.

Having selected this option, the following dialog window appears:

It represents a list of all defined shortcuts in the system, visualized in a table. Each row in the table corresponds to one shortcut. The first two cells show the key combination and the third one describes briefly the action of the shortcut. In column Key the keys which activate shortcuts are shown. In the second column, Modifier(s), the modifier keys for each shortcut are shown. They can be more than one. The column Action shows the shortcuts actions information.

The shortcut management which this dialog window offers includes adding, editing and removing shortcuts.

Here follows a description of the dialog buttons and their functions:

  • Add - creates a new shortcut. Having pressed this button, a blank Shortcut Editor window will apper.
  • Edit - modifies an existing shortcut. Having pressed this button, a Shortcut Editor window will apper with all data from the selected shortcut in the table will be loaded in it. This function can also be called by performing a mouse double-click on a shortcut's row in the table.
  • Remove - removes the selected shortcuts from the table WITHOUT confirmation warning.
  • Done - updates the changes (if any) on the shortcuts and closes the manager window.
  • Reset Shortcuts - removes all existing to the moment shortcuts and restores the initial system ones.
  • Cancel - discards the changes (if any) on the shortcuts and closes the manager window.

Shortcut Editor

The Shortcut Editor window appears when the user presses buttons Edit and Add.

The definition of a key combination can be done in sections Modifiers and Key Code. Different combinations of modifiers can be selected by clicking on the relevant checkboxes. In order to select a shortcut activation key or to change the current one, the user has to press button Change. The system will respond with a small dialog, titled Press a key and it will wait for a single key stroke (without modifiers) which will be recorded as a new activation key. If the new key recording has to be stopped, the user must press button Cancel. Having chosen a new activation key, the recording dialog will disappear and the program control will be returned to the editor. If after pressing a key, the dialog does not disappear, it means that the selected key is not suitable for an activation key (the same happens if a modifier key is pressed). If the recording is successful, the new selected key is shown in the Key Code section.

The action for the shortcut is defined in section Action. The section is separated into three sub-sections, each responsible for one type of action. The sub-sections are as follows:

  • Action Items

    This type of actions covers most of the menu items which can be chosen in the system menues. The target menues are: the main system editor menu (Menu Item), the menu which appears after a right mouse click on the tree areas (Tree Item) and the menu which appears on the attribute table (Attribute Item). The type of an action (menu) item can be selected from field Action Type.

  • Queries

    An action of this type representa an application of one or more predefined tool queries. There is no restriction on the type of queries and the order they are selected (when more than one). The queries are applied on the current document, taking into account the current tree area selection. If there is no current document or the tree selection is not an appropriate one, no action is performed.

    Query list management:

    • Add - selects one or more tool queries and appends them to the current queries list;
    • Remove - removes the selected tool query from the queries list;
    • Options - sets some shortcut execution options which are taken into account when the shortcut is activated.

      The options are:

      Use tree selection as:

      • context for selection - determines how the selected data in the tree will be handled. Almost all tool queries contain XPath expressions which select the data on which the given tool query will be applied. If this option is selected, the queries XPath expressions are evaluated on each selected node as a context and the union of all results forms the tool input data.
      • processing data - ignores the queries XPaths (if there are such) and uses the current tree selection as an input for the given tool(s).
      • Suppress warning / info messages - this option enables the suppression of messages showing intermediate results during application.

  • XPath Search

    This type of actions allows performing an XPath search on the current document with a predefined query expression. It is handy when many XPath searches are needed to be done manually on different context nodes. Then instead of going each time to the search field on the toolbar for each new selected context, a key combination can be used. Additionally, if no result message dialogs are needed for search operation, they can be suppressed by selecting Suppress info messages and thus only 'discreet' messages will appear on the status bar in the bottom.


Sync Rules

The Synchronization Rules is a means for establishing connections between opened documents in the system editor. The connections are expressed as distributions of a selection in one document to selections in other documents. The connections are based on XPath expressions processing and evaluation.

When a connection between the current document and another referent document is established each change of the selection in the current one is registered. If the new selection satisfies certain conditions, the connection is activated and a new XPath expression is generated on the basis of a pattern. The new XPath is evaluated in the referent document and the result is selected (in case it is not an empty node-set).

An example usage of this facility is when a certain document is explored and simultaneous look-ups in a dictionary document are needed. In this case a connection between the observed and the dictionary documents can be established. Whenever a certain word/expression to be looked up is selected, the system extracts the necessary information and performs a search in the dictionary document. If the searched entry is found it is selected in the background. Thus moving from word to word in the observed document leads to automanic showing the corresponding entries from the dictionary.

Another application of this tool is when different documents are connected in some way by references. Whenever a reference in one document is selected, the referent entity in the corresponding document is selected.

For establishing a connection of this type, exactly two documents are required: a current document in which the user works and a referent document in which the selection moves depending on the Sync rule parameters and the selection in the current document. There is no restriction on the number of connections which can take a certain document as a current one. In this respect, the user can connect one document with several referent documents and in this way navigating in one document causes distribution of selections in different documents in the same time.

In order a connection to be established, a Sync Rule must be defined. Here is an example view of the Sync Rule Editor:

Each rule consists of three parts:

  • Restriction (XPath) - this expression defines a restriction on the nodes in the current document for which a pattern expression will be generated. When a connection is established, whenever the selection in the current document is changed, for the new selected node the restriction expression is evaluated. If the result is approving (not empty node-set, not empty string, positive number or true boolean value), the node is suitable for proceeding with pattern generation.
  • Pattern - the pattern here defines the way an XPath expression is constructed for evaluation in the referent document. The pattern expression is an XPath expression which may contain certain parameters whose values are generated on the basis of the selected node in the current document. The process of generating of an XPath includes: calculating the pattern parameters, converting the results to strings and inserting them in the pattern expression. The result is a complete XPath expression ready for evaluation in the referent document. The places in the pattern expression where parameter values are to be inserted are denoted by expressions in curly brackets ('{', '}'). The expressions between these brackets must be again XPath expressions which will be evaluated on the selection in the current document. The curly brackets in the pattern expression can be escaped by a leading '^' symbol.
  • Ref Document - determines the referent document of the rule. When a connection is established with this rule, the selected document is used for evaluating the dynamically generated pattern (XPath) expressions.

Once a Sync Rule is defined, it can be applied in a connection by opening a document (which will serve as a current one) in the editor and then assigning rule(s) by choosing menu item Document/Synchronize with ....


Document Index

The document indexing is a representation of the document's content in a way optimized for fast search. This representation in CLaRK can be done to the level of tokens. In order to use this optimized search, the user must preprocess (index) the input document(s). During the process of indexation the system reads the data from the document and produces the index data which is stored separately from the document. Whenever a fast search is needed, this data is automatically loaded. The ability to search in an indexed document is provided by an extension function of XPath: search(). With its help the index search can be used in many places and tools within the system.

In order the user to make an indexation of a document, a Document Index definition must be created first. These definitions determine the data on which the indexation will be performed. Each such definition uses XPath expression to select the nodes to be indexed. With an additional XPath expression the user can further specify the representative value for each selected node. These values are converted to strings, tokenized (optional) and stored in the internal structure. For convenience, the user can index different parts of a document independently and thus forming different index repositories for one document. Then each of them can be used independently. Whenever an index search is needed, the user specifies the search query and a repository in which the searching will be performed. If no repository is stated, the search is performed in all available repositories for the document.

To define a Document Index, the user has to use the following manager:

It contains a list of all document index definitions saved in the system. It is visualized in a table where the first column contains the names of the definitions and the second one contains lists of repositories of each index.

The possible operations here are: creating new definitions (New Index), modifying existing definitions (Edit Index) and removing existing definitions (Remove Index). Each index definition must have a name.

Once a definition is created, an indexing with it of document(s) can be applied with button Apply on document. The user is asked to select documents for indexing, after which the selected ones are indexed according to the selected definition and the data is saved. Having done that, fast searching can be performed in the processed documents. If an indexed document has to be modified later, a new re-indexing might be necessary.

The following section describes the creation of a document index definition. In order to create a definition the user has to supply a name for it. Having done that, a Document Index Editor window appears:

It contains information about the repository definitions of the current index definition. Initially the table is empty.

Each repository definition contains several parts:

  • Name - the name for this repository, which later will be cited when a search is performed in it.
  • Targets - an XPath expression for selecting nodes to be indexed in this repository. The target nodes are the nodes which will be found later during searching.
  • Keys - an XPath expression which determines the important value for each selected target node with which the node will be searched for later. In other words, a node will be found if it is searched for by its key value. Example: a dictionary in which each word entry has a certain XML structure. An appropriate indexing is: tagrets are the root elements for each word structure and keys pointing to the word(form) itself. Thus, the search query will be a word (or a part of word) and the result will be the structure(s) which contain(s) the relevant information.

An additional option here is setting a tokenizer. It is needed when a document must be indexed not by whole text nodes, but by text tokens within the text. Here the user selects a Tokenizer for processing the key values and the token categories to be indexed (button Customize). With it, the user can filter the categories which are not interesting for indexing. Additionally a token normalization can be used.

Searching in indexed documents

The searching in indexed documents is embedded in the extension of the XPath query language. Thus the indexing can be used wherever XPath search can be performed, i.e. all major system tools.

In order this functionality to be used, the target documents have to be indexed in advance. When an index search is performed on a document for a first time, the relevant index information is loaded automatically. This may cause a short delay before proceeding with further tasks. If an error occurs during index data loading, nothing is loaded and subsequent searches will be unsuccessful (no result will be returned). Possible reasons for unsuccessful index data loading can be that a document has been modified after it was indexed or the index data files have been corrupted. Whenever there is smth wrong with loading index data, the user can open the document in the editor and try to reload the data by using menu item Document/Load Index Data which indicates the failure reason (if any).

The extension XPath function which allows index search is: search(). The function usage is the following:

node-set search ( string, string? )

The function's result is a node-set (possibly empty) which contains the nodes from the current document which answer to the given search query. The search query, itself, is set in the first argument and it represents a full or partial token value description, i.e. a certain word or a wildcards description. In case, no tokanization is used the queries will be matched against the whole nodes values stored in the index. A definition of the tokens value description language can be found in section Grammars.

The second function parameter is optional and it allows refining the search results by considering only a certain repository within the whole index. In this way, if an index contains different repositories with different content type (for example, one containing words form text nodes and another containing ID values comming from attributes), the search efficiency will be improved and the results will be more precise when index repositories are cited.

Example index search queries:

  • search("noun") - returns all nodes, whose value contains the token noun stored anywhere within all available repositories for the index.
  • search("noun", "dictionary") - returns all nodes, whose value contains the token noun stored in repository dictionary for the index.
  • search("12.3.#", "IDs") - returns all nodes, whose value contains tokens starting with 12.3 ("12.3.23", "12.3.4", "12.3.6", etc.) stored in repository IDs for the index.
  • search("#aba#") - returns all nodes, whose value contains tokens containing the substring aba stored anywhere within all available repositories for the index.


Graphical Tree Layout

The Graphical Tree Layout is a means for drawing arbitrary tree structures represented in XML. The resulting graphical representation obeys different user adjustments, like colors and shapes rendering, text and structure definition and filtering and others.

The main graphical objects which can be used for nodes representation are: rectangles, rounded rectangles and ellipses. The user can specify their outline color and thickness, background color, text label inside (font, color and content). The nodes in the drawing are connected with lines, the appearance of which is again user defined: color and thickness. Additionally, there are cross-branches links available. They can connect any nodes in the drawing with arcs for which the curvature can be adjusted (to avoid overlapping with other lines and arcs) .

The layout itself represents a set of rules, each of which corresponding to one shape definition. Each rule has a conditional part which defines to what kind of nodes the rule is applicable (element, text or comment nodes or nodes, appearing in certain XPath defined context). If a condition for a rule is satisfied for a certain node, a new graphical object appears in the drawing canvas. The appearance of the object is defined in the rule.

Another important part of each rule is the Children Definition section which determines the nodes whose graphical representations which will appear as children of the graphical representation of the current node. The children definition is based on XPath expression and this allows visualization of nodes which are not direct children of the current node (or even nodes which do not belong at all to the current structure) as child nodes. The default value of the children definition for each rule is child::*, i.e. all direct child nodes.

If none of the rules in a layout are applicable for a ceratin node, there are three default rules, one of which always succeeds depending on the node type (element, text or comment).

An important section in each layout is the Structure roots definition section. It defines which nodes in the current document are suitable to be roots of a structure to be visualized. It contains an XPath expression which is used as a condition and if it is evaluated successfully for a certain node, the graphical representation building starts from it. Otherwise, the system searches for the closest ancestor which satisfies the condition.

In order to visualize a document (or a part of a document), it must be opened in the system editor. Having selected a node in the document, the menu item View/Graphical Tree View must be chosen which shows the graphical representation in a separate window. The layout which will be used is defined in the DTD Tree Layout of the cirrent document.

In order to define a graphical tree layout, the user must select menu item Definition / Graphical Tree Layout. The following dialog window appears:

The layouts manager contains several sections which are described below.

The Layouts section contains a list of all layouts which are currently available in the system. Initally the list contains only the entry Default Graphical Layout. The selection in this section determines the content of the other sections of the manager, i.e. it determines the current layout for editing ot just exploring.

The Rules section contains information about the rules defined in the current layout. It is represented in a table where each row represents one rule. Having selected a rule from the table a preview image is generated according to the rule's settings and it is shown in section Preview. The preview image contains fixed text and a shape with fixed dimensions. All other settings are taken from the rule.

In this section the user can add New Rule, Edit an existing rule or Remove Rule. The definition of a new rule or modification of an existing one is described in section Layout Rule Editor.

The first three rows of the table contain the default rules for the layout. They can be modified, but can not be removed.

The Links section contains information about the defined cross-branches links in the layout. Each such link is directed (although it is not visible on the canvas). The starting point of each link is determined by an XPath expression which is evaluated on the whole document. For each selected node which is visible on the canvas a second XPath expression is evaluated and the result of which determines the ending point of the corresponding link. If a starting or ending point is not visible, no link is drawn. Each link definition is processed independently. The creation of a link definition or modification of an existing one is described in section Layout Link Editor below.

The Structure roots definition section determines which nodes from the current document can represent a root for a structure to be visualized. The XPath expression is used as a condition for each selected node and in case of success the structure building starts from the corresponding node. If a condition fails for a node, it is checked again for its parent node and so on until a suitable root node is found. If no suitable root is found the system shows an error message.

Control buttons:

  • New Layout - creates a new layout and prompts the user for a name for it;
  • Remove Layout - removes the current layout which is preceded by a confirmation message.
  • Options - offers some global options concerning the graphical visualization.

    • Margins(top, bottom, left, right) - determines the spacing between the tree image and the corresponding borders in pixels;
    • Nodes H Gap - determines the minimal space (in pixels) between the nodes in horizontal alignment;
    • Nodes V Gap - determines the minimal space (in pixels) between the nodes in vertical alignment;
    • Background - determines the background color of the drawing canvas.

  • OK - updates the changes on the layouts (if there are such) and closes the dialog;
  • Cancel - discards the changes on the layouts (if there are such) and closes the dialog;

Layout Rule Editor

The Layout Rule Editor is used for creating new graphical tree layout rules or modifying existing ones. The layout of the editor is the following:

The dialog contains several sections:

  • Shape - describes the shape characteristics for this rule:
    • Nodes shape - determines the shape which will be drawn for a node: Rectangle, Rounded Rect(angle) or Ellipse;
    • Outline width - determines the outline thickness of the shape in pixels;
    • Outline color - determines the outline color of the shape;
    • Background - determines the background color of the shape;
    • Parent arc color - determines the color of the arc which connects the current node with the parent;
    • Parent arc width - determines the thickness of the arc which connects the current node with the parent in pixels.
  • Preview - shows an image preview of the current rule's shape. It changes automatically when a characteristic is changed. Except for the text content (which is context dependent) all other characteristics are applied on the preview and the shape appears in its realistic dimensions.
  • Label - this section determines the label text and its appearance within the shape. The options are:
    • Label pattern - contains a definition of the label text content. The syntax of the pattern definition is the same as the one in the DTD Tree Layout;
    • Label Color - determines the color of the label text;
    • Font Name - determines the font of the label text;
    • Font Size - determines the size of the label text;
    • Font Style - determines the style (Plain, Bold, Italic) of the label text;
  • Children Definition - determines the child nodes which will be drawn on the canvas for the current node. The children are result from the evaluation of XPath expression with context the current node. If the result is an empty nodelist or if this field is empty - no children are drawn for the current node;
  • Context Restriction - this section contains the conditions which have to be satisfied by a node in order this rule to be applied. There are two types of conditions:
    1. by tag name - this condition is satisfied is the current node which is tested is of type element and its tagname coincides with the value specified in this field;
    2. by xpath - the expression specified in this field is used as a predicate for the current node. If the evaluation is successful the rule is used.

Layout Link Editor

The Layout Link Editor is used for drawing cross-branches arcs in order to express relations other than parent-of or child-of. Each link of this kind is directed and it has a starting point (target) and an ending point (reference). The targets and the references are determined by XPath expressions evaluation.

Each link definition is processed independently in the following way. An XPath expression is evaluated on the whole source document. The result node set is reduced only to those nodes which appear in the current view. This forms a set of candidates for link starting points. For each of them a second XPath expression is evaluated and if the result is a not empty list, a link is established between the current context and the first entry of the result which is also represented in the image.

Here is the layout of the Link Editor dialog:

The components of the editor are as follows:

  • Preview - shows a dynamically updated preview image according to the current settings;
  • Color - determines the color of the link;
  • Width - determines the thickness of the link;
  • Deviation - determines the curvature of the link in order to avoid overlapping with other components. The value here is a percentage coefficient of the length of the link (positive values - bulged curve; 0 - straight lines; negative values - concave lines);
  • Targets - an XPath expression determining the starting points of the links;
  • Reference - a relative XPath expression determining the ending points of the links.


Export Definitions

The user can save the selected definitions (DTDs, Tokenizers, Filters, XPath Macros and/or Keys) from the system as a file.

If there is a tokenizer which is used in some definitions and it is not exported the relevant warning message is given.


Import Definitions

The user can load selected definitions (DTDs, Tokenizers, Filters, XPath Macros and/or Keys) from a file. The imported file must be generated by the export operation from the system.

The Loading Options (see below) determine the behaviour of the import operation.

If one of the new definitions uses some tokenizer which does not exist in the system the relevant warning message is given and the user can validate this definition by a validating dialog. This dialog is relevant to the type of the definition - Element Features Validation, Attribute Features Validation, XPath Macros Validation, Keys Validation.

Loading Options

The loading options are related with the cases when there is a definition in the system with the same name as the new imported definition. There are four modes:

  1. Overwrite all - the data of the system definition is overwritten by the data from the new definition.
  2. Do not overwrite all (Skip) - the data of the new definition is skipped.
  3. Do not overwrite all (New Name) - the user can save the data of the new definition with another name.
  4. Ask for each - the user is asked whether to overwrite the data of the definition in the system with the data from the new definition or to save the new definition data with another name.