XML considers the content of each text element a whole string that is unacceptable for corpus processing. For this reason, it is required for the word-forms, punctuation and other tokens in the text to be distinguished. In order to solve this problem, the CLaRK System supports a user-defined hierarchy of tokenizers. At the very basic level users can define a tokenizer (Primitive) in terms of a set of token types. In this basic tokenizer each type is defined by a set of UNICODE symbols. Above basic level there are tokenizers (Complex) for which the token types are defined as regular expressions over the tokens of some other tokenizer, the so called Parent tokenizer.
Here is the Tokenizer Manager tool which shows all the tokenizers saved in the system, aa well as some of their characteristics:
The user can create a new tokenizer; edit, compile, or remove an existing one, load tokenizers from file out of the system or save tokenizer(s) as an external file.
This menu item starts the filter editor. In order to browse the filters in the system, the user can rely on the "Filter" combo box at the top of the dialog. The user can add token categories from different tokenizers or add XPath expressions to filter element nodes. A filter defines the way of removing tokens and/or elements from the content of a given element when some tool processes its content. Usually filters are used in connection with grammars. When the grammar is applied, it is applied to the content of an element. The content is processed before the grammar is applied. The processing includes tokenization of the text in the content and conversion of elements to list of tokens. The result is a list of tokens which is the input for the grammar. Very often some of the tokens in this list make no sense for the grammar. Such are space tokens, some special symbols, some special elements (an element for parenthetical expression, for instance). In order to escape these non-meaningful tokens, they can be filtered out from the grammar input in advance. This is the purpose of the filters in the system.
"Token Types" is a list of token categories, that will be filtered. The user can take categories from tokenizers in the system ("Choose From" list on the left side of the dialog) and add them to the list of filtered token categories with the arrow ("=>") button.
In order to filter an element the user has to specify an XPath expression. This XPath expression is evaluated on each element in the content which is filtered and if it is evaluated as true or returns a non-empty list of nodes, then the element is filtered out of the content. The addition of the XPath expression is done by pressing "Add XPath" button. The new XPath expression will be added to the "Expression" list.
To remove a token category or an XPath from one of the lists, the user must select the line and press the "Remove" button under the table.
The user has to save the filter after editing by the Save button. The user can remove a filter by the Remove button.
The Export Filters button saves all filters as a file.
The Import Filters button loads filters from a file. There are loading options which determine the behaviour of the import operation.
The Element Features is used for assigning information to the elements of a DTD.
The user can add the following information:
In order to select the default tokenizer for the DTD the user must select an item in the "Default Tokenizer" combo box. The user can select a tokenizer for each element of the DTD in the "Tokenizer" column of the table by clicking on a table cell. The check boxes in the "Number" column are used by the sort tool to determine whether the content of the current element can be treated as a number. For example, pages and price can be treated as numbers in the comparison of two books. The values in the "Element Value" column are used by the Grammar Engine to define the value of the element nodes. (For defining value check the Edit Grammar). The check boxes in the "Before" and "After" columns are used to determine whether to insert space symbol before and after a deleted element by the delete operations - Delete Subtree and Delete Node from the Tree Popup Menu and XPath Remove from the Tools menu. After applying one of this operations if the Before chechbox is checked for an element the space symbol is inserted before this tag if it does not exist and if the After chechbox is checked for an element the space symbol is inserted after this tag if it does not exist.
The addition (creation) of a new Element feature can be performed by pressing the Add button. The removal of elements can be done by selecting the corresponding row and pressing the Remove button. The OK button closes the dialog and updates the changes (if any). The Cancel button closes the dialog without saving the changes.
The order of the elements in DTD can be defined in the sort table shown when the user presses the "Sort Order" button. When comparing two elements, the position in the sort table defines their order. The user can change the position of elements by dragging their rows to correct positions or using the context menu opened when the user right clicks on sort table row. Here is an example:
The Export Element Features button saves Element Features for the current selected DTD as a file.
The Import Element Features button loads Element Features from a file. All previous data are replaced by the new data. The user can restore the previous data by clicking Cancel button.
If one of the Element Features uses some tokenizer which does not exist in the system the relevant warning message is given and the user can validate this definition by a validating dialog.
The Attribute Features is used to add information to the attributes of elements from a DTD or user elements.
The addition (creation) of a new Attribute feature can be performed by pressing the Add button. The removal of attributes can be done by selecting the corresponding row and pressing the Remove button. The OK button closes the dialog and updates the changes (if any). The Cancel button closes the dialog without saving the changes
The "Element" column of the table contains the elements which have attributes. One and the same element can appear several times because it can have more than one attribute. The "Attribute" column has the name of the attribute for which the attribute features are defined. "Is Enumeration " column shows whether the attribute is Enumeration or not. In the "Tokenizer" column the user can select different tokenizers for each attribute. The sort tool uses information from the "Number" column to select the way of comparing the value of the attributes (As plain text or as a number).
An additional feature is the order of enumerated attribute values. The attributes with enumerated values have "Yes" string at the "Is Enumeration" column. In order to sort the enumerated values the user must click on an attribute with enumeration value and to click the "Sort Values" button. Example:
The values are sorted in ascending order. The user can change the order by dragging the rows or by pressing the right mouse button.
The Export Attribute Features button saves Attribute Features for the current selected DTD as a file.
The Import Attribute Features button loads Attribute Features from a file. All previous data are replaced by the new data. The user can restore the previous data by clicking Cancel button.
If one of the Attribute Features uses some tokenizer which does not exist in the system the relevant warning message is given and the user can validate this definition by a validating dialog.
The XPath Macros gives the user possibility to name XPath
Expressions. It contains a macro name and an XPath expression. It is designed
for the general usage. The XPath Macros can be parts of XPath expressions and thus
they can be used in each place where XPath is used. Within an XPath expression a
macro is denoted by '%' (percent sign), followed by the macro name, i.e.
%macro_name. A macro citation itself also represents an XPath expression
(CLaRK System extension).
The Export XPath Macros button saves selected XPath Macros as a file.
The Import XPath Macros button loads XPath Macros from a file. There are loading options which determine the behaviour of the import operation.
If one of the XPath Macros has no XPAth expression or it is invalid the relevant warning message is given and the user can validate this definition by a validating dialog.
The Keys are a means for naming of XPath expressions and some specific information important for some of the tools in the system. The key names can be unique arbitrary strings. Here is what the XPath Key manager window looks like:
Each row in the table above represents one XPath Key. The content of the table cannot be modified directly form here. In order to modify a key entry, the user must select the corresponding row in the table and then press the Edit button. A new dialog window appears where the key specifications can be changed. The addition (creation) of a new XPath Key can be done by pressing the Add button. The removal of keys can be done by selecting the corresponding row and pressing the Remove button. The Exit button closes the dialog and updates the changes (if any).
There are several types of keys. Each type has different additional options and usage:
The Export Keys button saves selected Keys as a file.
The Import Keys button loads Keys from a file. There are loading options which determine the behaviour of the import operation.
If one of the Keys uses some tokenizer which does not exist in the system the relevant warning message is given and the user can validate this definition by a validating dialog.
This tool allows user to use certain key combinations to execute different actions in the system.
Each keyboard shortcut definition consists of two parts:
A key combination includes modifier keys (at least one) and an ordinary key. The modifier keys may vary in the different computer architectures. Usually, such keys are:
The action definition determines what type of action will be performed and the concrete action. The available action types are three: selecting a menu item, applying a tool query and an XPath search query. Each of them will be described in details later in this section.
Having selected this option, the following dialog window appears:
It represents a list of all defined shortcuts in the system, visualized in a table. Each row in the table corresponds to one shortcut. The first two cells show the key combination and the third one describes briefly the action of the shortcut. In column Key the keys which activate shortcuts are shown. In the second column, Modifier(s), the modifier keys for each shortcut are shown. They can be more than one. The column Action shows the shortcuts actions information.
The shortcut management which this dialog window offers includes adding, editing and removing shortcuts.
Here follows a description of the dialog buttons and their functions:
The Shortcut Editor window appears when the user presses buttons Edit and Add.
The definition of a key combination can be done in sections Modifiers and Key Code. Different combinations of modifiers can be selected by clicking on the relevant checkboxes. In order to select a shortcut activation key or to change the current one, the user has to press button Change. The system will respond with a small dialog, titled Press a key and it will wait for a single key stroke (without modifiers) which will be recorded as a new activation key. If the new key recording has to be stopped, the user must press button Cancel. Having chosen a new activation key, the recording dialog will disappear and the program control will be returned to the editor. If after pressing a key, the dialog does not disappear, it means that the selected key is not suitable for an activation key (the same happens if a modifier key is pressed). If the recording is successful, the new selected key is shown in the Key Code section.
The action for the shortcut is defined in section Action. The section is separated into three sub-sections, each responsible for one type of action. The sub-sections are as follows:
The Synchronization Rules is a means for establishing connections between opened documents in the system editor. The connections are expressed as distributions of a selection in one document to selections in other documents. The connections are based on XPath expressions processing and evaluation.
When a connection between the current document and another referent document is established each change of the selection in the current one is registered. If the new selection satisfies certain conditions, the connection is activated and a new XPath expression is generated on the basis of a pattern. The new XPath is evaluated in the referent document and the result is selected (in case it is not an empty node-set).
An example usage of this facility is when a certain document is explored and simultaneous look-ups in a dictionary document are needed. In this case a connection between the observed and the dictionary documents can be established. Whenever a certain word/expression to be looked up is selected, the system extracts the necessary information and performs a search in the dictionary document. If the searched entry is found it is selected in the background. Thus moving from word to word in the observed document leads to automanic showing the corresponding entries from the dictionary.
Another application of this tool is when different documents are connected in some way by references. Whenever a reference in one document is selected, the referent entity in the corresponding document is selected.
For establishing a connection of this type, exactly two documents are required: a current document in which the user works and a referent document in which the selection moves depending on the Sync rule parameters and the selection in the current document. There is no restriction on the number of connections which can take a certain document as a current one. In this respect, the user can connect one document with several referent documents and in this way navigating in one document causes distribution of selections in different documents in the same time.
In order a connection to be established, a
Each rule consists of three parts:
Once a Sync Rule is defined, it can be applied in a connection by opening a document (which will serve as a current one) in the editor and then assigning rule(s) by choosing menu item Document/Synchronize with ....
The document indexing is a representation of the document's content in a way optimized for fast search.
This representation in CLaRK can be done to the level of tokens. In order to use this optimized search, the
user must preprocess (index) the input document(s). During the process of indexation the system reads the data
from the document and produces the index data which is stored separately from the document. Whenever a fast
search is needed, this data is automatically loaded. The ability to search in an indexed document is provided by
an extension function of XPath:
In order the user to make an indexation of a document, a Document Index definition must be created first. These definitions determine the data on which the indexation will be performed. Each such definition uses XPath expression to select the nodes to be indexed. With an additional XPath expression the user can further specify the representative value for each selected node. These values are converted to strings, tokenized (optional) and stored in the internal structure. For convenience, the user can index different parts of a document independently and thus forming different index repositories for one document. Then each of them can be used independently. Whenever an index search is needed, the user specifies the search query and a repository in which the searching will be performed. If no repository is stated, the search is performed in all available repositories for the document.
To define a Document Index, the user has to use the following manager:
It contains a list of all document index definitions saved in the system. It is visualized in a table where the first column contains the names of the definitions and the second one contains lists of repositories of each index.
The possible operations here are: creating new definitions (New Index), modifying existing definitions (Edit Index) and removing existing definitions (Remove Index). Each index definition must have a name.
Once a definition is created, an indexing with it of document(s) can be applied with button Apply on document. The user is asked to select documents for indexing, after which the selected ones are indexed according to the selected definition and the data is saved. Having done that, fast searching can be performed in the processed documents. If an indexed document has to be modified later, a new re-indexing might be necessary.
The following section describes the creation of a document index definition. In order to create a definition the user has to supply a name for it. Having done that, a Document Index Editor window appears:
It contains information about the repository definitions of the current index definition. Initially the table is empty.
Each repository definition contains several parts:
An additional option here is setting a tokenizer. It is needed when a document must be indexed not by whole text nodes, but by text tokens within the text. Here the user selects a Tokenizer for processing the key values and the token categories to be indexed (button Customize). With it, the user can filter the categories which are not interesting for indexing. Additionally a token normalization can be used.
Searching in indexed documents
The searching in indexed documents is embedded in the extension of the XPath query language. Thus the indexing can be used wherever XPath search can be performed, i.e. all major system tools.
In order this functionality to be used, the target documents have to be indexed in advance. When an index search is performed on a document for a first time, the relevant index information is loaded automatically. This may cause a short delay before proceeding with further tasks. If an error occurs during index data loading, nothing is loaded and subsequent searches will be unsuccessful (no result will be returned). Possible reasons for unsuccessful index data loading can be that a document has been modified after it was indexed or the index data files have been corrupted. Whenever there is smth wrong with loading index data, the user can open the document in the editor and try to reload the data by using menu item Document/Load Index Data which indicates the failure reason (if any).
The extension XPath function which allows index search is:
node-set search ( string, string? )
The function's result is a node-set (possibly empty) which contains the nodes from the current document which answer to the given search query. The search query, itself, is set in the first argument and it represents a full or partial token value description, i.e. a certain word or a wildcards description. In case, no tokanization is used the queries will be matched against the whole nodes values stored in the index. A definition of the tokens value description language can be found in section Grammars.
The second function parameter is optional and it allows refining the search results by considering only a certain repository within the whole index. In this way, if an index contains different repositories with different content type (for example, one containing words form text nodes and another containing ID values comming from attributes), the search efficiency will be improved and the results will be more precise when index repositories are cited.
Example index search queries:
The Graphical Tree Layout is a means for drawing arbitrary tree structures represented in XML. The resulting graphical representation obeys different user adjustments, like colors and shapes rendering, text and structure definition and filtering and others.
The main graphical objects which can be used for nodes representation are: rectangles, rounded rectangles and ellipses. The user can specify their outline color and thickness, background color, text label inside (font, color and content). The nodes in the drawing are connected with lines, the appearance of which is again user defined: color and thickness. Additionally, there are cross-branches links available. They can connect any nodes in the drawing with arcs for which the curvature can be adjusted (to avoid overlapping with other lines and arcs) .
The layout itself represents a set of rules, each of which corresponding to one shape definition. Each rule has a conditional part which defines to what kind of nodes the rule is applicable (element, text or comment nodes or nodes, appearing in certain XPath defined context). If a condition for a rule is satisfied for a certain node, a new graphical object appears in the drawing canvas. The appearance of the object is defined in the rule.
Another important part of each rule is the Children Definition section which determines the nodes whose graphical representations which will appear as children of the graphical representation of the current node. The children definition is based on XPath expression and this allows visualization of nodes which are not direct children of the current node (or even nodes which do not belong at all to the current structure) as child nodes. The default value of the children definition for each rule is child::*, i.e. all direct child nodes.
If none of the rules in a layout are applicable for a ceratin node, there are three default rules, one of which always succeeds depending on the node type (element, text or comment).
An important section in each layout is the Structure roots definition section. It defines which nodes in the current document are suitable to be roots of a structure to be visualized. It contains an XPath expression which is used as a condition and if it is evaluated successfully for a certain node, the graphical representation building starts from it. Otherwise, the system searches for the closest ancestor which satisfies the condition.
In order to visualize a document (or a part of a document), it must be opened in the system editor. Having selected a node in the document, the menu item View/Graphical Tree View must be chosen which shows the graphical representation in a separate window. The layout which will be used is defined in the DTD Tree Layout of the cirrent document.
In order to define a graphical tree layout, the user must select menu item Definition / Graphical Tree Layout. The following dialog window appears:
The layouts manager contains several sections which are described below.
The Layouts section contains a list of all layouts which are currently available in the system. Initally the list contains only the entry Default Graphical Layout. The selection in this section determines the content of the other sections of the manager, i.e. it determines the current layout for editing ot just exploring.
The Rules section contains information about the rules defined in the current layout. It is represented in a table where each row represents one rule. Having selected a rule from the table a preview image is generated according to the rule's settings and it is shown in section Preview. The preview image contains fixed text and a shape with fixed dimensions. All other settings are taken from the rule.
In this section the user can add New Rule, Edit an existing rule or Remove Rule. The definition of a new rule or modification of an existing one is described in section Layout Rule Editor.
The first three rows of the table contain the default rules for the layout. They can be modified, but can not be removed.
The Links section contains information about the defined cross-branches links in the layout. Each such link is directed (although it is not visible on the canvas). The starting point of each link is determined by an XPath expression which is evaluated on the whole document. For each selected node which is visible on the canvas a second XPath expression is evaluated and the result of which determines the ending point of the corresponding link. If a starting or ending point is not visible, no link is drawn. Each link definition is processed independently. The creation of a link definition or modification of an existing one is described in section Layout Link Editor below.
The Structure roots definition section determines which nodes from the current document can represent a root for a structure to be visualized. The XPath expression is used as a condition for each selected node and in case of success the structure building starts from the corresponding node. If a condition fails for a node, it is checked again for its parent node and so on until a suitable root node is found. If no suitable root is found the system shows an error message.
The Layout Rule Editor is used for creating new graphical tree layout rules or modifying existing ones. The layout of the editor is the following:
The dialog contains several sections:
The Layout Link Editor is used for drawing cross-branches arcs in order to express relations other than parent-of or child-of. Each link of this kind is directed and it has a starting point (target) and an ending point (reference). The targets and the references are determined by XPath expressions evaluation.
Each link definition is processed independently in the following way. An XPath expression is evaluated on the whole source document. The result node set is reduced only to those nodes which appear in the current view. This forms a set of candidates for link starting points. For each of them a second XPath expression is evaluated and if the result is a not empty list, a link is established between the current context and the first entry of the result which is also represented in the image.
Here is the layout of the Link Editor dialog:
The components of the editor are as follows:
The user can save the selected definitions (DTDs, Tokenizers, Filters, XPath Macros and/or Keys) from the system as a file.
If there is a tokenizer which is used in some definitions and it is not exported the relevant warning message is given.
The user can load selected definitions (DTDs, Tokenizers, Filters, XPath Macros and/or Keys) from a file. The imported file must be generated by the export operation from the system.
The Loading Options (see below) determine the behaviour of the import operation.
If one of the new definitions uses some tokenizer which does not exist in the system the relevant warning message is given and the user can validate this definition by a validating dialog. This dialog is relevant to the type of the definition - Element Features Validation, Attribute Features Validation, XPath Macros Validation, Keys Validation.
The loading options are related with the cases when there is a definition in the system with the same name as the new imported definition. There are four modes: