Constraints Tool Demo

The goal of this demo is to show how the constraints tool of the CLaRK system can be used for :

  • Checking the validity of a document regarding a set of constraints - allows the creation of constraints for the validation of a corpus according to given requirements.
  • Supporting the linguist in his/her work during the building of a corpus - helps the underlying strategy for minimisation of the human labour.
  • The general syntax of the constraints in the CLaRK system is the following:

    The following types of constraints are implemented in CLaRK:

    1. Regular Expression constraints - additional constraints over the content of given elements based on a document context.
    2. Number constraints - cardinality constraints over the content of a document.
    3. Value constraints - restriction of the possible content or parent of an element in a document based on a context.

    Regular Expression Constraints

    In this kind of constraints the selection of nodes, which the constraints will be applied to, is defined by XPath expressions. The contents of the selected nodes must match a description given as a regular expression in the constraint. This kind of constraints work in validation mode. During application the selected nodes are split into two sets, containing nodes matching the regular expression and nodes which do not match. The user can navigate subsequently through any of these sets of nodes.

    These constraints can be used for simulation of XML Schema constraints over textual nodes. In addition to checking the content, these constraints - via the XPath expression - can also determine the context of the elements, which they will be applied to. In this way they can be used for imposing regular constraints in addition to these in the DTD, making them more specific on the basis of the surrounding context.

    Number Restriction Constraints

    In this kind of constraints the selection of nodes is defined as an XPath expression. On the selected nodes separately another XPath expression is evaluated and the result from each evaluation is converted to a number using the rules defined in the XPath specification. A constraint is satisfied for a node if the corresponding numeric result is in a range given by two numbers MIN and MAX. The MIN and MAX values can be dynamically determined for each node by other two XPath expressions, which return numbers as results. These kinds of constraints can be useful for checking equal number of nodes of different type within a given context.

    Value Constraints

    These constraints determine the possible children, attributes or the parent of an element in a document. These constraints apply when the user enters a new child or a new parent of an element. In both cases a list of possible children or parents are determined by the DTD, but depending on the context in the document an additional reduction of these lists is possible. In case the only possible child of an element is a text, or an attribute is entered, then these constraints determine the possible text values for the element.


    Demo 1 : Some Children constraints - main disambiguater

    The goal here is to disambiguate manually a morpho-syntactically annotated text. The text first is tokenized, then possible morpho-syntactic tags are added to each wordform. At the end the text is manually disambiguated with the help of Value constraint of type Some Children.

    The text in the document is segmented in tokens - Latin words, numbers, punctuation, etc. The following annotation is used in the text: <pt> tag for punctuation , <w> tag for Latin words and <tok> tag for other tokens like numbers.

    For each word we encode the wordform in <ph> tag, the appropriate morpho-syntactic information from the dictionary is encoded as two elements: <aa> element, which contains a list of morpho-syntactic tags for the wordform separated by semicolons, and <ta> element, which has to contain the actual morpho-syntactic tag for this use of the wordform in the text.

    The value of <ta> element has to be among the values in the list presented in the element <aa> for the same wordform.

    Note that when the context determines only one possible value, it is added automatically to the content of <ta> element and thus the constraint becomes a rule.

    If there are more than one values, the constraint offers to the user a possibility to choose one of the allowed values. While listing the different choices, the user can get brief information about the meaning of each choice. This information must be stored in an internal document - Help Document. Its structure is described in a DTD in the file: helpFile.dtd. The information about a given choice appears in the status bar of the editor when the mouse pointer is over the choice.

    The document used for this demo is Standart20030524constr.tag.

    The tagset description is stored in the help file tag.ttt.

    The tokenizer disambiguate is used. It is in the demo directory and can be loaded within the System. If you would like to see the token types defined within this tokenizer, you have to select Tokenizers item from menu Definitions.

    In order to run the demo, you have to perform the following steps:

    1. Check whether the document: Standart20030524constr.tag and the help file tag.ttt are loaded and saved in the system. If they are not, then load them as it is described in Multi Import Tool Demo. If the documents are loaded in the system, open Standart20030524constr.tag.
    2. Check whether the tokenizer disambiguate is in the tokenizers list. If it is not then you can load it from Tokenizers item from menu Definitions by Load Tokenizer button.
    3. Check whether the filter NO_SC that skips semicolons in the text is in the filters list. If it is not, you can create it as described in Filter Tool Demo.

    4. Open the dialog of the Value Constraints tool from the menu item Tools/Constraints/Value Constraints/Edit Value Constraint.
    5. Create a new Value Constraint by clicking on the button new. In General panel insert as a Constraint Name - disambiguate elements and select for Type of Constraint - Some Children. It means that the constraint will work over the children nodes of the selected element.
    6. Select Options panel and set Insertation Mode radio button as an Application Mode. It means that a child node of the selected element is to be inserted (or a token will be added to the text content of the element). The position of the inserted child can be set from Position text field, where children are counted starting from 1. If the content of the element to which a token is added is non-empty, then it is separated from the rest of the text by the string, stated in the Separator text field.
    7. Select Show Status Before in order to see the number of the nodes which the constraint will be applied to.
    8. Select Show Status After in order to see how many nodes the constraint was applied to.
    9. Target Specification

      Target specification determines the nodes which the constraint will be applied to. In this demo the target is all <ta> elements which do not have content.

    10. In the text box Target XPath in Target panel write //ta[not(child::*)]. Note that tais the name of the element which the constraint will be applied to. not(child::*) determines all elements ta which have no children as target elements of the constraints.
    11. Source Specification

      Source specification determines where the possible values will be taken from. There are three possibilities: local document, external document, XML mark-up (including text). In the first case the value is pointed by an XPath expression. In the second, the value is in another document and the XPath expression is evaluated within this document. The third option allows the user to state explicit XML fragment or text. If the selected source is text, it is tokenized. In this demo the possible values are stored as text in the <aa> element for each word.

    12. Select the Local Document radio button in Restriction panel.
    13. In the text box XPath/XML write the following XPath expression: previous::aa/text(). This XPath specifies the source list for the constraint - all possible values for the target node are in all analyses <aa> element and are separated by semicolons.
    14. To determine different values, a tokenizer must be set. Choose Set a Tokenizer in Advanced panel and select disambiguate tokenizer from the list of tokenizers. This tokenizer is used for segmentation of the tagset.
    15. A filter must be set in order to avoid the separator as a possible value. To do so - choose Set a Filter and set NO_SC - it filters the semicolons.
    16. Very often the values are abbreviations which is hard to remember. In order to view the description of the selected value in the status bar set a help file containing such a description. To do so - choose Set Help Document and select tag.ttt from the list.
    17. Then you can close the constraint editor and return to the table with all constraints. In order to use the constraint, the user must activate it. Select the corresponding check box in Active column. Now the constraint is ready to be applied.

    The user can apply constraints in two ways:

    If the user applies the second method, he/she must:

    The dialog of the tool in this case has to be:

    The above query is saved in the document main.cnst.que in the demo directory.

    When the constraint is applied over the document, a small dialog appears for elements, which have more than one possible values.

    Going through values their description is shown in the status bar. This is very helpful when the paradigm is rather complex.

    Demo 2 : Some Attribute constraints - main disambiguater

    This demo is very similar to the previous one. Except for the fact that the morpho-syntactic information is represented as values of attributes instead of content of elements.

    The goal is the same to disambiguate manually a morpho-syntactically annotated text. The text first is tokenized, then possible morpho-syntactic tags are added to each wordform. At the end the text is manually disambiguated with the help of the Value constraint of type Some Attributes.

    The text in the document is segmented in tokens - Latin words, numbers, punctuation, etc. The following annotation is used in the text: <pt> tag for punctuation, <w> tag for Latin words and <tok> tag for other tokens like numbers.

    For each wordform the appropriate morpho-syntactic information is encoded as two attributes: aa attribute, which contains a list of possible morpho-syntactic tags for the wordform separated by a semicolon, and ana element, which contains the actual morpho-syntactic tag for this use of the wordform.

    The value of ana attribute has to be among the values in the list presented in the attribute aa for the same wordform.

    When the context determines only one possible value, it is added automatically to the content of ana attribute and thus the constraint becomes a rule.

    The document used for this demo is Standart20030525constr.tag.

    The tokenizer is disambiguate tokenizer which is in the demo directory and can be loaded within the System. If you would like to see the token types defined within this tokenizer, you have to select Tokenizers item from the menu Definitions.

    The constraint is almost the same as the constraint described above. For this reason only the differences will be described.

    In order to run the demo you have to perform the following steps:

    1. Check whether the document: Standart20030525constr.tag is loaded in the system. If it is not, then load it as it is described in Import XML Tool Demo. If the document is loaded in the system, open it.
    2. Check whether the tokenizer disambiguate is in the tokenizers list. If it is not, then you can load it from Tokenizers item from menu Definitions by Load Tokenizer button.
    3. Check whether the filter NO_SC is in the filters list. If it is not, you can create it as it is described in Filter Tool Demo.

    4. Open the dialog of the Value Constraints tool from the menu item Tools/Constraints/Value Constraints/Edit Value Constraint.
    5. Create a new Value Constraint. In General panel insert as a Constraint Name - disambiguate attributes and select Type of Constraint - Some Attributes. It means that the constraint will work over the attributes of the selected element.
    6. The settings in the panel Options are the same as in the constraint described above.

      Target Specification

    7. In the text box Target XPath in Target panel write \\w[not(@ana)]. w is the name of the elements which the constraint will be applied to - this is because different elements can have attributes with the same name. not(@ana) determines all the elements w that have no attribute ana.
    8. In the text box Target Attribute write ana. This is the name of the attribute which the constraint will be applied to.
    9. Source Specification

    10. In the text box XPath/XML in Restriction panel write the following XPath expression: @aa. This XPath specifies the source list for the constraint - all possible values for the target attribute which are stored as value of the attribute aa separated by semicolons.

    The application of the constraint is the same as above. The query is saved in the document attribute.cnst.que in the demo directory.

    Demo 3: Creation of a dictionary

    The goal here is to create a very simple dictionary.

    The document used for this demo is: const_demo.xml. It only contains very simple structures of word entries without any other information. The file also contains a header element with a list of all part-of-speech category names allowed for the dictionary.

    The set of constraints in this demo builds partial sub-structures automatically when the user supplies certain information. In the cases when the user must decide what information to supply, the constraints reduce (constraint) the choices as much as possible.

    In order to run the demo you have to perform the following steps:

    1. Check whether the document: const_demo.xml is loaded and opened in the system. If it is not, then the user must compile the DTD from file: const_demo.dtd and then load the document as it is described in Import XML Tool Demo. If the document is loaded in the system, open it.
    2. Check whether the tokenizer disambiguate is in the tokenizers list. If it is not, then you can load it from Tokenizers item of the menu Definitions by Load Tokenizer button.
    3. Check whether the filter NO_SC that skips semicolons in the text is in the filters list. If it is not, you can create it as described in Filter Tool Demo.
    4. Load the constraints. The constraints are stored in a file const_demo.scr. In order to load it, the user must choose menu item Tools/Constraints/Value Constraints/Edit Value Constraint. Then choose Load From File button and point to file const_demo.src. The constraints are loaded in the dictionary group.
    5. Apply the Value Constraint group, created in the previous step.

    The user can apply constraint groups in two ways:

    The user can apply constraint from Value Constraints Manager or Apply Constraints dialog:

    If the user applies the first method, he/she must:

    If the user applies the second method, he/she must: