Grammar Tool Demo

The goal of this demo is to show how the grammar tool of the CLaRK system can be used to find and mark-up sequence of data from XML documents.

A Grammar in the CLaRK System is defined as a set of rules. Each rule consists of three regular expressions and a category (represented as an XML fragment, called Return Markup). The three regular expressions are called: Regular Expression, Left Regular Expression, Right Regular Expression. The Regular Expression determines the content which the rule can be applied to. The Left and the Right Regular Expression determine the left and the right context of the content the rule recognises (if there is no constraints over one of the contexts, then the corresponding expressions are empty). When the rule is applied and recognises some part of an XML document, the part is substituted by the return markup of the rule. If it is necessary to keep the recognised part, it can be cited by using the variable \w. If the user needs to use the string \w in the return markup, he/she can avoid the \w variable in the following way: ^\w

The regular grammars in the CLaRK System work over token and/or element values generated from the content of an XML document and they incorporate their results back in the document as XML mark-up.

The tokens are determined by the corresponding tokenizer.

Before having been used in the grammar, each XML element is converted into a list of textual items. This list is called element value for the XML element. The element values are defined with the help of XPath keys, which determine the important information for each element.

In the grammars, the token and element values are described by token and element descriptions. These descriptions could contain wildcard symbols and variables. The variables are shared among the token descriptions within a regular expression and can be used for the treatment of phenomena like agreement.

Here is the list of the token and element descriptions:

  1. "token" -> describes the token itself. This description can be matched to the token itself and nothing else.

  2. $TokenCategory -> describes all tokens of the category TokenCategory. In the grammar input this description is matched against exactly one token of this category.

  3. Wildcard Symbols: #, @, % -> describe substrings of a given token. # - describes a substring of arbitrary length from 0 to infinity, @ - describes a substring of arbitrary length from 0 or 1, % - describes a substring of arbitrary length one. Here are some examples:

  4. Variables: &V -> describes some substring in a token, when initialised for the first time, then matches to the same substring in the same token, or some following tokens. The scope of a variable is one grammar rule. The variable can be used in the return mark-up and in this case the value of the variable is copied into the return mark-up. Each variable consists of the symbol & followed by a single Latin letter. Each variable has positive and/or negative constraints over the possible values. Both the positive and the negative constraints over variables are given by lists of token descriptions. The value assigned to a variable during the application of the grammar has to be described by one of the positive constraints and must not be described by any of the negative constraints. These token descriptions can contain wildcard symbols, but no other variables. Here are some examples: "A&N&G", "Nc&N&G". These token descriptions can be used in a rule to ensure the agreement in number and gender between an adjective and a noun.

  5. Complex token descriptions. The user can combine the above descriptions in one token description. Some examples: "lov%#", "Vp&N&G#&P"

  6. Element description is a regular expression in angle brackets: < Regular Expression >. Here the Regular Expression is over token descriptions which is matched against the element value. Examples:

The application of a rule works in the following way. The element value is scanned from left to right. Its Regular Expression is evaluated from the current point. If it recognises a part of the element value (this part we will call a match of the rule), then the regular expressions for the left and for the right contexts are evaluated (if they are not empty). If they are satisfied by the context of the match, then the match is substituted in the return markup for each presence of the variable \w (The user must be careful if he/she has, for example, text like \word in the return markup, the beginning \w will be substituted by the match. In this case the variable must be escaped). After these substitutions, the new markup is substituted in the XML document instead of the match place.

When a regular expression is evaluated from a given point within the element value, there is a possibility for several matches to the expression. For instance, the expression (A,B)+ over the element value L,A,B,A,B,A can recognise two matches from the second possition: A,B and A,B,A,B. This allows for a non-deterministic choice in this place. One can choose either the shortest match, or the longest one, or some in between. Generally, there are no universal principles for making such a choice. This is why in the CLaRK system we allow for user definition of a strategy to choose a match among more choices. We envisage four strategies: shortest match - in this case the system always selects the shortest possible match; longest match - in this case the system always selects the longest possible match; any up - in this case the possible matches are enumerated from the shortest to the longest possible match up to the moment when the left and/or the right context of the match satisfy the Left and/or the Right Regular Expression. If there is no Left and Right Regular Expressions then any up strategy is the same as shortest match; any down - it is similar to any up except for the fact that the possible matches are enumerated from the longest to the shortest one. These strategies are specified within the grammar queries. This allows the same grammar to be applied with different strategies over different documents.

The definition and application of a grammar are separated within the CLaRK system. The grammar itself is defined at one place, the parameters for its application are defined at another place in the form of grammar queries. This separation allows the use of the same grammar with different parameters like different tokenizers, different element values, different filters etc. Each of them has an XML representation. These XML representations allow grammars and their queries to be exchanged among different users. Also this allows the grammars to be constructed out of the system and then imported within it.

The grammar definition consists of a set of rules, variable definitions and context evaluation parameter (Check Context Order). The rules have already been discussed. The variable definitions are given by the positive and negative constraints over the variable. The context evaluation parameter determines the regular expression for which the context will be checked first - the left and then the right, or vice versa.

The grammar application determines: the elements which the grammar will be applied to; the element values for those elements, including the tokenizer and the filter; whether the textual elements will be normalized; the application strategy (longest, shortest match, any up and any down match).

Demo 1: Grammar that marks tokens

Our goal here is to mark all meaningful tokens - words, punctuation, numbers etc. in the text using tokenizer categories.

The document used for this demo is: Standart20030524.tag.

The tokenizer is MixedWord tokenizer which is defined within the System. If you would like to see the token types defined within this tokenizer you have to select Tokenizers item from menu Definitions.

In order to run the demo you have to perform the following steps:

  1. Check whether the document: Standart20030524.tag is loaded and saved in the system. If it is not, then load it as it is described in Import XML Tool Demo.
  2. Open the document in the editor.
  3. Open the dialog of the Grammar Manager tool from the menu item Tools.
  4. Create new grammar named tokenize - the grammar editor dialog is opened.
  5. Our goal of the grammar is to mark all tokens from the text - words, numbers, punctuation and white spaces. Punctuation will be marked with <pt> tag, white spaces with <whitespace> tag and all other tokens will be marked with <tok> tag. The MixedWord Tokenizer separates the text in such tokens, but sometimes it can divide tokens which must be considered a whole.

    For each token we have a different rule in the grammar:

  6. The grammar in this case has to be:
  7. In order to use the grammar, you have to save and compile it, and then to return to the Grammar Manager dialog from Exit button.
  8. Applying the grammar over the document:

The above query is saved in document tokenize.gram.que in the demo directory.

Demo 2: Grammar for tense recognition

Our goal here is to create a grammar that will be used for a concordance of the verbal tenses in the text. We do not to present a precise grammar. This grammar can be used for observation in the text and can be a base for more explicit grammars.

The document used for this demo is: Standart20030524.tag.

The tokenizer is MixedWord which is defined within the System. If you would like to see the token types defined within this tokenizer you, have to select Tokenizers item from the menu Definitions.

In order to run the demo you have to perform the following steps:

  1. Check whether the document: Standart20030524.tag is loaded and saved in the system. If it is not, then load it as it is described in Import XML Tool Demo.
  2. Open the document in the editor.
  3. Open the dialog of the Grammar Manager tool from the menu item Tools.
  4. Create a new grammar named tense - the grammar editor dialog is opened.
  5. This grammar will be used by the concordance tool. For this reaon the found items will not be marked.

    For each tense we have a different rule in the grammar:

  6. The grammar in this case has to be:
  7. In order to use the grammar, save and compile it, and then return to the Grammar Manager dialog from Exit button.
  8. Applying the grammar over the document:

The above query is saved in the document tense.gram.que in the demo directory.

Demo 3: Grammars that add morphological information to tokens

Our goal here is add morphological information to tokens - all possible analyses for the current word.

Two grammars are demonstrated. The difference between them is only in the Return mark-up. Grammars work over the result file from Demo1.

The document used for this demo is: Standart20030524.tag.

No tokenizer is needed because the grammar works over the whole string in tok elements.

In order to run the demo you have to perform the following steps:

  1. Check whether the document: Standart20030524.tag - the result from Demo1 is loaded and saved in the system. If it is not, then load it as it is described in Import XML Tool Demo.
  2. Open the document in the editor.
  3. Open the dialog of the Grammar Manager tool from the menu item Tools.
  4. Load the two grammars dict-attr.grm and dict-element.grm in the Grammar Manager:
  5. The user can see grammars when selecting each of them and when pressing Edit button.

    The grammars do mark each token in appropriate way and add information with all the possible morphological analyses for that token.

    In dict-element.grm the recognized words are marked with <w> tag. For each word we encode the wordform in <ph> tag, the appropriate morpho-syntactic information from the dictionary is encoded as two elements: <aa> element, which contains a list of morpho-syntactic tags for the wordform separated by a semicolon, and <ta> element, which contains the actual morpho-syntactic tag for this use of the wordform.

    The grammar in this case is:

    In dict-attr.grm the recognized words are marked with <w> tag. For each wordform the appropriate morpho-syntactic information from the dictionary is encoded as two attributes: aa attribute, which contains a list of morpho-syntactic tags for the wordform separated by a semicolon, and ana element, which contains the actual morpho-syntactic tag for this use of the wordform.

    The grammar in this case is:

  6. Applying the grammar over the document - query for dict-attr.grm grammar will be described. The query for dict-element.grm is the same, only the grammar must be changed :

The above query is saved in the document Eng_Dict_Attribute in the demo directory.

The query of dict-element.grm is saved in the document Eng_Dict_Element in the demo directory.

Demo 4 : A grammar for sentence delimitation

Our goal here is to delimitate sentence boundaries.

The document used for this demo is: Standart20030525gram.tag.

The tokenizer is Uptok tokenizer which separates words with capital first letter from words with small letters. This is important for named entity recognition and sentence boundary. The tokenizer can be loaded within the System from Tokenizers item of menu Definitions.

In order to run the demo, you have to perform the following steps:

  1. Check whether the tokenizer Uptok is loaded and saved in the system. If it is not, then load it as it is described in Tokenizer Tool Demo.
  2. Check whether the document: Standart20030525gram.tag is loaded and saved in the system. If it is not, then load it as it is described in Import XML Tool Demo.
  3. Open the document in the editor.
  4. Open the dialog of the Grammar Manager tool from the menu item Tools.
  5. Load the grammar in the system from File I/O, Load grammar from file - choose sentence.gram. If you would like to see the grammar rules within this grammar, you have to select Edit button of Grammar Manager Dialog.
  6. The grammar determines a sentence as a sequence of tokens – words, numbers and punctuation (we refer to the content of <w>, <tok> and <pt> elements with Element Values). As a sentence first element we have a capitalized word ($LATwc)or tok element, which is a number ($NUMBER+). The sentence ends with punctuation – full stop, question mark or exclamation mark. The sentence can be followed by another sentence or it can be the last one in a paragraph ($$) – these features are matched by the right context of grammar. Sometimes the sentence can be direct speech – in this case it starts with a dash.

  7. The grammar in this case has to be:
  8. Applying the grammar over the document:

The above query is saved in the document sentence.gram.que in the current directory.