The goal of this demo is to show how the grammar tool of the CLaRK system can be used to find and mark-up sequence of data from XML documents.
A Grammar in the CLaRK System is defined as a set of rules.
Each rule consists of three regular expressions and a category
(represented as an XML fragment, called Return Markup). The three regular
expressions are called: Regular Expression, Left Regular Expression, Right
Regular Expression. The Regular Expression determines the content which the
rule can be applied to. The Left and the Right Regular Expression determine the
left and the right context of the content the rule recognises (if there is no
constraints over one of the contexts, then the corresponding expressions are
empty). When the rule is applied and recognises some part of an XML document,
the part is substituted by the return markup of the rule. If it is necessary to
keep the recognised part, it can be cited by using the variable
\w
. If the user needs to use the string \w in the return markup, he/she can
avoid the \w
variable in the following way: ^\w
The regular grammars in the CLaRK System work over token and/or element values generated from the content of an XML document and they incorporate their results back in the document as XML mark-up.
The tokens are determined by the corresponding tokenizer.
Before having been used in the grammar, each XML element is converted into a list of textual items. This list is called element value for the XML element. The element values are defined with the help of XPath keys, which determine the important information for each element.
In the grammars, the token and element values are described by token and element descriptions. These descriptions could contain wildcard symbols and variables. The variables are shared among the token descriptions within a regular expression and can be used for the treatment of phenomena like agreement.
Here is the list of the token and element descriptions:
"token"
-> describes the token itself. This
description can be matched to the token itself and nothing else.
$TokenCategory
-> describes all tokens of
the category TokenCategory
. In the grammar input this description
is matched against exactly one token of this category.
Wildcard Symbols: #
, @
,
%
-> describe substrings of a given token. #
-
describes a substring of arbitrary length from 0 to infinity, @
-
describes a substring of arbitrary length from 0 or 1, %
-
describes a substring of arbitrary length one. Here are some examples:
"lov#"
: matches exactly one token which
starts with "lov": "lov", "love", "loves", and many others.
"lo#ve"
: matches exactly one token which
starts with "lo" and ends with "ve": "love", "locative", "locomotive" etc.
"%og"
:matches exactly one token which ends
in "og": "bog", "dog", "fog", "jog" etc.
"do"
:matches exactly one token which starts
with "do": "do", "doe", "dog", "don" and many others.
Variables: &V
-> describes some substring in
a token, when initialised for the first time, then matches to the same substring in
the same token, or some following tokens. The scope of a variable is one grammar
rule. The variable can be used in the return mark-up and in this case the value
of the variable is copied into the return mark-up. Each variable consists of the
symbol &
followed by a single Latin letter. Each variable has
positive and/or negative constraints over the possible values. Both the positive
and the negative constraints over variables are given by lists of token
descriptions. The value assigned to a variable during the application of the
grammar has to be described by one of the positive constraints and must not be
described by any of the negative constraints. These token descriptions can
contain wildcard symbols, but no other variables. Here are some examples:
"A&N&G"
, "Nc&N&G"
. These token descriptions can be
used in a rule to ensure the agreement in number and gender between an adjective
and a noun.
Complex token descriptions. The user can combine the
above descriptions in one token description. Some examples:
"lov%#"
, "Vp&N&G#&P"
Element description is a regular expression in angle
brackets: <
Regular Expression >
. Here the
Regular Expression is over token descriptions which is matched against the
element value. Examples:
<w>
: matches exactly one
w
element.
<$TokenCategory>
: matches exactly
one element, whose element value is a token description with category
TokenCategory
<"token">
: matches exactly one
element, whose element value is the token itself and nothing else. A token can
contain wildcard symbols and variables.
<<N>>
: matches exactly one
element, whose element value is the XML element N
.
The application of a rule works in the following way. The
element value is scanned from left to right. Its Regular Expression is evaluated
from the current point. If it recognises a part of the element value (this part
we will call a match of the rule), then the regular expressions for the
left and for the right contexts are evaluated (if they are not empty). If they
are satisfied by the context of the match, then the match is substituted in the
return markup for each presence of the variable \w
(The user must be careful if
he/she has, for example, text like \word
in the return markup, the
beginning \w
will be substituted by the match. In this case
the variable must be escaped). After these substitutions, the new markup is
substituted in the XML document instead of the match place.
When a regular expression is evaluated from a given point
within the element value, there is a possibility for several matches to the
expression. For instance, the expression (A,B)+
over the element
value L,A,B,A,B,A
can recognise two matches from the second
possition: A,B
and A,B,A,B
. This allows for a
non-deterministic choice in this place. One can choose either the shortest
match, or the longest one, or some in between. Generally, there are no universal
principles for making such a choice. This is why in the CLaRK system we allow for
user definition of a strategy to choose a match among more choices. We envisage four strategies: shortest
match - in this case the system always selects the shortest possible match; longest match - in
this case the system always selects the longest possible match; any up -
in this case the possible matches are enumerated from the shortest to the
longest possible match up to the moment when the left and/or the right context
of the match satisfy the Left and/or the Right Regular Expression. If there is
no Left and Right Regular Expressions then any up strategy is the same as
shortest match; any down - it is similar to any up except for the fact
that the possible matches are enumerated from the longest to the shortest one.
These strategies are specified within the grammar queries. This allows the same
grammar to be applied with different strategies over different documents.
The definition and application of a grammar are separated within the CLaRK system. The grammar itself is defined at one place, the parameters for its application are defined at another place in the form of grammar queries. This separation allows the use of the same grammar with different parameters like different tokenizers, different element values, different filters etc. Each of them has an XML representation. These XML representations allow grammars and their queries to be exchanged among different users. Also this allows the grammars to be constructed out of the system and then imported within it.
The grammar definition consists of a set of rules, variable definitions and context evaluation parameter (Check Context Order). The rules have already been discussed. The variable definitions are given by the positive and negative constraints over the variable. The context evaluation parameter determines the regular expression for which the context will be checked first - the left and then the right, or vice versa.
The grammar application determines: the elements which the grammar will be applied to; the element values for those elements, including the tokenizer and the filter; whether the textual elements will be normalized; the application strategy (longest, shortest match, any up and any down match).
Our goal here is to mark all meaningful tokens - words, punctuation, numbers etc. in the text using tokenizer categories.
The document used for this demo is:
Standart20030524.tag
.
The tokenizer is MixedWord
tokenizer which is defined within the System. If you would like
to see the token types defined within this tokenizer you have to select Tokenizers
item from menu
Definitions
.
In order to run the demo you have to perform the following steps:
Standart20030524.tag
is loaded and saved in the system. If it is not, then load it
as it is described in Import XML Tool Demo.Tools
.tokenize
- the grammar editor dialog is opened.Our goal of the grammar is to mark all tokens from the text - words, numbers, punctuation and white spaces.
Punctuation will be marked with <pt>
tag, white spaces with <whitespace>
tag and all other tokens will be marked with <tok>
tag. The MixedWord Tokenizer separates the text in such tokens, but sometimes it can divide tokens which must be considered a whole.
For each token we have a different rule in the grammar:
LATw
- the Regular Expression(RE) of the grammar rule will be $LATw
.MP's
will be
interpreted as MP
,'
and s
. So the rule must unite these three tokens
- the RE will be $LATw,"'",$LATw
The two rules for words and possessive forms differ in "'",$LATw
and they can be united:
$LATw,("'",$LATw)?
.
$NUMBER+
($NUMBER is a Token Category that recognizes digits).$NUMBER+,$LATw
Exit
button.Grammar Manager
pressing Apply
button or from Apply Grammar
menu item.tokenize
in the Grammar
field by clicking
Select
button and choose it from the Grammar Manager
dialog.Apply to
write the following XPath expression :
//text/descendant-or-self::p | //head | //trailer
which select all the parents of a textual element within
the document.Tokenizer
choose
MixedWord
- which token Categories are used in the grammar.Normalize
check box is not selected because the tokenizer makes no difference between
capital and small letters.Multiple
Apply
. Then a new part of the dialog appears in which you have to select the documents from the
corresponding document group via the button Add Documents
.Options
and then
Overwrite
radio button in order to have the result from the grammar in the same document. Then
return to the main dialog.The above query is saved in document tokenize.gram.que
in the demo directory.
Our goal here is to create a grammar that will be used for a concordance of the verbal tenses in the text. We do not to present a precise grammar. This grammar can be used for observation in the text and can be a base for more explicit grammars.
The document used for this demo is: Standart20030524.tag
.
The tokenizer is MixedWord
which is defined within the System. If you would like to see
the token types defined within this tokenizer you, have to select Tokenizers
item from the menu
Definitions
.
In order to run the demo you have to perform the following steps:
Standart20030524.tag
is loaded and saved in the system. If it is
not, then load it as it is described in Import XML Tool Demo.Tools
.tense
- the grammar editor dialog is opened.This grammar will be used by the concordance tool. For this reaon the found items will not be marked.
For each tense we have a different rule in the grammar:
("was"|"were"),"not"?,"#"
. ("has"|"have"|"had"),"#"
("am"|"is"|"are"),"#"
("has"|"have"|"had"),"been","#"
."will","not"?,"#"
Exit
button.Grammar Manager
pressing
Apply
button or from Apply Grammar
menu item.tense
in the Grammar
field by clicking
Select
button and choose it from the Grammar Manager
dialog.Apply to
write the following XPath expression : //text//p | //head
which selects all the parents of a textual element that may contain tenses within the document.Tokenizer
choose MixedWord
- that separates text into
words.no-blank
filter is
used. It is described in Filter Tool Demo.Normalize
check box can be selected because in the text there can be words with capitalized
letters.Multiple Apply
. Then a new part of the dialog appears in which you have to select the documents
from the corresponding document group via the button Add Documents
.Options
and then Overwrite Input
radio
button in order to have the result from the grammar in the same document. Then return to the main dialog.The above query is saved in the document tense.gram.que
in the demo directory.
Our goal here is add morphological information to tokens - all possible analyses for the current word.
Two grammars are demonstrated. The difference between them is only in the Return mark-up. Grammars work over the result file from Demo1.
The document used for this demo is: Standart20030524.tag
.
No tokenizer is needed because the grammar works over the whole string in tok
elements.
In order to run the demo you have to perform the following steps:
Standart20030524.tag
- the result from Demo1 is loaded and
saved in the system. If it is not, then load it as it is described in Import XML
Tool Demo.Tools
.dict-attr.grm
and dict-element.grm
in the Grammar
Manager:File I/O
button at the bottom of the dialog and click over Load grammar from
file
- a standard File chooser appears.dict-attr.grm
and press Load from file
button.The user can see grammars when selecting each of them and when pressing Edit
button.
The grammars do mark each token in appropriate way and add information with all the possible morphological analyses for that token.
In dict-element.grm
the recognized words are marked with <w>
tag. For each
word we encode the wordform in <ph>
tag, the appropriate morpho-syntactic information
from the dictionary is encoded as two elements: <aa>
element, which contains a list of
morpho-syntactic tags for the wordform separated by a semicolon, and <ta>
element, which
contains the actual morpho-syntactic tag for this use of the wordform.
The grammar in this case is:
In dict-attr.grm
the recognized words are marked with <w>
tag. For each
wordform the appropriate morpho-syntactic information from the dictionary is encoded as two attributes:
aa
attribute, which contains a list of morpho-syntactic tags for the wordform separated by a
semicolon, and ana
element, which contains the actual morpho-syntactic tag for this use of the
wordform.
The grammar in this case is:
dict-attr.grm
grammar will be described. The query for dict-element.grm
is the same, only the grammar must be changed :Grammar Manager
pressing
Apply
button or from Apply Grammar
menu item.dict-attr.grm
in the Grammar
field by clicking
Select
button and choose it from the Grammar Manager
dialog.Apply to
write the following XPath expression : //tok
which
selects all the tok
elements within the document.Tokenizer
choose No Tokenizer
.Simple
radio button as a Settings Mode
.Normalize
check box must be selected in order not to have difference between capital
and small letters.Multiple Apply
. Then
a new part of the dialog appears in which you have to select the documents from the corresponding document
group via the button Add Documents
.Options
and then Overwrite
radio button in order to have the result from the grammar in the same document. Then return to the main dialog.w
tag and morphological information added.The above query is saved in the document Eng_Dict_Attribute
in the demo directory.
The query of dict-element.grm
is saved in the document Eng_Dict_Element
in the demo directory.
Our goal here is to delimitate sentence boundaries.
The document used for this demo is: Standart20030525gram.tag
.
The tokenizer is Uptok
tokenizer which separates words with capital first letter from words
with small letters. This is important for named entity recognition and sentence boundary. The tokenizer can be
loaded within the System from Tokenizers
item of menu Definitions
.
In order to run the demo, you have to perform the following steps:
Uptok
is loaded and saved in the system. If it is not, then load it
as it is described in Tokenizer Tool Demo.Standart20030525gram.tag
is loaded and saved in the system. If
it is not, then load it as it is described in Import XML Tool Demo.Tools
.File I/O
, Load grammar from file
-
choose sentence.gram
. If you would like to see the grammar rules within this grammar, you have to select Edit
button of Grammar Manager
Dialog.The grammar determines a sentence as a sequence of tokens – words, numbers and punctuation (we refer to
the content of <w>, <tok> and <pt> elements with Element Values). As a sentence first element
we have a capitalized word ($LATwc
)or tok element, which is a number
($NUMBER+
). The sentence ends with punctuation – full stop, question mark or exclamation
mark. The sentence can be followed by another sentence or it can be the last one in a paragraph
($$
) – these features are matched by the right context of grammar. Sometimes the sentence can be
direct speech – in this case it starts with a dash.
Grammar Manager
pressing
Apply
button or from Apply Grammar
menu item.sentence
in the Grammar
field by clicking
Select
button and choose it from the Grammar Manager dialog.
Apply to
write the following XPath expression : //text/descendant::p
which selects all the paragraphs within the document.Tokenizer
choose Uptok
- which token Categories are used in
the grammar.no-blank
filter is used. It is described in Filter Tool Demo.Normalize
check box is not selected because it is important to distinguish capital letters from
small ones.w
, pt
and tok
elements, but we are interested in their text. For that reason Element Values are used :
w
- the text of ph child element is taken : ph/text()
tok
- the text : text()
pt
element text child is taken : text()
.s
tag.The above query is saved in the document sentence.gram.que
in the current
directory.