The goal of this demo is to show how the concordance tool of the CLaRK system can be used for searching of some kind of linguistic phenomena within XML documents.
The concordance tool is implemented on the basis of the XPath engine, the regular grammar engine and a sorting module. The concordance tool is useful for searching some kind of units within some bigger units. For instance, a word within a sentence, a phrase within a paragraph and similar.
The bigger element is called here a context
and the smallest
element is called item
. The context is defined by an XPath expression.
The item could be defined by an XPath expression, a regular expression, a regular
grammar or by a grammar query. There are additional possibilities to restrict the
context by grammar query or XPath expressions.
The found units are stored in a new document and presented to the user in a table format. The user could also open the document as ordinary XML document and use all the tools available in the system in order to process this document further.
The goal here is to find all the possible uses of verbal tenses in the text.
Verbal tenses are determined by the grammar tenses
- described in
Grammar Tool Demo.
The document used for this demo is
Standart20030524.tag
.
The tokenizer that segments text to words is MixedWord
tokenizer which is defined within the system. If you would like to see the token types
defined within this tokenizer you have to select Tokenizers
item
from the menu Definitions
.
In order to run the demo you have to perform the following steps:
Standart20030524.tag
is loaded in
the system. If it is not, then load it as it is described in Import XML Tool Demo from the demo
directory. If it is, open it.tense
is in the system grammars.
Open the Grammar Manager
dialog by clicking Select
button. If it
is not, then you can load it by clicking File I/O
button - Load grammar from the
file
. The grammar is tense.gram
in the demo directory.no-blank
filer in the system. Then you can import the grammar
query tense.gram.que
in Root : SYSTEM : Queries :
Grammar
directory of the system.Tools
.Define Context
write the following XPath
Expression //text/descendant::p |//head
which selects all
paragraphs and headings within the document(s).Grammar
Panel. The user has three possibilities. To search with a regular expression
in Simplified
Usage Mode, with a grammar in Normal
Usage Mode or with a
grammar query in Query
Usage Mode. Queries
radio button for Usage Modetense.gram.que
in the Search Query
field by clicking Select
button and choose it from the Internal Document Manager -
Grammar Queies
dialog.Restriction query
is used when the user wants to restrict
context for the search with another grammar query. In our case we do not define such a
grammar query.Text only
option if in the text some of the
words are marked. When this option is selected, the engine skips tags (if any)
and takes their text content. It is not selected in this demo.Add number attribute
option in order to enumerate each
item that is found.Add source attribute
option in order to add an
attribute with the source document name to the item.Add path attribute
option in order to add an attribute
with the XPath value coming from the source document to the item.Normal
radio button for Usage Modetense
in the Grammar
field by clicking
Select
button and choose it from the Grammar Manager
dialog.Tokenizer
choose MixedWord
- that separates text into
words.no-blank
filter is
used. It is described in Filter Tool Demo.Normalize
check box can be selected because in the text there can be words with capitalized letters.Item
, Left Context
, Right Context
and
Comment
.Sort Table
button.Item
, by Left
or Right Context
. Let us
sort the result by the Item.Prefix
column of the sort table select I
-
it indicates that Item
will be sorted.Expression
column write the following XPath Expression
text()
- it determines the text in the items that are found.Asc
stands for ascending. Trim
option in order not to sort by the spaces
around the text (if any), and normalized
in order different graphical
writings of the same item to be sorted as equal.
Sort
button.Update
button.Settings
.When the dialog is closed, the user is asked to name the document with the result and is offered to see that document.
The query using a grammar is saved in document tenseN.conc.que
, and the one using a grammar
query is saved in the document tenseQ.conc.que
in the demo directory.
The goal here is to find number expressions in order to be able to make precise grammars for each different kind of the number expressions.
The document used for this demo is Standart20030524.tag
.
The tokenizer that segments text to words is MixedWord
tokenizer which is defined within the system. If you would like to see the token types
defined within this tokenizer, you have to select Tokenizers
item
from the menu Definitions
.
In order to run the demo you have to perform the following steps:
Standart20030524.tag
is loaded in
the system. If it is not, then load it as it is described in Import XML Tool Demo from the demo
directory. If it is, open it.Tools
.Define Context
write the following XPath
Expression //text/descendant::p |//head
which selects all
paragraphs and headings within the document(s).Grammar
Panel. The user has three possibilities. To search with a regular expression
in Simplified
Usage Mode, with a grammar in Normal
Usage Mode or with a
grammar query in Query
Usage Mode. Simplified
radio button for Usage Mode.Query String
write $NUMBER+,$#
that selects an expression starting
with a number(s) and followed by a token - it can be any token with a category recognized by the tokenizer - a Latin
word, punctuation, a brace etc.Tokenizer
choose MixedWord
- it
separates text into words.no-blank
filter is
used. It is described in Filter Tool Demo.Normalize
check box can be selected because in the text there can be words with capitalized
letters.Add number attribute
option in order to enumerate each
item that is found.Add source attribute
option in order to add an
attribute with the source document name to the item.Add path attribute
option in order to add an attribute
with the XPath value coming from the source document to the item.Item
, Left Context
,
Right Context
and Comment
.Sort Table
button.Prefix
column of the sort table select I
-
that indicates that Item
will be sorted.Expression
column write the following XPath Expression
text()
- it determines the text in the items that are found.Asc
stands for ascending.
Sort
button.Update
button.Settings
.When the dialog is closed, the user is asked to name the document with the result and is offered to see that document.
The query is saved in numEX.conc.que
document in the demo directory.
The goal here is to find all tok
elements in the text that
contain numbers. Such elements have an attribute type
with value
num
.
The document used for this demo is
Standart20030525concord.tag
.
In order to run the demo, you have to perform the following steps:
Standart20030525concord.tag
is
loaded in the system. If it is not, then load it as it is described in Import XML Tool Demo from the demo
directory. If it is, open it.Tools
.Define Context
write the following XPath
Expression //p
which selects all paragraphs.XPath
panel will be used. In the text box Search Elements
write the
following XPath Expression tok[@type="num"]
that selects all tok
elements containing numbers within the document(s).Left Context
and the Right Context
are used
when the user wants to restrict the context. Add number attribute
option in order to enumerate each
item that is found.Add source attribute
option in order to add an
attribute with the source document name to the item.Add path attribute
option in order to add an attribute
with the XPath value coming from the source document to the item.Item
, Left Context
,
Right Context
and Comment
.Sort Table
button.Item
, by Left
or Right Context
. Let us
sort the result by the Item. Prefix
column of the sort table select I
-
it indicates that Item
will be sorted.Expression
column write the following XPath Expression
tok/text()
- it determines the text in the items that are
found.Asc
stands for ascending. Trim
option in order not to sort by the spaces
around the text (if any), and number
in order to compare the elements
as numbers, not as strings.
Sort
button.Update
button.Settings
.When the dialog is closed, the user is asked to name the document with the result and is offered to see that document.
The above query is saved in the document numbers.conc.que
in the
demo directory.