The goal of this demo is to show how the tokenizer tool of the CLaRK system can be used for splitting the text into tokens.
The CLaRK System supports a user-defined hierarchy of tokenizers. At the very basic level the user can define a tokenizer in terms of a set of token types. In this basic tokenizer each token type is defined by a set of UNICODE symbols. Above this basic level tokenizers the user can define other tokenizers for which the token types are defined as regular expressions over the tokens of some other tokenizer, the so called parent tokenizer. The tokens in the system are used in different processing modules. For each tokenizer an alphabetical order over the token types is defined. This order is used for operations as comparing two tokens, sorting and similar.
The goal of this demo is to create a tokenizer that will separate words with capital first letter from words with all small letters.
It will be based on the Default
tokenizer that is defined in the
system. This tokenizer assigns a category to each symbol.
In order to create a tokenizer, you have to perform the following steps:
Definitions
. A list with all the tokenizers available in the system
appears.New
button. A new dialog appears in which the user
must specify the name of the tokenizer and its parent if the tokenizer is
complex - otherwise Primitive
checkbox must be selected.Name
write Uptok
.Default
- it is the parent tokenizer. Categories from that
tokenizer will be used in the new one. (Tokenizer categories can be seen from
the previous dialog by Edit
button.)Then press OK
button - the tokenizer editor dialog opens.
Category
column
write LATwc
- and in the Expression
column write the
following regular expresion:
LATc,((LATc|LATs)+|((LATc|LATs)+,("-"|"'"),(LATc|LATs)+)+)*
- these
are words starting with a capital letter (LATc
is a category from
the parent tokenizer for capital letters) and followed by arbitrary number of
capital or small letters (LATs
). The word can contain dash
-
and/or apostrophe '
, but never at the end of the
word. Press Enter
in order to finish editing of the row.Category
column write
LATws
- and in the Expression
column write the
following regular expresion:
LATs,((LATc|LATs)+|((LATc|LATs)+,("-"|"'"),(LATc|LATs)+)+)*
-
regular expression is the same but the words start with small letter.Sometimes the new tokenizer is a modification of an existing one. In this case if the
user wants to create a copy of an existing tokenizer he/she must select the
tokenizer that wants to copy and then click on New
button. The
dialog in this case has the new button Use current
that must be
pressed. A name of the new tokenizer must be given.
Our goal here is to show how the text really looks like when different tokenizers are used.
When a new DTD is loaded in the system, a tokenizer is automatically assigned
to it. It is the Default
tokenizer - defined in the system. The
tokenizer assigned to the DTD is called a default tokenizer for it. If nothing else
is said this tokenizer is used for processing documents with respect to this
DTD. The user can change the default tokenizer via Element features menu
item.
The document for this demo is Standart20030524.tag
which is a
valid document according to teixlite2x.dtd
.
In order to see the text in the document tokenized by the tokenizer assigned to the DTD, you have to perform the following steps:
Standart20030524.tag
is loaded in
the system. If it is not, then load it as it is described in Import XML Tool Demo.p
with the right
mouse and choose More/Info
. Text child of the selected node is
tokenized according to the Default
tokenizer - on symbols. For each
symbol a Category is assigned. XPath Expression identifying the selected node is
shown.Element
Features
item tool from Definitions
menu. Choose
teixlite2x.dtd
from the list of all DTDs compiled in the system and
change the Default Tokenizer
for the DTD.No Tokenizer
. Right click on Tree
Preview the same paragraph and choose Info. The text child of the selected
paragraph is taken as a whole string - it is not tokenized.Uptok
. Choose the same node and see the
Info. The text child is tokenized. Tokens are Latin words with capital first
letter, Latin words with small first letter spaces, punctuation etc. For each
token a category from the tokenizer is shown. A tokenizer can be set not only to a DTD but to a current node. From Definitions/Element Features - an element can be added and a tokenizer - assigned to it. If there is no selected tokenizer for an element - the DTD tokenizer is taken.