Tokenizer Tool Demo

The goal of this demo is to show how the tokenizer tool of the CLaRK system can be used for splitting the text into tokens.

The CLaRK System supports a user-defined hierarchy of tokenizers. At the very basic level the user can define a tokenizer in terms of a set of token types. In this basic tokenizer each token type is defined by a set of UNICODE symbols. Above this basic level tokenizers the user can define other tokenizers for which the token types are defined as regular expressions over the tokens of some other tokenizer, the so called parent tokenizer. The tokens in the system are used in different processing modules. For each tokenizer an alphabetical order over the token types is defined. This order is used for operations as comparing two tokens, sorting and similar.

Demo 1 : Creation of Tokenizers

The goal of this demo is to create a tokenizer that will separate words with capital first letter from words with all small letters.

It will be based on the Default tokenizer that is defined in the system. This tokenizer assigns a category to each symbol.

In order to create a tokenizer, you have to perform the following steps:

  1. Open the dialog of the Tokenizers tool from the menu item Definitions. A list with all the tokenizers available in the system appears.
  2. Click on New button. A new dialog appears in which the user must specify the name of the tokenizer and its parent if the tokenizer is complex - otherwise Primitive checkbox must be selected.
  3. In the text box Name write Uptok.
  4. From the list with available tokenizers in the system select Default - it is the parent tokenizer. Categories from that tokenizer will be used in the new one. (Tokenizer categories can be seen from the previous dialog by Edit button.)
  5. Then press OK button - the tokenizer editor dialog opens.

  6. For each Token Category we have a different row in the tokenizer - the first column is the Category of the tokens; the second row is description (with Regular Expression) of tokens that must be matched by that category. To add a row in the table right click on a row.
  7. The tokenizer in this case must look like this:

Sometimes the new tokenizer is a modification of an existing one. In this case if the user wants to create a copy of an existing tokenizer he/she must select the tokenizer that wants to copy and then click on New button. The dialog in this case has the new button Use current that must be pressed. A name of the new tokenizer must be given.

Demo 2: The usage of Tokenizers

Our goal here is to show how the text really looks like when different tokenizers are used.

When a new DTD is loaded in the system, a tokenizer is automatically assigned to it. It is the Default tokenizer - defined in the system. The tokenizer assigned to the DTD is called a default tokenizer for it. If nothing else is said this tokenizer is used for processing documents with respect to this DTD. The user can change the default tokenizer via Element features menu item.

The document for this demo is Standart20030524.tag which is a valid document according to teixlite2x.dtd.

In order to see the text in the document tokenized by the tokenizer assigned to the DTD, you have to perform the following steps:

  1. Check whether the document: Standart20030524.tag is loaded in the system. If it is not, then load it as it is described in Import XML Tool Demo.
  2. Click on Tree View the first paragraph node p with the right mouse and choose More/Info. Text child of the selected node is tokenized according to the Default tokenizer - on symbols. For each symbol a Category is assigned. XPath Expression identifying the selected node is shown.
  3. Change the tokenizer assigned to the DTD. Select Element Features item tool from Definitions menu. Choose teixlite2x.dtd from the list of all DTDs compiled in the system and change the Default Tokenizer for the DTD.
  4. Change the tokenizer to No Tokenizer. Right click on Tree Preview the same paragraph and choose Info. The text child of the selected paragraph is taken as a whole string - it is not tokenized.
  5. Change the tokenizer to Uptok. Choose the same node and see the Info. The text child is tokenized. Tokens are Latin words with capital first letter, Latin words with small first letter spaces, punctuation etc. For each token a category from the tokenizer is shown.

A tokenizer can be set not only to a DTD but to a current node. From Definitions/Element Features - an element can be added and a tokenizer - assigned to it. If there is no selected tokenizer for an element - the DTD tokenizer is taken.