The goal of this demo is to show how the statistics tool of the CLaRK system can be used for extraction of frequency information from XML documents.
The Statistics tool is used for counting the number of occurrences of certain tokens or/and XML elements within documents. By XPath expression the user can specify the location and type of data she/he is interested in.
If tokens are to be counted, then a tokenizer can be specified and only tokens of certain types can be counted. The information which is given for each token is: the token itself, its type, the number of occurrences and the percentage from all.
If XML elements are counted, then they are grouped on the basis of certain data connected with the elements. The data is determined via keys. The keys are defined by XPath expressions and additional processing of the selected data like normalization, tokenization etc.
The result can be stored as an XML document and later used for different purposes.
Our goal here is to construct a list of all the tokens in some set of documents according to a tokenizer and the number of the occurrences of these tokens.
The documents used for this demo are: Standart20030524.tag
,
Standart20030525.tag
.
The tokenizer is MixedWord
which is defined within the System. If you like to see the
token types defined within this tokenizer you have to select Tokenizers
item from menu
Definitions
.
In order to run the demo, you have to perform the following steps:
Standart20030524.tag
,
Standart20030525.tag
are loaded and saved in the system. If they are not, then load them as it is
described in Multi Import Tool Demo.Tools
.Select (XPath)
write the following XPath expression :
//text/descendant-or-self::text()
which selects all textual element within the document(s).self::*
. Additionally, you have to select trim
in order not to count the spaces around
the text (if any), and normalized
in order different writings of the same token to be counted as
equal.Choose Tokenizer
choose MixedWord
.Multiple Apply
. Then a new
part of the dialog appears in which you have to select the documents from the corresponding document group via
the button Add Documents
.Options
and then United
radio
button and specify a name for the result document: standartResult.stat
, for example. Then return to
the main dialog.standartResult.stat
in the
group Root : SYSTEM : Results : Statistics
- the standard group for results from this tool. If you
would like, you can change it.The above query is saved in the document Statistics1.stat.que
in the demo directory.
The result will be all the tokens from the selected documents, tokenized by the chosen tokenizer (MixedWord). Some of the token categories are not interesting for the user and can be skipped if they are not checked in the Tokenizer Category Filter table from Customize button. Categories TAB, SPACE, LF, CR and ESC can be skipped, too.