Statistics Tool Demo

The goal of this demo is to show how the statistics tool of the CLaRK system can be used for extraction of frequency information from XML documents.

The Statistics tool is used for counting the number of occurrences of certain tokens or/and XML elements within documents. By XPath expression the user can specify the location and type of data she/he is interested in.

If tokens are to be counted, then a tokenizer can be specified and only tokens of certain types can be counted. The information which is given for each token is: the token itself, its type, the number of occurrences and the percentage from all.

If XML elements are counted, then they are grouped on the basis of certain data connected with the elements. The data is determined via keys. The keys are defined by XPath expressions and additional processing of the selected data like normalization, tokenization etc.

The result can be stored as an XML document and later used for different purposes.

Demo 1: Statistics over tokens

Our goal here is to construct a list of all the tokens in some set of documents according to a tokenizer and the number of the occurrences of these tokens.

The documents used for this demo are: Standart20030524.tag, Standart20030525.tag.

The tokenizer is MixedWord which is defined within the System. If you like to see the token types defined within this tokenizer you have to select Tokenizers item from menu Definitions.

In order to run the demo, you have to perform the following steps:

  1. Check whether the two documents: Standart20030524.tag, Standart20030525.tag are loaded and saved in the system. If they are not, then load them as it is described in Multi Import Tool Demo.
  2. Open the query dialog of the Statistics tool from the menu item Tools.
  3. In the text box Select (XPath) write the following XPath expression : //text/descendant-or-self::text() which selects all textual element within the document(s).
  4. The key in this case is the whole element itself. This is specified by the XPath expression self::*. Additionally, you have to select trim in order not to count the spaces around the text (if any), and normalized in order different writings of the same token to be counted as equal.
  5. From the list Choose Tokenizer choose MixedWord.
  6. In order to apply statistics to both documents you have to select Multiple Apply. Then a new part of the dialog appears in which you have to select the documents from the corresponding document group via the button Add Documents.
  7. Additionally, you have to specify the output document for the result. Normally the system offers for each input document its own output document, but in this case it is necessary to select one document containing the result from all documents. To do so, you have to select Options and then United radio button and specify a name for the result document: standartResult.stat, for example. Then return to the main dialog.
  8. The dialog of the tool in this case has to be:
  9. Then you can run the query. The result will be saved in the document standartResult.stat in the group Root : SYSTEM : Results : Statistics - the standard group for results from this tool. If you would like, you can change it.

The above query is saved in the document Statistics1.stat.que in the demo directory.

The result will be all the tokens from the selected documents, tokenized by the chosen tokenizer (MixedWord). Some of the token categories are not interesting for the user and can be skipped if they are not checked in the Tokenizer Category Filter table from Customize button. Categories TAB, SPACE, LF, CR and ESC can be skipped, too.