Sort Tool Demo

The goal of this demo is to show how the sort tool of the CLaRK system can be used for ordering similar structures from XML documents according to certain criteria.

The Sort tool is used for sorting XML elements within documents. By an XPath expression the user can specify the data she/he wants to sort. Only elements of the same document level can be sorted.

The XML elements are sorted on the basis of certain data connected with the elements. The data is determined via keys. The keys are defined by XPath expressions and additional processing of the selected data like normalization, tokenization etc. For each key the user can select Order of sorting - ascending or descending, whether the data must be compared by the endings , by selecting Reverse option. If data must be compared as numbers, a Number option must be selected. Additionally, the user has to select Trim in order not to do sorting by the spaces around the text (if any), and Normalize - in order different writings of the same token to be counted as equal. A tokenizer can be set.

In order to apply Sort Tool to more than one documents you have to select Multiple Apply. Then a new part of the dialog appears in which you have to select the documents from the corresponding document group via the button Add Documents.

Additionally, you have to specify the output document for the result. Normally the system offers for each input document its own output document, but in this case it is possible to overwrite the source document with the result from the sort. To do so, you have to select Options and then Overwrite radio button. Then return to the main dialog.

Demo 1: Sort over statistics

Ascending according to the number of occurrences

Our goal here is to sort all counted in the statistics tokens by number of their occurrences.

The document used for this demo is standartResult.stat - a statistics document. For each token counted, there is a <item> tag. Each <item> contains the token itself (<element> tag), the category of the token (<category> tag), the number of token occurrences in document(s) (<number> tag), the percentage (<percent> tag) and the values of the keys according to which the statistics was made (<keyvalue> tag).

In order to run the demo you have to perform the following steps:

  1. Check whether the document: standartResult.stat is loaded in the system. If it is not, then load it as it is described in Import XML Tool Demo.
  2. Open the query dialog of the Sort tool from the menu item Tools.
  3. In the text box Select Elements: write the following XPath expression : //item. This expression selects all item elements within the document.
  4. The key in this case is the text of the <number> child of <item> element. This is specified by the XPath expression number/text(). Additionally, you have to select number in order to compare text as numbers, not as strings. For example to compare 21 and 112 . When compared as numbers 21 is less than 112. When compared as strings 112 is less than 21 because 1 is before 2 in lexicographic order.
  5. The dialog of the tool in this case has to be:
  6. Then you can run the query.

The above query is saved in the document number.sort in the demo directory.

Ascending according to the number of tokens occurrences and descending according to the tokens

The query above can be extended to sort nodes not only in ascending order by number of occurrences but elements that have equal number of appearances to be sorted descending by the value of the token.

This can be done with a second key.

The second key in this case is the text of <element> child of <item> element. This is specified by the XPath expression element/text(). Additionally, you have to select trim in order not to sort by the spaces around the text (if any), and normalized in order different writings of the same token to be counted as equal. A tokenizer MixedWord can be set.

The dialog in this case will be:

The query is saved in the document numElem.sort in the current directory.

The item elements in the result document will be ordered in such way:

Demo 2 : Sort over a grammar (reverse)

Our goal here is to sort all the rules in a grammar by the Regular Expression endings. We rely on the assumption that the words with the same endings have the same grammatical characteristics, and it will be easy to fill these characteristics in the grammar.

The document used for this demo is stat_to_gram.gram - a grammar document. Each grammar consists of rules. For each rule there is a <line> tag. Each rule contains a Regular Expression (<RE> tag) and a Return Mark-up (<RM> tag). It can also contain Left Content of the Regular Expression (<LC> tag) and Right Content of the Regular Expression (<RC> tag).

In order to run the demo you have to perform the following steps:

  1. Check whether the document: stat_to_gram.gram is loaded in the system. If it is not, then load it as it is described in Import XML Tool Demo.
  2. Open the query dialog of the Sort tool from the menu item Tools.
  3. In the text box Select Elements: write the following XPath expression : //line. This expression selects all line elements within the document (rules).
  4. The key in this case is the text of the <RE> child of <line> element. This is specified by the XPath expression RE/text(). Additionally, you have to select Reverse in order to sort by the endings of the RE, trim in order not to sort by the spaces around the text (if any), and normalized in order different writings of the same token to be counted as equal. A tokenizer MixedWord can be set.
  5. The dialog of the tool in this case has to be:
  6. Then you can run the query.

The above query is saved in the document grammar.sort in the demo directory.