XSLT Tool Demo

The goal of this demo is to show how the transformation tool of the CLaRK system can be used for converting one XML document (result from statistics) to another XML document (grammar).

The XSLT tool is used for transformation of XML documents. The XSL transformations can be applied locally to an XML element and its content.

The XSLT processor takes as an input a source document and a transformation document, and returns a new document that is called the result document.

A source document can be any well-formed XML document.

In the CLaRK System every transformation document is a well-formed XML document that contains transformation rules. The tag names used in transformation documents are reserved tag names defined in the XSL standard. In order to be separated from the user-defined tag names, they usually may be preceded by an xsl prefix followed by a colon (:) or, in other words, a namespace prefix for XSL.

The root element of each transformation document must be either xsl:transform or xsl:stylesheet tag.

The content of an xsl:transform consists of template elements. A template typically contains instructions that select an additional list of source nodes for processing. The match attribute contains a string that is interpreted as an XPath expression that identifies the nodes which the template will be applied to. The children elements of a template element could be of any kind, but some tag names have special meaning.

The xsl:apply-templates tag tells the XSLT processor to continue applying the rules. In the absence of a select attribute, the instruction processes all the children of the current node, including text nodes. A select attribute can be used for processing nodes selected by an XPath expression instead of processing all children.

The xsl:for-each tag applies its content to every node in the node set where it selects attribute. Its content is completely similar to the content of a template tag and therefore - is similarly used, but it may have additionally sort tag children that determines the order of the nodes processing.

The xsl:sort tag determines the key for sorting the lists, which are result from XPath expression specified in select attribute. When no such tag occurs, the list is processed in document order. Otherwise, it is sorted first. The first sort child determines the first key, the second sort child - the second key and so on. The xsl:sort tags may appear as children of xsl:for-each and xsl:apply-templates tags. Elsewhere it is considered as a normal tag without a special meaning.

This tag xsl:if contains an XPath in test attribute that is evaluated (or casted) to boolean. If the answer is true, then the processor applies the content of the tag to the current node. Otherwise it does nothing.

The xsl:copy-of tag contains an XPath, that is evaluated to a list of nodes, to a boolean, to a number or a string. In the first case each of the selected nodes together with its subtree is attached to the result. In other cases the XPath is converted to a string and then is added to the result tree.

The xsl:copy tag copies the current node and its attributes. The subtree is not copied. No attributes are defined for this tag.

The xsl:value-of element is instantiated to create a text node in the result tree. The required select attribute is an XPath expression that is evaluated and the resulting object is converted to a string.

These are the special meaning tags. Every other tag or character data is directly copied to the result tree.

For complete information about the XSLT standard see XSLT. Note that not everything stated in the standard is implemented in the CLaRK System, and some things work differently.

Demo: Transformation over Statistics

The goal is the creation of an initial grammar on the basis of tokens in some texts. The grammar will have to be further developed in order to be used for morphological annotation.

The documents used for this demo are: standartResult.stat - the source document and statToGram.xsl - the transformation document. The source document is a result from the application of system statistics tool over text documents. For each token counted there is a <item> tag. Each <item> contains the token itself (<element> tag), the category of the token (<category> tag), number of occurrences of the token in document(s) (<number> tag), the percentage (<percent> tag) and the values of the keys according to which the statistics was made (<keyvalue> tag).

The transformation is as follows:

<xsl:transform>
 <xsl:template  match="self::statistics">
  <Grammars><grammar>
   <name>stat_to_gram</name>
 	<xsl:apply-templates select="item"/>
  </grammar></Grammars>
 </xsl:template>
	
 <xsl:template match="item">
  <line>
   <RE>"<xsl:value-of select="element/text()"/>"</RE>
   <RM>&lt;w&gt;&lt;ph&gt;\w&lt;/ph&gt;&lt;aa&gt;&lt;/aa&gt;&lt;ta&gt;&lt ;/ta&gt;&lt;/w&gt;</RM>
  </line>
 </xsl:template>
</xsl:transform>

The XSL processor finds first template and uses it as a main program. The other templates in the transformation are used as sub-programs that are called from the main program or other sub-programs. The sub-programs are addressed by xsl:apply-templates tag. In the transformations above two templates are used.

The first one will be applied to the root element of the source document - statistics and produces the root of the result document - Grammars , the grammar node and the name of the grammar.

The second template is applied to item nodes from the statistics, and for each item from the statistics document

produces a rule - line in the grammar document.

Each rule contains a RE element which contains the text element from the element node of the statistics and an appropriate return mark-up (RM). The RM is the category assigned to a given rule.The category has to be encoded as XML mark-up in the document and this mark-up could be very different depending on the DTD we use. In the CLaRK System there is a custom mark-up that substitutes the recognized word. Since in most cases we would also like to save the recognized word, we use the variable \w for the recognized word. The XML mark-up used in the RM is as follows:

In order to run the demo you have to perform the following steps:

  1. Check if the document standartResult.stat and transformation statToGram.xsl are loaded and saved in the system. If they are not, then load them as it is described in Multi Import Tool Demo
  2. Open the source document in the system.
  3. Add transformation in XSLT Manager - this can be done from menu Tool/XSLT/XSLT Manager from Add New button.
  4. Apply transformation over the source document. This can be done in two ways:
  5. Choose the transformation from Select XSLT combo box.
  6. The dialog of the tool in this case has to be:

  7. In order to apply the transformation to more documents you have to select Multiple Apply. Then a new part of the dialog appears in which you have to select the documents from the corresponding document groups via the button Add Documents.
  8. Additionally you have to specify the output documents for the result. Normally for each input document the system offers its own output document.
  9. Then you can run the query.

  10. The result document will be opened in the System. If the query is applied over more than one documents - their results will be saved in the group Root : SYSTEM : Results : XSLT - the standard group for results from this tool. If you like, you can change it.

The above query is saved in document stat_to_gram.stat.que in the demo directory.

The result will be a grammar with a number of rules equal to the number of item elements from the statistics.

The result document is opened in the system. It can be loaded as a grammar (stat_to_gram) in the grammar Editor from Tool/Grammars/Grammar Manager - XML Editor/Load Current Document As Grammar. It is up to the user whether to fill all possible grammatical characteristics in aa for each word. In this way a simple tagger can be made.