The goal of this demo is to show how the transformation tool of the CLaRK system can be used for converting one XML document (result from statistics) to another XML document (grammar).
The XSLT tool is used for transformation of XML documents. The XSL transformations can be applied locally to an XML element and its content.
The XSLT processor takes as an input a source document and a transformation document, and returns a new document that is called the result document.
A source document can be any well-formed XML document.
In the CLaRK System every transformation document is a well-formed XML document that contains
transformation rules. The tag names used in transformation documents are reserved tag names defined in the
XSL standard. In order to be separated from the user-defined tag names, they usually may be preceded by an
xsl
prefix followed by a colon (:) or, in other words, a namespace prefix for XSL.
The root element of each transformation document must be either xsl:transform
or
xsl:stylesheet
tag.
The content of an xsl:transform consists of template
elements. A template typically contains
instructions that select an additional list of source nodes for processing. The match
attribute
contains a string that is interpreted as an XPath expression that identifies the nodes which the template will be
applied to. The children elements of a template element could be of any kind, but some tag names have special
meaning.
The xsl:apply-templates
tag tells the XSLT processor to continue applying the rules. In the
absence of a select
attribute, the instruction processes all the children of the current node,
including text nodes. A select
attribute can be used for processing nodes selected by an XPath
expression instead of processing all children.
The xsl:for-each
tag applies its content to every node in the node set where it selects
attribute. Its content is completely similar to the content of a template tag and therefore - is similarly used, but it may have
additionally sort tag children that determines the order of the nodes processing.
The xsl:sort
tag determines the key for sorting the lists, which are result from XPath
expression specified in select
attribute. When no such tag occurs, the list is processed in
document order. Otherwise, it is sorted first. The first sort child determines the first key, the second sort child - the
second key and so on. The xsl:sort
tags may appear as children of xsl:for-each
and
xsl:apply-templates
tags. Elsewhere it is considered as a normal tag without a special
meaning.
This tag xsl:if
contains an XPath in test
attribute that is evaluated (or casted) to
boolean. If the answer is true, then the processor applies the content of the tag to the current node. Otherwise it
does nothing.
The xsl:copy-of
tag contains an XPath, that is evaluated to a list of nodes, to a
boolean, to a number or a string. In the first case each of the selected nodes together with its subtree is attached to
the result. In other cases the XPath is converted to a string and then is added to the result tree.
The xsl:copy
tag copies the current node and its attributes. The subtree is not copied. No
attributes are defined for this tag.
The xsl:value-of
element is instantiated to create a text node in the result tree. The required
select
attribute is an XPath expression that is evaluated and the resulting object is converted to a
string.
These are the special meaning tags. Every other tag or character data is directly copied to the result tree.
For complete information about the XSLT standard see XSLT. Note that not everything stated in the standard is implemented in the CLaRK System, and some things work differently.
The goal is the creation of an initial grammar on the basis of tokens in some texts. The grammar will have to be further developed in order to be used for morphological annotation.
The documents used for this demo are: standartResult.stat
- the
source document and statToGram.xsl
- the transformation document.
The source document is a result from the application of system statistics tool
over text documents. For each token counted there is a
<item>
tag. Each <item>
contains the token
itself (<element>
tag), the category of the token
(<category>
tag), number of occurrences of the token in
document(s) (<number>
tag), the percentage
(<percent>
tag) and the values of the keys according to which
the statistics was made (<keyvalue>
tag).
The transformation is as follows:
<xsl:transform> <xsl:template match="self::statistics"> <Grammars><grammar> <name>stat_to_gram</name> <xsl:apply-templates select="item"/> </grammar></Grammars> </xsl:template> <xsl:template match="item"> <line> <RE>"<xsl:value-of select="element/text()"/>"</RE> <RM><w><ph>\w</ph><aa></aa><ta>< ;/ta></w></RM> </line> </xsl:template> </xsl:transform>
The XSL processor finds first template and uses it as a main program. The
other templates in the transformation are used as sub-programs that are called
from the main program or other sub-programs. The sub-programs are addressed by
xsl:apply-templates
tag. In the transformations above two
templates are used.
The first one will be applied to the root element of the source document -
statistics
and produces the root of the result document -
Grammars
, the grammar
node and the name of the
grammar.
The second template is applied to item
nodes from the
statistics, and for each item
from the statistics document
produces a rule - line
in the grammar document.
Each rule contains a RE element which contains the text element from the
element
node of the statistics and an appropriate return mark-up
(RM). The RM is the category assigned to a given rule.The category has to be
encoded as XML mark-up in the document and this mark-up could be very different
depending on the DTD we use. In the CLaRK System there is a custom mark-up that
substitutes the recognized word. Since in most cases we would also like to save
the recognized word, we use the variable \w
for the recognized
word. The XML mark-up used in the RM is as follows:
w
- for recognized words ph
- phonetics - orthographical representation of the wordaa
- all analyses - all possible morphological characteristics
associated with the current word.ta
- true analysis - the appropriate morphological
characteristic of the word in the current contextIn order to run the demo you have to perform the following steps:
standartResult.stat
and transformation statToGram.xsl
are loaded and saved in the system. If they are not, then load them as it is described in Multi Import Tool DemoTool/XSLT/XSLT Manager
from Add New
button.Apply
button.Tool/XSLT/Apply XSLT
Select XSLT
combo box.The dialog of the tool in this case has to be:
Multiple Apply
.
Then a new part of the dialog appears in which you have to select the documents from the corresponding
document groups via the button Add Documents
.Then you can run the query.
Root : SYSTEM : Results : XSLT
- the standard group for
results from this tool. If you like, you can change it.The above query is saved in document stat_to_gram.stat.que
in the demo directory.
The result will be a grammar with a number of rules equal to the number of item
elements from the statistics.
The result document is opened in the system. It can be loaded as a grammar (stat_to_gram) in the grammar
Editor from Tool/Grammars/Grammar Manager - XML Editor/Load Current Document As Grammar. It is up to the user whether to fill all possible grammatical characteristics in aa
for each word. In this way a simple tagger can be made.