Bulgarian NLP pipeline in CLaRK System (BTB-Pipe)
The Bulgarian Natural Language Processing pipeline (BTB-Pipe) comprises the following modules:
- Tokenizer and sentence splitter
- Morphosyntactic tagger
- Dependency parser
The performance of the pipe is as follows: For POS tagging the accuracy is 96.87 %, the lemmatization is with
95.25 % accuracy, and for dependency parsing LAS is 87.41 % UAS is 90.83 %.
Here we describe briefly the main functionalities of BTB-Pipe. It is implemented by using the following systems: a) CLaRK System – an XML Based System For Corpora Development and Processing, and b) MATE Tools for statistical NLP trained on Bulgarian data. CLaRK System is used for the following part of the tasks: tokenization, sentence boundary detection, lemmatization, post-processing. Mate Tools are used for statistical POS tagging and Dependency parsing. For the POS tagging we use the BulTreeBank Morphosyntactic Tagset, and for the dependency parser we use Dependency Relations constructed for the CoNLL 2006 shared task.
The BTB-Pipe is implemented within Java and thus it does not require any installation. When you download the archive you have to unzip it in an appropriate location on your hard drive. In the root directory of the BTB-Pipe there are two scripts: runClark.bat for Windows and runClark.sh for Linux. With them you start the pipe.
The user interface of BTB-Pipe is the CLaRK System. For working with CLaRK System we advise the users to consult the documentation of the system. There are two modes of pipe usages: one document or several documents. The documents have to contain only text surrounded by the XML element <textdata> (the structure of the document is defined within laska.dtd DTD, compiled inside the system. Here is a screenshot of one opened document within the system:
The document could contain lines, but they are represented as one line in this view within CLaRK System. In order to apply the pipe to the text within the document you have to run one of the saved scripts within the CLaRK System. They are called multiqueries . In order to apply them, you have to select MultyQuery Tool item in Tools menu ( Tools → MultyQuery Tool ). Thе MultiQuery tool is designed to call lists of other tools, which are executed one by one in the order of their appearance. The result from each single tool application is an input for the next single tool application.
When selecting MultyQuery Tool a dialog window appears. In order to select the pipeline one must choose the Select button. Then you could select one of the two multi queries runAllMateMorph (for tokenization, lemmatization and morphosyntactic tagging) or runAllAndDependency (for all previous steps plus dependency parsing). Once selected, the pipeline is ready to run by pressing the Start command from the MultyQuery Tool dialog window and will prompt for confirmation with three options Yes | Yes to All | Cancel. Please, click on Yes to All option in order to run all the subprocedures. After the processing has been completed you can see the result. Here are some screenshots:
Screenshot for runAllMateMorph
The result consists of the following elements: <s> for sentences and <tok> for tokens. Each tok element has the following attributes:
- aa – this attribute contains all possible morphosyntactic tags from the inflectional lexicon
- ana – this attribute contains the correct tag within the current context
- len – this attribute contains the length of the token
- lm – this attribute contains the lemma for the token
- n – this attribute represents the number of the token in the text
- offset – this attribute contains the number of symbols between the current token and the previous token. Usually the values are: 0 if there are no symbols between them (punctuation mark with respect to the previous word) or 1 if there is one interval between two words. Depending on the text there could be greater values.
- pos – this attribute is a short representation of the morphosyntactic tag
Screenshot for runAllAndDependency
The result is similar to the previous, but contains the following additional attributes:
- drel – this attribute contains the dependency relation from the token to the head token
- head – this attribute contains the number of the head token
- olia – this attribute contains the classification of the token with respect to Olia ontology
- pentag – this attribute contains the most appropriate POS tag from Penn Treebank tagset.
The first two attributes define the dependency tree. The second two ones represent connections to
some popular classifications.
When the pipeline is applied to several documents, the user needs to import the documents in the system via Multi-Import item from File menu. Then again the MultyQuery Tool is selected. Again with the Select button one of the two multiqueries is selected to be applied. After this the documents to be processed are added. By clicking on the Start button the pipe is applied to all documents. The results are stored internally as new documents. There is an option in the system for the results to overwrite the original documents. Here is a screenshot.
There is also a version of the BTB-Pipe which can be run as a server application.
By using the same language resources – the BTB treebank and the inflectional lexicon, we are in a process of implementation of a new and imroved version which soon will be available within CLaDA-BG National Infrastructure (part of CLARIN infrastructure).