22nd юни 2017

Part-of-Speech Tagging of the BulTreeBank (Bulgarian Taggers)

This page contains a few models for tagging Bulgarian. They are to be used with three taggers, respectively: T’n’T (Thorsten Brants), SVMtool (Jesús Giménez and Lluís Márquez) and the Example-based tagger of the Acopost package (Ingo Schröder).

1. T’n’T (Thorsten Brants)

T’n’T is a trigram tagger free to use for research purposes. It can be obtained from http://www.coli.uni-saarland.de/~thorsten/tnt/ after a license is filled and faxed to the developer of the tagger (Thorsten Brants). We trained three models for tagging Bulgarian. One of them was trained on the training set of the BulTreeBank. Evaluated on the test set of the BulTreeBank it gave accuracy of 92.53%. The second model was trained and tested on two smaller sets of newspaper articles. The accuracy of the model evaluated on newspaper articles is 93.46%. The third tagger was trained on all the newspaper articles material from the BulTreeBank. We did not evaluate it but its accuracy has to be greater than or equal to the one of the second model.

The input for tagging must be given one word per line with empty lines between the sentences. The files .123 and .lex must be put in the same folder. For more details the reader is referred to the user manual of the T’n’T tagger: http://www.coli.uni-saarland.de/~thorsten/publications/Brants-TR-TnT.pdf

  • Model 1: btbdep7wd
  • Model 2: Knews
  • Model 3: NEWS
  • gzipped package of all the three models

2. SVMtool (Jesús Giménez and Lluís Márquez)

The SVMtool has been developed by Jesús Giménez and Lluís Márquez. It uses Support Vector Machines to learn the labels of annotated text, be they part-of-speech tags or something else (e.g. semantic labels). The tagger’s code is open source (LGPL). We trained three models, as we did for the T’n’T tagger, the first one being trained and tested on the whole BulTreeBank; the second – trained and tested on the newspaper article register of the treebank and the third – trained on all the newspaper articles material.

The input for tagging must be given one word per line with empty lines between the sentences. A model for tagging consists of the following files: .DICT (dictionary); .WIN; .UNKP; .AMBP; .A4; .A4.UNK; .M4.LR.MRG; .M4.RL.MRG; .UNK.M4.; R.MRG; .UNK.M4.RL.MRG. The models are trained using the default settings for English. Accuracy of Model 1 and 2 are: 92.22% and 93.45%, respectively. The tagger is invoked with the command:

./SVMtool -V 2 -S LRL -T 4 [MODELNAME] < [INFILE] > [OUTFILE]On the whole the SVMtool tags more slowly than the T’n’T tagger. For more details, the reader is referred to the technical manual of the SVMtool and other resources available on its web page: http://www.lsi.upc.es/~nlp/SVMTool/

  • Model 1: btbdep7wd
  • Model 2: Knews
  • Model 3: NEWS
  • gzipped package of all the three models

3. The example-based tagger from the Acopost package (Ingo Schröder)

The example-based tagger of the Acopost package by Ingo Schröder belongs to the class of applications implemented using history-based models. Its accuracy is lower if compared to those of the T’n’T tagger and the SVMtool. As usual we trained three models. The balanced corpus model gave accuracy of 89.91% while the newspaper article corpus gave 90.72%. The third model is trained on all the newspaper articles material from the BulTreeBank. All the models have been trained using the default settings of the tagger.

The tagger can be executed using the following command:

./et [ModelName].known.wtree [ModelName].unknown.wtree [ModelName].lex < [input].raw > [output].et

More information can be found on the Acopost web page: http://acopost.sourceforge.net/

  • Model 1: btbdep7wd
  • Model 2: Knews
  • Model 3: NEWS
  • A gzipped package of all the three models

Note: If you have trained a tagger on the BulTreeBank and want your model to be put on this page, please contact me (Atanas Chanev, e-mail: chanev at form dot unitn dot it)

Тази страница съдържа тагери за българския, тренирани от Атанас Чанев: e-mail: chanev at form dot unitn dot it