This distribution represents only the morphological information encoded in BulTreeBank – HPSG-based Treebank of Bulgarian. It contains about 214000 tokens. It was used for the training of the TreeTagger for Bulgarian.
It contains sentences from Bulgarian Grammar Textbooks, Newspapers, Literature and other sources of texts.
Full documentation (Style Book, Tagset description) of the Treebank can be found in Publications menu.
Data Format
The morphological annotation is described in:
-
Kiril Simov and Petya Osenova. BTB-TR02: BulTreeBank Text Corpus of Bulgarian: Content, Segmentation, Tokenization. BulTreeBank Project Technical Report № 02. 2004.
-
Kiril Simov and Petya Osenova. BTB-TR04: BulTreeBank Morphosyntactic Annotation of Bulgarian Texts. BulTreeBank Project Technical Report № 04. 2004.
Tagset
The tagset is described in:
-
Kiril Simov, Petya Osenova and Milena Slavcheva. BTB-TR03: BulTreeBank Morphosyntactic Tagset. BulTreeBank Project Technical Report № 03. 2004
Acquiring the Data
If you are interested in using BulTreeBank-Morph, please, fill in the user agreement form, print it, scan it and send it to Kiril Simov. If not possible to send it electronically, please, send it by regular mail to:
Kiril Simov
BulTreeBank Project
Linguistic Modelling Laboratory, IPP,
Bulgarian Academy of Sciences
Acad. G.Bonchev St. 25A
1113 Sofia, Bulgaria
After receiving the filled form we will send to you the data.
Acknowledgements
The BulTreeBank is developed under the BulTreeBank Project, which is a joint project of the Linguistic Modelling Laboratory (LML), Institute for Parallel Processing, Bulgarian Academy of Sciences and Seminar für Sprachwissenschaft (SfS), Eberhard-Karls-Universitä t, Tübingen, Germany. The project is funded by the Volkswagen Stiftung, Federal Republic of Germany under the Programme „Cooperation with Natural and Engineering Scientists in Central and Eastern Europe“.
We would like to thank our colleagues from Tübingen!