This distribution represents only the dependency information encoded in BulTreeBank – HPSG-based Treebank of Bulgarian. It contains about 196000 tokens.
It contains sentences from Bulgarian Grammar Textbooks, Newspapers, Literature and other sources of texts.
Full documentation (Style Book, Tagset description) of the Treebank can be found on section Technical Reports of the BulTreeBank Project in Publications menu.
The BulTreeBank-DP is provided in the CoNNL-X shared task table format. The table format required for the CoNLL-X shared task is given in the following section.
Data adheres to the following rules:
- Data files contain sentences separated by a blank line.
- A sentence consists of one or more tokens, each one starting on a new line.
- A token consists of ten fields described in the table below. Fields are separated by a single tab character. Space/blank characters are not allowed in within fields.
- All data files will contain these ten fields.
- Data files are UTF-8 encoded (Unicode).
|Field number:||Field name:||Description:|
|1||ID||Token counter, starting at 1 for each new sentence.|
|2||FORM||Word form or punctuation symbol.|
|3||LEMMA||Lemma is not available.|
|4||CPOSTAG||Coarse-grained part-of-speech tag.|
|5||POSTAG||Fine-grained part-of-speech tag. Two or three caracters from the original tag in the Treebank. The rest features are encoded in the field 6.|
|6||FEATS||Unordered set of syntactic and/or morphological features, separated by a vertical bar (|), or an underscore if not available.|
|7||HEAD||Head of the current token, which is either a value of ID or zero (‘0’).|
|8||DEPREL||Dependency relation to the HEAD. The dependency relations are given below. ‘ROOT’ value determines the root of the sentence.|
|9||PHEAD||Projective head of current token, which is either a value of ID or zero (‘0’), or an underscore if not available. The dependency structure resulting from the PHEAD column is guaranteed to be projective, whereas the structures resulting from the HEAD column will be non-projective for some sentences.|
|10||PDEPREL||Dependency relation to the PHEAD, or an underscore if not available. The dependency relations are given below. ‘ROOT’ value determines the root of the sentence.|
|mod||Modifier (dependants which modify nouns, adjectives, adverbs)|
|prepcomp||Complement of preposition|
|comp||Complement (arguments of: non-verbal heads, non-finite verbal heads, copula)|
|adjunct||Adjunct (optional verbal argument)|
|xprepcomp||Clausal complement of preposition|
|conj||Conjunction in coordination|
|conjarg||Argument (second, third, …) of coordination|
|marked||Marked (clauses, introduced by a subordinator)|
|obj||Object (direct argument of a non-auxiliary verbal head)|
|indobj||Indirect Object (indirect argument of a non-auxiliary verbal head)|
Training and Test Division
For comparison of the extracted parsers we propose the treebank to be divided in training and testing parts. They are stored in different directories. The users, of course, are free to use the two sets in an appropriate for them manner.
Differences from the BulTreeBank
BulTreeBank-DP differs from BulTreeBank in several ways. Here we list the most significant ones:
It is not proved that the procedure for conversion from HPSG-based annotation into dependency format is reversible. Thus, some information from the original encoding is missing.
Sentences containing ellipses are completely missing, because it is not clear how to represent them in the dependency format.
Co-referential relations encoded in the original format are completely missing.
Ontological classification of the Named Entities is missing.
Petya Osenova and Kiril Simov. BTB-TR05: BulTreeBank Stylebook. BulTreeBank Project Technical Report № 05. 2004
Kiril Simov and Petya Osenova. Practical Annotation Scheme for an HPSG Treebank of Bulgarian. In: Proc. of the 4th International Workshop on Linguistically Interpreteted Corpora (LINC-2003), Budapest, Hungary. 2003.
Kiril Simov, Gergana Popova, Petya Osenova. HPSG-based syntactic treebank of Bulgarian (BulTreeBank). In: “A Rainbow of Corpora: Corpus Linguistics and the Languages of the World”, edited by Andrew Wilson, Paul Rayson, and Tony McEnery; Lincom-Europa, Munich 2002. pages 135-142.
Kiril Simov, Petya Osenova and Milena Slavcheva. BTB-TR03: BulTreeBank Morphosyntactic Tagset. BulTreeBank Project Technical Report № 03. 2004
Kiril Simov, Petya Osenova, Alexander Simov, Milen Kouylekov. Design and Implementation of the Bulgarian HPSG-based Treebank. In Erhard Hinrichs and Kiril Simov, editors, Journal of Research on Language and Computation, Special Issue, Kluwer Academic Publishers. pages 495-522.
Acquiring the Data
If you are interested in using BulTreeBank-DP, please, fill in the user agreement form, print it, scan it and send it to Kiril Simov. If not possible to send it electronically, please, send it by regular mail to:
Linguistic Modelling Laboratory, IPP,
Bulgarian Academy of Sciences
Acad. G.Bonchev St. 25A
1113 Sofia, Bulgaria
After receiving the filled form we will send to you the data.
The BulTreeBank is developed under the BulTreeBank Project, which is a joint project of the Linguistic Modelling Laboratory (LML), Institute for Parallel Processing, Bulgarian Academy of Sciences and Seminar für Sprachwissenschaft (SfS), Eberhard-Karls-Universität, Tübingen, Germany. The project is funded by the Volkswagen Stiftung, Federal Republic of Germany under the Programme “Cooperation with Natural and Engineering Scientists in Central and Eastern Europe”.
We would like to thank our colleagues from Tübingen!
We were invited to provide the treebank for the CoNLL-X shared task by
Toshiba Research Europe Ltd (UK)
sabine dot buchholz at crl dot toshiba dot co dot uk .
The conversion of the treebank from the original HPSG-based annotation into dependency format was done by Kiril Simov, Petya Osenova, Svetoslav Marinov, Atanas Chanev.
The BulTreeBank-DP was used by 33 groups for the CoNLL-X shared task. We thank all of them.