11th юни 2017

A short description of the Dependency Part of BulTreeBank (BulTreeBank-DP)

This distribution represents only the dependency information encoded in BulTreeBank – HPSG-based Treebank of Bulgarian. It contains about 196000 tokens.

It contains sentences from Bulgarian Grammar Textbooks, Newspapers, Literature and other sources of texts.

Full documentation (Style Book, Tagset description) of the Treebank can be found on section  Technical Reports of the BulTreeBank Project in Publications menu.

The BulTreeBank-DP is provided in the CoNNL-X shared task table format. The table format required for the CoNLL-X shared task is given in the following section.

Data Format

Data adheres to the following rules:

  • Data files contain sentences separated by a blank line.
  • A sentence consists of one or more tokens, each one starting on a new line.
  • A token consists of ten fields described in the table below. Fields are separated by a single tab character. Space/blank characters are not allowed in within fields.
  • All data files will contain these ten fields.
  • Data files are UTF-8 encoded (Unicode).
Field number: Field name: Description:
1 ID Token counter, starting at 1 for each new sentence.
2 FORM Word form or punctuation symbol.
3 LEMMA Lemma is not available.
4 CPOSTAG Coarse-grained part-of-speech tag.
5 POSTAG Fine-grained part-of-speech tag. Two or three caracters from the original tag in the Treebank. The rest features are encoded in the field 6.
6 FEATS Unordered set of syntactic and/or morphological features, separated by a vertical bar (|), or an underscore if not available.
7 HEAD Head of the current token, which is either a value of ID or zero (‘0’).
8 DEPREL Dependency relation to the HEAD. The dependency relations are given below. ‘ROOT’ value determines the root of the sentence.
9 PHEAD Projective head of current token, which is either a value of ID or zero (‘0’), or an underscore if not available. The dependency structure resulting from the PHEAD column is guaranteed to be projective, whereas the structures resulting from the HEAD column will be non-projective for some sentences.
10 PDEPREL Dependency relation to the PHEAD, or an underscore if not available. The dependency relations are given below. ‘ROOT’ value determines the root of the sentence.

Dependancy Relations

 punct  Punctuation
 clitic  Clitic form
 mod  Modifier (dependants which modify nouns, adjectives, adverbs)
 prepcomp  Complement of preposition
 comp  Complement (arguments of: non-verbal heads, non-finite verbal heads, copula)
 adjunct  Adjunct (optional verbal argument)
 subj  Subject
 xadjunct  Clausal adjunct
 xsubj  Clausal subject
 xmod  Clausal modifier
 xcomp  Clausal complement
 xprepcomp  Clausal complement of preposition
 conj  Conjunction in coordination
 conjarg  Argument (second, third, …) of coordination
 pragadjunct  Pragmatic adjunct
 marked  Marked (clauses, introduced by a subordinator)
 obj  Object (direct argument of a non-auxiliary verbal head)
 indobj  Indirect Object (indirect argument of a non-auxiliary verbal head)

Training and Test Division

For comparison of the extracted parsers we propose the treebank to be divided in training and testing parts. They are stored in different directories. The users, of course, are free to use the two sets in an appropriate for them manner.

Differences from the BulTreeBank

BulTreeBank-DP differs from BulTreeBank in several ways. Here we list the most significant ones:

  • It is not proved that the procedure for conversion from HPSG-based annotation into dependency format is reversible. Thus, some information from the original encoding is missing.

  • Sentences containing ellipses are completely missing, because it is not clear how to represent them in the dependency format.

  • Co-referential relations encoded in the original format are completely missing.

  • Ontological classification of the Named Entities is missing.

Key Publications

  • Petya Osenova and Kiril Simov. BTB-TR05: BulTreeBank Stylebook. BulTreeBank Project Technical Report № 05. 2004

  • Kiril Simov and Petya Osenova. Practical Annotation Scheme for an HPSG Treebank of Bulgarian. In: Proc. of the 4th International Workshop on Linguistically Interpreteted Corpora (LINC-2003), Budapest, Hungary. 2003.

  • Kiril Simov, Gergana Popova, Petya Osenova. HPSG-based syntactic treebank of Bulgarian (BulTreeBank). In: „A Rainbow of Corpora: Corpus Linguistics and the Languages of the World“, edited by Andrew Wilson, Paul Rayson, and Tony McEnery; Lincom-Europa, Munich 2002. pages 135-142.

  • Kiril Simov, Petya Osenova and Milena Slavcheva. BTB-TR03: BulTreeBank Morphosyntactic Tagset. BulTreeBank Project Technical Report № 03. 2004

  • Kiril Simov, Petya Osenova, Alexander Simov, Milen Kouylekov. Design and Implementation of the Bulgarian HPSG-based Treebank. In Erhard Hinrichs and Kiril Simov, editors, Journal of Research on Language and Computation, Special Issue, Kluwer Academic Publishers. pages 495-522.

Acquiring the Data

If you are interested in using BulTreeBank-DP, please, fill in the user agreement form, print it, scan it and send it to Kiril Simov. If not possible to send it electronically, please, send it by regular mail to:

Kiril Simov
BulTreeBank Project
Linguistic Modelling Laboratory, IPP,
Bulgarian Academy of Sciences
Acad. G.Bonchev St. 25A
1113 Sofia, Bulgaria

After receiving the filled form we will send to you the data.

Acknowledgements

The BulTreeBank is developed under the BulTreeBank Project, which is a joint project of the Linguistic Modelling Laboratory (LML), Institute for Parallel Processing, Bulgarian Academy of Sciences and Seminar für Sprachwissenschaft (SfS), Eberhard-Karls-Universität, Tübingen, Germany. The project is funded by the Volkswagen Stiftung, Federal Republic of Germany under the Programme „Cooperation with Natural and Engineering Scientists in Central and Eastern Europe“.

We would like to thank our colleagues from Tübingen!

We were invited to provide the treebank for the CoNLL-X shared task by
Sabine Buchholz
Toshiba Research Europe Ltd (UK)
sabine dot buchholz at crl dot toshiba dot co dot uk .

The conversion of the treebank from the original HPSG-based annotation into dependency format was done by Kiril Simov, Petya Osenova, Svetoslav Marinov, Atanas Chanev.

The BulTreeBank-DP was used by 33 groups for the CoNLL-X shared task. We thank all of them.