22nd юни 2017

Results from EuroMatrixPlus Project

This page contains links to some of our results from EuroMatrixPlus Project. It will be updated in future with more data.

Manually aligned Bulgarian-English parallel sentences on word level:

    • EMP-BTB-CSLI-MWA dataset – this dataset is based on the testset, distributed with the English Resource Grammar. It was translated by professional translators in Bulgarian. The source language is English, the target language is Bulgarian. It contains 893 sentence pairs.It has been finally checked by a superannotator.
    • EMP-BTB-JH0-MWA dataset – this dataset is a part of the JH dataset, distributed with the English Resource Grammar. It contains touristic information about Norway. It was translated by professional translators in Bulgarian. The source language is English, the target language is Bulgarian. It contains 250 sentence pairs. It has been finally checked by a superannotator.
    • EMP-BTB-SETIMES-01-MWA dataset – this dataset is a part of the SETIMES dataset, distributed with the OPUS parallel corpus. It contains news articles about Balkan countries. It is published in several Balkan languges as well as in English. It is not clear which language is the source one and which is the target one. Thus, we assumed that Bulgarian is the source language and English is the target language. It contains 2755 sentence pairs. It has been finally checked by a superannotator.

***

  • EMP-BTB-SETIMES-02-MWA dataset – this dataset is a part of the SETIMES dataset, distributed with the OPUS parallel corpus. It contains news articles about Boulcan countries. It is published in several Balkan languages as well as in English. It is not clear which language is the source one and which is the target one. Thus, we assumed that Bulgarian is the source language and English is the target language. It contains 1900 sentence pairs. Please note that this dataset has been annotated by just one annotator, and has been only partially checked by a supperannotator.

EMP-BTB-SETIMES dataset – this dataset is the SETIMES dataset of the OPUS parallel corpus which whose alignments have been cleaned on sentence level and whose Bulgarian part was linguistically processed. The format of both datasets – English and Bulgarian – is one sentence per line; tokens are in lower case and they are separated by intervals. Additionally to each Bulgarian token we have added the following information: wftoken | lemma | POS | GrammaticalFeatures | DependencyRelation | LemmaOfDependencyHead | POSOfDependencyHead | MRSElementaryPredicate | TypeOfMainArgument | ElementaryPredicateOfARG01 | POSOfARG01 | ElementaryPredicateOfARG02 | POSOfARG02 | ElementaryPredicateOfARG03 | POSOfARG03POS and GrammaticalFeature are based on the morphosyntactic tags from BTBTagset, assigned to the wordforms by the POS Tagger. POS is one or two letters from the tag, depending on the part-of-speech of the wordform. GrammaticalFeature is the rest of the tag, separated by dots and completed by dashes to the length of the longest tag – 9 positions. The dataset is divided into training and testing subsets. This dataset was used for the experiments, reported in the following publications:

Rui Wang, Petya Osenova and Kiril Simov. 2012. Linguistically-Augmented Bulgarian-to-English Statistical Machine Translation Model. In Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra). Avignon, France. 23 April 2012. Association for Computational Linguistics. pages 119–128.

Rui Wang, Petya Osenova and Kiril Simov. 2012. Linguistically-Enriched Models for Bulgarian-to-English Machine Translation. Accepted for publication at Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation (SSST-6), ACL 2012 / SIGMT / SIGLEX Workshop, 12 July 2012, Jeju, Korea. pages 10–19.