Available Linguistic Resources for Bulgarian
The following linguistic resources for Bulgarian are available.
If you couldn’t download some of the available resources, please, send an e-mail to us and we will send to you a copy of the corresponding resource.
- Frequency list. A frequency list of the first 100 000 Cyrillic tokens in the archive are available here: BTB-FreqList in UTF-16. Free for research purposes.
- Stopword list. A stopword list based on the archive are available here: BTB-StopWordList in UTF-16. Free for research purposes.
- Morphological analyzer – Slovnik. This is a system for morphological analyses and generation based on (Popov, Simov, Vidinska 1998) developed by Ognyan Chernokozhev and Atanas Kiryakov at OntoText Lab. The system recognizes the wordforms of more than 110 000 Bulgarian lexemes and assigns to them the appropriate morphosyntactic characteristics. You could try the demo version of Slovnik.
We also implemented our own morphological analyzer within the CLaRK System in the form of a finite-state grammar.
- Neural Network MorphoSyntactic disambiguator for Bulgarian. This system was developed under the CLaRK Programme by Stanislava Vlaseva, Petya Osenova, and Kiril Simov. Currently we have a corpus of about 2600 sentences extracted from newspapers, narratives and textbooks which demonstrate some of the most frequent ambiguities on the morphosyntactic level. We have trained a neural network on the basis of 1500 sentences given in different order and with different number of ambiguous words. The resulting network predicted the right part-of-speech for 95.25% of the words in the rest of the sentences in the corpus. Therefore the accuracy on the morphosyntactic level is 93.17%.
An extraction from this corpus of Bulgarian sentences marked-up with part of speech information can download from here:
BTB-POS Corpus I (324 011 bytes) – ISO 8879:1986 encoding of the cyrillic letters (entities).
BTB-POS Corpus I (306 966 bytes) – MS Windows encoding of the cyrillic letters.
BTB-POS Corpus I (246 964 bytes) – Unicode encoding of the cyrillic letters.
5. Bulgaria National Reference Corpus. We have collected mainly from the Internet a set of Bulgarian texts. These cover more than 400 000 000 tokens. We are converting them in TEI (see (Text Encoding Initiative 1997)) compatible XML markup on the paragraph level. We intend to mark them up with morphological information using a morphological tagger developed under BulTreeBank Project based on Popov, Simov, Vidinska 1998 and the corpora development tools implemented under the CLaRK Programme: CLaRK System. 50% of the texts come from fiction, 30% from newspapers, 10% from legal texts and government texts and 10% from other genres.
The Bulgarian National Reference Corpus is continuously updated with new texts.
Some text available for research purposes.