Related Bulgarian Resources
1. Linguistic Modelling Department
The first main BALRIC-LING objective is to raise the awareness in the newly associated Balkan countries Bulgaria and Romania about the potential of the most advanced Human Language Technologies (HLT) and about the possible scientific and industrial applications of the corresponding linguistic resources. It is very important to create such awareness in those countries, because they are on their way to full EC integration but examples of successful HLT marketable applications do not exist at all for Bulgarian and Romanian. Having been isolated for many years from the broad West European scientific exchange of language engineering ideas and practice and having to deal with structurally different languages, the very few advanced HLT-groups from both countries still cannot gather by themselves a critical mass of informed specialists who could raise the quality of Language Engineering (LE) applications for the newly emerging HLT markets in Bulgaria and Romania. Since HLT is a rather broad field, BALRIC-LING will focus on four topics only:
word-centered linguistic resources and annotations;
corpora and tagging;
relevant supporting tools, and;
possible advanced HLT usages of considered resources.
To meet the target of building awareness in Bulgaria and Romania, BALRIC-LING aims at the realisation of the following three main initiatives:
Development of Regional Information Centers (RIC) in BULgaria and Romania (BULRIC and RORIC correspondingly). Most generally, these information centers represent Web-sites with HLT tools descriptions, data samples, linguistic resources and prototypes of related supporting tools. The sites will support information desk, where specialist from the consortia will prepare comprehensive overviews of the four BALRIC-LING topics and interested organisation and individuals may post their queries, relevant to the RICs thematics. Documents in English as well as in Bulgarian and Romanian correspondingly for each country, will be available on the site because only in this way the materials will reach the public and interested companies and research groups in the respective Balkan states. In addition to this, the RICs sites will contain information about pertinent conferences, workshops and summer schools organised in Europe and closely related with BALRIC-LING topics. This information will be extremely useful for young people, showing interest in the field, because information about such events is extremely scarce in Bulgaria and Romania.
Virtual seminars, based on the queries received at the RIC centers support' desks, will be held every six months. The idea is that subscribers to especially organised mailing lists with broad coverage may pose questions concerning all materials exposed at the corresponding RICs. Specialists from the consortia will prepare answers and they will be mailed to all the seminar subscribers once in six months (i.e. 3 times during the network duration). These seminars will facilitate raising the awareness and distribution of expertise from the more informed academic units to interested industrial organisations in Bulgaria and Romania. Biannual virtual bulletins in Bulgarian and Romanian respectively will be directed to subscribers of virtual seminars and will allow for broad dissemination of BALRIC-LING initiatives especially among linguistic units and among individuals in the country-side of Bulgaria and Romania.
The second main BALRIC-LING objective is to help Balkan research groups to become better prepared for scientific co-operation at European level. ILSP (Greece) and Sheffield University (UK) will share their rich experience and practice in conducting successful research with both European and national dimensions. The Regional Information Centers in Bulgaria and Romania will contribute to dissemination of HLT ideas among software companies which are interested in further development of advanced HLT applications for Bulgarian and Romanian.
One of the ways to facilitate the formation of future project consortia will be the exchange of information about existing formats and standardisation of the internal representation of some available resources of all partners. Being prepared in unified formats, those resources can be smoothly integrated for simultaneous use in multilingual applications and further developed. All standardisation requirements, set by BALRIC-LING, will be publicly available, so Balkan research groups and interested software companies may refer to them as guideline.
BALRIC-LING aims at standardisation of two formats of internal representation:
Standardisation of formats for encoding of monolingual and parallel corpora;
Standardisation of formats for internal representations of grammatical dictionaries in the three Balkan countries.
The compact BALRIC-LING configuration will allow for in-depth acquaintance with all details exchanged via narrower communication among partners. BALRIC-LING participants from EU-member countries will prepare an overview with evaluation of the progress of the awareness and dissemination tasks in the newly associated Balkan countries Bulgaria and Romania.
CGWorld - A Web Based Workbench for Conceptual Graphs Management and Applications
CGWorld – a Web based workbench for joint distributed development of a large knowledge base of Conceptual Graphs, which resides on a central server. It is implemented in Java and Prolog. The workbench includes a graphical, easy to use CG Editor written as a Java Applet for increased security. CGWorld has facilities for translating CGs between four different formats - Display form, CGLex, FOPC, and CGIF, an implementation of the canonical formation rules and browsing features for easy search on large, cooperatively developed KBs. Also, using the Application Server technology, Internet access is added to previously developed CG applications. Using the standard Internet client - the browser, it is possible to add new features based on a new presentation layer
O CoRrect: Cyrillic and Latin OCR correction
O CoRrect: Cyrillic and Latin OCR correction using electronic dictionaries and sentence context (2002-2004)
The main objective of this project is to expand and develop methods, resources and software systems for improving the OCR correction of Bulgarian and Multilingual (Bulgarian, Russian, English and German) documents.
The high level achievements of the project are given bellow:
Word context based OCR correction
Further development of word context based on Levenshtein Automata correction method. This method can be further refined in several directions. First, one can use probabilities for symbol-dependent recognition errors in order to sort more precisely the correction candidates. For the implementation of this option we can extend the concept of Levenshtein automata by using of weighted automata, which will deliver optimal efficiency. Second we can order the possible correction candidates in respect of the word frequencies.
Extension of the Bulgarian, Russian, German and English Electronic Dictionaries with OCR aiding data which makes the further correction methods possible. This includes adding information about the word frequencies, recognition error risk values. The lexical resources will be formatted in order to provide efficiency. For correction of multilingual documents a very-large size consolidated Bulgarian-Russian-German-English dictionary will be constructed.
Test series for the probabilities of symbol-dependent recognition errors for Cyrillic and Latin Fonts. This series will provide the font dependent data for the building of the weighted Levenshtein automata. In that way the list of the correction candidates can be sorted in respect of the recognition error probability.
Sentence context based OCR correction
Analysis of large-size Corpora for extracting word collocation table for Bulgarian to be used for OCR correction based on word collocation techniques.
Implementation of a robust and highly efficient correction system based on the Levenshtein automata framework and the sentence context correction. We plan to implement our approach in order to test and compare it against the traditional methods. This implementation can demonstrate the achievements of the project in order to attract industrial applications.
Grammatical Web server for Bulgarian
authors : St. Mihov, Elena Paskaleva, Svetla Koeva, Petar Gulev, Anton Zinoviev
of Bulgarian Words
Speech Synthesis for Bulgarian
2. BGSpeech:A site about Spoken Bulgarian.
3. Programms in Computational linguistics
Master’s program Plovdiv University (FMI) http://www.fmi.pu.acad.bg/bg_ver/edu/predmeti_inf_zad/comp_lingv.htm
Faculty of Slavic Studies Computers in Humanities Master's Program - http://compling.ibl.bas.bg/
Faculty of Classical and Modern Philology - Master's Program - /CLProgramme/index.html
Applied Linguistics - http://www.nbu.bg/al/new/default.htm
Master’s Program – Language Technologies - http://www.math.bas.bg/mp3.html
4. Department for Computer Modeling of Bulgarian language
Head: Svetla Koeva
The Department for Computer Modeling of Bulgarian language was formed at the beginning of 1994. The scope of the research of the department includes:
Theoretical problems of the formal language description;
Formal semantic, morphological and syntactic analysis;
Information retrieval ad Information extraction;
INTEX "Computer representation of grammatical knowledge of Bulgarian"
INTEX is a linguistic development environment that includes large-coverage dictionaries and grammars, and parses texts of several million words in real time. INTEX includes tools to create and maintain large-coverage lexical resources, as well as morphological and syntactic grammars. Dictionaries and grammars are applied to texts in order to locate morphological, lexical and syntactic patterns, remove ambiguities, and tag simple and compound words. INTEX is used by several research centers to rapidly construct extractors to identify semantic units in large texts, such as Proper names of persons, locations, technical expressions of finance, etc. INTEX can build lemmatized concordances and indices of large texts with respect to all types of Finite State patterns. INTEX is used in over 80 laboratories as an information retrieval system, to parse literary texts, to quantify language variations, to teach second languages, as a terminological extractor, and in several universities to teach computational linguistics to graduate students.
BalkaNet is an EU-financed international project which is aimed at an ĺffective combination of the existing lexicographic resources of the Balkan countries with a view to creating a multilingual system in which cross-references between Balkan anguages as well as between Balkan and WestÅuropean languages will be possible.
The project is a continuation of two previous projects: WordNet, carried out at the Princeton University, represents the semantic relations among words in English. The successful results lead to the launching of the subsequent European project - EuroWordNet which expanded the WordNet system with eight other languages. EuroWordNet resulted in a huge network of words and semantic relations among them which allows interlingual cross-references, finding out translation equivalents and can be used in information extraction and information retrieval via the Internet.
At this stage (through the first year of the project) the core of the Bulgarian WordNet has been formed - a system for representation of synonymy, antonymy, hyperonymy and other semantic relations. The Bulgarian WordNet is now expanded to 8000 synsets (sets of sense-equivalent words), containing also their relevant dictionary meanings, relations with the corresponding synsets in English, relations with the corresponding hyperonym sets and hyponym sets when there are such. Various computer programmes have been created for the implementation of particular tasks in expanding and in checking up the completeness and consistency of the Bulgarian database. A notable achievement is the WordNet Explorer - a unified environment based on the language-independent WordNet logic formulated by the Bulgarian team specially for the purposes of the project. It must be noted that, unlike the rest of the partners who used preliminary developed material, the Bulgarian team started from the very beginning and at the present moment shows the best rate of development. This fact as well as the highest quality of the work of the Bulgarian team have been pointed out many times at the partners` meetings and have been entered in the project`s proceedings.
5. The Bulgarian Association for Computational Linguistics
Structured Bulgarian Corpus
The development of Bulgarian structured linguistic corpus has been one of the stages of the project BalkaNet. The corpus has been created in the framework of the existing similar corpus at Brown University.
The corpus consist of 1.000.805 words extracted from texts published chiefly in electronic form. An important requirement which has been strictly observed in compiling the corpus is that the texts have been written by Bulgarian authors. Some exceptions, however, have been made: the extracts from the genres of the love story and the western are taken from foreign language sources translated into Bulgarian because of the lack of original Bulgarian texts in these genres. It has been decided that the corpus should be divided into 500 text units - approximately 2000 words each, in this way sentence boundaries have been preserved. The majority of texts consist of more than 2000 words and only a small number of less than 2000.
The texts were sampled from 15 different text categories according the model of Brown corpus. The number of texts in each category varies:
6. Plovdiv University (FMI)
Morphological analyser for bulgarian texts:
The words in the language
are divided into classes (inflectional types). Every class has a unique
machine number for identification and a list of rules for a generation
of a paradigm. For every word a pattern is constructed which matches to
all wordforms belonging to the paradigm of this word. The pattern and
the inflectional type number incorporate information for the whole paradigm
of a particular word. A machine dictionary consists of (word-pattern,
inflectional type number) entries. When an arbitrary wordform has to be
classified the analyser looks up a matching word-pattern in the dictionary.
If such a pattern has been found, using the rules of responding inflectional
type a paradigm from this pattern is generated. If the analysed word coincides
with a wordform from the generated paradigm it obtains the grammatical
features of that wordform. In such way the word is morphologically completely
determined. On the basis of this methodology a morphological processor
for the Bulgarian language has been built up.
Syntactiacal agreement between two words in bulgarian
author: Hristo Krushkov
7. Ontotext Lab
Ontotext - a Sirma Lab for Knowledge and Language Engineering
- development of tools and solutions:
knowledge management; language engineering; web services; custom reasoning
The ontologies are crucial for any kind of "intelligent" software solution. They are in the same time the source of common sense necessary to support any kind of non-trivial text processing as well as the periscope necessary to interpret, understand and make use of the result of the text analysis. Further the ontologies (often called domain models) play a crucial role in the natural language generation tasks - it is impossible to generate a reasonable non-redundant text without deep knowledge of the domain.
8. Machine Translation
English – Bulgarian Machine translator. Commercial software + online demo
BULTRA is a
program product for translation from English to Bulgarian
Supports online test http://online.bultra.com