Deep models of Semantic Knowledge (DemoSem)
The mission of the project is to make text data more communicative and relevant to the users’ needs and related societal tasks. The main goal of the project is to extend the current knowledge on linguistic and formal levels about word sense disambiguation process. The main language of the study will be Bulgarian, but since the results from our work will be applicable also to other languages (through the language independent components and various adaptation techniques), we will perform some experiments on English.
- Project objectives and hypotheses
Our objectives are as follows:
- To linguistically model the important aspects of difficulty in identification, combination and representation of the semantic knowledge, encoded in lexicons and realized in corpora.
- To mathematically model features in the graph-based and deep neural network algorithms that would help for the representation of the semantic knowledge and for scaling its identification, combination and extraction in WSD.
- Integration of deep neural network models within graph-based approaches to WSD.
- Creation of test suites (benchmarks) for the evaluation of the developed formal models.
The linguistic models will play two main roles in the project: first, they will help to determine the semantic flow in the text on levels of local context (fixed word window, sentences), cross-sentential level, discourse level and cross-document level. Each of these levels will determine different kinds of knowledge that plays role in WSD. Second, these models will guide the preparation of annotated texts which demonstrate the realization of the models. These annotated texts will be used to test the created neural network implementation over the linguistic models.
The knowledge graph methods will be used as the actual mechanism for performing WSD. The main effort here will be devoted to knowledge encoded in the graph and the selection of appropriate contexts in the text.
The neural network models will learn the characteristics of the contexts for knowledge graph methods and thus they will govern the application of the knowledge graph approaches.
Our hypotheses are as follows:
- The simultaneous exploration of lexicons and corpora can improve the quality of the required semantic knowledge, since it combines paradigmatic and syntagmatic knowledge.
- The deep neural network models can complement the achievements of the knowledge graph-based ones, especially in the areas where it is difficult to identify relations in advance as well as in global and often highly ‘disturbed’ connectedness provided by large volumes of texts.
- A model can be developed which improves over the current WSD for Bulgarian and potentially for other languages.
The proofs of the above presented hypotheses will mean that the goals of the proposed project are met. The realization of semantics in the text as an interaction with the lexicon is a basis for the exploitation of the existing knowledge in lexicons and ontologies through graph-based methods. The neural network will complement this knowledge when learning new features that are still not possible or hard to formulate in a symbolic way. The combination of the three approaches: linguistic, graph-based and neural network based, will allow us to improve the knowledge models for the task of word sense disambiguation.
- Approaches for accomplishment of the research goals including interdisciplinarity of the project
Due to interdisciplinary nature of the project, we will exploit both kinds of research approaches – Qualitative and Quantitative. Linguistics is an empirical science which is based on observations from actual usage of language in real texts and speech acts. The collections of actual usages of language are called corpora (including text archives and annotated corpora). On the other hand, linguists use their intuition to generate linguistic theories on the basis of their insights. The linguistic theories are inductive in their nature. The linguistic models in the project will reflect the relational structures (lexicons and ontologies) defining the semantic knowledge and the formal grammars that determine the relations between the semantics in the lexicon and its realization in the text. The implementation of the relation between the lexicon and the text through a grammar is deductive.
The linguistic theories are validated by examination over appropriate samples of corpora in two ways. On the one hand, the validation is done by manual annotation of new samples (or extending the annotation of existing samples) of language data. This process ensures the quality of the linguistic theory. The quality is measured with the help of annotation of the same samples by more than one person. In this way, we could evaluate the interannotator agreement and manually check the discrepancy in annotation by a super annotator. The gaps in the annotations will contribute to the improvement of the linguistic theory.
On the other hand, validation of the linguistic theory will be performed via its usage in knowledge graph models and neural network models for WSD. The performance of both approaches will be done through checking their prediction on unseen data by the measures like precision, recall, accuracy and f-measure. Here both inductive and deductive approaches will be utilized in order to learn new knowledge from language data and to explicate the hidden one in the lexicon as well as ontologies.
The results from the theory validation will be a basis for further development of the formal linguistic theories and their implementation in knowledge graph and neural network models of WSD.
Thus, more specifically, we will use the formal modelling of natural language, which will be expressed in the annotation scheme for covering various pieces of semantic knowledge in the text and in the representation of various relations in the lexicon. This approach will ensure the consistency of the data and the possibility to test as well as evaluate the appropriate supervised models on it. We will rely on the graph-based and deep neural network approaches for describing the linguistic models in an adequate way for WSD. This means that some parts of the system performing algorithms have to be made transparent and comprehensible for the purposes of human control, when needed in the process.
- Justification of the type of planned research (fundamental or applied)
We apply the following definition to fundamental research: Fundamental research means experimental or theoretical work undertaken primarily to acquire new knowledge of the underlying foundations of phenomena and observable facts, without any direct commercial application or use in view. (The definition is laid down in Article 2 (84) of Commission Regulation No 651/2014[1]).
We aim at scientific exploration of various settings of incorporated knowledge chunks from available or further developed resources with the help of deep neural network algorithms without leaving out the graph-based ones, since some hybrid architectures might be tested as well.
Our findings, achieved through research and experiments on improving the WSD might be further used in scientific or industry applications.
Our research is in the accordance with the above definition, since it aims to investigate the relevant pieces of semantic information for the purposes of WSD and under the following conditions: the data knowledge comes from two types of languages resources: corpora and lexicons; we would like to go beyond the graph-based algorithms, which are more paradigmatic-tailored. Thus, we will test the deep neural network ones, which can explore also local and global syntagmatic text relations. These deep algorithms might be supported with the best results from graph-based methods. We believe that the two approaches complement each other.
[1] http://eur-lex.europa.eu/legal-content/EN/TXT/?uri=uriserv%3AOJ.L_.2014.187.01.0001.01.ENG