15th януари 2019

QTLeap Resources

QTLeap Language Resources and Tools Developed by BulTreeBank Group

Knowledge-based Word Sense Disambiguation Resources

The following sets of relations are compiled within QTLeap Project. They are tested with the
UKB system.
The format of the relations are as required by the UKB system:

u:nodeid v:nodeid s:Source d:D

where nodeid is node identifier. Usually, nodeid is a synset id – wnid.
In the sets below all synset ids are from WordNet 3.0.
Source describes the source of the relation. D is 0 or 1 and determines
whether the relation is directed (1) or undirected (0).

Relations extracted from the logical analysis of Princeton WordNet glosses

These relations are extracted from
eXtended WordNet (XWN).
XWN provides semantic annotation of the words in the glosses of synsets in WordNet.
Additionally it contains a parse of the gloss and a logical form of the gloss.
For example, the synset { disyllable, dissyllable }06290539-n,
is defined by a word having two syllables.
After the analysis, the following synsets are selected:
06286395-nword, 06304671-nsyllable,
02203362-vhave. The logical form for this gloss in XWN is the following:

 disyllable:NN(x1) -> 
  word:NN(x1) have:VB(e1, x1, x2) 
  two:JJ(x2) syllable:NN(x2) 

In our opinion, each predicate that originates
from a verbal, adjectival, adverbial, or prepositional
lemma expresses an event. In the example, have:VB(e1, x1, x2)
denotes the event of holding
of object denoted by x2 by the object denoted by x1.
Both of these objects are participants of the event of holding e1.
From this we extract the following relations:

  u:02203362-v v:06286395-n
  u:02203362-v v:06304671-n
  u:06286395-n v:06304671-n

In the case of the synset { ice-cream cone } defined by ice cream in a crisp conical wafer
the following logical form is presented:

 ice-cream_cone:NN(x1) -> 
  ice_cream:NN(x1) in:IN(x1, x2) 
  crisp:JJ(x2) conical:JJ(x2) 
  wafer:NN(x2)   

From it we have extracted relations between ice cream and wafer
on the basis of the predicate in:IN(x1, x2), also between crisp and
wafer, conical and wafer, and between crisp and conical
in the appropriate senses.

Thus, here is a set of relations extracted from the logical analysis of
Princeton WordNet glosses (represented in XWN) including the following main type of relations:

  • verb-noun relations on the basis of event-participants relations;
  • adjective-noun relations on the basis of event-participants relations where we assume that adjective denote a kind of event;
  • noun-noun relation on the basis of two nouns related by a preposition;
  • adjective-adjective on the basis two adjectives modifying the same noun; and some others.

eXtended WordNet relations from logical form: WN30gl

Relations extracted from the gloss annotation with word senses in XWN

The extraction from manually annotated corpora with word senses
can be performed in different ways as one can see in the publication below.
Here we provide a set of relations based on the glosses as sentences.
Thus we have prepared a set of relations where the words in the gloss
annotated with senses are represented as connected nodes in the graph.
The connection between these nodes can be determined at least in two ways:

  • As a sequence of nodes corresponding to the order of the words in the sentence;
  • As a syntactic structure of nodes corresponding to a parse of the sentence.

The set of relations provided here is encoded according to the first approach.

For each <gloss> element in XWN we consider the element <wsd>, containg the words of the gloss with assigned synset id from WordNet V2.0:

<wsd>
<wf pos="DT">a</wf>
<wf lemma="kind" pos="NN" quality="silver" wn20="ENG20-05501538-n" wnsn="1">kind</wf>
<wf pos="IN">of</wf>
<wf lemma="artificial_heart" pos="NN" quality="silver" wn20="ENG20-02650459-n" wnsn="1">artificial<us/>heart</wf>
<wf pos="WDT">that</wf>
<wf lemma="have" pos="VBZ" quality="normal" wn20="ENG20-02139918-v" wnsn="1">has</wf>
<wf lemma="be" pos="VBN" quality="normal" wn20="ENG20-02526983-v" wnsn="1">been</wf>
<wf lemma="use" pos="VBN" quality="normal" wn20="ENG20-01123102-v" wnsn="1">used</wf>
<wf pos="IN">with</wf>
<wf pos="DT">some</wf>
<wf lemma="success" pos="NN" quality="normal" wn20="ENG20-06869923-n" wnsn="1">success</wf>
</wsd>

We have performed the following operations:

  • The synset id for synset of the gloss is added as a first element;
  • The WordNet v2.0 ids are converted to WordNet v3.0 ids as they are used in UKB knowledge graphs;
  • For each word annotated with WordNet id we created a node which connects to the node of the corresponding
    WordNet synset and to the node of the preceding word annotated with WordNet synset id in the gloss.

Thus for the above example the representation is as follows:

u:n03461356-1-gl v:03593862-n s:gw d:0
u:n03461356-2-gl v:05839024-n s:gw d:0
u:n03461356-2-gl v:n03461356-1-gl s:gr d:0
u:n03461356-3-gl v:02745492-n s:gw d:0
u:n03461356-3-gl v:n03461356-2-gl s:gr d:0
u:n03461356-4-gl v:02203362-v s:gw d:0
u:n03461356-4-gl v:n03461356-3-gl s:gr d:0
u:n03461356-5-gl v:02604760-v s:gw d:0
u:n03461356-5-gl v:n03461356-4-gl s:gr d:0
u:n03461356-6-gl v:01158872-v s:gw d:0
u:n03461356-6-gl v:n03461356-5-gl s:gr d:0
u:n03461356-7-gl v:07319103-n s:gw d:0
u:n03461356-7-gl v:n03461356-6-gl s:gr d:0

where n03461356-*-gl are nodes corresponding to the word in the gloss and the other nodes are corresponding to the synsets in WordNet v3.0.

We call this set of relations WN30glCon

Relations extracted from the SemCor corpus

SemCor
is a sense-tagged corpus created at Princeton University. The corpus is based on a subset
of the English Brown Corpus. The distribution of the corpus contains 186 files:
from br-a01 to br-r09.
IIn our experiments we divided the corpus into two parts: test part and training part. The test part consists of 49 files:
from br-a01 to br-f44. The training part consists of the rest of the files. The test part can be downloaded
here as one file in NAF format.

The training part of SemCor was parsed with the dependency parser from the
IXA pipeline.
The set of relations presented here is based on the dependency
tree for each sentence. Each node in the dependency tree corresponds
to a new node for the relevant word. Then it is related to the head node.
An additional relation points to the node corresponding to the WordNet Synset.
Because SemCor consists of text fragments, the sentences that are from the same
fragment are connected via relations between the roots of the dependency trees.
The root of the second tree is related to the root of the first sentence, the
root of the third sentence to the root of the second sentence, etc.

For example, the sentence „Evidence that other sources of financing are unavailable must be provided.“ is analyzed as:

From this analysis we construct the following set of relations:

u:019-043-t1145-wn v:019-043-t1153-wo s:dp d:0
u:019-043-t1145-wn v:05823932-n s:wn d:0
u:019-043-t1146-wp v:019-043-t1145-wn s:dp d:0
u:019-043-t1147-wg v:019-043-t1148-wn s:dp d:0
u:019-043-t1147-wg v:02069355-a s:wn d:0
u:019-043-t1148-wn v:019-043-t1151-wv s:dp d:0
u:019-043-t1148-wn v:04263614-n s:wn d:0
u:019-043-t1149-wp v:019-043-t1148-wn s:dp d:0
u:019-043-t1150-wn v:019-043-t1149-wp s:dp d:0
u:019-043-t1150-wn v:01099109-n s:wn d:0
u:019-043-t1151-wv v:019-043-t1146-wp s:dp d:0
u:019-043-t1151-wv v:02604760-v s:wn d:0
u:019-043-t1152-wg v:019-043-t1151-wv s:dp d:0
u:019-043-t1152-wg v:00185921-a s:wn d:0
u:019-043-t1154-wv v:019-043-t1153-wo s:dp d:0
u:019-043-t1155-wv v:019-043-t1154-wv s:dp d:0
u:019-043-t1155-wv v:02327200-v s:wn d:0
u:019-043-t1156-wo v:019-043-t1153-wo s:dp d:0
u:019-043-t1153-wo v:019-042-t1144-wo s:sr d:0

where 019-043-t****-** represents the nodes in
the dependency tree, the first three digits represent the
number of the file from which the sentence is selected,
then the number of the sentence and then the number of
tokens in the sentence. Nodes like 05823932-n
are from the knowledge graph of WordNet v3.0.
The last relation u:019-043-t1153-wo v:019-042-t1144-wo s:sr d:0
is the relation between the root of the sentence and the root of the previous sentence.

The set is called GraphRelSC.
Keep in mind that the parsing of the sentences is done automatically. Thus, there might be errors.

Experiments over SemCor test corpus

Here we present some evaluation results for
the different sets of relations over the SemCor test corpus.
We have performed experiments with two algorithms implemented in UKB system: Static and PPRw2w

The sets are as follows:

  • WN30. The set of relations based on WordNet v3.0. (can be download from here)
  • WN30g. The set of relations based on eXtended WordNet v3.0 representing co-occurrence relations of the synset id and the annotation of the gloss. (can be download from here)
  • WN30gl. The set of relations based on the analysis of
    the logical forms of the glosses in eXtended WordNet v3.0.
    (can be download from here)
  • WN30glCon. The set of relations based on the word sense
    annotation of the glosses in eXtended WordNet v3.0.
    (can be download from here)
  • GraphRelSC. The set of relations based on the analysis of
    the sentences in the SemCor training corpus presented above.
    (can be download from here)

 Graph   SemCor 
 Static   PPRw2w 
 WN30   56.60   56.35 
 WN30g   56.00   57.33 
 WN30 + WN30g   59.55   62.24 
 WN30gl   60.46   60.35 
 WN30 + WN30gl   66.61   67.19 
 WN30 + WN30glCon   66.99   66.43 
 WN30 + GraphRelSC   67.04   65.97 
 WN30 + WN30gl + GraphRelSC   68.41   68.51 
 WN30 + WN30glCon + GraphRelSC   68.74   68.14 
 WN30 + WN30gl + WN30glCon + GraphRelSC   68.77   68.48 
 WN30 + WN30g + WN30gl + WN30glCon + GraphRelSC   68.39   68.58 


Publications:

Kiril Simov; Alexander Popov; Petya Osenova. 2015.
Improving Word Sense Disambiguation with Linguistic Knowledge from a Sense Annotated Treebank. RANLP 2015.

Kiril Simov; Alexander Popov; Petya Osenova. 2016.
The Role of the WordNet Relations in the Knowledge-based Word Sense Disambiguation Task. Global WordNet Conference 2016.