The purpose of this task is to match three resources to each other: the Thesaurus of the Netherlands Institute for Sound and Vision called the GTAA, WordNet and DBPedia. The GTAA is in Dutch, while WordNet is in English. DBPedia contains labels in both languages and can therefore be used as a mediator between the two.
All three are broad in scope. WordNet is a generic English lexical database, DBPedia is a semantic web version of Wikipedia and GTAA contains all concepts needed to index Dutch broadcast television.
The purpose of this task is twofold: (1) to map thesauri in different languages and (2) to map resources that are large, rich in semantics but weak in formal structure, i.e. realistic on the Web.
We are aiming for SKOS relations: skos:exactMatch, skos:broadmatch, skos:narrowmatch.
Go directly to: GTAA, WordNet, DBPedia, evaluation, schedule, contact
The GTAA thesaurus, a Dutch acronym for "Common Thesaurus [for] Audiovisual Archives", is used at the Netherlands Institute for Sound and Vision, the Dutch public audiovisual broadcast's archives, for indexing their documents. The thesaurus consists of 6 facets that concern the description of:
<skos:Concept rdf:about="#Subject_alternatieveenergie"> <skos:prefLabel>alternatieve energie</skos:prefLabel> <skos:inScheme rdf:resource="http://www.beeldengeluid.nl/Thesaurus/Subject"/> <skos:broader rdf:resource="#Subject_energie"/> <skos:narrower rdf:resource="#Subject_biobrandstoffen"/> <skos:narrower rdf:resource="#Subject_windenergie"/> <skos:narrower rdf:resource="#Subject_zonne-energie"/> <skos:related rdf:resource="#Subject_alcohol"/> <skos:related rdf:resource="#Subject_energiebeleid"/> <skos:related rdf:resource="#Subject_energiebronnen"/> <skos:related rdf:resource="#Subject_getijden"/> <skos:related rdf:resource="#Subject_milieubeleid"/> <skos:related rdf:resource="#Subject_waterkracht"/> <skos:related rdf:resource="#Subject_waterstof"/> <skos:altLabel>getijdenenergie</skos:altLabel> </skos:Concept>
<skos:Concept rdf:about="#Person_BeatrixkoninginNederland"> <skos:prefLabel>Beatrix (koningin Nederland)</skos:prefLabel> <skos:inScheme rdf:resource="http://www.beeldengeluid.nl/Thesaurus/Person"/> <skos:related rdf:resource="#Person_BeatrixkroonprinsesNederland"/> <skos:scopeNote>va30-4-80</skos:scopeNote> </skos:Concept>
<skos:Concept rdf:about="#Name_Abba"> <skos:prefLabel>Abba</skos:prefLabel> <skos:inScheme rdf:resource="http://www.beeldengeluid.nl/Thesaurus/Name"/> <skos:scopeNote>popgroep Zweden</skos:scopeNote> </skos:Concept>
<skos:Concept rdf:about="#Location_Amsterdam"> <skos:prefLabel>Amsterdam</skos:prefLabel> <skos:inScheme rdf:resource="http://www.beeldengeluid.nl/Thesaurus/Location"/> <skos:scopeNote>Nederland</skos:scopeNote> </skos:Concept>
WordNet is a lexical database of the English language. Its main building blocks are synsets: groups of words with a synonymous meaning. An example of a synset is {cliff, drop, drop-off}, described as "a steep high face of rock". In this task, the goal is to match synsets (not words). Four kinds of synsets are distinguished, containing nouns, verbs, adjectives or adverbs. In this task, we are only interested in noun-synsets. There are 7 types of relations between noun-synsets:
The main hierarchy in WordNet is built on hyponym relations between synsets, which are similar to subclass relations. Other frequently occurring relations between synsets are meronyms, which denote part-of relations. The classifiedBy relations group synsets according to their topic, region or usage.
W3C has translated WordNet version 2.0 into RDF/OWL. The details of this translation are published in a working draft. The working draft contains a download section, from which the complete dataset can be downloaded. For convenience, the data is split up in different files, one for each relation. Participants of this OAEI task are free to use (or ignore) all files with all types of relations. Although the goal is to match noun-synsets, relationships between verbs, adverbs and adjectives may be used in the process.
We recommend to use at least the following: synsets, their labels (e.g the synonymous words cliff, drop, drop-off), their descriptive sentences called 'glosses' (e.g. "a steep high face of rock"), and the hyponym hierarchy between synsets, which is comparable to a subclass hierarchy. You would need to download four files:
The file WordNet Basic Schema contains the RDF-Schema of the dataset. This file states, for example, that a NounSynset is a subClassOf a Synset, that the property meronymOf is the inverse of holonymOf, etc.
The file Synsets contains the types of synsets (nounSynsets, verbSynsets, adjectiveSynsets or adverbSynsets), a label and an ID number. The latter is a link to the original princeton version of WordNet and can safely be ignored. Synsets have the form: http://www.w3.org/2006/03/wn/wn20/instances/synset-cliff-noun-1
The file senseLabels contains all labels of the synsets (the Synsets file gives only one for each synset).<wn20schema:NounSynset rdf:about="&wn20instances;synset-cliff-noun-1" rdfs:label="cliff"> <wn20schema:synsetId>108670141</wn20schema:synsetId> </wn20schema:NounSynset>
The file hyponymy contains the hyponym (i.e. subclass) hierarchy between synsets.<rdf:Description rdf:about="&wn20instances;synset-cliff-noun-1"> <wn20schema:senseLabel>cliff</wn20schema:senseLabel> <wn20schema:senseLabel>drop</wn20schema:senseLabel> <wn20schema:senseLabel>drop-off</wn20schema:senseLabel> </rdf:Description>
In a similar fashion, there are files for other the other types of relations in the download section of the W3C working draft.<rdf:Description rdf:about="&wn20instances;synset-cliff-noun-1"> <wn20schema:hyponymOf rdf:resource="&wn20instances;synset-geological_formation-noun-1"/> </rdf:Description>
becomes<rdf:Description rdf:about="&wn20instances;synset-cliff-noun-1"> <wn20schema:hyponymOf rdf:resource="&wn20instances;synset-geological_formation-noun-1"/> </rdf:Description>
This translation may be used as an alternative to the hyponym file and can be downloaded from here.<skos:Concept rdf:about="&wn20instances;synset-cliff-noun-1"> <skos:broader rdf:resource="&wn20instances;synset-geological_formation-noun-1"/> </skos:Concept>
DBPedia is an extremely rich dataset. It contains 2.18 million resources or "things", each tied to an article in the English language Wikipedia. The "things" are described by titles and abstracts in English and often also in Dutch. DBPedia "things" have numerous properties, such as categories, properties derived from the wikipedia 'infoboxes', links between pages within and outside wikipedia, etc. The purpose of this task is to map the DBPedia "things" to WordNet synsets and GTAA concepts.
Important to note is that some DBPedia "things" are already tied to WordNet synsets with a wordnet-type property. For example:
In addition, most DBPedia "things" are instances in the YAGO ontology. Classes in the YAGO ontology correspond to wikipedia categories and WordNet synsets. For example:<rdf:Description rdf:about="&dbpedia;Air_New_Zealand"> <dbpedia2:wordnet-type rdf:resource="&wn20instances;synset-airline-noun-2"/> </rdf:Description>
This information may be used to find matches (skos:exactMatch, skos:broadmatch or skos:narrowmatch) between DBPedia "things" and WordNet synsets.<rdf:Description rdf:about="&dbpedia;%22Crazy%22_Joe_Davola"> <rdf:type rdf:resource="&yago;FictionalCharacter109587565"/> </rdf:Description>
All information can be downloaded from the DBPedia download site. For each type of property (title, abstracts, infobox properties, links to pages, etc.), there is separate file that can be downloaded. Also, small preview files are provided.
Every type of relation from this download site can be used in this OAEI task. However, you are of course not obliged to use them all. You can pick and choose the information that you think is useful and that your tool can handle. A reasonable choice seems to be to use at least the following information: "things" and their labels, comments, and type information, in addition to the Yago class hierarchy. The model is as follows:DBPedia "thing" -- rdfs:label "its title" -- rdfs:comment "its abstract" -- rdf:type yago:SomeClass yago:SomeClass -- rdfs:subClassOf yago:SomeSuperClassThe file 'Titles' contains labels, in English and in Dutch, which are the titles of the corresponding wikipedia articles. The file 'Short Abstracts' contains short abstracts in English and in Dutch. The file YAGO classes contains rdf:type links between the DBPedia "things" and the Yago classes, and the file YAGO Class Hierarchy contains the Yago class hierarchy. In addition, the wikipedia Articles Categories might be useful:
DBPedia "thing" -- skos:subject dbpedia:SomeCategory
We will evaluate mappings between all three pairs of thesauri. Evaluation of both precision and recall will be based on a sample. More information will follow soon.
We would like to thank Chris Bizer, Fabian Suchanec and Jens Lehman for their help with the DBPedia dataset. We would also like to thank Willem van Hage for his advise. We gratefully acknowledge the Dutch Institute for Sound and Vision for allowing us to use the GTAA.
Send any questions, comments, or suggestions to: