The prupose of this task is to match the Thesaurus of the Netherlands Institute for Sound and Vision (called GTAA, see below for more information) to two other resources: the English WordNet from Princeton University and DBpedia.
All three are broad in scope. WordNet is a generic English lexical database, DBPedia is a semantic web version of Wikipedia and GTAA contains all concepts needed to index Dutch broadcast television.
The purpose of this task is twofold. First, we want to map a thesaurus in a language different from English to widely used English resources: the GTAA is in Dutch, while WordNet is in English; DBPedia contains labels in both languages and can therefore be used as a mediator between the two. Second, we want to align resources that are large, rich in semantics but weak in formal structure, i.e. realistic on the Web. Mapping languages other than English to WordNet and Wikipedia can open up the archives indexed with the monolingual thesaurus to multilinguality, and enable non-native speakers to access their content.
We are aiming for SKOS relations: skos:exactMatch and skos:closeMatch
Go directly to: GTAA, WordNet, DBPedia, evaluation, schedule, contact
The GTAA thesaurus, a Dutch acronym for "Common Thesaurus [for] Audiovisual Archives", is used at the Netherlands Institute for Sound and Vision, the Dutch public audiovisual broadcast's archives, for indexing their documents. The thesaurus consists of 6 facets that concern the description of:
<skos:Concept rdf:about="#Subject_alternatieveenergie"> <skos:prefLabel>alternatieve energie</skos:prefLabel> <skos:inScheme rdf:resource="http://www.beeldengeluid.nl/Thesaurus/Subject"/> <skos:broader rdf:resource="#Subject_energie"/> <skos:narrower rdf:resource="#Subject_biobrandstoffen"/> <skos:narrower rdf:resource="#Subject_windenergie"/> <skos:narrower rdf:resource="#Subject_zonne-energie"/> <skos:related rdf:resource="#Subject_alcohol"/> <skos:related rdf:resource="#Subject_energiebeleid"/> <skos:related rdf:resource="#Subject_energiebronnen"/> <skos:related rdf:resource="#Subject_getijden"/> <skos:related rdf:resource="#Subject_milieubeleid"/> <skos:related rdf:resource="#Subject_waterkracht"/> <skos:related rdf:resource="#Subject_waterstof"/> <skos:altLabel>getijdenenergie</skos:altLabel> </skos:Concept>
<skos:Concept rdf:about="#Person_BeatrixkoninginNederland"> <skos:prefLabel>Beatrix (koningin Nederland)</skos:prefLabel> <skos:inScheme rdf:resource="http://www.beeldengeluid.nl/Thesaurus/Person"/> <skos:related rdf:resource="#Person_BeatrixkroonprinsesNederland"/> <skos:scopeNote>va30-4-80</skos:scopeNote> </skos:Concept>
<skos:Concept rdf:about="#Name_Abba"> <skos:prefLabel>Abba</skos:prefLabel> <skos:inScheme rdf:resource="http://www.beeldengeluid.nl/Thesaurus/Name"/> <skos:scopeNote>popgroep Zweden</skos:scopeNote> </skos:Concept>
<skos:Concept rdf:about="#Location_Amsterdam"> <skos:prefLabel>Amsterdam</skos:prefLabel> <skos:inScheme rdf:resource="http://www.beeldengeluid.nl/Thesaurus/Location"/> <skos:scopeNote>Nederland</skos:scopeNote> </skos:Concept>
WordNet is a lexical database of the English language. Its main building blocks are synsets: groups of words with a synonymous meaning. An example of a synset is {cliff, drop, drop-off}, described as "a steep high face of rock". In this task, the goal is to match synsets (not words). Four kinds of synsets are distinguished, containing nouns, verbs, adjectives or adverbs. In this task, we are only interested in noun-synsets. There are 7 types of relations between noun-synsets:
The main hierarchy in WordNet is built on hyponym relations between synsets, which are similar to subclass relations. Other frequently occurring relations between synsets are meronyms, which denote part-of relations. The classifiedBy relations group synsets according to their topic, region or usage.
W3C has translated WordNet version 2.0 into RDF/OWL. The details of this translation are published in a working draft. The working draft contains a download section, from which the complete dataset can be downloaded. For convenience, the data is split up in different files, one for each relation. Participants of this OAEI task are free to use (or ignore) all files with all types of relations. Although the goal is to match noun-synsets, relationships between verbs, adverbs and adjectives may be used in the process.
We recommend to use at least the following: synsets, their labels (e.g the synonymous words cliff, drop, drop-off), their descriptive sentences called 'glosses' (e.g. "a steep high face of rock"), and the hyponym hierarchy between synsets, which is comparable to a subclass hierarchy. You would need to download four files:
The file WordNet Basic Schema contains the RDF-Schema of the dataset. This file states, for example, that a NounSynset is a subClassOf a Synset, that the property meronymOf is the inverse of holonymOf, etc.
The file Synsets contains the types of synsets (nounSynsets, verbSynsets, adjectiveSynsets or adverbSynsets), a label and an ID number. The latter is a link to the original princeton version of WordNet and can safely be ignored. Synsets have the form: http://www.w3.org/2006/03/wn/wn20/instances/synset-cliff-noun-1
The file senseLabels contains all labels of the synsets (the Synsets file gives only one for each synset).<wn20schema:NounSynset rdf:about="&wn20instances;synset-cliff-noun-1" rdfs:label="cliff"> <wn20schema:synsetId>108670141</wn20schema:synsetId> </wn20schema:NounSynset>
The file<rdf:Description rdf:about="&wn20instances;synset-cliff-noun-1"> <wn20schema:senseLabel>cliff</wn20schema:senseLabel> <wn20schema:senseLabel>drop</wn20schema:senseLabel> <wn20schema:senseLabel>drop-off</wn20schema:senseLabel> </rdf:Description>
In a similar fashion, there are files for other the other types of relations in the<rdf:Description rdf:about="&wn20instances;synset-cliff-noun-1"> <wn20schema:hyponymOf rdf:resource="&wn20instances;synset-geological_formation-noun-1"/> </rdf:Description>
becomes<rdf:Description rdf:about="&wn20instances;synset-cliff-noun-1"> <wn20schema:hyponymOf rdf:resource="&wn20instances;synset-geological_formation-noun-1"/> </rdf:Description>
This translation may be used as an alternative to the hyponym file and can be downloaded from here.<skos:Concept rdf:about="&wn20instances;synset-cliff-noun-1"> <skos:broader rdf:resource="&wn20instances;synset-geological_formation-noun-1"/> </skos:Concept>
DBPedia is an extremely rich dataset. It contains 2.18 million resources or "things", each tied to an article in the English language Wikipedia. The "things" are described by titles and abstracts in English and often also in Dutch. DBPedia "things" have numerous properties, such as categories, properties derived from the wikipedia 'infoboxes', links between pages within and outside wikipedia, etc. The purpose of this task is to map the DBPedia "things" to WordNet synsets and GTAA concepts.
All information can be downloaded from the DBPedia download site. For each type of property (title, abstracts, infobox properties, links to pages, etc.), there is separate file that can be downloaded. Also, small preview files are provided. In the following description we will link to the preview files instead of to the actual content files that you need for the allignment, to prevent multiple downloads of very large files.
Every type of relation from the download site can be used in this OAEI task. However, you are of course not obliged to use them all. You can pick and choose the information that you think is useful and that your tool can handle. A reasonable choice seems to be to use at least the following information: "things", their labels and their comments:
DBPedia "thing" -- rdfs:label -- "title of the wikipedia page" -- rdfs:comment -- "abstract of the wikipedia page"The file Titles contains labels, available in English and Dutch, which are the titles of the corresponding wikipedia articles. The file Short Abstracts contains short abstracts, available in English and Dutch.
In addition, we consider the DBpedia ontology, categories and links to WordNet valuable sources for the current alignment task. Descriptions will be given below.
The DBpedia Ontology is an ontology of currently more than 170 classes, organised in a subsumption hierarchy. The ontology was manually created based on the most commonly used infoboxes of Wikipedia. The file DBpedia Ontology contains the classes and properties of this ontolgy, while the file Ontology Types contains the instances of the Ontology, i.e. DBpedia "things".
DBpedia "things" are organised into categories. The file Categories(SKOS) provides the categories and the SKOS relations between them, while the file Articles Categories contains the skos:Subject links between "things" and categories.
Important to note is that some DBPedia "things" are already tied to WordNet synsets with a wordnet-type property in the file WordNet Classes. For example:
In addition, most DBPedia "things" are instances in the YAGO ontology. Classes in the YAGO ontology correspond to wikipedia categories and WordNet synsets. For example:<rdf:Description rdf:about="&dbpedia;Air_New_Zealand"> <dbpedia2:wordnet-type rdf:resource="&wn20instances;synset-airline-noun-2"/> </rdf:Description>
This information may be used to find matches between DBPedia "things" and WordNet synsets. The file YAGO Classes contains rdf:type links between the DBPedia "things" and the Yago classes, as well as labels of the YAGO classes and the YAGO subsumption hierarchy.<rdf:Description rdf:about="&dbpedia;%22Crazy%22_Joe_Davola"> <rdf:type rdf:resource="&yago;FictionalCharacter109587565"/> </rdf:Description>
We will evaluate the alignments between both pairs of thesauri: WordNet-GTAA and DBpedia-GTAA. Since mapping performance can vary over the different facets of the GTAA, we will evaluate samples of mappings for each facet and report on their quality separately.
To estimate precision, we will draw a random sample of 100 mappings from each GTAA facet in each alignment, resulting in 800 mappings to be evaluated in total per participating tool. For recall, we will create a reference alignment of 100 mappings per facet, per alignment. Considering the time consuming nature of creating a reference alignment, we will only do this for selected facets.
We would like to thank Chris Bizer, Fabian Suchanec and Jens Lehman for their help with the DBPedia dataset. We would also like to thank Willem van Hage for his advise. We gratefully acknowledge the Dutch Institute for Sound and Vision for allowing us to use the GTAA.
Send any questions, comments, or suggestions to:
Initial location of this page: http://www.cs.vu.nl/~laurah/oaei/2009/