Very Large Crosslingual Resources

The prupose of this task is to match the Thesaurus of the Netherlands Institute for Sound and Vision (called GTAA, see below for more information) to two other resources: the English WordNet from Princeton University and DBpedia.

All three are broad in scope. WordNet is a generic English lexical database, DBPedia is a semantic web version of Wikipedia and GTAA contains all concepts needed to index Dutch broadcast television.

The purpose of this task is twofold. First, we want to map a thesaurus in a language different from English to widely used English resources: the GTAA is in Dutch, while WordNet is in English; DBPedia contains labels in both languages and can therefore be used as a mediator between the two. Second, we want to align resources that are large, rich in semantics but weak in formal structure, i.e. realistic on the Web. Mapping languages other than English to WordNet and Wikipedia can open up the archives indexed with the monolingual thesaurus to multilinguality, and enable non-native speakers to access their content.

We are aiming for SKOS relations: skos:exactMatch and skos:closeMatch

Go directly to: GTAA, WordNet, DBPedia, evaluation, schedule, contact

Data sets

GTAA

General Information

The GTAA thesaurus, a Dutch acronym for "Common Thesaurus [for] Audiovisual Archives", is used at the Netherlands Institute for Sound and Vision, the Dutch public audiovisual broadcast's archives, for indexing their documents. The thesaurus consists of 6 facets that concern the description of:

the topic of a TV program: keywords are selected from the Subject facet
the main people mentioned in a TV program: keywords are selected from the Person facet
the main "Named Entities" mentioned in a TV program (Corporation names, music bands etc): keywords are selected from the Name facet
the main locations mentioned in a TV program or the place where it has been created: keywords are selected from the Location facet
the genre of a TV program: keywords are selected from the Genre facet
the makers and presentators of a TV program: keywords are selected from the Makers facet

The GTAA contains approximately 160.000 terms: ~3800 Subject keywords, ~97.000 "Person"s, ~27.000 "Names", ~14.000 Locations, 113 Genres and ~18.000 Makers. In this mapping experiment, we consider only the four following facets: Subject, Person, Name and Location. The thesaurus was originally structured along the lines of the ISO 2788 standard for thesauri that is commonly used by companies and institutions. We have converted it to the SKOS reference model. It contains Broader Terms, Narrower Terms, Related Terms, Scope Notes (notes that help to clarify or define a term) and Preferred and Alternative Labels. Terms in all facets of the GTAA can have Related Terms and Scope Notes, but only the Subject facet has Alternative Labels and Broader Term/Narrower Term relations, the latter organizing the terms into a hierarchy. The hierarchal organisation of the Subject facet is not very dense: 80% of the terms are not involved in hierarchies deeper than 3 levels (the average hierarchy depth is 1.3).

SKOS Representation

The SKOS version of the GTAA consists of skos:Concepts related by skos:broader, skos:narrower and skos:related properties. In addition, some concepts are clarified with a skos:scopeNote. Samples of the datasets are directly available for inspection from Subject, Person, Name, Location, and below are examples from each facet. Each facet is in a different file, but all concepts have a skos:inScheme property that specifies the name of the facet that it belongs to, enabling you to put all the data in only one file if necessary.

Examples of GTAA concepts

 <skos:Concept rdf:about="#Subject_alternatieveenergie">
	<skos:prefLabel>alternatieve energie</skos:prefLabel>
	<skos:inScheme rdf:resource="http://www.beeldengeluid.nl/Thesaurus/Subject"/>
	<skos:broader rdf:resource="#Subject_energie"/> 
	<skos:narrower rdf:resource="#Subject_biobrandstoffen"/> <skos:narrower rdf:resource="#Subject_windenergie"/> 
	<skos:narrower rdf:resource="#Subject_zonne-energie"/> 
	<skos:related rdf:resource="#Subject_alcohol"/> 
	<skos:related rdf:resource="#Subject_energiebeleid"/> 
	<skos:related rdf:resource="#Subject_energiebronnen"/> 
	<skos:related rdf:resource="#Subject_getijden"/> 
	<skos:related rdf:resource="#Subject_milieubeleid"/> 
	<skos:related rdf:resource="#Subject_waterkracht"/> 
	<skos:related rdf:resource="#Subject_waterstof"/>
	<skos:altLabel>getijdenenergie</skos:altLabel> 
</skos:Concept>

 <skos:Concept rdf:about="#Person_BeatrixkoninginNederland">
	<skos:prefLabel>Beatrix (koningin Nederland)</skos:prefLabel> 
	<skos:inScheme rdf:resource="http://www.beeldengeluid.nl/Thesaurus/Person"/>
	<skos:related rdf:resource="#Person_BeatrixkroonprinsesNederland"/>
	<skos:scopeNote>va30-4-80</skos:scopeNote>
</skos:Concept>

 <skos:Concept rdf:about="#Name_Abba">
	<skos:prefLabel>Abba</skos:prefLabel> 
	<skos:inScheme rdf:resource="http://www.beeldengeluid.nl/Thesaurus/Name"/>
	<skos:scopeNote>popgroep Zweden</skos:scopeNote>
</skos:Concept>

 <skos:Concept rdf:about="#Location_Amsterdam">
	<skos:prefLabel>Amsterdam</skos:prefLabel> 
	<skos:inScheme rdf:resource="http://www.beeldengeluid.nl/Thesaurus/Location"/>
	<skos:scopeNote>Nederland</skos:scopeNote>
</skos:Concept>

Obtaining the GTAA dataset

The GTAA is copyrighted material. To obtain the full datasets, please download the user agreement here. Fax the signed agreement to +31 20 5987728, at the attention of Laura Hollink, or alternatively scan the form and e-mail it to laurah at cs dot vu dot nl. You will receive by email the password to access the complete dataset.

WordNet

General Information

WordNet is a lexical database of the English language. Its main building blocks are synsets: groups of words with a synonymous meaning. An example of a synset is {cliff, drop, drop-off}, described as "a steep high face of rock". In this task, the goal is to match synsets (not words). Four kinds of synsets are distinguished, containing nouns, verbs, adjectives or adverbs. In this task, we are only interested in noun-synsets. There are 7 types of relations between noun-synsets:

hyponymOf
memberMeronymOf
substanceMeronymOf
partMeronymOf
classifiedByTopic
classifiedByUsage
classifiedByRegion

The main hierarchy in WordNet is built on hyponym relations between synsets, which are similar to subclass relations. Other frequently occurring relations between synsets are meronyms, which denote part-of relations. The classifiedBy relations group synsets according to their topic, region or usage.

RDF/OWL Representation

W3C has translated WordNet version 2.0 into RDF/OWL. The details of this translation are published in a working draft. The working draft contains a download section, from which the complete dataset can be downloaded. For convenience, the data is split up in different files, one for each relation. Participants of this OAEI task are free to use (or ignore) all files with all types of relations. Although the goal is to match noun-synsets, relationships between verbs, adverbs and adjectives may be used in the process.

We recommend to use at least the following: synsets, their labels (e.g the synonymous words cliff, drop, drop-off), their descriptive sentences called 'glosses' (e.g. "a steep high face of rock"), and the hyponym hierarchy between synsets, which is comparable to a subclass hierarchy. You would need to download four files:

The file WordNet Basic Schema contains the RDF-Schema of the dataset. This file states, for example, that a NounSynset is a subClassOf a Synset, that the property meronymOf is the inverse of holonymOf, etc.

The file Synsets contains the types of synsets (nounSynsets, verbSynsets, adjectiveSynsets or adverbSynsets), a label and an ID number. The latter is a link to the original princeton version of WordNet and can safely be ignored. Synsets have the form: http://www.w3.org/2006/03/wn/wn20/instances/synset-cliff-noun-1

 <wn20schema:NounSynset rdf:about="&wn20instances;synset-cliff-noun-1" rdfs:label="cliff">
	<wn20schema:synsetId>108670141</wn20schema:synsetId>
</wn20schema:NounSynset>

The file senseLabels contains all labels of the synsets (the Synsets file gives only one for each synset).

 <rdf:Description rdf:about="&wn20instances;synset-cliff-noun-1">
	<wn20schema:senseLabel>cliff</wn20schema:senseLabel>
	<wn20schema:senseLabel>drop</wn20schema:senseLabel>
	<wn20schema:senseLabel>drop-off</wn20schema:senseLabel>
</rdf:Description>

The file hyponymy contains the hyponym (i.e. subclass) hierarchy between synsets.

 <rdf:Description rdf:about="&wn20instances;synset-cliff-noun-1"> 
	<wn20schema:hyponymOf rdf:resource="&wn20instances;synset-geological_formation-noun-1"/> 
</rdf:Description>

In a similar fashion, there are files for other the other types of relations in the download section of the W3C working draft.

Translation to SKOS

The original WordNet model is a rich and well-designed model. However, some tools may have problems with the fact that the synsets are instances rather than classes. Therefore, for the purpose of this OAEI task, we have translated the hyponym hierarchy in a skos:broader hierarchy, making the synsets skos:Concepts.

 <rdf:Description rdf:about="&wn20instances;synset-cliff-noun-1"> 
	<wn20schema:hyponymOf rdf:resource="&wn20instances;synset-geological_formation-noun-1"/>
</rdf:Description>

becomes

 <skos:Concept rdf:about="&wn20instances;synset-cliff-noun-1"> 
	<skos:broader rdf:resource="&wn20instances;synset-geological_formation-noun-1"/>
</skos:Concept>

This translation may be used as an alternative to the hyponym file and can be downloaded from here.

DBPedia

General Information

DBPedia is an extremely rich dataset. It contains 2.18 million resources or "things", each tied to an article in the English language Wikipedia. The "things" are described by titles and abstracts in English and often also in Dutch. DBPedia "things" have numerous properties, such as categories, properties derived from the wikipedia 'infoboxes', links between pages within and outside wikipedia, etc. The purpose of this task is to map the DBPedia "things" to WordNet synsets and GTAA concepts.

All information can be downloaded from the DBPedia download site. For each type of property (title, abstracts, infobox properties, links to pages, etc.), there is separate file that can be downloaded. Also, small preview files are provided. In the following description we will link to the preview files instead of to the actual content files that you need for the allignment, to prevent multiple downloads of very large files.

Every type of relation from the download site can be used in this OAEI task. However, you are of course not obliged to use them all. You can pick and choose the information that you think is useful and that your tool can handle. A reasonable choice seems to be to use at least the following information: "things", their labels and their comments:

 DBPedia "thing" 
		-- rdfs:label -- "title of the wikipedia page" 
		-- rdfs:comment -- "abstract of the wikipedia page"

The file Titles contains labels, available in English and Dutch, which are the titles of the corresponding wikipedia articles. The file Short Abstracts contains short abstracts, available in English and Dutch.

In addition, we consider the DBpedia ontology, categories and links to WordNet valuable sources for the current alignment task. Descriptions will be given below.

DBPedia ontology

The DBpedia Ontology is an ontology of currently more than 170 classes, organised in a subsumption hierarchy. The ontology was manually created based on the most commonly used infoboxes of Wikipedia. The file DBpedia Ontology contains the classes and properties of this ontolgy, while the file Ontology Types contains the instances of the Ontology, i.e. DBpedia "things".

Existing links to WordNet

Important to note is that some DBPedia "things" are already tied to WordNet synsets with a wordnet-type property in the file WordNet Classes. For example:

 <rdf:Description rdf:about="&dbpedia;Air_New_Zealand"> 
	<dbpedia2:wordnet-type rdf:resource="&wn20instances;synset-airline-noun-2"/>
</rdf:Description>

In addition, most DBPedia "things" are instances in the YAGO ontology. Classes in the YAGO ontology correspond to wikipedia categories and WordNet synsets. For example:

 <rdf:Description rdf:about="&dbpedia;%22Crazy%22_Joe_Davola"> 
	<rdf:type rdf:resource="&yago;FictionalCharacter109587565"/>
</rdf:Description>

This information may be used to find matches between DBPedia "things" and WordNet synsets.

The file YAGO Classes contains rdf:type links between the DBPedia "things" and the Yago classes, as well as labels of the YAGO classes and the YAGO subsumption hierarchy.

RDF/OWL and SKOS representation

All DBpedia files are in RDF, and some are in SKOS.

Evaluation

We will evaluate the alignments between both pairs of thesauri: WordNet-GTAA and DBpedia-GTAA. Since mapping performance can vary over the different facets of the GTAA, we will evaluate samples of mappings for each facet and report on their quality separately.

To estimate precision, we will draw a random sample of 100 mappings from each GTAA facet in each alignment, resulting in 800 mappings to be evaluated in total per participating tool. For recall, we will create a reference alignment of 100 mappings per facet, per alignment. Considering the time consuming nature of creating a reference alignment, we will only do this for selected facets.

Schedule

July 10st: tests are frozen
September 1st: participants send preliminary results (for interoperability-checking)
September 28st: participants send final results and papers
October 5th: organisers publish results for comments
October 25th: final results ready and OM-2009 workshop.

Acknowledgements

We would like to thank Chris Bizer, Fabian Suchanec and Jens Lehman for their help with the DBPedia dataset. We would also like to thank Willem van Hage for his advise. We gratefully acknowledge the Dutch Institute for Sound and Vision for allowing us to use the GTAA.

Contacts

Send any questions, comments, or suggestions to:

Laura Hollink: laurah at cs dot vu dot nl
Véronique Malaisé: vmalaise at few dot vu dot nl

Initial location of this page: http://www.cs.vu.nl/~laurah/oaei/2009/

Very Large Crosslingual Resources

Data sets

GTAA

General Information

SKOS Representation

Examples of GTAA concepts

Obtaining the GTAA dataset

WordNet

General Information

RDF/OWL Representation

Translation to SKOS

DBPedia

General Information

DBPedia ontology

Categories

Existing links to WordNet

RDF/OWL and SKOS representation

Evaluation

Schedule

Acknowledgements

Contacts