Very Large Crosslingual Resources

The purpose of this task is to match three resources to each other: the Thesaurus of the Netherlands Institute for Sound and Vision called the GTAA, WordNet and DBPedia. The GTAA is in Dutch, while WordNet is in English. DBPedia contains labels in both languages and can therefore be used as a mediator between the two.

All three are broad in scope. WordNet is a generic English lexical database, DBPedia is a semantic web version of Wikipedia and GTAA contains all concepts needed to index Dutch broadcast television.

The purpose of this task is twofold: (1) to map thesauri in different languages and (2) to map resources that are large, rich in semantics but weak in formal structure, i.e. realistic on the Web.

We are aiming for SKOS relations: skos:exactMatch, skos:broadmatch, skos:narrowmatch.

Go directly to: GTAA, WordNet, DBPedia, evaluation, schedule, contact

Data sets

GTAA

General Information

The GTAA thesaurus, a Dutch acronym for "Common Thesaurus [for] Audiovisual Archives", is used at the Netherlands Institute for Sound and Vision, the Dutch public audiovisual broadcast's archives, for indexing their documents. The thesaurus consists of 6 facets that concern the description of:

the topic of a TV program: keywords are selected from the Subject facet
the main people mentioned in a TV program: keywords are selected from the Person facet
the main "Named Entities" mentioned in a TV program (Corporation names, music bands etc): keywords are selected from the Name facet
the main locations mentioned in a TV program or the place where it has been created: keywords are selected from the Location facet
the genre of a TV program: keywords are selected from the Genre facet
the makers and presentators of a TV program: keywords are selected from the Makers facet

The GTAA contains approximately 160.000 terms: ~3800 Subject keywords, ~97.000 "Person"s, ~27.000 "Names", ~14.000 Locations, 113 Genres and ~18.000 Makers. In this mapping experiment, we consider only the four following facets: Subject, Person, Name and Location. The thesaurus was originally structured along the lines of the ISO 2788 standard for thesauri that is commonly used by companies and institutions. We have converted it to the SKOS reference model. It contains Broader Terms, Narrower Terms, Related Terms, Scope Notes (notes that help to clarify or define a term) and Preferred and Alternative Labels. Terms in all facets of the GTAA can have Related Terms and Scope Notes, but only the Subject facet has Alternative Labels and Broader Term/Narrower Term relations, the latter organizing the terms into a hierarchy. The hierarchal organisation of the Subject facet is not very dense: 80% of the terms are not involved in hierarchies deeper than 3 levels (the average hierarchy depth is 1.3).

SKOS Representation

The SKOS version of the GTAA consists of skos:Concepts related by skos:broader, skos:narrower and skos:related properties. In addition, some concepts are clarified with a skos:scopeNote. Samples of the datasets are directly available for inspection from Subject, Person, Name, Location, and below are examples from each facet. Each facet is in a different file, but all concepts have a skos:inScheme property that specifies the name of the facet that it belongs to, enabling you to put all the data in only one file if necessary.

Examples of GTAA concepts

<skos:Concept rdf:about="#Subject_alternatieveenergie">
    <skos:prefLabel>alternatieve energie</skos:prefLabel>
    <skos:inScheme rdf:resource="http://www.beeldengeluid.nl/Thesaurus/Subject"/>
    <skos:broader rdf:resource="#Subject_energie"/>
    <skos:narrower rdf:resource="#Subject_biobrandstoffen"/>
    <skos:narrower rdf:resource="#Subject_windenergie"/>
    <skos:narrower rdf:resource="#Subject_zonne-energie"/>
    <skos:related rdf:resource="#Subject_alcohol"/>
    <skos:related rdf:resource="#Subject_energiebeleid"/>
    <skos:related rdf:resource="#Subject_energiebronnen"/>
    <skos:related rdf:resource="#Subject_getijden"/>
    <skos:related rdf:resource="#Subject_milieubeleid"/>
    <skos:related rdf:resource="#Subject_waterkracht"/>
    <skos:related rdf:resource="#Subject_waterstof"/>
    <skos:altLabel>getijdenenergie</skos:altLabel>
</skos:Concept>

<skos:Concept rdf:about="#Person_BeatrixkoninginNederland">
    <skos:prefLabel>Beatrix (koningin Nederland)</skos:prefLabel>
    <skos:inScheme rdf:resource="http://www.beeldengeluid.nl/Thesaurus/Person"/>
    <skos:related rdf:resource="#Person_BeatrixkroonprinsesNederland"/>
    <skos:scopeNote>va30-4-80</skos:scopeNote>
</skos:Concept>

<skos:Concept rdf:about="#Name_Abba">
    <skos:prefLabel>Abba</skos:prefLabel>
    <skos:inScheme rdf:resource="http://www.beeldengeluid.nl/Thesaurus/Name"/>
    <skos:scopeNote>popgroep Zweden</skos:scopeNote>
</skos:Concept>

<skos:Concept rdf:about="#Location_Amsterdam">
    <skos:prefLabel>Amsterdam</skos:prefLabel>
    <skos:inScheme rdf:resource="http://www.beeldengeluid.nl/Thesaurus/Location"/>
    <skos:scopeNote>Nederland</skos:scopeNote>
</skos:Concept>

Obtaining the GTAA dataset

The GTAA is copyrighted material. To obtain the full datasets, please download the user agreement here. Fax the signed agreement to +31 20 5987728, at the attention of Laura Hollink, or alternatively scan the form and e-mail it to laurah at cs dot vu dot nl. You will receive by email the password to access the complete dataset.

WordNet

General Information

WordNet is a lexical database of the English language. Its main building blocks are synsets: groups of words with a synonymous meaning. An example of a synset is {cliff, drop, drop-off}, described as "a steep high face of rock". In this task, the goal is to match synsets (not words). Four kinds of synsets are distinguished, containing nouns, verbs, adjectives or adverbs. In this task, we are only interested in noun-synsets. There are 7 types of relations between noun-synsets:

hyponymOf
memberMeronymOf
substanceMeronymOf
partMeronymOf
classifiedByTopic
classifiedByUsage
classifiedByRegion

The main hierarchy in WordNet is built on hyponym relations between synsets, which are similar to subclass relations. Other frequently occurring relations between synsets are meronyms, which denote part-of relations. The classifiedBy relations group synsets according to their topic, region or usage.

RDF/OWL Representation

W3C has translated WordNet version 2.0 into RDF/OWL. The details of this translation are published in a working draft. The working draft contains a download section, from which the complete dataset can be downloaded. For convenience, the data is split up in different files, one for each relation. Participants of this OAEI task are free to use (or ignore) all files with all types of relations. Although the goal is to match noun-synsets, relationships between verbs, adverbs and adjectives may be used in the process.

We recommend to use at least the following: synsets, their labels (e.g the synonymous words cliff, drop, drop-off), their descriptive sentences called 'glosses' (e.g. "a steep high face of rock"), and the hyponym hierarchy between synsets, which is comparable to a subclass hierarchy. You would need to download four files:

The file WordNet Basic Schema contains the RDF-Schema of the dataset. This file states, for example, that a NounSynset is a subClassOf a Synset, that the property meronymOf is the inverse of holonymOf, etc.

The file Synsets contains the types of synsets (nounSynsets, verbSynsets, adjectiveSynsets or adverbSynsets), a label and an ID number. The latter is a link to the original princeton version of WordNet and can safely be ignored. Synsets have the form: http://www.w3.org/2006/03/wn/wn20/instances/synset-cliff-noun-1

<wn20schema:NounSynset rdf:about="&wn20instances;synset-cliff-noun-1"
    rdfs:label="cliff">
    <wn20schema:synsetId>108670141</wn20schema:synsetId>
</wn20schema:NounSynset>

The file senseLabels contains all labels of the synsets (the Synsets file gives only one for each synset).

<rdf:Description rdf:about="&wn20instances;synset-cliff-noun-1">
  <wn20schema:senseLabel>cliff</wn20schema:senseLabel>
  <wn20schema:senseLabel>drop</wn20schema:senseLabel>
  <wn20schema:senseLabel>drop-off</wn20schema:senseLabel>
</rdf:Description>

The file hyponymy contains the hyponym (i.e. subclass) hierarchy between synsets.

<rdf:Description rdf:about="&wn20instances;synset-cliff-noun-1">
  <wn20schema:hyponymOf rdf:resource="&wn20instances;synset-geological_formation-noun-1"/>
</rdf:Description>

In a similar fashion, there are files for other the other types of relations in the download section of the W3C working draft.

Translation to SKOS

The original WordNet model is a rich and well-designed model. However, some tools may have problems with the fact that the synsets are instances rather than classes. Therefore, for the purpose of this OAEI task, we have translated the hyponym hierarchy in a skos:broader hierarchy, making the synsets skos:Concepts.

<rdf:Description rdf:about="&wn20instances;synset-cliff-noun-1">
  <wn20schema:hyponymOf rdf:resource="&wn20instances;synset-geological_formation-noun-1"/>
</rdf:Description>

becomes

<skos:Concept rdf:about="&wn20instances;synset-cliff-noun-1">
  <skos:broader rdf:resource="&wn20instances;synset-geological_formation-noun-1"/>
</skos:Concept>

This translation may be used as an alternative to the hyponym file and can be downloaded from here.

DBPedia

General Information

DBPedia is an extremely rich dataset. It contains 2.18 million resources or "things", each tied to an article in the English language Wikipedia. The "things" are described by titles and abstracts in English and often also in Dutch. DBPedia "things" have numerous properties, such as categories, properties derived from the wikipedia 'infoboxes', links between pages within and outside wikipedia, etc. The purpose of this task is to map the DBPedia "things" to WordNet synsets and GTAA concepts.

Important to note is that some DBPedia "things" are already tied to WordNet synsets with a wordnet-type property. For example:

<rdf:Description rdf:about="&dbpedia;Air_New_Zealand">
    <dbpedia2:wordnet-type rdf:resource="&wn20instances;synset-airline-noun-2"/>
</rdf:Description>

In addition, most DBPedia "things" are instances in the YAGO ontology. Classes in the YAGO ontology correspond to wikipedia categories and WordNet synsets. For example:

<rdf:Description rdf:about="&dbpedia;%22Crazy%22_Joe_Davola">
    <rdf:type rdf:resource="&yago;FictionalCharacter109587565"/>
</rdf:Description>

This information may be used to find matches (skos:exactMatch, skos:broadmatch or skos:narrowmatch) between DBPedia "things" and WordNet synsets.

RDF/OWL and SKOS representation

All information can be downloaded from the DBPedia download site. For each type of property (title, abstracts, infobox properties, links to pages, etc.), there is separate file that can be downloaded. Also, small preview files are provided.

Every type of relation from this download site can be used in this OAEI task. However, you are of course not obliged to use them all. You can pick and choose the information that you think is useful and that your tool can handle. A reasonable choice seems to be to use at least the following information: "things" and their labels, comments, and type information, in addition to the Yago class hierarchy. The model is as follows:

DBPedia "thing"
    -- rdfs:label "its title"
    -- rdfs:comment "its abstract"
    -- rdf:type yago:SomeClass

yago:SomeClass
    -- rdfs:subClassOf yago:SomeSuperClass

The file 'Titles' contains labels, in English and in Dutch, which are the titles of the corresponding wikipedia articles. The file 'Short Abstracts' contains short abstracts in English and in Dutch. The file YAGO classes contains rdf:type links between the DBPedia "things" and the Yago classes, and the file YAGO Class Hierarchy contains the Yago class hierarchy. In addition, the wikipedia Articles Categories might be useful:

DBPedia "thing"
    -- skos:subject dbpedia:SomeCategory

Evaluation

We will evaluate mappings between all three pairs of thesauri. Evaluation of both precision and recall will be based on a sample. More information will follow soon.

Schedule

May 20th-May 21rd: Datasets are made available and first task description is finalized.
May 20th - June 14th: all comments and questions are very welcome.
June 15th: end of commenting period
July 1st: datasets and task description is frozen
September 1st: participants send preliminary results (for interoperability-checking)
September 26th: participants send final results and papers
October 10th: organisers publish results for comments
October 26th-27th: final results ready and OM-2008 workshop.

Acknowledgements

We would like to thank Chris Bizer, Fabian Suchanec and Jens Lehman for their help with the DBPedia dataset. We would also like to thank Willem van Hage for his advise. We gratefully acknowledge the Dutch Institute for Sound and Vision for allowing us to use the GTAA.

Contacts

Send any questions, comments, or suggestions to:

Laura Hollink: laurah at cs dot vu dot nl
V�ronique Malais�: vmalaise at few dot vu dot nl

Initial location of this page: http://www.few.vu.nl/~laurah/oaei/.