OAEI 2012::Library Track

Libraries play an important role in the linked data web, and they widely agree that linked data technologies are ideal to integrate the data of libraries around the world and to foster the collaboration on cataloguing among the libraries. Library data does not only consist of the vast amount of cataloguing data, but especially -- and probably more interesting for other communities -- also of authority data, i.e., normed descriptions of locations, events, persons, corporate bodies, and subject concepts. The subject concepts are usually organized in more or less hierarchical knowledge organization systems, together with semantic relations between the concepts. A thesaurus is such a knowledge organization system that is used for indexing purposes and that provides quasi-synonymous, describing labels for each concept. Thesauri are sometimes referred to as lightweight ontologies, however, we will see that this definition can be misleading.

Thesauri, and authority data in general, have a long history in libraries and are actively used and maintained by information professionals and domain experts. Due to their high quality and their long-term development, they could function as a "backbone of the Semantic Web".

Most thesauri are domain-dependent and specialized to be used within a certain field, e.g., to index publications with an economical focus. During previous experiments, we examined the topical overlap between the two thesauri used in this challenge: TheSoz (social sciences) and STW (economics). They share not only a lot of concepts, there is also a manually created alignment that can be used as reference. Many thesauri exist that cover the same or overlapping domains, often in different languages. Multilingual thesauri are an important means to bridge the gap between catalogs in different languages, so that users can search for relevant literature using their own language. Another possibility is the creation of links between concepts across different thesauri, possibly in different languages. Such alignments -- or correspondences or cross-concordances -- can be exploited to mutually add further information to both thesauri and subsequently improve the retrieval. Therefore, for many, selected thesauri exist alignments that are manually created by domain experts. Nevertheless, the automatic identification of alignments is strongly desired, mainly due to two reasons: First, the manual creation of alignments between all existing thesauri is not feasible, so additional alignments have to be created, possibly by exploiting existing alignments (e.g., their transitivity). Second, automatically created alignments can be used to improve and enhance existing alignments, after approval by a domain expert. This is necessary, as most existing alignments are not complete and even if they are supposed to be complete, they have to be maintained just like the thesauri themselves, i.e., a constant effort is required to keep them up-to-date.

This library track is a new track within OAEI. However, there has already been a library track from 2007 to 2009 using different thesauri, as well as other thesaurus tracks like the food track and the environment track. A common motivation is that these tracks use a real-world scenario, i.e., real thesauri. For us, it is still a motivation to develop a better understanding, how thesauri differ from ontologies and how these differences affect state-of-the-art ontology matchers. We hope that the community accepts the challenge and that subsequently significant improvements can be seen that push the quality of automatic alignments between thesauri. Furthermore, we will use the matching results as input for the maintainers of the reference alignment to improve the alignment. While a full manual evaluation of all matching results is certainly not feasible, this way we constantly improve the reference alignment and mitigate possible weaknesses and incompleteness.

Data set

For this track, we use specific versions of the thesauri where extensions like SKOS-XL or own ones are not included.
There are two possibilites to run the data set. You can either download the data set or run it on the SEALS platform.

Download

Parameter for the SEALS platform

Library SKOS Testsuite

Repository: http://seals-test.sti2.at/tdrs-web/
Suite-ID: 05297c9c-ad72-4aba-995a-bda5f0bd2067
Version-ID: e497e6b5-b6ef-4293-be24-9dccda76d0ff

Library OWL Testsuite with complete reference alignment

Repository: http://seals-test.sti2.at/tdrs-web/
Suite-ID: 05297c9c-ad72-4aba-995a-bda5f0bd2067
Version-ID: ced13a5c-b055-4d11-b9c1-a4d2a53e1f7d

Transformation

Ontology matching systems taking part in the OAEI only work on OWL ontologies and are not (yet) ready to deal with the specialties of a thesaurus. To get first results and to lower the barrier of taking part in this challenge, we provide OWL versions of the thesauri, generated as follows:
skos:concept ➔ owl:class
skos:prefLabel, skos:altLabel ➔ rdfs:label
skos:scopeNote, skos:notation ➔ rdfs:comment
skos:narrower ➔ rdfs:superClassOf
skos:broader ➔ rdfs:subClassOf
skos:related ➔ rdfs:seeAlso

This transformation obviously is not loss-less. First and foremost, within the ontology, it is not recognizable which label is the preferred one and which ones are alternative labels. Since matching systems mostly have to focus on the labels, this transformation might lead to suboptimal results. There are, however, more fundamental differences between ontologies and thesauri that we show in the next section.

URI Transformation

We invented a new namespace for the OWL version to avoid any confusion with the original data. First, we changed the base namespaces as following:

http://zbw.eu/stw/ ➔ http://stw.owl
http://lod.gesis.org/thesoz/ ➔ http://thesoz.owl

Within the original data, the concepts are divided into descriptors and non-descriptors. This disitnction is dropped during the transformation into OWL. Thus, the corresponding encoding into the URIs is also omitted. Moreover, only letters and dots are permitted within the URI. Examples:

http://zbw.eu/stw/thsys/72180 ➔ http://stw.owl#72180
http://zbw.eu/stw/descriptor/16207-5 ➔ http://stw.owl#16207.5
http://lod.gesis.org/thesoz/classification/4.1.07 ➔ http://thesoz.owl#4.1.07
http://lod.gesis.org/thesoz/concept/10034303 ➔ http://thesoz.owl#10034303

These transformations simplify the matching but nevertheless the original data can be reconstructed.

Test data

The library track uses two real-world thesauri, that are in many aspects comparable. They have roughly the same size, are both originally developed in German, are today both multilingual, both have English translations, and, most important, despite being from two different domains, they have huge overlapping areas. Not least, both are freely available in RDF using SKOS.

STW

The STW Thesaurus for Economics provides vocabulary on any economic subject: more than 6,000 standardized subject headings (skos:Concepts, with preferred labels in English and German) and 19,000 additional keywords (skos:altLabels) in both languages. The vocabulary was developed for indexing purposes in libraries and economic research institutions and includes technical terms used in law, sociology, or politics, and geographic names. The entries are richly interconnected by 16,000 skos:broader/narrower and 10,000 skos:related relations. An additional hierarchy of main categories provides a high level overview. The vocabulary is maintained on a regular basis by ZBW German National Library of Economics - Leibniz Centre for Economics and has been translated into SKOS.

TheSoz

The Thesaurus for the Social Sciences (TheSoz) serves as a crucial instrument for indexing documents and research information in the social sciences. It contains overall about 12,000 keywords, from which 8,000 are standardized subject headings (in English and German) and 4,000 additional keywords. The thesaurus covers all topics and sub-disciplines of the social sciences. Additionally terms from associated and related disciplines are included in order to support an accurate and adequate indexing process of interdisciplinary, practical-oriented and multi-cultural documents. The thesaurus is owned and maintained by GESIS- Leibniz Institute for the Social Sciences and is available in SKOS.

Goal

Since the existing mapping of STW and TheSoz has been manually created by domain experts in the KoMoHe project and does not cover the changes and enhancements in both thesauri since 2006, the evaluation is supposed to show whether the creation of the alignment can be automated and to which degree. Moreover, the could possibly inform further automatic or semi-automatic mapping precedures to be implemented for a regular maintenance of the mapping. Additionally, we would like to see how current state-of-the-art matching systems are able to deal with lightweight ontologies which are very often used in practice. Due to the large amount of concepts together with the plenty of semantic relations and additional keywords, the matching systems need to find a way how to deal with these conditions. Thus, it should become clear which matching techniques are best suitable for such real-world tasks.

Organizers

Dominique Ritze (Mannheim University Library) dominique[.]ritze[at]bib[.]uni-mannheim[.]de
Kai Eckert (Mannheim University Library)
Benjamin Zapilko (GESIS)
Joachim Neubert (ZBW)

Original page: http://web.informatik.uni-mannheim.de/oaei-library/2012/ [cached: 24/06/2014]

Library Track

Description