Instance data matching

This track aims at evaluating instance data matchers.

What is instance data ?

In the context of the semantic Web, instances are resources described in RDF and typed according to an ontology. The task of identifying equivalent resources is crucial for proper integration of Web data. This task is also known as identity recognition, object consolidation, or record linkage in the databases world.

What is an instance data matcher ?

An instance data matcher is a system able to take as input two instance sets (two RDF graphs), and to return an alignment between these sets.

Data sets

The evaluation is performed on three benchmarks using various datasets described below. The two first benchmarks are each containing three datasets which need to be matched two by two. The third benchmark contain 37 dataset which are all a variant of a refenrence datasets. Each dataset need to be matched to the reference dataset.

A-R-S benchmark

This benchmark (download Rexa and eprints, swetoDBLP) includes three datasets containing instances from the domain of scientific publications:

eprints - this dataset contains papers produced within the AKT research project and extracted using an HTML-wrapper from the source web-site.
Rexa - this dataset was extracted from the search results of the search server.
SWETO-DBLP - a version of the DBLP dataset.

T-S-D benchmark

This benchmark (download) includes three datasets covering several topics and structured according to different ontologies:

TAP - a benchmark dataset
SWETO-testbed - another benchmark dataset.
DBPedia 3.2 - a version of DBPedia structured according to a manually constructed ontology

Ontologies are included in the benchmarks

IIMB benchmark

A generated benchmark constituted using one dataset and modifying it according to various criterias. The benchmark is generated using the ISLab Instance Matching Benchmark, used on a dataset from the OKKAM project..
The testbed (download) provides OWL/RDF data about actors, sport persons, and business firms taken from OKKAM. The main directory contains 37 sub-directories and the original ABox and the associated TBox (abox.owl and tbox.owl). The original ABox contains about 300 different instances.

Each sub-directory contains a modified ABox (abox.owl + tbox.owl) and the corresponding mapping with the instances in the original ABox (refalign.rdf). The introduced modifications are the following:

Directory 01: Contains an identical copy of the original ABox (the instance IDs are randomly changed!).
Directory 002 - Directory 010: Value transformations (i.e., typographical errors simulation, use of different standard for representing the same information). In order to simulate typographical errors, property values of each instance are randomly modified. Modifications are applied on different subsets of the instances property values and with different levels of difficulty (i.e., introducing a different number of errors).
Directory 011 - Directory 019: Structural transformations (i.e., deletion of one or more values, transformation of datatype properties into object properties, separation of a single property into more properties).
Directory 020 - Directory 029: Logical transformations (i.e., instantiation of identical individuals into different subclasses of the same class, instantiation of identical individuals into disjoint classes, instantiation of identical individuals into different classes of an explicitly declared class hierarchy).
Directory 030 - Directory 037: Several combinations of the previous transformations.

Modalities

The participants are expected to provide an alignment between the dataset. The alignment will be provided in the alignment format.

Evaluation will measure the following criteria:

Precision. The number of correct retrieved mappings / the number of retrieved mappings.
Recall. The number of correct retrieved mappings / the number of expected mappings.
F-measure. 2 x (precision x recall) / (precision + recall).
Fall-out. The number of incorrect retrieved mappings / the number of non-expected mappings.
Execution time.

For this track, we introduce the possibility to tune matchers for each testbed. In the results, tuned matchers will be differentiated from pure matchers working with one configuration for all testbeds.

Schedule

The schedule is given at https://oaei.ontologymatching.org/2009/.

Tool

The Alignment API can be used to provide the alignment in the alignment format. Using the alignment API, alignments can be grounded into a set of owl:sameAs triples, and encapsulated into a named graph with void annotations.

Results

Results of the experiment can be found here.

Acknowledgements

Many thanks to

The OKKAM project
The Linking Open Data initiative

for providing nice evaluation data.

Contacts

Alfio ferrara - ferrara[at]dico[dot]unimi[dot]it
Andriy Nikolov - A[dot]Nikolov[at]open[dot]ac[dot]uk
François Scharffe - francois[dot]scharffe[at]inrialpes[dot]fr

Initial location of this page: http://www.scharffe.fr/events/oaei2009/index.html