Instance Matching at OAEI 2011 (IM@OAEI2011)

General description

IM@OAEI2011 is an initiative for the evaluation of instance matching techniques and tools. IM@OAEI2011 is a track of the Ontology Alignment Evaluation Initiative (OAEI - http://oaei.ontologymatching.org/2011), held every year in collaboration with the Ontology Matching Workshop at ISWC (http://om2011.ontologymatching.org).

IM@OAEI2011 is focused on RDF and OWL data in the context of the Semantic Web. Participants will be asked to execute their algorithms against various datasets and their results will be evaluated by comparing them with a pre-defined reference alignment provided by IM@OAEI2011. Results will be evaluated according to standard precision and recall metrics.

For each task describe below, we give the datasets to interlink, but also the reference alignments. Participants can thus prepare their algortihms so that they work at best and send us their evaluation results.

Datasets

Use the following datasets as input for your matching system. You can already test with this data and report probems (send reports to contact email). They will be frozen by July 1st.

Interlinking New-York Times Data - DOWNLOAD

Participants are requested to re-build the links among the NYT dataset itself (see data.nytimes.com), and to the external data sources DBPedia, Geonames and Freebase. Reference alignments are provided for each resource as RDF alignments. These alignments are extracted from the links provided and curated by NYT. The whole NYT dataset is available under Creative Commons license - CC BY 3.0.

Here are a few stats on the datasets and their interlinks. The NYT dataset contain three areas: people, organizations and locations.

Stat	People	Organizations	Locations
Nr of NYT resources	9958	6088	3840
Total nr of sameAs links	14884	8003	8786¹
Links to Freebase	4979	3044	1920
Links to DBPedia	4977	1949	1920
Links to NYT	4979	3044	1920
Links to Geonames	0	0	1789

(1) 12 links actually do not belong to the interconnected datasets and are not considered for this task.

(2) We are interested in matching all DBpedia resources, with the exception of resources that redirect to another resource (e.g. dbpedia:Gordon_Sumner dbpedia:ontology/wikiPageRedirects dbpedia:Sting_(musician) .) and resources that represent Wikipedia disambiguation pages (e.g. http://dbpedia.org/page/Lille_(disambiguation) dbpedia-owl:wikiPageDisambiguates dbpedia:Lille .).

Instructions to acess the other datasets are given below:

DBPedia dataset is very large and we do not recommend to try to download and do the interconnection locally. We rather recommend to use a linked data approach by searching and dereferencing URIs. Here are a few links: DBPedia dump, SPARQL endpoint, lookup service. Beware, these services might not support heavy load.
Freebase dataset is also very large, so the same recommendation apply. There is no dump available, but one can be built using the dumps in Freebase quadruple format and this RDFizer. A better approach would be to do keyword lookups using the Freebase API and then using the Freebase RDF service to get RDF representation from the retrieve Freebase IDs.
Geonames is also a large dataset. The RDF dump is available here. The Web service allows to search and get resource descriptions in RDF.

Synthetic data generated from Freebase data

IIMB - DOWNLOAD (~400MB zipped text, about 1500 movies from Freebase) - NOTE: IIMB actually online on this web page has been updated on July 20 2011 to fix a consintency problem in the ontology. Please use this last version (file is named iimb_2011_20072011.zip)!
IIMB is divided into tasks and reference alignments are automatically generated by introducing controlled modifications in an initial reference ontology instance. The files contain a subdirectory for each task. Participants are requested to match the reference ontology (the one in the 000 directory) against all the others (from 001 to 080). Each directory contains also the reference alignment. Data are provided as OWL individuals according to the RDF/XML format, while reference alignments are provided as RDF alignments.

IIMB is based on the SWING methodology, developed by Alfio Ferrara, Stefano Montanelli, Jan Noessner, Heiner Stuckenschmidt. Further details about SWING are available in (a SWING implementation is available at http://code.google.com/p/swing-generator/):
```
@inproceedings{iimb_iswc_2011,
  author    = {Alfio Ferrara and Stefano Montanelli and
               Jan Noessner and Heiner Stuckenschmidt},
  title     = {Benchmarking Matching Applications on the Semantic Web},
  booktitle = {The Semanic Web: Research and Applications - 8th Extended
               Semantic Web Conference, ESWC 2011},
  publisher = {Springer},
  series    = {Lecture Notes in Computer Science},
  address   = {Heraklion, Crete, Greece},  
  year      = {2011},
  pages     = {108-122}
}
```

Modalities

Subtasks

IM@OAEI2010 is organized in two sub-tracks, namely:

Data interlinking track (DI). This year the Data interlinking track focuses the following aspects: Retrieving New York Times interlinks with DBPedia, Freebase and Geonames. The dataset and the reference alignments are given in the Datasets section above. The New York Times Dataset includes 4 sub datasets: Persons, locations, organizations and descriptors that should be matched to themselves to detect duplicates, and to DBPedia, Freebase and Geonames. We note that only Geonamaes has links to the Locations dataset of NYT.

Synthetic data track (IIMB). The synthetic data track is focused on two main goals: i) to provide an evaluation dataset for various kinds of data trasformations, including value trasformations, structural tranformations, and logical transformations; ii) to cover a wide spectrum of possible techniques and tools. To this end, the IIMB benchmark is generated by starting from an initial OWL knowledge base that is transformed into a set of modified knowledge bases by applying several automatic transformations of data. Participants are requested to find the correct correspondences among individuals of the first knowledge base and individuals of the others.

Participation Conditions

Participating systems are free to use any combination of matching techniques and background knowledge.

Format of submission

For each track you participate, your submission should contain the following folders and files.

+- imei
|  +- [trackname]
|  |  +- participant.rdf

The files participant.rdf (replace 'partcipant' by the name of your system) contain the mapping generated by your system. These files have to follow the format described here (standard format for submissions to the OAEI).

The reference mapping contains only correspondences between instances of the ontologies. No correspondences between concepts and properties (roles) are specified in the reference alignment.

Please submit the files (preliminary and final results) directly to the email address contact mail. Send the results (g)zipped in a file participant.zip or participant.tgz and let the name of your matching systems occur somewhere in the subject heading of the mail.

Schedule

May 30th: datasets are out
June 27th: end of commenting period
July 1st: tests are frozen
September 1st: participants send preliminary results for interoperability-checking
September 23rd: participants send final results
October 23rd or 24th: final results ready and OM-2011 workshop.

Acknowledgements

We would like to thank all of the participants of the previous OAEI instance matching track editions for hints and discussions with respect to the realization and evaluation over the last years.
We would like to thank Evan Sandhaus from the New York Times for his support.

Contact

Alfio Ferrara, Università degli Studi di Milano, Italy

Laura Hollink, TU Delft, Netherland

Andriy Nikolov, Knowledge Media Institute, The Open University, UK

Jan Noessner, University of Mannheim, Germany

Willem Robert van Hage, VU Amsterdam, Netherland

François Scharffe, LIRMM, University of Montpellier, France

Raphael Troncy, Eurecom, France

Original page: http://www.instancematching.org/oaei/imei2011.html [cached: 06/12/2011]