IM@OAEI2011 is an initiative for the evaluation of instance matching techniques and tools. IM@OAEI2011 is a track of the Ontology Alignment Evaluation Initiative (OAEI - http://oaei.ontologymatching.org/2011), held every year in collaboration with the Ontology Matching Workshop at ISWC (http://om2011.ontologymatching.org).
IM@OAEI2011 is focused on RDF and OWL data in the context of the Semantic Web. Participants will be asked to execute their algorithms against various datasets and their results will be evaluated by comparing them with a pre-defined reference alignment provided by IM@OAEI2011. Results will be evaluated according to standard precision and recall metrics.
For each task describe below, we give the datasets to interlink, but also the reference alignments. Participants can thus prepare their algortihms so that they work at best and send us their evaluation results.
Use the following datasets as input for your matching system. You can already test with this data and report probems (send reports to contact email). They will be frozen by July 1st.
Interlinking New-York Times Data - DOWNLOAD
Participants are requested to re-build the links among the NYT dataset itself (see data.nytimes.com), and to the external data sources DBPedia, Geonames and Freebase. Reference alignments are provided for each resource as RDF alignments. These alignments are extracted from the links provided and curated by NYT. The whole NYT dataset is available under Creative Commons license - CC BY 3.0.
Here are a few stats on the datasets and their interlinks. The NYT dataset contain three areas: people, organizations and locations.
Stat | People | Organizations | Locations |
Nr of NYT resources | 9958 | 6088 | 3840 |
Total nr of sameAs links | 14884 | 8003 | 87861 |
Links to Freebase | 4979 | 3044 | 1920 |
Links to DBPedia | 4977 | 1949 | 1920 |
Links to NYT | 4979 | 3044 | 1920 |
Links to Geonames | 0 | 0 | 1789 |
(1) 12 links actually do not belong to the interconnected datasets and are not considered for this task.
(2) We are interested in matching all DBpedia resources, with the exception of resources that redirect to another resource (e.g. dbpedia:Gordon_Sumner dbpedia:ontology/wikiPageRedirects dbpedia:Sting_(musician) .) and resources that represent Wikipedia disambiguation pages (e.g. http://dbpedia.org/page/Lille_(disambiguation) dbpedia-owl:wikiPageDisambiguates dbpedia:Lille .).
Instructions to acess the other datasets are given below:
Synthetic data generated from Freebase data
IIMB is divided into tasks and reference alignments are automatically generated by introducing controlled modifications in an initial reference ontology instance. The files contain a subdirectory for each task. Participants are requested to match the reference ontology (the one in the 000 directory) against all the others (from 001 to 080). Each directory contains also the reference alignment. Data are provided as OWL individuals according to the RDF/XML format, while reference alignments are provided as RDF alignments.
IIMB is based on the SWING methodology, developed by Alfio Ferrara, Stefano Montanelli, Jan Noessner, Heiner Stuckenschmidt. Further details about SWING are available in (a SWING implementation is available at http://code.google.com/p/swing-generator/):
@inproceedings{iimb_iswc_2011, author = {Alfio Ferrara and Stefano Montanelli and Jan Noessner and Heiner Stuckenschmidt}, title = {Benchmarking Matching Applications on the Semantic Web}, booktitle = {The Semanic Web: Research and Applications - 8th Extended Semantic Web Conference, ESWC 2011}, publisher = {Springer}, series = {Lecture Notes in Computer Science}, address = {Heraklion, Crete, Greece}, year = {2011}, pages = {108-122} }
IM@OAEI2010 is organized in two sub-tracks, namely:
Data interlinking track (DI). This year the Data interlinking track focuses the following aspects: Retrieving New York Times interlinks with DBPedia, Freebase and Geonames. The dataset and the reference alignments are given in the Datasets section above. The New York Times Dataset includes 4 sub datasets: Persons, locations, organizations and descriptors that should be matched to themselves to detect duplicates, and to DBPedia, Freebase and Geonames. We note that only Geonamaes has links to the Locations dataset of NYT.
Synthetic data track (IIMB). The synthetic data track is focused on two main goals: i) to provide an evaluation dataset for various kinds of data trasformations, including value trasformations, structural tranformations, and logical transformations; ii) to cover a wide spectrum of possible techniques and tools. To this end, the IIMB benchmark is generated by starting from an initial OWL knowledge base that is transformed into a set of modified knowledge bases by applying several automatic transformations of data. Participants are requested to find the correct correspondences among individuals of the first knowledge base and individuals of the others.
Participating systems are free to use any combination of matching techniques and background knowledge.
For each track you participate, your submission should contain the following folders and files.
+- imei | +- [trackname] | | +- participant.rdf
The files participant.rdf (replace 'partcipant' by the name of your system) contain the mapping generated by your system. These files have to follow the format described here (standard format for submissions to the OAEI).
The reference mapping contains only correspondences between instances of the ontologies. No correspondences between concepts and properties (roles) are specified in the reference alignment.
Please submit the files (preliminary and final results) directly to the email address contact mail. Send the results (g)zipped in a file participant.zip or participant.tgz and let the name of your matching systems occur somewhere in the subject heading of the mail.
We would like to thank all of the participants of the previous OAEI instance matching track editions for hints and discussions with respect to the realization and evaluation over the last years.
We would like to thank Evan Sandhaus from the New York Times for his support.
Alfio Ferrara, Università degli Studi di Milano, Italy
Laura Hollink, TU Delft, Netherland
Andriy Nikolov, Knowledge Media Institute, The Open University, UK
Jan Noessner, University of Mannheim, Germany
Willem Robert van Hage, VU Amsterdam, Netherland
François Scharffe, LIRMM, University of Montpellier, France
Raphael Troncy, Eurecom, France