MultiFarm 2011.5 Dataset

You may also want to consult the multifarm project page.

This page informs about the MultiFarm dataset, a comprehensive dataset for multilingual ontology matching. The dataset can be downloaded and used for any kind of scientific purpose. Its generation and structure is briefly explained on this webpage, more details can be found in the following paper.

Christian Meilicke, Raúl García Castro, Fred Freitas, Willem Robert van Hage, Elena Montiel-Ponsoda, Ryan Ribeiro de Azevedo, Heiner Stuckenschmidt, Ondrej Svab-Zamazal, Vojtech Svatek, Andrei Tamilin, Cássia Trojahn, Shenghui Wang. MultiFarm: A Benchmark for Multilingual Ontology Matching. Accepted for publication at the Journal of Web Semantics.

Download the authors version of the paper

Modifications

The following enumeration describes modifications that have been applied to the dataset after its first publication.

14.10.2011: From all language pairs reference alignments for edas and ekaw have been filtered out. This allows to use arround half of the reference alignments for blind tests in OAEI or in other evaluation campaigns, while a good deal of testcases remains freely available to improve current matching systems. The same changes have applied to the data stored in the SEALS platform.
14.10.2011: The meta-desription in the data stored in the SEALS platform has been changed. Now all tesuites can be found by searching for multifarm here (user account required). The id cn-cz-multifarm, for example, refers to the chinese-czech language pair.

Evaluation campaigns

The dataset has been used in in the following experiments:

OAEI 2011.5 results page

It would be nice of you could inform us (contact below) in case you use the dataset in an experimental evaluation.

Translations in raw format

The dataset has been generated by translating the existing OntoFarm dataset. The results of this first step are available in simple structured textfiles and can be downloaded from the following table. Please notice that all files are UTF-8 encoded. Some letters might be incorrectly displayed by your browser, because it does not detect the encoding correctly.

	Spanish	German	French	Russian	Portuguese	Czech	Dutch	Chinese
CMT	link	link	link	link	link	link	link	link
CONFERENCE	link	link	link	link	link	link	link	link
CONFOF	link	link	link	link	link	link	link	link
EDAS	-	-	-	-	-	-	-	-
EKAW	-	-	-	-	-	-	-	-
IASTED	link	link	link	link	link	link	link	link
SIGKDD	link	link	link	link	link	link	link	link

Complete bundle with ontologies and reference alignments

The results of the translation have been used to generate language specific variants of existing ontologies and reference alignment for all pairs of ontologies. These files are bundled in a single zip-file. They can be downloaded and used in any kind of scenario/experiment.

The zip-file is structured as follows:

ont/ 
   cn/
      cmt-cn.owl
      conference-cn.owl
      [for each ontology cmt, conference, confOf, edas, ekaw, iasted, sigkdd]
   cz/ (contains 7 files)
      cmt-cz.owl
      conference-cz.owl
      ...
   de/ (contains 7 files)
      cmt-de.owl
      conference-de.owl
      ...
   [a directory for each language cn, cz, de, en, es, fr, nl, pt, ru]
ref/
   cn-cz/
      cmt-cmt-cn-cz.rdf
      cmt-conference-cn-cz.rdf
      cmt-conference-cz-cn.rdf
      cmt-confOf-cn-cz.rdf
      cmt-confOf-cz-cn.rdf
      ...
      conference-conference-cn-cz.rdf
      ...
      [overall 21*2=42+7*1 files]
   [a directory for each language pair cn-cz, cn-de, ...]

>>> Download the zipped bundle

SEALS Testsuites

The dataset can also be used via the SEALS platform, where we have prepared and stored a testsuite for each language pair, resulting in 36 testsuites. You need an account for the SEALS platform to search and retrieve them from the test data repository.

>>> Link to the SEALS platform

You can, for example, find the testsuite for the language pair Czech-German if you just type 'cz-de' in the search field of the test data repository.

Involved people

The dataset has been generated by a collaborative initiative of the following people.

Elena Montiel-Ponsoda, Raul Garcia Castro (Spanish)
Christian Meilicke, Heiner Stuckenschmidt (German)

with support from: Dominique Ritze and Jakob Huber

Cassia Trojahn (French)
Andrei Tamilin (Russian )
Fred Freitas, Ryan Ribeiro de Azevedo (Portuguese)

with support from: Ícaro Medeiros, Fernando Lins, Eric Rommel, and Roberta Fernandes

Ondrej Zamazal, Vojtech Svatek (Czech, owner of OntoFarm dataset)
Willem Robert van Hage (Dutch)
Shenghui Wang (Chinese)

Contact

Contact Christian Meilicke or Cassia Trojahn. for further information related to this dataset.

Known Bugs

Some users of the dataset have already detected some small bugs. In the future we will fix these bugs, for the moment we will just list them:

Heiko Paulheim has reported (Feb 2012) that MultiFarm uses incorrect ISO languagecodes for some languages. The bug is related to Chinese (zh instead of cn) and czech (cz instead of cs). We recommend to write a specific workaround in your matching system to map the wrong codes to the right ones until we fixed the problem.

Colophon

The logo at the top of this page is a modified version of a logo often used to refer to the Semantic Web. We have added the chinese signs for 'many' and 'language' to the original logo.

Original page: http://web.informatik.uni-mannheim.de/multifarm/ [cached: 04/07/2012]