Multilingual directory Data Set (MLdirectoy) - Internet Directory Alignment

This track provides alignment problems for different internet directories. The main focus on this track is treatment of multilingual environment (English and Japanese) and instances in the data set.

Data sets

The data set is constructed from Google (open directory project), Yahoo!, Lycos Japan, and Yahoo! Japan. The data set can be downloaded from here.

The data set is consist of five domains, automobile, movie, outdoor, photo and software. There are four files for each domain. Two of them (*_1.owl and *_2.owl) are for English directories and the rest (*_3.owl and *_4.owl) are for Japanese directories.

Owl file is organized with two parts. The first part is structure part, which describes about class structures. The second part is description part, which describes about instances of the classes. One of the main differences between MLdirectory data set and Directory data set, which is also available for OAEI-2008, is the second part. You can use snippets of the web pages in the internet directories as well as category names.

Modalities

The task for this data set is to find corresponding classes between different ontologies. Since we provide four data for each domain, we can obtain six aliments. In other words, we could obtain aliments between following combinations:

*_1.owl and *_2.owl (English - English)
*_1.owl and *_3.owl (English - Japanese)
*_1.owl and *_4.owl (English - Japanese)
*_2.owl and *_3.owl (English - Japanese)
*_2.owl and *_4.owl (English - Japanese)
*_3.owl and *_4.owl (Japanese - Japanese)

Although we encourage you to submit all alignments above, you can submit some of them such as English-English alignment only.

It is allowed to use background knowledge such as Japanese-English dictionaries and WordNet. In addition to it, you can use different data included in the MLdirectory data set for parameter tuning. For example, you can use automobile data for adjusting your system, and then induce the alignment results for movie data by the system. It is inhibited that you use the same data to adjust your system, because the system will not be applicable to unseen data. In the same manner, you cannot use specifically crafted background knowledge because it will violate the assumption that we have no idea about unseen data in advance.

Schedule

Please refer to the schedule at http://oaei.ontologymatching.org/2008/

Reference

This data set is used in the following papers:

Ryutaro Ichise, Masahiro Hamasaki, Hideaki Takeda: Discovering Relationships among Catalogs, Proceedings of the 7th International Conference on Discovery Science, LNAI 3245, pp. 371-379, Springer, (2004)
Ryutaro Ichise, Hideaki Takeda, Shinichi Honiden: Integrating Multiple Internet Directories by Instance-based Learning, Proceedings of the 18th International Joint Conference on Artificial Intelligence, pp. 22-28, (2003)

Contacts

If you have any questions and comments, please feel free to contact Ryutaro Ichise ( )

Initial location of this page: http://ri-www.nii.ac.jp/OAEI/2008/index.html.