Ontology Alignment Evaluation Initiative

FAO results

Data sets

The FAO task involved three data sets, all in OWL:

AGROVOC is a thesaurus about all matters of interest for FAO, it has been translated in OWL in a very flat ontology containing one class per term plus one instance per class (with the same name by prefixed with "i_")
ASFA is a thesaurus more specifically dedicated to fisheries. In its OWL translation, descriptors and non-descriptors are modeled as classes, so the ontology does not contain any instance. The tree sctructure of ASFA is relatively flat, with most concept not having subclasses, and maximum depth 4 levels (double check).
The fisheries ontologies are two small OWL ontologies, modelling metadata for statistical series analysis (coming from the Reference Table Management System or RTMS), about specific fishery resources: commodities and species. They have a fairly simple class structure (e.g. the species ontologies has one top class and four subclasses) and a large number of instances. They contain instances in up to 3 languages (English, French and Spanish).

Concepts have annotations associated containing the English definition of the term.

Subtracks

The mapping expected between AGROVOC and ASFA is therefore to be expected at the class level, since both model entries of the thesaurus as classes, but AGROVOC also adds instances to these classes. For the same reason, the mapping between AGROVOC and the fisheries ontologies is expected to be at the instance level, so is the mapping between the fisheries ontologies.

However, no strict instructions were given to participants about the exact type of mapping expected, as one of the goal of the experiments was to find how automatic systems can deal with a real-life situation, when the ontologies given are designed according to different models and have little or no documentation.

The alignment between AGROVOC and ASFA is called agrafsa, the alignment between AGROVOC and RTMS data is called agrorgbio, while the alignment between RTMS data and RTMS data is called fishbio.

The equivalence mappings requested to be found for the subtrack agrafsa and agrorgbio subtrack are plausible, given the similar nature of the two resources (thesauri used for human indexing, with overlap in the domain covered). In the case of the fishbio subtrack this is not true, as the two ontologies involved are about two domains that are disjoint, although related, i.e., commodities and fish species. The relation between the two domains is given from the fact that a specific species (or more than one) are the primary source of the goods sold, i.e. the commodity. Their relation then is not an equivalence relation but can rather be seen as an object property with domain and range sitting in different ontologies.

The intent of this subtrack is then to explore the possibility of using the machinery available for inferring equivalence mapping to non conventional cases.

Evaluation procedure

All participants but one, Aroma, returned equivalence mapping only. The non-equivalence correspondences of Aroma were ignored.

A reference alignment was obtained by randomly selecting a fixed number of correspondences from each system and then pooling together. This provided a sample alignment A⁰.

This sample alignment was evaluated by FAO experts for correctness. This provided a partial reference alignment R⁰. We had two assessors: one specialized in thesauri and daily working with AGROVOC (assessing the alignments of the track ``agrafsa'') and one specialized in fisheries data (assessing subtracks agrorgbio and fishbio). Given the diffences between the ontologies considered, some transformation had to be made in order to present data to the assessors in a user-friendly manner. So, in the case of AGROVOC, evaluators were given the English labels together with all the UF available.

The size of samples taken for evaluation are as follows:

dataset	tot retrieved (A^*)	evaluated (A⁰)	correct (R⁰)	A⁰/A^*	R⁰/A⁰
agrafsa	2588	506	226	19%	45%
agrorgbio	742	264	156	36%	59%
fishbio	1013	346	131	34%	38%
TOTAL	4343	1116	513	26%	46%

Table 1 summarizes the sample size per each data sets. The second column (tot retrieved) contains the total number of distinct correspondences provided by all participants for each track. The third column (evaluated) reports the size of the sample extracted for manual assessment. The forth column (correct) reports the number of correspondences found correct by the assessors.

After manual evaluation was over, we realized that some participants did not use the correct URI in the agrafsa dataset, so some mappings were considered as different even though they were actually the same. However, this happened only in very few cases.

For each system, precision was computed on the basis of the subset of alignments that were manually assessed, i.e., A∩A⁰. Hence,

P⁰(A,R⁰)= P(A∩A⁰,R⁰)= |A∩R⁰| / |A∩A⁰|

The same was considered for recall which was computed with respect to the total number of correct alignment per subtrack, as assessed by the human assessors. Hence,

R⁰(A,R⁰)= R(A∩A⁰,R⁰)= |A∩R⁰| / |R⁰| Recall is expected to be higher than actual recall because it is based only on correspondences that at least on system returned, leaving aside those that no system were able to return.

We call these two measures relative precision and recall because they are relative to the sample that has been extracted.

Results

Table 2 summarizes the precision and (relative) recall values of all systems, by subtrack. The third column reports the total number of mappings returned by each system, per each subtrack. All non-equivalence mappings were discarded, but this only happened for one systems (aroma). The fourth column reports the number of alignments from the system that were evaluated, while the fifth column reports the number of correct alignments as judged by the assessors. Finally, the sixth and seventh columns reports the values of relative precision and recall computed as described above. The star next to the alignment system marks the systems that alignmed proprties.

	subtrack	tot retrieved	tot evaluated	tot correct	RPrecision	RRecall
SYSTEM		\|A\|	\|A&capA⁰\|	\|A&capR⁰\|	P⁰(A,R⁰)	R⁰(A,R⁰)
aroma ***	agrafsa	195	144	90	0.625	0.39823
	agrorgbio	2	4	0
	fishbio *	11
ASMOV	agrafsa	1
	agrorgbio	0
	fishbio *	5
DSSim	agrafsa	218	129	70	0.542636	0.309735
	agrorgbio	339	214	151	0.705607	0.967949
	fishbio	243	166	79	0.475904	0.603053
Lily	agrafsa	390	105	91	0.866667	0.402655
MapPSO	agrorgbio *	6
	fishbio *	16
RiMOM	agrafsa	743	194	158	0.814433	0.699115
	agrorgbio	395	219	149	0.680365	0.955128
	fishbio	738	217	118	0.543779	0.900763
SAMBO	agrafsa	389	176	121	0.6875	0.535398
SAMBOdtf	agrafsa	650	219	124	0.56621	0.548673

Discussion

From DSSim and RiMOM results, it seems that fishbio is the more difficult task in terms of precision and agrafsa the more difficult in terms of recall (for most of the systems).

It is really difficult to compare systems at this stage but it seems that RiMOM is the one which provided the best results. However, the results comparable with that of others.

The sampling method that has been used is certainly not perfect. In particular, it did not permitted to evaluate two systems which returned few results (ASMOV and MapPSO) as well as most of Aroma. However, the results returned by these system were not likely to provide good recall results. This is related to the next remark however.

The lack of instructions, and the particular character of the testset, clearly puzzled the participants and their systems. In consequence, the results may not be as good as a very polished test with easily comparable data sets. This provides a honest insight of what these systems would do when confronted with these ontologies on the web. In that respects, the results are not bad.

* Mappings between properties
** Thanks to Andrew Bagdanov, Aureliano Gentile, Gudrun Johannsen.
*** in agrafsa: 40 non equivalence alignments ignored. In agrorgbio 2 non equivalence mapping ignored.

$Id: FAO-results.html,v 1.2 2014/04/21 19:58:37 euzenat Exp $