Ontology Alignment Evaluation Initiative - OAEI-2008 CampaignOAEI

Ontology Alignment Evaluation Initiative

FAO results

Data sets

The FAO task involved three data sets, all in OWL:

Concepts have annotations associated containing the English definition of the term.

Subtracks

The mapping expected between AGROVOC and ASFA is therefore to be expected at the class level, since both model entries of the thesaurus as classes, but AGROVOC also adds instances to these classes. For the same reason, the mapping between AGROVOC and the fisheries ontologies is expected to be at the instance level, so is the mapping between the fisheries ontologies.

However, no strict instructions were given to participants about the exact type of mapping expected, as one of the goal of the experiments was to find how automatic systems can deal with a real-life situation, when the ontologies given are designed according to different models and have little or no documentation.

The alignment between AGROVOC and ASFA is called agrafsa, the alignment between AGROVOC and RTMS data is called agrorgbio, while the alignment between RTMS data and RTMS data is called fishbio.

The equivalence mappings requested to be found for the subtrack agrafsa and agrorgbio subtrack are plausible, given the similar nature of the two resources (thesauri used for human indexing, with overlap in the domain covered). In the case of the fishbio subtrack this is not true, as the two ontologies involved are about two domains that are disjoint, although related, i.e., commodities and fish species. The relation between the two domains is given from the fact that a specific species (or more than one) are the primary source of the goods sold, i.e. the commodity. Their relation then is not an equivalence relation but can rather be seen as an object property with domain and range sitting in different ontologies.

The intent of this subtrack is then to explore the possibility of using the machinery available for inferring equivalence mapping to non conventional cases.

Evaluation procedure

All participants but one, Aroma, returned equivalence mapping only. The non-equivalence correspondences of Aroma were ignored.

A reference alignment was obtained by randomly selecting a fixed number of correspondences from each system and then pooling together. This provided a sample alignment A0.

This sample alignment was evaluated by FAO experts for correctness. This provided a partial reference alignment R0. We had two assessors: one specialized in thesauri and daily working with AGROVOC (assessing the alignments of the track ``agrafsa'') and one specialized in fisheries data (assessing subtracks agrorgbio and fishbio). Given the diffences between the ontologies considered, some transformation had to be made in order to present data to the assessors in a user-friendly manner. So, in the case of AGROVOC, evaluators were given the English labels together with all the UF available.

The size of samples taken for evaluation are as follows:

datasettot retrieved (A*)evaluated (A0)correct (R0)A0/A*R0/A0
agrafsa258850622619%45%
agrorgbio74226415636%59%
fishbio101334613134%38%
TOTAL4343111651326%46%
Table 1 summarizes the sample size per each data sets. The second column (tot retrieved) contains the total number of distinct correspondences provided by all participants for each track. The third column (evaluated) reports the size of the sample extracted for manual assessment. The forth column (correct) reports the number of correspondences found correct by the assessors.

After manual evaluation was over, we realized that some participants did not use the correct URI in the agrafsa dataset, so some mappings were considered as different even though they were actually the same. However, this happened only in very few cases.

For each system, precision was computed on the basis of the subset of alignments that were manually assessed, i.e., AA0. Hence,

P0(A,R0)= P(AA0,R0)= |AR0| / |AA0|

The same was considered for recall which was computed with respect to the total number of correct alignment per subtrack, as assessed by the human assessors. Hence,

R0(A,R0)= R(AA0,R0)= |AR0| / |R0|
Recall is expected to be higher than actual recall because it is based only on correspondences that at least on system returned, leaving aside those that no system were able to return.

We call these two measures relative precision and recall because they are relative to the sample that has been extracted.

Results

Table 2 summarizes the precision and (relative) recall values of all systems, by subtrack. The third column reports the total number of mappings returned by each system, per each subtrack. All non-equivalence mappings were discarded, but this only happened for one systems (aroma). The fourth column reports the number of alignments from the system that were evaluated, while the fifth column reports the number of correct alignments as judged by the assessors. Finally, the sixth and seventh columns reports the values of relative precision and recall computed as described above. The star next to the alignment system marks the systems that alignmed proprties.

subtracktot retrievedtot evaluatedtot correctRPrecisionRRecall
SYSTEM|A||A&capA0||A&capR0|P0(A,R0)R0(A,R0)
aroma ***agrafsa195144900.6250.39823
agrorgbio240
fishbio *11
ASMOVagrafsa1
agrorgbio0
fishbio *5
DSSimagrafsa218129700.5426360.309735
agrorgbio3392141510.7056070.967949
fishbio243166790.4759040.603053
Lilyagrafsa390105910.8666670.402655
MapPSOagrorgbio *6
fishbio *16
RiMOMagrafsa7431941580.8144330.699115
agrorgbio3952191490.6803650.955128
fishbio7382171180.5437790.900763
SAMBOagrafsa3891761210.68750.535398
SAMBOdtfagrafsa6502191240.566210.548673

Discussion

From DSSim and RiMOM results, it seems that fishbio is the more difficult task in terms of precision and agrafsa the more difficult in terms of recall (for most of the systems).

It is really difficult to compare systems at this stage but it seems that RiMOM is the one which provided the best results. However, the results comparable with that of others.

The sampling method that has been used is certainly not perfect. In particular, it did not permitted to evaluate two systems which returned few results (ASMOV and MapPSO) as well as most of Aroma. However, the results returned by these system were not likely to provide good recall results. This is related to the next remark however.

The lack of instructions, and the particular character of the testset, clearly puzzled the participants and their systems. In consequence, the results may not be as good as a very polished test with easily comparable data sets. This provides a honest insight of what these systems would do when confronted with these ontologies on the web. In that respects, the results are not bad.


* Mappings between properties
** Thanks to Andrew Bagdanov, Aureliano Gentile, Gudrun Johannsen.
*** in agrafsa: 40 non equivalence alignments ignored. In agrorgbio 2 non equivalence mapping ignored.
$Id: FAO-results.html,v 1.2 2014/04/21 19:58:37 euzenat Exp $