Track: Anatomy - OAEI 2007 - Results

Some of the runtime measurements are not correctly presented in the the current version of the OAEI results-paper. The results presented here are corrected with respect to these mistakes and the results-paper will also be fixed at a later time.

Test Data and Experimental Setting

The ontologies of the antomy track are the NCI Thesaurus describing the human anatomy, published by the National Cancer Institute (NCI) and the Adult Mouse Anatomical Dictionary which has been developed as part of the Mouse Gene Expression Database project. Both resources are part of the Open Biomedical Ontologies (OBO). We gratefully thank Martin Ringwald and Terry Hayamizu for providing the reference alignment for these ontologies. The complex and laborious task of generating the reference alignment has been conducted by a combination of computational methods and extensive manual evaluation. In addition, the ontologies were extended and harmonized to increase the number of mappings between both ontologies.

The task is placed in a domain where we find large, carefully designed ontologies that are described in technical terms. Besides their large size and a conceptualization that is only to a limited degree based on the use of natural language, they also differ from other ontologies with respect to the use of specific annotations and roles, e.g. the extensive use of the partOf relation. The manual harmonization of the ontologies leads to a situation, where we have a high number of rather trivial mappings that can be found by simple string comparison techniques. At the same time, we have a good share of non-trivial mappings that require a careful analysis and sometimes also medical background knowledge. To better understand the occurrence of non-trivial correspondences in alignment results, we implemented a straight forward matching tool that compares normalized concept labels. This trivial matcher generates for all pairs of concepts á C, D ñ a correspondence if and only if the normalized label of C is identical to the normalized label of D. In general we expect an alignment generated by this approach to be highly precise while recall will be relatively low. With respect to our matching task we measured approximately 99% precision and 60% recall. Notice that the value for recall is relatively high, which is partially caused by the harmonization process mentioned above.

Because we assumed that all matchers would easily find the trivial mappings, we introduce an additional measure for recall called recall+. Recall+ measures how many non trivial correct correspondences can be found in an alignment M. Given reference alignment R and alignment S generated by the naive string equality matching, recall+ is defined in the following way:


Recall+ = | (R M) - S | / | R - S |

We divided the task of automatically generating an alignment between these ontologies into three subtasks. Task #1 was obligatory for participants of the anatomy track, while task #2 and #3 were optional. For task #1 the matching system has to be applied with standard settings to obtain a result that is as good as possible with respect to the expected f-value. For task #2 an alignment with increased precision has to be found. This seems to be an adequate requirement in a scenario where the automatically generated alignment will be directly used without subsequent manual evaluation. Contrary to this approach, in task #3 an alignment with increased recall has to be generated. Such an alignment could be seen as basis for subsequent expert evaluation. We believe that systems configurable with respect to these requirements will be much more useful in concrete scenarios compared to static systems.

Main Results

In total, eleven systems participated in the anatomy task. These systems can be roughly divided in three groups. Systems of type A are highly specialized on matching biomedical ontologies and make extensive use of medical background knowledge. These systems are AOAS and Sambo. Systems of type B can solve matching problems of different domains, but include a component exploiting biomedical background knowledge (e.g. using UMLS as lexical reference system). Asmov and RiMOM fall into this category. Systems of type C, finally can be seen as all-round matching systems that do not distinguish between medical ontologies and ontologies of different domains. Most systems in the experiment fell into this category. Table gives an overview of participating systems.

*Due to some mistakes in the evaluation process, the runtimes for SAMBO and ASMOV have been confunded. Note that SAMBO runs approx. 6h for track #1 and ASMOV runs 15 hours (for preliminary results it was 7 days) for track #1 (and not vice versa). The information published before 16.10.2007 on this page was not correct and has now been corrected.

System Type Testcase #1 Testcase #2 Testcase #3 Recall+
  Runtime Prec Rec F-val Prec Rec Prec Rec #1 #3
AOAS A n.a. 0.928 0.804 0.861 - - - - 0.505 -
Sambo A 6 h 0.845 0.786 0.815 - - - - 0.580 -
ASMOV B 15 h * 0.803 0.701 0.749 0.870 0.696 0.739 0.705 0.270 0.284
RiMOM B 4 h 0.377 0.659 0.480 - - - - 0.390 -
- Label Eq. - - 3 min 0.987 0.605 0.750 - - - - 0.0 -
Falcon-AO C 12 min 0.964 0.591 0.733 0.986 0.540 0.814 0.655 0.123 0.280
TaxoMap C 5 h 0.596 0.732 0.657 0.985 0.642 - - 0.230 -
AgreementM. C 30 min 0.558 0.635 0.594 0.930 0.286 0.424 0.651 0.262 0.302
Prior+ C 23 min 0.594 0.590 0.592 0.663 0.497 0.371 0.657 0.338 0.426
Lily C 4 days 0.481 0.559 0.517 0.672 0.380 0.401 0.588 0.374 0.410
X-SOM C 10 h 0.916 0.248 0.390 0.942 0.104 0.783 0.565 0.008 0.079
DSSim C 75 min 0.208 0.187 0.197 - - - - 0.067 -

* The runtime of the ASMOV system obtained for preliminary results was 7 days, first (some paragraphs in the text are based on this information). Meanwhile, the system has been optimized and the runtime has been reduced to approx. 15 hours for track #1, as now presented in the table.

Runtime

The runtime of the systems differs significantly (Remark: Runtime information has been provided by the participants. All alignments have been generated on similar equipped standard PCs. Advantages based on hardware differences could be neglected due to the significant differences in runtime). In average type-C systems outperformed systems that use medical knowledge. Falcon-AO, a system that solves large matching problems by applying a partition-based block matching strategy, solves the matching task in about 12 minutes without loss of quality with respect to the resulting alignment compared to other systems of type C. It has to be considered if similar approaches can also be applied to systems like ASMOV or Lily to solve their problems with runtime.

Type-C systems

The most astounding result is based on the suprisingly good performance of the naive label comparison approach compared to the alignments generated by systems of type C. The results of the naive approach are better with respect to recall as well as precision for testcase #1 compared to almost all matching systems of type C. Only TaxoMap and AgreementMaker generate an alignment with higher recall but a significant loss in precision. We would have expected the participating systems to find more correct correspondences than applying straight forward label comparisons. It seems that many matching systems do not accept a correspondence even if the normalized labels of the concepts are equal. On the one hand this might be caused by not detecting this equality at all (e.g. due to a partition based approach). On the other hand a detected label equality can be rejected as correspondence due to the fact that additional information related to the concepts suggests that these concepts have a different meaning.

Type-A/B systems

Systems that use additional background knowledge related to the biomedical domain clearly generate better alignments compared to type-C systems. This result conforms with our expectations. The only exception is the low precision of the RiMOM system. The values for recall+ points to the advantage of using domain related background knowledge. Both AOAS and Sambo detect about 50% of the non trivial correspondences, while only Lily and Prior+ (systems of type C) achieve about 42% for testcase #3 with a significant loss in precision. Amongst all systems the AOAS approach generates the best alignment closely followed by Sambo. Notice that AOAS is not available as stand alone system, but consists of a set of coupled programs which eventually demands user configuration.

Discussion of main results

Obviously, the use of domain related background knowledge is a crucial point in matching biomedical ontologies and the additional effort of exploiting this knowledge pays off. This observation supports the claims for the benefits of using background knowledge made by other researchers . Amongst all systems AOAS and Sambo generate the best alignments, especially the relatively high number of detected non trivial correspondences has to be mentioned positively. Nevertheless, for type C systems it is possible to detect non trivial correspondences, too. In particular the results of Lily and Prior+ on sub track #3 demonstrate this. Thus, there also seems to be a significant potential of exploiting knowledge encoded in the ontologies. Even if no medical background knowledge is used, it seems to make sense to provide a configuration that is specific to this type of domain. This is clearly demonstrated by the fact that most of the universal matching systems fail to find a significant number of trivial correspondences. While in general it makes sense for a matcher not to accept all trivial correspondences to avoid the problem of homonymy, there are domains like the present one, however, where homonymy is not a problem, for example because the terminology has been widely harmonized.

One major problem of matching medical ontologies is related to their large size. Though type C systems achieve relatively low values for recall, matching large ontologies seems to be less problematic. On the other hand the extensive use of domain related background knowledge has positive effects on recall, but does not seem to scale well. Thus, a trade-off between runtime and recall has to be found.

In further research we have to distinguish between different types of non trivial correspondences. While for detecting some of these correspondences domain specific knowledge seems to be indispensable, the results indicate that there is also a large subset that can be detected by the use of alternative methods that solely rely on knowledge encoded in the ontologies. The distinction between different classes of non trivial correspondences will be an important step for combining the strengths of both domain specific and domain independent matching systems. In summary, we can conclude that the data set used in the anatomy track is well suited to measure the characteristics of different matching systems with respect to the problem of matching biomedical ontologies.

Further results

To better understand the differences and similarities between different matching approaches implemented in participating systems, we also analyzed how similar the generated results are. Therefore, we computed the Jaccard index, where A and B represent the sets of correspondences generated by the matching systems.

The results for all pairs of systems (including the reference mapping and the trivial label equality matcher) are listed in the following table with respect to #1. Cells marked orange have a value higher than 0.7, while cells marked yellow have a value between 0.5 and 0.7.

Reference AOAS Sambo ASMOV RiMOM LabelEq. Falcon-AO TaxoMap AgreementM. Prior+ Lily X-Som DSSim
Reference - 0.756 0.687 0.598 0.316 0.600 0.578 0.464 0.423 0.420 0.349 0.205 0.109
AOAS 0.756 - 0.738 0.679 0.308 0.706 0.657 0.484 0.435 0.416 0.334 0.239 0.115
Sambo 0.687 0.738 - 0.573 0.302 0.576 0.562 0.443 0.401 0.407 0.326 0.206 0.103
ASMOV 0.598 0.679 0.573 - 0.282 0.681 0.620 0.461 0.408 0.376 0.297 0.232 0.106
RiMOM 0.316 0.308 0.302 0.282 - 0.277 0.272 0.264 0.252 0.258 0.224 0.089 0.065
LabelEq. 0.600 0.706 0.576 0.681 0.277 - 0.814 0.515 0.444 0.404 0.308 0.333 0.121
Falcon-AO 0.578 0.657 0.562 0.620 0.272 0.814 - 0.468 0.431 0.400 0.300 0.293 0.106
TaxoMap 0.464 0.484 0.443 0.461 0.264 0.515 0.468 - 0.359 0.432 0.286 0.180 0.104
AgreementM. 0.423 0.435 0.401 0.408 0.252 0.444 0.431 0.359 - 0.341 0.274 0.166 0.086
Prior+ 0.420 0.416 0.407 0.376 0.258 0.404 0.400 0.432 0.341 - 0.286 0.139 0.089
Lily 0.349 0.334 0.326 0.297 0.224 0.308 0.300 0.286 0.274 0.286 - 0.115 0.067
X-Som 0.205 0.239 0.206 0.232 0.089 0.333 0.293 0.180 0.166 0.139 0.115 - 0.038
DSSim 0.109 0.115 0.103 0.106 0.065 0.121 0.106 0.104 0.086 0.089 0.067 0.038 -

In particular the low similarity values of comparing systems of type C, indicate that the implemented matching methods differ to a large degree. We expected a significant overlap of the alignment results for these systems, due to common systematic errors and missing background knowledge. Nevertheless, it seems that the systems that generated low f-values "are barking up the wrong tree", but for all of them it seems to be a different one.

As part of these measurements we also found that AOAS is the only system, that included all correspondences of the trivial label equality approach. Nearly the same holds only for the Falcon-AO system with respect to testcase #3. We have discussed above some reasons for this oddity.

Other resources

Reference Samples and further results