Ontology Alignment Evaluation Initiative - OAEI-2013 Campaign

Results OAEI 2013 -- Conference track

Matching systems (matchers) have been evaluated based on reference alignment with regard to their precision, recall and F1-measure performace. We also provide brief report about runtimes. We also provide degree of alignment incoherence.

Evaluation setting

Regarding evaluation based on reference alignment, we first filtered out (from alignments generated using SEALS platform) all instance-to-any_entity and owl:Thing-to-any_entity correspondences prior to computing Precision/Recall/F1-measure because they are not contained in the reference alignment. In order to compute average Precision and Recall over all those alignments we used absolute scores (i.e. we computed precision and recall using absolute scores of TP, FP, and FN across all 21 test cases). Therefore, resulted numbers can slightly differ with those computed by the SEALS platform. Then, we computed F1-measure in a standard way. Finally, we found the highest average F1-measure with thresholding (if possible).


This year we have 22 participants (three systems were evaluated in two versions). For overview table please see general information about results. Matchers which participated either in OAEI 2011.5 or OAEI 2012 we provide comparison in terms of highest average F1-measure below.


Participants alignments

You can download subset of all alignments for which there is a reference alignment. In this case we provide alignments as generated by the SEALS platform (afterwards we applied some tiny modifications which we explained above). Alignments are stored as it follows: matcher-ontology1-ontology2.rdf.

Reference alignments

Reference alignment contains 21 alignments (test cases), which corresponds to the complete alignment space between 7 ontologies from the OntoFarm data set. This is a subset of all ontologies within this track (16). Total number of test cases is hence 120. There are two variants of reference alignment:

Results of evaluation based on reference alignment


In order to provide some context for understanding matchers performance we included two simple string-based matchers as baselines. StringEquiv (before it was called Baseline1) is a string matcher based on string equality applied on local names of entities which were lowercased before (this baseline was also used within anatomy track 2012) and edna (string editing distance matcher) was adopted from benchmark track (wrt. performance it is very similar as preivously used baseline2).

Results with regard to reference alignment ra2

In the table below, there are results of all 25 systems with regard to the reference alignment (ra2). There are precision, recall, F1-measure, F2-measure and F0.5-measure computed for the threshold that provides the highest average F1-measure computed for each matcher. F1-measure is the harmonic mean of precision and recall. F2-measure (for beta=2) weights recall higher than precision and F0.5-measure (for beta=0.5) weights precision higher than recall.

The highest average F1-measure and its corresponding precision and recall for the optimal threshold for each matcher.

Matchers are ordered according to their highest average F1-measure. According to matcher's position with regard to two baselines it can be in one of two basic groups:

Results with regard to reference alignment ra1

For better comparison with previous years there is also table summarizing performance based on original reference alignment (ra1). In the case of evaluation based on ra1 the results are almost in all cases slightly bette. The order of matchers according to F1-measure is not preserved, e.g. OntoK went over CIDER-CL.

The highest average F1-measure and its corresponding precision and recall for some threshold for each matcher.

According to matcher's position with regard to two baselines it can be in one of three groups:

Note: CroMatcher was unable to process any ontology pair where conference.owl ontology was included. Therefore, the evaluation was only on 15 test cases (alignments).

Results visualization on precision/recall triangular graph

Performance of matchers regarding an average F1-measure is visualized in the figure below where matchers of participants from first group are represented as squares or triangles. Baseline edna is represented as circle. Horizontal line depicts level of precision/recall while values of average F1-measure are depicted by areas bordered by corresponding lines F1-measure=0.[5|6|7]. -->

precision/recall triangular graph for conference and F1-measure

Comparison of OAEI 2011.5, 2012 and 2013

Table below summarize performance results of matchers participated in more than one edition of OAEI, conference track with regard to reference alignment ra2.

Perfomance results summary OAEI 2011.5, 2012 and 2013

The highest improvement achieved MapSSS (0.12 increase wrt. F1-measure) and ServOMap (0.7 increase wrt. F1-measure) between OAEI 2013 and OAEI 2012, see Table below.

Difference between 2011.5, 2012 and 2013 results


Next, we measured total runtime on generating those 21 alignments. It was executed on a laptop with Unbuntu machine running on Intel Core i5, 2.67GHz and 8GB RAM.

Total runtimes of participated matchers are in the plot below.

There are eleven systems which finished all 21 testcases within 1 minute or around 1 minute (AML-bk - 16 secons, ODGOMS - 19 secons, LogMapLite - 21 seconds, AML, HerTUDA, StringsAuto, HotMatch, LogMap, IAMA, RIMOM2013 - 53 seconds and MaasMatch - 76 seconds). Next four systems need less than 10 minutes (ServOMap, MapsSSS, SYNTHESIS, CIDER-CL). 10 minutes are enough for three four matchers (YAM++, XMapGen, XMapSiG). Finally, three matchers needed up to 40 minutes to finish all 21 testcases (WeSeE-Match 19 minutes, WikiMatch - 26 minutes, OntoK - 40 minutes.

Runtimes for 18 matchers

Further notes:

Results of evaluation based on logical reasoning

As in the previous years, we apply the Maximum Cardinality measure to evaluate the degree of alignment incoherence. Details on this measure and its implementation can be found in [1]. We computed the average for all 21 test cases of the conference track for which there exists a reference alignment. In one case (RIMOM2013 marked with an asterisk) we could not compute the exact degree of incoherence due to the combinatorial complexity of the problem, however we were still able to compute a lower bound for which we know that the actual degree is higher. We do not provide numbers for the CroMatcher since it did not generate all 21 alignments and for the MaasMatch. We could not compute the degree for incoherence for MaasMatch since its alignments are highly incoherent (and thus the reasoner encountered some exceptions).

Alignment size9.3339.71422.8579.85710.5248.85710.7149.9529.8579.66711.7628.95256.8111.04811.0488.42989.7621218.57110.7148.04812.524
Incoherent alignments002199707091372040911010010
Incoherence degree0%0%19.5%5.3%5%4%05.4%04.4%6.5%3.9%2.71%2%0%4.8%0.4%6.1%0.4%0%0%0.8%0

This year eight systems managed to generate coherent alignments. These systems are AML, AML-bk, LogMap, MapSSS, StringsAuto, XMapGen, XMapSiG and YAM++. Coherent results need not only be related to a specific approach ensuring the coherency but it can be indirectly caused by generating small and highly precise alignments. However, looking at the table it seems that there is no matcher which on average generate too small alignments, thus we could conclude that all of these systems have implemented specific coherency-preserving methods. In all, this is a large important improvement compared to the previous years, where we observed that only four (two) systems managed to generate (nearly) coherent alignments in 2012 (2011 resp.)


We would like to thank to Christian Meilicke for his "evaluation based on logical reasoning".


Contact address is Ondřej Zamazal (ondrej.zamazal at vse dot cz).


[1] Christian Meilicke. Alignment Incoherence in Ontology Matching. PhD thesis at University Mannheim 2011.