Results of Evaluation for the Conference track within OAEI 2012

Matching systems (matchers) have been evaluated based on reference alignment with regard to their precision, recall, F1-measure performace. Next, we evaluated the degree of alignment incoherence and we also provide brief report about runtimes.

Evaluation setting

Regarding evaluation based on reference alignment, we first filtered out (from alignments generated using SEALS platform) all instance-to-any_entity and owl:Thing-to-any_entity correspondences prior to computing Precision/Recall/F1-measure because they are not contained in the reference alignment. In order to compute average Precision and Recall over all those alignments we used absolute scores (i.e. TP, FP, and FN). Therefore, resulted numbers can slightly differ with those computed by SEALS platform. Then we computed F1-measure based on average Precision and Recall in a standard way. Finally, we found the highest average F1-measure with thresholding (if possible).

Participants

This year we had 18 participants. For overview table please see general information about results. Matchers which participated either in OAEI 2011 or OAEI 2011.5 we provide comparison in terms of highest average F1-measure below.

Data

Participants alignments

You can either download all alignments generated by SEALS platform for all test cases where each participant has its own directory. Or you can download subset of all alignment for which there is reference alignment. In this case we provide alignments after some tiny modifications about which we explained above. Allignments are stored as it follows: matcher-ontology1-ontology2.rdf.

Reference alignments

Reference alignment contains 21 alignments (testcases), which corresponds to the complete alignment space between 7 ontologies from the data set. This is a subset of all ontologies within this track (16). Total number of testcases is hence 120. There are two variants of reference alignment:

original reference alignment (ra1) which you can download - please let us know how you use this reference-alignment (outside OAEI context) and data set (ondrej.zamazal at vse dot cz).
entailed reference alignment (ra2) generated as a transitive closure computed on the original reference alignment (ra1). In order to obtain coherent reference alignment set, conflicting correspondences have been inspected and resolved by evaluators. As a result the degree of correctness and completeness of ra2 is probaly slightly better than for ra1. However, the differences are relatively restricted.

Results of evaluation based on reference alignment

Baselines

In order to provide some context for understanding matchers performance we made two simple string-based matchers as baselines. Baseline1 is a string matcher based on string equality applied on local names of entities which were lowercased before. Baseline2 enhanced baseline1 with three string operations: removing of dashes, underscore and 'has' words from all local names.

Results with regard to reference alignment ra2

In the table below, there are results of all 18 participants with regard to the reference alignment (ra2). There are precision, recall, F1-measure, F2-measure and F0.5-measure computed for the threshold that provides the highest average F1-measure. F1-measure is the harmonic mean of precision and recall. F2-measure (for beta=2) weights recall higher than precision and F0.5-measure (for beta=0.5) weights precision higher than recall.

The highest average F1-measure and its corresponding precision and recall for some threshold for each matcher.

Matchers are ordered according to their highest average F1-measure. According to matcher's position with regard to two baselines it can be in one of two basic groups:

Group 1 consists matchers (YAM++, LogMap, CODI, Optima, GOMMA, Hertuda, MaasMatch, Wmatch, WeSeE, HotMatch and LogMapLt) having better (or at least the same) results than baselines in terms of highest average F1-measure.
Group 2 consists of matchers (ServOMap, ServOMapLt, MEDLEY, ASE, MapSSS, AUTOMSv2 and AROMA) performing worse than Baseline1.

Results with regard to reference alignment ra1

For better comparison with previous years there is also table summarizing performance based on original reference alignment (ra1). In the case of evaluation based on ra1 the results are almost in all cases better by 0.03 to 0.05 points. The order of matchers according to F1-measure is not preserved, e.g. CODI went over Optima.

According to matcher's position with regard to two baselines it can be in one of three groups:

Group 1 consists matchers (YAM++, LogMap, Optima, CODI, GOMMA, Wmatch, WeSeE, Hertuda, MaasMatch, LogMapLt, HotMatch) having better (or at least the same) results than baselines in terms of highest average F1-measure.
Group 2 consists of matchers which perform worse than Baseline2 in terms of average F1-measure but still better (or at least the same) than Baseline1 (ServOMap).
Group 3 consists of matchers (ServoMapLt, MEDLEY, ASE, MapSSS, AUTOMSv2 and AROMA) performing worse than Baseline1.

Note: ASE and AUTOMSv2 matchers did not generate 3 out of 21 alignments. Thus, results for them are just approximations.

Results visualization on precision/recall triangular graph

Performance of matchers regarding an average F1-measure is visualized in the figure below where matchers of participants from first two groups are represented as squares. Baselines are represented as circles. Horizontal line depicts level of precision/recall while values of average F1-measure are depicted by areas bordered by corresponding lines F1-measure=0.[5|6|7].

precision/recall triangular graph for conference and F1-measure

Comparison of OAEI 2011, 2011.5 and 2012

Table below summarize performance results of matchers participated in more than one edition of OAEI, conference track with regard to reference alignment ra2.

Perfomance results summary OAEI 2011, 2011.5 and 2012

The highest improvement achieved Optima (0.29 increase wrt. F1-measure) and YAM++ (0.11 increase wrt. F1-measure) between OAEI 2012 and OAEI 2011, see Table below.

Difference between 2011, 2011.5 and 2012 results

Runtimes

We measured total runtime on generating all 120 alignments. It was executed on a laptop with Unbuntu machine running on Intel Core i5, 2.67GHz and 4GB RAM. Since the conference track consists of rather small ontologies (up to 140 classes) we set a time limit of five hours. In case the system did not finish it in time (after 5 hours), we stopped the experiments.

Total runtimes of 18 matchers are in the plot below. Experiment with YAM++ was stopped after 5 hours time limit.

There are four matchers which finished all 120 testcases within 1 minute (LogMapLt - 44 seconds, Hertuda - 49 seconds, ServOMapLt - 50 seconds and AROMA - 55 seconds). Next four matchers need less than 2 minutes (ServOMap, HotMatch, GOMMA and ASE). 10 minutes are enough for next four matchers (LogMap, MapSSS, MaasMatch and AUTOMSv2). Finally, 5 matchers needed up to 40 minutes to finish all 120 testcases (Optima 22 min, MEDLEY - 30 min, WeSeE - 36 min, CODI - 39 min and Wmatch - 40 min).

Further notes:

* ASE matcher failed to generate 6 alignments out of 120 testcases.
** LogMap could not generate all alignments with confious ontology due to problems with HermiT reasoner and seals client. New version of LogMap (delivered after deadline) succeeded to generate all alignments. Precision and recall performance above are reported with the initial version of the system (however precision and recall performance did not change between these two versions). Runtime is reported according to the latest version.
*** AUTOMSv2 failed to generate 6 alignments out of 120 testcases. Improved version delivered after deadline succeeded to generate all alignments with improved performance (in the case of ra2: P=0.75, F1-measure=0.52, R=0.4 and in the case of ra1: P=0.79, F1-measure=0.56, R=0.43) however because it was delivered after deadline official results are reported according to initial submitted version. Runtime is reported according to the latest version which does not differ with the initial version much.

Results of evaluation based on logical reasoning

As in the previous years, we apply the Maximum Cardinality measure to evaluate the degree of alignment incoherence. Details on this measure and its implementation can be found in [1]. The results of our experiments are depicted in the following table. In opposite to the previous year, we only compute the average for all test cases of the conference track for which there exists a reference alignment. The presented results are thus aggregated mean values for 21 test cases. In some cases we could not compute the degree of incoherence due to the combinatorial complexity of the problem. In this case we were still able to compute a lower bound for which we know that the actual degree is (probably only slightly) higher. Such results are marked with a *. Note that we only included those systems in our evaluation, that generated alignments for all test cases of the subset with reference alignments.

Matcher	AROMA	CODI	GOMMA	Hertuda	HotMatch	LogMap	LogMapLt	MaasMatch *	MapSSS	MEDLEY *	Optima	ServOMap	ServOMapLt	WeSeE	Wmatch	YAM++
Alignment Size	20.9	11.1	8.2	9.9	10.5	10.3	9.9	83.1	14.8	55.6	15.9	9	6.5	9.4	9.9	12.5
Incoherence Degree	19.4%	0%	1.1%	5.2%	5%	0%	5.4%	24.3%	12.6%	30.5%	7.6%	3.3%	0%	3.2%	6%	0%
Incoherent Alignments	18	0	2	9	9	0	7	20	18	20	12	5	0	6	10	0

Four matching systems can generate coherent alignments. These systems are CODI, LogMap, ServOMapLt, and YAM++. However, it is not always clear whether this is related to a specific approach that tries to ensure the coherency, or whether this is only indirectly caused by generating small and highly precise alignments. Especially, the coherence of the ServOMapLt alignments might be caused by such an approach. The system generates overall the smallest alignments. Because there are some systems that cannot generate a coherent alignment for alignments that have in average a size from 8 to 12 correspondences, it can be assumed that CODI, LogMap, and YAM++ have implemented specific coherency-preserving methods. Those system generate also between 8 to 12 correspondences, however, none of their alignments is incoherent. This is an important improvement compared to the previous years, for which we observed that only one or two systems managed to generate (nearly) coherent alignments.

Acknowledgements

We would like to thank to Christian Meilicke for his "evaluation based on logical reasoning".

Contacts

Contact address is Ondřej Šváb-Zamazal (ondrej.zamazal at vse dot cz).

References

[1] Christian Meilicke. Alignment Incoherence in Ontology Matching. PhD thesis at University Mannheim 2011.