Results OAEI 2013 -- Conference track

Matching systems (matchers) have been evaluated based on reference alignment with regard to their precision, recall and F1-measure performace. We also provide brief report about runtimes. We also provide degree of alignment incoherence.

Evaluation setting

Regarding evaluation based on reference alignment, we first filtered out (from alignments generated using SEALS platform) all instance-to-any_entity and owl:Thing-to-any_entity correspondences prior to computing Precision/Recall/F1-measure because they are not contained in the reference alignment. In order to compute average Precision and Recall over all those alignments we used absolute scores (i.e. we computed precision and recall using absolute scores of TP, FP, and FN across all 21 test cases). Therefore, resulted numbers can slightly differ with those computed by the SEALS platform. Then, we computed F1-measure in a standard way. Finally, we found the highest average F1-measure with thresholding (if possible).

Participants

This year we have 22 participants (three systems were evaluated in two versions). For overview table please see general information about results. Matchers which participated either in OAEI 2011.5 or OAEI 2012 we provide comparison in terms of highest average F1-measure below.

Data

Participants alignments

You can download subset of all alignments for which there is a reference alignment. In this case we provide alignments as generated by the SEALS platform (afterwards we applied some tiny modifications which we explained above). Alignments are stored as it follows: matcher-ontology1-ontology2.rdf.

Reference alignments

Reference alignment contains 21 alignments (test cases), which corresponds to the complete alignment space between 7 ontologies from the OntoFarm data set. This is a subset of all ontologies within this track (16). Total number of test cases is hence 120. There are two variants of reference alignment:

original reference alignment (ra1) which you can download - please let us know how you use this reference-alignment (outside the OAEI context) and data set (ondrej.zamazal at vse dot cz).
entailed reference alignment (ra2) generated as a transitive closure computed on the original reference alignment (ra1). In order to obtain coherent reference alignment set, conflicting correspondences have been inspected and resolved by evaluators. As a result the degree of correctness and completeness of ra2 is probaly slightly better than for ra1. However, the differences are relatively restricted.

Results of evaluation based on reference alignment

Baselines

In order to provide some context for understanding matchers performance we included two simple string-based matchers as baselines. StringEquiv (before it was called Baseline1) is a string matcher based on string equality applied on local names of entities which were lowercased before (this baseline was also used within anatomy track 2012) and edna (string editing distance matcher) was adopted from benchmark track (wrt. performance it is very similar as preivously used baseline2).

Results with regard to reference alignment ra2

In the table below, there are results of all 25 systems with regard to the reference alignment (ra2). There are precision, recall, F1-measure, F2-measure and F0.5-measure computed for the threshold that provides the highest average F1-measure computed for each matcher. F1-measure is the harmonic mean of precision and recall. F2-measure (for beta=2) weights recall higher than precision and F0.5-measure (for beta=0.5) weights precision higher than recall.

The highest average F1-measure and its corresponding precision and recall for the optimal threshold for each matcher.

Matchers are ordered according to their highest average F1-measure. According to matcher's position with regard to two baselines it can be in one of two basic groups:

Group 1 consists matchers (YAM++, AML-bk (AML stands for AgreementMakerLight), LogMap, AML, ODGOMS, StringsAuto, ServOMap, MapSSS, HerTUDA, WikiMatch, WeSeE-Match, IAMA, HotMatch, CIDER-CL) having better results than baselines in terms of highest average F1-measure.
Group 2 consists of matchers (OntoK, LogMapLite, XMapSigG, XMapGen and SYTNHESIS) performing better than baseline StringEquiv.
Group 3 consists of matchers (RIMOM2013, CroMatcher and MaasMatch) performing worse than baseline StringEquiv.

Results with regard to reference alignment ra1

For better comparison with previous years there is also table summarizing performance based on original reference alignment (ra1). In the case of evaluation based on ra1 the results are almost in all cases slightly bette. The order of matchers according to F1-measure is not preserved, e.g. OntoK went over CIDER-CL.

The highest average F1-measure and its corresponding precision and recall for some threshold for each matcher.

According to matcher's position with regard to two baselines it can be in one of three groups:

Group 1 consists matchers (YAM++, AML-bk, LogMap, AML, ODGOMS, StringsAuto, ServOMap, MapSSS, WeSeE-Match, HerTUDA) having better (or at least the same) results than baselines in terms of highest average F1-measure.
Group 2 consists of matchers which perform worse than edna in terms of average F1-measure but still better than StringEquiv (WikiMatch, LogMapLite, IAMA, HotMatch, OntoK, CIDER-CL, SYNTHESIS).
Group 3 consists of matchers (RIMOM2013, CroMatcher and MaasMatch) performing worse than StringEquiv.

Note: CroMatcher was unable to process any ontology pair where conference.owl ontology was included. Therefore, the evaluation was only on 15 test cases (alignments).

Results visualization on precision/recall triangular graph

Performance of matchers regarding an average F1-measure is visualized in the figure below where matchers of participants from first group are represented as squares or triangles. Baseline edna is represented as circle. Horizontal line depicts level of precision/recall while values of average F1-measure are depicted by areas bordered by corresponding lines F1-measure=0.[5|6|7]. -->

precision/recall triangular graph for conference and F1-measure

Comparison of OAEI 2011.5, 2012 and 2013

Table below summarize performance results of matchers participated in more than one edition of OAEI, conference track with regard to reference alignment ra2.

Perfomance results summary OAEI 2011.5, 2012 and 2013

The highest improvement achieved MapSSS (0.12 increase wrt. F1-measure) and ServOMap (0.7 increase wrt. F1-measure) between OAEI 2013 and OAEI 2012, see Table below.

Difference between 2011.5, 2012 and 2013 results

Runtimes

Next, we measured total runtime on generating those 21 alignments. It was executed on a laptop with Unbuntu machine running on Intel Core i5, 2.67GHz and 8GB RAM.

Total runtimes of participated matchers are in the plot below.

There are eleven systems which finished all 21 testcases within 1 minute or around 1 minute (AML-bk - 16 secons, ODGOMS - 19 secons, LogMapLite - 21 seconds, AML, HerTUDA, StringsAuto, HotMatch, LogMap, IAMA, RIMOM2013 - 53 seconds and MaasMatch - 76 seconds). Next four systems need less than 10 minutes (ServOMap, MapsSSS, SYNTHESIS, CIDER-CL). 10 minutes are enough for three four matchers (YAM++, XMapGen, XMapSiG). Finally, three matchers needed up to 40 minutes to finish all 21 testcases (WeSeE-Match 19 minutes, WikiMatch - 26 minutes, OntoK - 40 minutes.

Further notes:

we do not provide runtime for CroMatcher since it fails to generate 6 out of 21 test cases.

Results of evaluation based on logical reasoning

As in the previous years, we apply the Maximum Cardinality measure to evaluate the degree of alignment incoherence. Details on this measure and its implementation can be found in [1]. We computed the average for all 21 test cases of the conference track for which there exists a reference alignment. In one case (RIMOM2013 marked with an asterisk) we could not compute the exact degree of incoherence due to the combinatorial complexity of the problem, however we were still able to compute a lower bound for which we know that the actual degree is higher. We do not provide numbers for the CroMatcher since it did not generate all 21 alignments and for the MaasMatch. We could not compute the degree for incoherence for MaasMatch since its alignments are highly incoherent (and thus the reasoner encountered some exceptions).

Matcher	AML	AML-bk	CIDER_CL	HerTUDA	HotMatch	IAMA	LogMap	LogMapLite	MapSSS	ODGOMS	ODGOMS1_2	OntoK2	RIMOM2013*	ServOMap_v104	StringsAuto	SYNTHESIS	WeSeEMatch	WikiMatch	XMapGen	XMapGen1_4	XMapSiG1_3	XMapSiG1_4	YAM++
Alignment size	9.333	9.714	22.857	9.857	10.524	8.857	10.714	9.952	9.857	9.667	11.762	8.952	56.81	11.048	11.048	8.429	8	9.762	12	18.571	10.714	8.048	12.524
Incoherent alignments	0	0	21	9	9	7	0	7	0	9	13	7	20	4	0	9	1	10	1	0	0	1	0
Incoherence degree	0%	0%	19.5%	5.3%	5%	4%	0	5.4%	0	4.4%	6.5%	3.9%	2.71%	2%	0%	4.8%	0.4%	6.1%	0.4%	0%	0%	0.8%	0

This year eight systems managed to generate coherent alignments. These systems are AML, AML-bk, LogMap, MapSSS, StringsAuto, XMapGen, XMapSiG and YAM++. Coherent results need not only be related to a specific approach ensuring the coherency but it can be indirectly caused by generating small and highly precise alignments. However, looking at the table it seems that there is no matcher which on average generate too small alignments, thus we could conclude that all of these systems have implemented specific coherency-preserving methods. In all, this is a large important improvement compared to the previous years, where we observed that only four (two) systems managed to generate (nearly) coherent alignments in 2012 (2011 resp.)

Acknowledgements

We would like to thank to Christian Meilicke for his "evaluation based on logical reasoning".

Contacts

Contact address is Ondřej Zamazal (ondrej.zamazal at vse dot cz).

References

[1] Christian Meilicke. Alignment Incoherence in Ontology Matching. PhD thesis at University Mannheim 2011.