Ontology Alignment Evaluation Initiative - OAEI 2012 Campaign

Benchmark results for OAEI 2012

In the following we present the results of the OAEI 2012 evaluation of the benchmarks track. If you notice any kind of error, do not hesitate to contact Jérôme Euzenat (see mail below).

Data set setting

The focus of this campaign was on scalability, i.e. the ability of matchers to deal with data sets of increasing number of elements. To that extent, we have generated five different benchmarks against which matchers have been evaluated: benchmark1 (biblio), benchmark2, benchmark3, benchmark4 and benchmark5 (finance). New benchmarks were generated following the same model of previous benchmarks, from seed ontologies from different domains and with different sizes.

The following table summarizes the information about ontologies' sizes.

Test set biblio benchmark2 (commerce) benchmark3 (bioinformatics) benchmark4 (product design) finance
ontology size
classes+prop 97 247 354 472 633
instances 112 35 681 376 1113
entities 209 282 1035 848 1746
triples 1332 1562 5387 4262 21979

Participation

From the 23 systems listed in the 2012 final results page, 17 systems participated in this track. TOAST was only able to pass the tests of the Anatomy track; OMR follows a very strange strategy, generating alignments with mappings containing non-existing entities in one or both of the ontologies being matched; OntoK revealed several bugs when preliminary tests were executed, and developers were not able to fix all of them; requirements for executing CODI in our machines were not met due to academic license problems. For those reasons, these tools were not evaluated. Of course, we did not consider systems participating only in the Instance Matching track.

Experimental setting

As we stated before, the focus of this campaign was on scalability. We have addressed this from two aspects:

For each aspect all systems have been executed in the same conditions whose specifications are given below.

Compliance raw results

We have executed all systems on two two cores and 8GB RAM Debian virtual machines (VM) running continuously in parallel, except for the finance benchmark which required 10GB RAM for some systems. We follow the general recommendations for Linux operating systems allocating no more than 80% of available memory for running Java processes. 6GB RAM were allocated to Java processes running in 8GB RAM VMs; 8GB RAM were allocated to Java processes running in 10GB RAM Vms.

For each benchmark seed ontology, a data set of 94 tests was automatically generated. From the whole systematic benchmark test set (111 tests), we excluded the tests that were not artificially generated: 102--104, 203--210, 230--231, 301--304. Runs for benchmarks 2, 3 and 4 were executed in blind mode. Just one run was done for each benchmark, because it was confirmed in previous campaigns, that even if some matchers exhibit non deterministic behavior on a test case basis, their average measures on the whole data set remains almost the same across different runs. The data sets used for biblio and finance benchmarks are available here.

The following table presents the harmonic means of precision, F-measure and recall for the five benchmark data sets for all the participants, along with their confidence-weighted values. The table also shows measures provided by edna, a simple edit distance algorithm on labels which is used as a baseline. ASE was not able to handle the finance data set, and MEDLEY did not completed the evaluation of benchmark4 and finance data sets in a reasonable amount of time (less than 12 hours). The MapSSS results presented for finance are those obtained for the OAEI2011.5 campaign where we did not put any time limitation. The alignments produced by the tools for biblio and finance are available here.

Harmonic means of precision, F-measure and recall, along with their confidence-weighted values
Matching system biblio benchmark2 benchmark3 benchmark4 finance
Precision F-measureRecall Precision F-measureRecall Precision F-measureRecall Precision F-measureRecall Precision F-measureRecall
edna 0.35(0.45)0.41(0.47)0.5 0.46(0.61)0.48(0.55)0.5 0.22(0.25)0.3(0.33)0.5 0.31(0.37)0.38(0.42)0.5 0.22(0.25)0.3(0.33)0.5
AROMA 0.98(0.99)0.77(0.73)0.64(0.58) 0.97(0.98)0.76(0.73)0.63(0.58) 0.38(0.43)0.53(0.54)0.83(0.73) 0.960.73(0.7)0.59(0.55) 0.940.72(0.7)0.58(0.56)
ASE 0.490.51(0.52)0.54 0.72(0.74)0.610.53 0.270.360.54 0.4(0.41)0.450.51 nanana
AUTOMSv2 0.970.690.54 0.970.680.52 0.99(1)0.70.54 0.91(0.92)0.650.51(0.5) 0.35*0.42(0.39)*0.55(0.46)*
GOMMA 0.69(0.68)0.48(0.43)0.37(0.31) 0.990.6(0.54)0.43(0.38) 0.93(0.94)0.64(0.59)0.49(0.43) 0.990.57(0.51)0.4(0.35) 0.960.6(0.53)0.43(0.37)
Hertuda 0.90.680.54 0.930.670.53 0.940.680.54 0.90.660.51 0.720.620.55
Hotmatch 0.960.660.5 0.990.680.52 0.990.680.52 0.990.660.5 0.970.670.51
LogMap 0.730.56(0.51)0.45(0.39) 10.64(0.59)0.47(0.42) 0.95(0.96)0.65(0.6)0.49(0.44) 0.99(1)0.63(0.58)0.46(0.41) 0.95(0.94)0.63(0.57)0.47(0.4)
LogMapLt 0.710.590.5 0.950.660.5 0.950.650.5 0.950.650.5 0.90.660.52
MaasMtch 0.54(0.9)0.56(0.63)0.57(0.49) 0.6(0.93)0.6(0.65)0.6(0.5) 0.53(0.9)0.53(0.63)0.53(0.48) 0.54(0.92)0.54(0.64)0.54(0.49) 0.59(0.92)0.59(0.63)0.58(0.48)
MapSSS 0.990.870.77 10.860.75 10.820.7 10.810.68 0.990.830.71
MEDLEY 0.6(0.59)0.54(0.53)0.5(0.48) 0.92(0.94)0.65(0.63)0.5(0.48) 0.780.61(0.56)0.5(0.43) tototo tototo
Optima 0.890.630.49 10.660.5 0.970.690.53 0.920.60.45 0.960.560.4
ServOMap 0.880.580.43 10.670.5 10.670.5 0.890.6(0.59)0.45(0.44) 0.920.630.48
ServOMapLt 10.330.2 10.510.34 10.550.38 10.410.26 0.990.510.34
WeSeE 0.990.69(0.68)0.53(0.52) 10.69(0.68)0.52 10.7(0.69)0.53 10.660.5 0.990.7(0.69)0.54(0.53)
Wikimatch 0.740.620.54 0.970.67(0.68)0.52 0.96(0.97)0.680.52 0.94(0.95)0.660.51 0.74(0.75)0.62(0.63)0.54
YAM++ 0.98(0.95)0.83(0.18)0.72(0.1) 0.96(1)0.89(0.72)0.82(0.56) 0.97(1)0.85(0.7)0.76(0.54) 0.96(1)0.83(0.7)0.72(0.54) 0.97(1)0.9(0.72)0.84(0.57)

n/a: not able to run this benchmark
to: timeout exceeded
*: uncompleted results

The table shows that with few exceptions, all systems achieve higher levels of precision than recall for all benchmarks. The exceptions are the baseline; AROMA for benchmark3; ASE for biblio, benchmark3 and benchmark4; AUTOMSv2 for finance; and MaasMatch which produced very similar values for both measures and for all benchmarks. Considering the baseline as a reference, no tool had a worst precision performance, and only ServOMapLt had a significantly lower recall, with LogMap having slightly lower values for the same measure.

The test-by-test results on which this table is built are given in the following table:

biblioweighted
benchmark 2weighted
benchmark 3weighted
benchmark 4weighted
Financeweighted and weighted2

Confidence-weighted values have sense only for those tools which generate correspondences with confidence values different from one. These measures reward systems able to provide accurate confidence values, providing significant precision increasing for systems like edna and MaasMatch, which had possibly many incorrect correspondences with low confidence, and significant recall decreasing for AROMA, GO2A, LogMap and YAM++, which had apparently many correct correspondences with low confidence. The variation for YAM++ is quite impressive, specially for the biblio benchmark.

Based on the average F-measure for all benchmarks, which is shown in the next table, we observe that the group of best systems in each data set remains relatively the same across different benchmarks. Even if there is no best system for all benchmarks, YAM++, MapSSS and AROMA seems to generate the best alignments in terms of F-measure.

Variability of results based on F-measure
Matching system biblio benchmark2 benchmark3 benchmark4 finance Average F-measure
YAM++ 0.83 0.89 0.85 0.83 0.9 0.86
MapSSS 0.87 0.86 0.82 0.81 0.83 0.84
AROMA 0.77 0.76 0.53 0.73 0.72 0.70
WeSeE 0.69 0.69 0.7 0.66 0.7 0.69
GOMMA 0.67 0.69 0.7 0.63 0.66 0.67
HotMatch 0.66 0.68 0.68 0.66 0.67 0.67
Hertuda 0.68 0.67 0.68 0.66 0.62 0.66
Wikimatch 0.62 0.67 0.68 0.66 0.62 0.65
LogMapLt 0.59 0.66 0.65 0.65 0.66 0.64
ServOMap 0.58 0.67 0.67 0.6 0.63 0.63
Optima 0.63 0.66 0.69 0.6 0.56 0.63
AUTOMSv2 0.69 0.68 0.7 0.65 0.39 0.62
LogMap 0.56 0.64 0.65 0.63 0.63 0.62
MEDLEY 0.54 0.65 0.61 to to 0.60
MaasMatch 0.56 0.6 0.53 0.54 0.59 0.56
ASE 0.51 0.61 0.36 0.45 na 0.48
ServOMapLt 0.33 0.51 0.55 0.41 0.51 0.46
edna 0.41 0.48 0.3 0.38 0.3 0.37

na: not able to pass this test
to: timeout exceeded
*: uncompleted results

On the average, all matchers have better performance than the baseline, and they behave relatively stable across all benchmarks. Nevertheless, we observe a high variance in the results of some systems. Outliers are, for instance, a poor precision for AROMA with benchmark3 and a poor recall for ServOMapLt with biblio. These variations might depend on inter-dependencies between matching systems and datasets, and needs additional analysis requiring a deep knowledge of the evaluated systems.

Finally, the next table shows the results of the tools that have participated in the OAEI2011 and/or OAEI2011.5 campaigns. We present only the biblio and finance benchmarks because they were used in those campaigns, and also in OAEI2012. Even if the figures shown in the table were obtained with different data sets generated with the same test generator from the same seed ontologies, the comparison is valid. Previous experiments have shown that the F-measures obtained for data sets generated in those conditions remained pretty much the same [1].

Results across OAEI2011, OAEI2011.5 and OAEI2012 campaigns
Matching system biblio finance
2011 2011.5 2012 2011 2011.5 2012
AROMA 0.76 0.76 0.77 0.70 0.70 0.72
AUTOMSv2 --- 0.69 0.69 --- na 0.39
GOMMA --- 0.67 0.67 --- 0.66 0.66
Hertuda --- 0.67 0.68 --- 0.60 0.62
LogMap 0.57 0.48 0.56 na 0.60 0.63
LogMapLt --- 0.58 0.59 --- 0.66 0.66
MaasMatch 0.58 0.50 0.56 0.61 0.52 0.59
MapSSS 0.84 0.86 0.87 to 0.83 0.83
Optima 0.65 --- 0.63 to --- 0.56
WeSeE --- 0.67 0.69 --- 0.69 0.70
YAM++ 0.86 0.83 0.83 to na 0.90

na: not able to pass this test
to: timeout exceeded

Small variations are observed in the table across different campaigns. With respect to biblio, negative variations between 2-4% for some tools and positive variations between 1-3% for others are observed. LogMap and MaasMatch fell more than 10% percent in the OAEI2011.5 campaign, but they recovered well in OAEI2012.
For finance, the number of systems that passed the tests increased, either because bugs reported with versions used in previous campaigns were fixed, either because we relaxed the time out constraint imposed for this ontology in OAEI2011. For many tools that passed the tests in previous campaigns, positive variations between 1-3% are observed.

Precision/recall graphs

For the systems which have provided their results with confidence measures different from 1 or 0, it is possible to draw precision/recall graphs in order to compare them; these graphs are given in the next figure. The graphs show the real precision at n% recall and they stop when no more correspondences are available; then the end point corresponds to the precision and recall reported in the first table shown above.

Precision-recall graphs for Benchmark datasets

Precision/recall graphs for benchmarks. The alignments generated by matchers are cut under a threshold necessary for achieving n% recall and the corresponding precision is computed.

Runtime results

Runtime scalability has been evaluated from two perspectives: on the one hand we considered the five seed ontologies from different domains and with different sizes; on the other hand we considered the finance ontology scaling it by reducing its size by different factors (25%, 50% and 75%). For the two modalities, the data sets were composed of a subset of a whole systematic benchmark; tests have been carefully selected in order to be representative of all the alterations used to build a whole benchmark data set. The tests used were 101, 201-4, 202-4, 221, 228, 233, 248-4, 250-4, 253-4, 254-4, 257-4, 260-4, 261-4, 265, 266.

All the experiments were done on a 3GHz Xeon 5472 (4 cores) machine running Linux Fedora 8 with 8GB RAM. The following tables present the runtime measurement in seconds for the data sets used; systems on the table are ordered by least cumulated time. Semi-log graphs for runtime measurement against benchmark size in terms of classes and properties are given after the tables.

Runtime measurement (in seconds) for the finance data sets
Matching system finance25% finance50% finance75% finance Total time
LogMapLt27343535131
ServOMapLt30354144150
GOMMA43504449186
LogMap39484954190
ServOMap36465764203
AROMA44446677231
YAM++na176256386818
Hertuda52119284452907
Hotmatch611363095221028
MaasMatch7942014078331535
AUTOMSv224449586915353143
Wikimatch84119133001435010105
WeSeE117423473780516312464
Optima68524345588823416941
MapSSS87622711917857882841320232
MEDLEY62246132987toto195233
ASEnananana

na: not able to run this benchmark
to: timeout exceeded

Runtime measurement (in seconds) for the five benchmarks
Matching system biblio benchmark2 benchmark3 benchmark4 finance Total time
LogMapLt6611113569
ServOMapLt7915134488
LogMap1517262654138
ServOMap1216263464152
GOMMA1721353859160
AROMA8111271877241
Hertuda9389646452641
HotMatch134514467522791
YAM++10897115182386888
MaasMatch241404872208331704
AUTOMSv25816151942115352694
WikiMatch57710592158153243509676
WeSeE411110018781627516310179
Optima18888219722001823413277
MapSSS21592993558284183575
ASE2669690276na1061
MEDLEY67298665810toto68863

na: not able to run this benchmark
to: timeout exceeded

Runtimes in Finance datasets

Runtime measurement VS ontology size (classes+properties) for finance data sets

Runtimes in Benchmark datasets

Runtime measurement VS ontology size (classes+properties) for all benchmarks

Some observations can be done from the graphs:

  1. For the finance tests, where the structure and the knowledge domain between the ontologies is preserved with a high degree, the majority of tools have a monotonic increasing runtime, which is indeed what we expected. The exception is MapSSS which exhibits almost a constant value.
  2. On the contrary, this does not happen for the benchmark tests, for which the benchmark3 test causes a break in the monotonic behavior. The reason could be that this ontology has a more complex structure than the other ones, and that this fact affects some matchers more than others.
  3. There is a set of tools that distance themselves from the others: LogMapLt, ServOMapLt, LogMap, ServOMap, GOMMA and Aroma are the fastest tools, and are able to process large ontologies in a short time. On the contrary, there exist tools that were not able to deal with large ontologies in the same conditions: MEDLEY and MapSSS fall in this category.

The results obtained this year allow us to confirm something that was observed in OAEI2011 and OAEI2011.5: we can not conclude on a general correlation between runtime and quality of alignments. Not always the slowest tools provide the best compliance results, neither do the tools having short response times.

Contact

This track is organized by José Luis Aguirre and Jérôme Euzenat. If you have any problems working with the ontologies, any questions or suggestions, feel free to write an email to jerome [.] euzenat [at] inria [.] fr

Bibliography

[1] Maria Roşoiu, Cássia Trojahn dos Santos, and Jérôme Euzenat. Ontology matching benchmarks: generation and evaluation. In Pavel Shvaiko, Isabel Cruz, Jérôme Euzenat, Tom Heath, Ming Mao, and Christoph Quix, editors, Proc. 6th International Workshop on Ontology Matching (OM) collocated with ISWC, Bonn (Germany), 2011.