Benchmark results for OAEI 2011.5

In the following we present the results of the OAEI 2011.5 evaluation of the benchmarks track. If you notice any kind of error (wrong numbers, incorrect information on a matching system) do not hesitate to contact José Luis Aguirre (see mail below).

Data set setting

The focus of this campaign was on scalability, i.e. the ability of matchers to deal with data sets of increasing number of elements. To that extent, we have generated four different benchmarks against which matchers have been evaluated: benchmark1 (biblio), benchmark2 (jerm), benchmark3 (provenance) and benchmark4 (finance). New benchmarks were generated following the same model as for previous benchmarks, from seed ontologies from different domains and with different sizes.

The following table summarizes the information about ontologies' sizes.


Test set	biblio	jerm	provenance	finance
	ontology size
classes+prop	97	250	431	633
instances	112	26	46	1113
entities	299	276	477	1746
triples	1332	1311	2366	21979

Participation

From the 19 systems listed in the 2011.5 final results page, 14 systems participated in this track. The reason is that several systems participating for the first time required Jdk 1.7 to be run. The systems that were left could not be ran in these conditions and they were all systems that did not present a new version for this campaign.

As new ontologies were used, some tools presented problems to pass the first test for benchmarks generated from these ontologies. We decided to give an opportunity to developers of those tools to fix the problems sending them the logs with the exceptions thrown.

The following table summarizes the list of participants. The second column indicates whether the system submitted was new for OAEI, or if it is a version or a modified version of a system that participated in OAEI 2011. The third column signals comments about special conditions to run the tools for all or part of the campaign and about the general behavior of the tools.

*List of participants*
Matching system	Status/Version	Comments
Aroma	2011 version	---
AUTOMSv2	New	Have to fix bug. Run only for biblio benchmark
CODI	Modified	Needed to adjust configuration for jerm benchmark. Not able to run provenance and finance benchmarks
GOMMA	New	Have to fix bug for benchmarks other than biblio
Hertuda	New	---
Lily	2011 version	Not able to run finance benchmark
LogMap	Modified	---
LogMapLt	New	---
MaasMtch	Modified	Have to fix bug for all benchmarks
MapEVO	Modified	---
MapPSO	Modified	Not able to finish some tests for provenance and finance benchmarks
MapSSS	Modified	---
WeSeE	New	---
YAM++	Modified	Have to fix bug for benchmarks other than biblio. Not able to run finace benchmark

Experimental setting

As we stated before, the focus of this campaign was on scalability. We have addressed this from two aspects:

Compliance: as usual, we compare results generated on top of the default setting against a reference alignment and compute precision, recall and f-measure.
Runtime: we analyze how far matching systems can deal with data sets of increasing number of elements including a huge ontology. We report on measured runtimes for the tools with data sets of different benchmarks.

For each aspect all systems have been executed in the same conditions whose specifications are given below.

Compliance raw results

We have executed all systems on two cores and 8GB RAM Debian virtual machines (VM) running continuously in parallel, except for finance benchmark which required 10GB RAM for some systems. We follow the general recommendations for Linux operating systems allocating no more than 80% of available memory for running Java processes. 6GB RAM were allocated to Java processes running in 8GB RAM VMs; 8GB RAM were allocated to Java processes running in 10GB RAM Vms. An exception is CODI which needs specific requirements/tools that we can not install in our machines due to academic license problems. CODI was executed on a two cores and 8GB RAM Ubuntu VM in the SEALS infrastructure.

Three test suites (data sets) of 94 tests for each benchmark were used for measuring compliance. Small test suites containing samples of the benchmarks are available here.

The following table presents the average precision, F-measure and recall for the 3 runs and for each benchmark. We confirmed, as it was already done in the previous 2011 campaign, that these three runs are not really necessary, as even if some matchers exhibit non deterministic behavior on a test case basis, their average measures on the whole data set remains almost the same. It can be seen from the table that some matchers exhibit similar behavior between different benchmarks, but other ones show great variations rendering non significant alignments in some cases.

*Precision, F-measure and recall*
Matching system	biblio			jerm			provenance			finance
	Precision	F-measure	Recall	Precision	F-measure	Recall	Precision	F-measure	Recall	Precision	F-measure	Recall
Aroma	0.97	0.76	0.63	0.99	0.96	0.93	0.78	0.60	0.49	0.90	0.70	0.57
AUTOMSv2	0.97	0.69	0.54	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a	n/a
CODI	0.93	0.75	0.63	1.00	0.96	0.93	n/a	n/a	n/a	n/a	n/a	n/a
GOMMA	0.79	0.67	0.58	0.97	0.67	0.51	0.14	0.22	0.55	0.84	0.66	0.55
Hertuda	1.00	0.67	0.50	0.96	0.66	0.50	0.59	0.54	0.50	0.75	0.60	0.50
Lily	0.95	0.75	0.62	0.93	0.71	0.58	0.92	0.68	0.54	u/r	u/r	u/r
LogMap	0.69	0.48	0.37	1.00	0.66	0.50	1.00	0.66	0.49	0.96	0.60	0.43
LogMapLt	0.7	0.58	0.50	0.98	0.67	0.51	0.99	0.66	0.50	0.90	0.66	0.52
MaasMtch	0.49	0.50	0.52	0.52	0.52	0.52	0.50	0.50	0.50	0.52	0.52	0.52
MapEVO	0.43	0.37	0.33	0.06	0.04	0.03	0.02	0.01	0.01	0.04	0.02	0.01
MapPSO	0.58	0.20	0.12	0.06	0.05	0.05	0.08^*	0.07^*	0.05^*	0.28^**	0.16^**	0.11^**
MapSSS	0.99	0.86	0.75	0.98	0.76	0.63	0.98	0.75	0.61	0.99	0.83	0.71
WeSeE	0.89	0.67	0.53	0.99	0.68	0.51	0.97	0.64	0.48	0.96	0.69	0.54
YAM++	0.99	0.83	0.72	0.99	0.72	0.56	u/r	u/r	u/r	n/a	n/a	n/a

n/a: not able to run this benchmark
u/r: uncompleted results, crashed or got stuck with some test
*: just one of three runs was completed
**: just two of three runs were completed

All results have been stored in the SEALS Raw Result Repository (RRS) and are available through the SEALS portal with the identifiers OMT-2011.5-%BENCHMARK%-benchmarks-r%RUN%-%SYSTEM%-rr for raw results and OMT-2011.5-%BENCHMARK%-benchmarks-r%RUN%-%SYSTEM%-ir for interpretations. For example OMT-2011.5-jerm-benchmarks-r2-Hertuda-rr for the raw results of the second run of the jerm benchmark for Hertuda.

Based on the standard deviation for F-measure, which is shown in the next table, the less variable matcher is MaasMtch, followed by WeSeE. The matcher having the greatest variability was GOMMA, but this is due to the result obtained with the provenance benchmark. We exclude from this table matchers that were not able to pass at least one run for provenance and finance benchmarks.

*Variability of results based on F-measure*
Matching system	biblio	jerm	provenance	finance	Standard deviation
MaasMtch	0.50	0.52	0.50	0.52	0.01
WeSeE	0.67	0.68	0.64	0.69	0.02
Lily	0.75	0.71	0.68	u/r	0.04
LogMapLt	0.58	0.67	0.66	0.66	0.04
MapSSS	0.86	0.76	0.75	0.83	0.05
Hertuda	0.67	0.66	0.54	0.60	0.06
MapPSO	0.20	0.05	0.07^*	0.16^**	0.07
LogMap	0.48	0.66	0.66	0.60	0.08
Aroma	0.76	0.96	0.60	0.70	0.15
MapEVO	0.37	0.04	0.01	0.02	0.17
GOMMA	0.67	0.67	0.22	0.66	0.22

u/r: uncompleted results, crashed or got stuck with some test
*: just one of three runs was completed
**: just two of three runs were completed

From the table we also observe that almost all of the tools got better results with biblio benchmark. Finally, the group of best systems in each data set remains relatively the same across the different seed ontologies: MapSSS revealed having the best results for three of the four benchmarks, with Aroma, Lily and WeSeE as followers.

Precision/recall graphs

Runtime results

We used a 3GHz Xeon 5472 (4 cores) machine running Linux Fedora 8 with 8GB RAM. CODI was excluded form this tests as it needs specific requirements/tools that we were not able to met due to academic license problems. AUTOMSv2 was tested only with biblio benchmark as it throws an exception with other benchmarks. YAM++ and Lily were not able to finish some benchmarks as they got stuck at one test for more than 12 hours.

Just one test suite of 15 tests were used for each benchmark. The following table presents the runtime measurement (in seconds) for those test suites. At a first glance, we observe that some tools exhibit a more or less stable behavior with respect to the number of classes and properties contained in the seed ontologies; other ones exhibit a chaotic one as big differences are observed in their behavior when dealing with different benchmarks.

*Runtime measurement (in seconds)*
Matching system	biblio	jerm	provenance	finance
LogMapLt	7	7	8	32
Hertuda	9	25	75	94
Aroma	9	10	27	63
GOMMA	10	10	24	61
LogMap	16	16	20	53
MapSSS	26	52	98	66494
MaasMtch	36	303	1284	2341
AUTOMSv2	63	n/a	n/a	n/a
YAM++	76	3428	u/r	n/a
MapPSO	140	968	u/r	u/r
MapEVO	158	194	751	64913
Lily	433	2645	10970	u/r
WeSeE	1025	2087	5315	7446

n/a: not able to run this benchmark
u/r: uncompleted results, crashed or got stuck with some test

The following figure shows a semi-log graph for runtime measurement against benchmark size in terms of classes and properties with the y-axis representing the runtime in a logarithmic scale. Just three points were included for Lily as we were not able to run finance benchmark with it. As it could be naturally expected, all tools have associated with them increasing functions with respect to benchmark sizes.

Runtimes in Benchmark datasets

Runtime measurement VS ontology size (classes+properties)

Some observations can be done from the graph:

LogMapLt is the fastest tool, followed by GOMMA, Aroma and LogMap; LogMapLt being a lightweight version of LogMap, the shape of their graphs is almost the same. GOMMA and Aroma exhibit very close behaviors, so close that it is difficult to distinguish one from the other in the graph. Hertuda had a good result for biblio benchmark as it was almost the same than GOMMA and Aroma, but its response clearly degrades with the rest of the benchmarks. The rest of the tools have greatest results for all benchmarks, with Lily being the slowest tool.
Even if we can not generalize for all available ontologies in the WWW from our experiments, the graph confirms what we said about behavior stability, at least for the benchmarks and test suites used. Two "families" of tools can be derived. The first one comprehends tools for which the results obtained stay inside the same vertical slice in our logarithmic graph. We include in this "family" LogMapLt, LogMap, Aroma, GOMMA, Hertuda and WeSeE (even if WeSeE had biggest results than other tools). The second one includes the rest of the tools, which exhibit big jumps between one benchmark and the next one; their results start at one vertical slice and finish at a different slice. Some of them jump even two or three slices between biblio and finance benchmarks.

The experiments also show that tools are more sensitive to classes and properties contained in the ontologies than to the number of triplets. A graph relating runtime with triplets contained in the benchmarks could also be drawn to support this affirmation, but it is enough to observe the fact that biblio and jerm original ontologies have almost the same number of triplets, but results obtained for those benchmarks are very different for almost all of the tools.

More interesting analysis could be done to relate runtime response and precision and recall. Those analysis will maybe integrated in the following campaigns.

Contact

This track is organized by Jose Luis Aguirre and Jérôme Euzenat. If you have any problems working with the ontologies, any questions or suggestions, feel free to write an email to jose-luis [.] aguirre [at] inria [.] fr