Results for the large biomedical ontology Track

Evaluation setting

We have run the evaluation in a high performance server with 16 CPUs and allocating 15 Gb RAM.

Precision, Recall and F-measure have been computed with respect to a UMLS-based reference alignment. Systems have been ordered in terms of F-measure.

Participation and success

In the OAEI 2013 largebio track 13 out of 21 participating OAEI 2013 systems have been able to cope with at least one of the tasks of the largebio track.

Synthesis, WeSeEMatch and WikiMatch failed to complete the smallest task with a time out of 18 hours, while MapSSS, RiMOM, CIDER-CL, CroMatcher and OntoK threw an exception during the matching process; note that the latter two threw an out-of-memory exception.

In total we have evaluated 20 system configurations (see information about variants).

Tool variants and background knowledge

XMap participates with two variants. XMapSig, which uses a sigmoid function, and XMapGen, which implements a genetic algorithm. ODGOMS also participates with two versions (v1.1 and v1.2). ODGOMS-v1.1 is the original submitted version while ODGOMS-v1.2 includes some bug fixes and extensions.

LogMap has also been evaluated with two variants: LogMap and LogMap-BK. LogMap-BK uses normalisations and spelling variants from the general (biomedical) purpose UMLS Lexicon, while LogMap has this feature deactivated.

AML has been evaluated with 6 different variant depending on the use of repair techniques (R), general background knowledge (BK) and specialised background knowledge based on the UMLS Metathesaurus (SBK).

YAM++ and MaasMatch also use the general purpose background knowledge provided by WordNet.

We have also re-run the OAEI 2012 version of GOMMA. The results of GOMMA may slightly vary w.r.t. those in 2012 since we have used a different reference alignment.

Note that, since the reference alignment of this track is based on the UMLS Metathesaurus, we did not included within the results the alignments provided by AML-SBK and AML-SBK-R. Nevertheless we consider their results very interesting: AML-SBK and AML-SBK-R averaged F-measures higher than 0.90 in all 6 tasks.

Alignment coherence

Together with Precision, Recall, F-measure and Runtimes we have also evaluated the coherence of alignments. We have reported (1) number of unsatisfiabilities when reasoning with the input ontologies together with the computed mappings, and (2) the ratio/degree of unsatisfiable classes with respect to the size of the union of the input ontologies.

We have used the OWL 2 reasoner MORe to compute the number of unsatisfiable classes. For the cases in which MORe could not cope with the input ontologies and the mappings (in less than 2 hours) we have provided a lower bound on the number of unsatisfiable classes (indicated by ≥) using the OWL 2 EL reasoner ELK.

In this OAEI edition, only three systems have shown mapping repair facilities, namely: YAM++, AML with (R)epair configuration and LogMap. The results show that even the most precise alignment sets may lead to a huge amount of unsatisfiable classes. This proves the importance of using techniques to assess the coherence of the generated alignments.

Results 2013

System runtimes and task completion

Table 1 shows which systems (including variants) were able to complete each of the matching tasks in less than 18 hours and the required computation times. Systems have been ordered with respect to the number of completed task and the average time required to complete them. Times are reported in seconds.

The last column reports the number of tasks that a system could complete. For example, 12 system configurations were able to complete all six tasks. The last row shows the number of systems that could finish each of the tasks. The tasks involving SNOMED were also harder with respect to both computation times and the number of systems that completed the tasks.

**Table 1:** System runtimes (s) and task completion. (*) Note that the times for SPHeRe were reported by the authors. SPHeRe is a special tool which relies on the utilization of cloud computing resources.
System	FMA-NCI		FMA-SNOMED		SNOMED-NCI		Average	# Tasks
System	Task 1	Task 2	Task 3	Task 4	Task 5	Task 6	Average	# Tasks
LogMapLt	7	59	14	101	54	132	61	6
IAMA	13	139	27	217	98	206	117	6
AML	16	201	60	542	291	569	280	6
AML-BK	38	201	93	530	380	571	302	6
AML-R	18	194	86	554	328	639	303	6
GOMMA₂₀₁₂	39	243	54	634	220	727	320	6
AML-BK-R	42	204	121	583	397	635	330	6
YAM++	93	365	100	401	391	712	344	6
LogMap-BK	44	172	85	556	444	1,087	398	6
LogMap	41	161	78	536	433	1,232	414	6
ServOMap	140	2,690	391	4,059	1,698	6,320	2,550	6
SPHeRe (*)	16	8,136	154	20,664	2,486	10,584	7,007	6
XMapSiG	1,476	-	11,720	-	-	-	6,598	2
XMapGen	1,504	-	12,127	-	-	-	6,816	2
Hertuda	3,403	-	17,610	-	-	-	10,507	2
ODGOMS-v1.1	6,366	-	27,450	-	-	-	16,908	2
HotMatch	4,372	-	32,243	-	-	-	18,308	2
ODGOMS-v1.2	10,204	-	42,908	-	-	-	26,556	2
StringsAuto	6,358	-	-	-	-	-	6,358	1
MaasMatch	12,409	-	-	-	-	-	12,409	1
# Systems	20	12	18	12	12	12	5,906	86

Results for the FMA-NCI matching problem

The following tables summarize the results for the tasks in the FMA-NCI matching problem.

LogMap-BK and YAM++ provided the best results in terms of both Recall and F-measure in Task 1 and Task 2, respectively. IAMA provided the best results in terms of precision, although its recall was below average. Hertuda provided competitive results in terms of recall, but the low precision damaged the final F-measure. On the other hand, StringsAuto, XMapGen and XMapSiG provided a set of alignments with high precision, however, the F-measure was damaged due to the low recall of their alignments. Overall, the results were very positive and many systems obtained an F-measure greater than 0.80 in the two tasks.

Note that efficiency in Task 2 has decreased with respect to Task 1. This is mostly due to the fact that larger ontologies also involves more possible candidate alignments and it is harder to keep high precision values without damaging recall, and vice versa.

Task 1: FMA-NCI small fragments

**Table 2.1:** Results for the largebio task 1.
System	Time (s)	# Mappings	Scores			Incoherence Analysis
System	Time (s)	# Mappings	Precision	Recall	F-measure	Unsat.	Degree
LogMap-BK	45	2,727	0.949	0.883	0.915	2	0.02%
YAM++	94	2,561	0.976	0.853	0.910	2	0.02%
GOMMA₂₀₁₂	40	2,626	0.963	0.863	0.910	2,130	20.9%
AML-BK-R	43	2,619	0.958	0.856	0.904	2	0.02%
AML-BK	39	2,695	0.942	0.867	0.903	2,932	28.8%
LogMap	41	2,619	0.952	0.851	0.899	2	0.02%
AML-R	19	2,506	0.963	0.823	0.888	2	0.02%
ODGOMS-v1.2	10,205	2,558	0.953	0.831	0.888	2,440	24.0%
AML	16	2,581	0.947	0.834	0.887	2,598	25.5%
LogMapLt	8	2,483	0.959	0.813	0.880	2,104	20.7%
ODGOMS-v1.1	6,366	2,456	0.963	0.807	0.878	1,613	15.8%
ServOMap	141	2,512	0.951	0.815	0.877	540	5.3%
SPHeRe	16	2,359	0.960	0.772	0.856	367	3.6%
HotMatch	4,372	2,280	0.965	0.751	0.845	285	2.8%
Average	2,330	2,527	0.896	0.754	0.810	1,582	15.5%
IAMA	14	1,751	0.979	0.585	0.733	166	1.6%
Hertuda	3,404	4,309	0.589	0.866	0.701	2,675	26.3%
StringsAuto	6,359	1,940	0.838	0.554	0.667	1,893	18.6%
XMapGen	1,504	1,687	0.833	0.479	0.608	1,092	10.7%
XMapSiG	1,477	1,564	0.864	0.461	0.602	818	8.0%
MaasMatch	12,410	3,720	0.407	0.517	0.456	9,988	98.1%

Task 2: FMA-NCI whole ontologies

**Table 2.2:** Results for the largebio task 2.
System	Time (s)	# Mappings	Scores			Incoherence Analysis
System	Time (s)	# Mappings	Precision	Recall	F-measure	Unsat.	Degree
YAM++	366	2,759	0.899	0.846	0.872	9	0.01%
GOMMA₂₀₁₂	243	2,843	0.860	0.834	0.847	5,574	3.8%
LogMap	162	2,667	0.874	0.795	0.832	10	0.01%
LogMap-BK	173	2,668	0.872	0.794	0.831	9	0.01%
AML-BK	201	2,828	0.816	0.787	0.802	16,120	11.1%
AML-BK-R	205	2,761	0.826	0.778	0.801	10	0.01%
Average	1,064	2,711	0.840	0.770	0.799	9,223	6.3%
AML-R	194	2,368	0.892	0.721	0.798	9	0.01%
AML	202	2,432	0.880	0.730	0.798	1,044	0.7%
SPHeRe	8,136	2,610	0.846	0.753	0.797	1,054	0.7%
ServOMap	2,690	3,235	0.727	0.803	0.763	60,218	41.3%
LogMapLt	60	3,472	0.686	0.813	0.744	26,442	18.2%
IAMA	139	1,894	0.901	0.582	0.708	180	0.1%

Results for the FMA-SNOMED matching problem

The following tables summarize the results for the tasks in the FMA-SNOMED matching problem.

YAM++ provided the best results in terms of F-measure on both Task 3 and Task 4. YAM++ also provided the best Precision and Recall in Task 3 and Task 4, respectively; while AML-BK provided the best Recall in Task 3 and AML-R the best Precision in Task 4.

Overall, the results were less positive than in the FMA-NCI matching problem and only YAM++ obtained an F-measure greater than 0.80 in the two tasks. Furthermore, 9 systems failed to provide a recall higher than 0.4. Thus, matching FMA against SNOMED represents a significant leap in complexity with respect to the FMA-NCI matching problem.

As in the FMA-NCI matching problem, efficiency also decreases as the ontology size increases. The most important variations were suffered by SPHeRe, IAMA and GOMMA in terms of precision.

Task 3: FMA-SNOMED small fragments

**Table 2.3:** Results for the largebio task 3.
System	Time (s)	# Mappings	Scores			Incoherence Analysis
System	Time (s)	# Mappings	Precision	Recall	F-measure	Unsat.	Degree
YAM++	100	6,635	0.982	0.729	0.836	13,040	55.3%
AML-BK	93	6,937	0.942	0.731	0.824	12,379	52.5%
AML	60	6,822	0.943	0.720	0.816	15,244	64.7%
AML-BK-R	122	6,554	0.950	0.696	0.803	15	0.06%
AML-R	86	6,459	0.949	0.686	0.796	14	0.06%
LogMap-BK	85	6,242	0.963	0.672	0.792	0	0.0%
LogMap	79	6,071	0.966	0.656	0.782	0	0.0%
ServOMap	391	5,828	0.955	0.622	0.753	6,018	25.5%
ODGOMS-v1.2	42,909	5,918	0.862	0.570	0.686	9,176	38.9%
Average	8,073	4,248	0.892	0.436	0.549	7,308	31.0%
GOMMA₂₀₁₂	54	3,666	0.924	0.379	0.537	2,058	8.7%
ODGOMS-v1.1	27,451	2,267	0.876	0.222	0.354	938	4.0%
HotMatch	32,244	2,139	0.872	0.209	0.337	907	3.9%
LogMapLt	15	1,645	0.973	0.179	0.302	773	3.3%
Hertuda	17,610	3,051	0.575	0.196	0.293	1,020	4.3%
SPHeRe	154	1,577	0.916	0.162	0.275	805	3.4%
IAMA	27	1,250	0.962	0.134	0.236	22,925	97.3%
XMapGen	12,127	1,827	0.694	0.142	0.236	23,217	98.5%
XMapSiG	11,720	1,581	0.760	0.134	0.228	23,025	97.7%

Task 4: FMA whole ontology with SNOMED large fragment

**Table 2.4:** Results for the largebio task 4.
System	Time (s)	# Mappings	Scores			Incoherence Analysis
System	Time (s)	# Mappings	Precision	Recall	F-measure	Unsat.	Degree
YAM++	402	6,842	0.947	0.725	0.821	≥57,074	≥28.3%
AML-BK	530	6,186	0.937	0.648	0.766	≥40,162	≥19.9%
AML	542	5,797	0.963	0.624	0.758	≥39,472	≥19.6%
AML-BK-R	584	5,858	0.941	0.617	0.745	29	0.01%
AML-R	554	5,499	0.966	0.594	0.736	7	0.004%
ServOMap	4,059	6,440	0.861	0.620	0.721	≥164,116	≥81.5%
LogMap-BK	556	6,134	0.874	0.600	0.711	0	0.0%
LogMap	537	5,923	0.888	0.588	0.708	0	0.0%
Average	2,448	5,007	0.835	0.479	0.588	40,143	19.9%
GOMMA₂₀₁₂	634	5,648	0.406	0.257	0.315	9,918	4.9%
LogMapLt	101	1,823	0.878	0.179	0.297	≥4,393	≥2.2%
SPHeRe	20,664	2,338	0.614	0.160	0.254	6,523	3.2%
IAMA	218	1,600	0.749	0.134	0.227	≥160,022	≥79.4%

Results for the SNOMED-NCI matching problem

The following tables summarize the results for the tasks in the SNOMED-NCI matching problem.

LogMap-BK and ServOMap provided the best results in terms of both Recall and F-measure in Task 5 and Task 6, respectively. YAM++ provided the best results in terms of precision in Task 5 while AML-R in Task 6.

As in the previous matching problems, efficiency decreases as the ontology size increases. For example, in Task 6, only ServOMap and YAM++ could reach an F-measure higher than 0.7. Furthermore, the results were less positive than in the FMA-SNOMED matching problem, and thus, matching NCI against SNOMED represented another leap in complexity.

Task 5: SNOMED-NCI small fragments

**Table 2.5:** Results for the largebio task 5.
System	Time (s)	# Mappings	Scores			Incoherence Analysis
System	Time (s)	# Mappings	Precision	Recall	F-measure	Unsat.	Degree
LogMap-BK	444	13,985	0.894	0.677	0.770	≥40	≥0.05%
LogMap	433	13,870	0.896	0.672	0.768	≥47	≥0.06%
ServOMap	1,699	12,716	0.933	0.642	0.761	≥59,944	≥79.8%
AML-BK-R	397	13,006	0.920	0.648	0.760	≥32	≥0.04%
AML-BK	380	13,610	0.894	0.658	0.758	≥66,389	≥88.4%
AML-R	328	12,622	0.924	0.631	0.750	≥36	≥0.05%
YAM++	391	11,672	0.967	0.611	0.749	≥0	≥0.0%
AML	291	13,248	0.895	0.642	0.747	≥63,305	≥84.3%
Average	602	12,003	0.925	0.599	0.723	32,222	42.9%
LogMapLt	55	10,962	0.944	0.560	0.703	≥60,427	≥80.5%
GOMMA₂₀₁₂	221	10,555	0.940	0.537	0.683	≥50,189	≥66.8%
SPHeRe	2,486	9,389	0.924	0.469	0.623	≥46,256	≥61.6%
IAMA	99	8,406	0.965	0.439	0.604	≥40,002	≥53.3%

Task 6: NCI whole ontology with SNOMED large fragment

**Table 2.6:** Results for the largebio task 6.
System	Time (s)	# Mappings	Scores			Incoherence Analysis
System	Time (s)	# Mappings	Precision	Recall	F-measure	Unsat.	Degree
ServOMap	6,320	14,312	0.822	0.637	0.718	≥153,259	≥81.0%
YAM++	713	12,600	0.881	0.601	0.714	≥116	≥0.06%
AML-BK	571	11,354	0.918	0.564	0.699	≥121,525	≥64.2%
AML-BK-R	636	11,033	0.929	0.555	0.695	≥41	≥0.02%
LogMap-BK	1,088	12,217	0.871	0.576	0.693	≥1	≥0.001%
LogMap	1,233	11,938	0.882	0.570	0.692	≥1	≥0.001%
AML	570	10,940	0.927	0.549	0.689	≥121,171	≥64.1%
AML-R	640	10,622	0.938	0.539	0.685	≥51	≥0.03%
Average	1,951	11,581	0.880	0.549	0.674	72,365	38.3%
LogMapLt	132	12,907	0.802	0.560	0.660	≥150,773	≥79.7%
GOMMA₂₀₁₂	728	12,440	0.787	0.530	0.634	≥127,846	≥67.6%
SPHeRe	10,584	9,776	0.881	0.466	0.610	≥105,418	≥55.7%
IAMA	207	8,843	0.917	0.439	0.593	≥88,185	≥46.6%

Summary results for the top systems

The following table summarises the results for the systems that completed all 6 tasks of the Large BioMed Track. The table shows the total time in seconds to complete all tasks and averages for Precision, Recall, F-measure and Incoherence degree. The systems have been ordered according to the average F-measure and Incoherence degree.

YAM++ was a step ahead and obtained the best average Precision, Recall and F-measure.

AML-R obtained the second best average Precision while AML-BK obtained the second best average Recall.

Regarding mapping incoherence, LogMap-BK computed, on average, the mapping sets leading to the smallest number of unsatisfiable classes. The configurations of AML using (R)epair also obtained very good results in terms of mapping coherence.

Finally, LogMapLt was the fastest system. The rest of the tools, apart from ServoMap and SPHeRe, were also very fast and only needed between 11 and 53 minutes to complete all 6 tasks. ServOMap required around 4 hours to complete the 6 tasks while SPHeRe required almost 12 hours.

**Table 3:** Summary Results for the top systems.
System	Total Time (s)	Average
System	Total Time (s)	Precision	Recall	F-measure	Incoherence
YAM++	2,066	0.942	0.728	0.817	14.0%
AML-BK	1,814	0.908	0.709	0.792	44.2%
LogMap-BK	2,391	0.904	0.700	0.785	0.014%
AML-BK-R	1,987	0.921	0.692	0.785	0.027%
AML	1,681	0.926	0.683	0.783	43.1%
LogMap	2,485	0.910	0.689	0.780	0.015%
AML-R	1,821	0.939	0.666	0.776	0.029%
ServOMap	15,300	0.875	0.690	0.766	52.4%
GOMMA₂₀₁₂	1,920	0.813	0.567	0.654	28.8%
LogMapLt	371	0.874	0.517	0.598	34.1%
SPHeRe	42,040	0.857	0.464	0.569	21.4%
IAMA	704	0.912	0.386	0.517	46.4%

Harmonization of the mapping outputs

Mapping repair evaluation

Contact

If you have any question/suggestion related to the results of this track or if you notice any kind of error (wrong numbers, incorrect information on a matching system, etc.), feel free to write an email to ernesto [at] cs [.] ox [.] ac [.] uk or ernesto [.] jimenez [.] ruiz [at] gmail [.] com

Original page: http://www.cs.ox.ac.uk/isg/projects/SEALS/oaei/