Ontology Alignment Evaluation Initiative - OAEI-2013 Campaign

Results for the large biomedical ontology Track

Evaluation setting

We have run the evaluation in a high performance server with 16 CPUs and allocating 15 Gb RAM.

Precision, Recall and F-measure have been computed with respect to a UMLS-based reference alignment. Systems have been ordered in terms of F-measure.

Participation and success

In the OAEI 2013 largebio track 13 out of 21 participating OAEI 2013 systems have been able to cope with at least one of the tasks of the largebio track.

Synthesis, WeSeEMatch and WikiMatch failed to complete the smallest task with a time out of 18 hours, while MapSSS, RiMOM, CIDER-CL, CroMatcher and OntoK threw an exception during the matching process; note that the latter two threw an out-of-memory exception.

In total we have evaluated 20 system configurations (see information about variants).

Tool variants and background knowledge

XMap participates with two variants. XMapSig, which uses a sigmoid function, and XMapGen, which implements a genetic algorithm. ODGOMS also participates with two versions (v1.1 and v1.2). ODGOMS-v1.1 is the original submitted version while ODGOMS-v1.2 includes some bug fixes and extensions.

LogMap has also been evaluated with two variants: LogMap and LogMap-BK. LogMap-BK uses normalisations and spelling variants from the general (biomedical) purpose UMLS Lexicon, while LogMap has this feature deactivated.

AML has been evaluated with 6 different variant depending on the use of repair techniques (R), general background knowledge (BK) and specialised background knowledge based on the UMLS Metathesaurus (SBK).

YAM++ and MaasMatch also use the general purpose background knowledge provided by WordNet.

We have also re-run the OAEI 2012 version of GOMMA. The results of GOMMA may slightly vary w.r.t. those in 2012 since we have used a different reference alignment.

Note that, since the reference alignment of this track is based on the UMLS Metathesaurus, we did not included within the results the alignments provided by AML-SBK and AML-SBK-R. Nevertheless we consider their results very interesting: AML-SBK and AML-SBK-R averaged F-measures higher than 0.90 in all 6 tasks.

Alignment coherence

Together with Precision, Recall, F-measure and Runtimes we have also evaluated the coherence of alignments. We have reported (1) number of unsatisfiabilities when reasoning with the input ontologies together with the computed mappings, and (2) the ratio/degree of unsatisfiable classes with respect to the size of the union of the input ontologies.

We have used the OWL 2 reasoner MORe to compute the number of unsatisfiable classes. For the cases in which MORe could not cope with the input ontologies and the mappings (in less than 2 hours) we have provided a lower bound on the number of unsatisfiable classes (indicated by ≥) using the OWL 2 EL reasoner ELK.

In this OAEI edition, only three systems have shown mapping repair facilities, namely: YAM++, AML with (R)epair configuration and LogMap. The results show that even the most precise alignment sets may lead to a huge amount of unsatisfiable classes. This proves the importance of using techniques to assess the coherence of the generated alignments.

Results 2013

System runtimes and task completion

Table 1 shows which systems (including variants) were able to complete each of the matching tasks in less than 18 hours and the required computation times. Systems have been ordered with respect to the number of completed task and the average time required to complete them. Times are reported in seconds.

The last column reports the number of tasks that a system could complete. For example, 12 system configurations were able to complete all six tasks. The last row shows the number of systems that could finish each of the tasks. The tasks involving SNOMED were also harder with respect to both computation times and the number of systems that completed the tasks.

System FMA-NCI FMA-SNOMED SNOMED-NCI Average # Tasks
Task 1 Task 2 Task 3 Task 4 Task 5 Task 6
LogMapLt 7 59 14 101 54 132 61 6
IAMA 13 139 27 217 98 206 117 6
AML 16 201 60 542 291 569 280 6
AML-BK 38 201 93 530 380 571 302 6
AML-R 18 194 86 554 328 639 303 6
GOMMA2012 39 243 54 634 220 727 320 6
AML-BK-R 42 204 121 583 397 635 330 6
YAM++ 93 365 100 401 391 712 344 6
LogMap-BK 44 172 85 556 444 1,087 398 6
LogMap 41 161 78 536 433 1,232 414 6
ServOMap 140 2,690 391 4,059 1,698 6,320 2,550 6
SPHeRe (*) 16 8,136 154 20,664 2,486 10,584 7,007 6
XMapSiG 1,476 - 11,720 - - - 6,598 2
XMapGen 1,504 - 12,127 - - - 6,816 2
Hertuda 3,403 - 17,610 - - - 10,507 2
ODGOMS-v1.1 6,366 - 27,450 - - - 16,908 2
HotMatch 4,372 - 32,243 - - - 18,308 2
ODGOMS-v1.2 10,204 - 42,908 - - - 26,556 2
StringsAuto 6,358 - - - - - 6,358 1
MaasMatch 12,409 - - - - - 12,409 1
# Systems 20 12 18 12 12 12 5,906 86
Table 1: System runtimes (s) and task completion. (*) Note that the times for SPHeRe were reported by the authors. SPHeRe is a special tool which relies on the utilization of cloud computing resources.

Results for the FMA-NCI matching problem

The following tables summarize the results for the tasks in the FMA-NCI matching problem.

LogMap-BK and YAM++ provided the best results in terms of both Recall and F-measure in Task 1 and Task 2, respectively. IAMA provided the best results in terms of precision, although its recall was below average. Hertuda provided competitive results in terms of recall, but the low precision damaged the final F-measure. On the other hand, StringsAuto, XMapGen and XMapSiG provided a set of alignments with high precision, however, the F-measure was damaged due to the low recall of their alignments. Overall, the results were very positive and many systems obtained an F-measure greater than 0.80 in the two tasks.

Note that efficiency in Task 2 has decreased with respect to Task 1. This is mostly due to the fact that larger ontologies also involves more possible candidate alignments and it is harder to keep high precision values without damaging recall, and vice versa.

Task 1: FMA-NCI small fragments

System Time (s) # Mappings Scores Incoherence Analysis
Precision  Recall  F-measure Unsat. Degree
LogMap-BK 45 2,727 0.949 0.883 0.915 2 0.02%
YAM++ 94 2,561 0.976 0.853 0.910 2 0.02%
GOMMA2012 40 2,626 0.963 0.863 0.910 2,130 20.9%
AML-BK-R 43 2,619 0.958 0.856 0.904 2 0.02%
AML-BK 39 2,695 0.942 0.867 0.903 2,932 28.8%
LogMap 41 2,619 0.952 0.851 0.899 2 0.02%
AML-R 19 2,506 0.963 0.823 0.888 2 0.02%
ODGOMS-v1.2 10,205 2,558 0.953 0.831 0.888 2,440 24.0%
AML 16 2,581 0.947 0.834 0.887 2,598 25.5%
LogMapLt 8 2,483 0.959 0.813 0.880 2,104 20.7%
ODGOMS-v1.1 6,366 2,456 0.963 0.807 0.878 1,613 15.8%
ServOMap 141 2,512 0.951 0.815 0.877 540 5.3%
SPHeRe 16 2,359 0.960 0.772 0.856 367 3.6%
HotMatch 4,372 2,280 0.965 0.751 0.845 285 2.8%
Average 2,330 2,527 0.896 0.754 0.810 1,582 15.5%
IAMA 14 1,751 0.979 0.585 0.733 166 1.6%
Hertuda 3,404 4,309 0.589 0.866 0.701 2,675 26.3%
StringsAuto 6,359 1,940 0.838 0.554 0.667 1,893 18.6%
XMapGen 1,504 1,687 0.833 0.479 0.608 1,092 10.7%
XMapSiG 1,477 1,564 0.864 0.461 0.602 818 8.0%
MaasMatch 12,410 3,720 0.407 0.517 0.456 9,988 98.1%
Table 2.1: Results for the largebio task 1.

Task 2: FMA-NCI whole ontologies

System Time (s) # Mappings Scores Incoherence Analysis
Precision  Recall  F-measure Unsat. Degree
YAM++ 366 2,759 0.899 0.846 0.872 9 0.01%
GOMMA2012 243 2,843 0.860 0.834 0.847 5,574 3.8%
LogMap 162 2,667 0.874 0.795 0.832 10 0.01%
LogMap-BK 173 2,668 0.872 0.794 0.831 9 0.01%
AML-BK 201 2,828 0.816 0.787 0.802 16,120 11.1%
AML-BK-R 205 2,761 0.826 0.778 0.801 10 0.01%
Average 1,064 2,711 0.840 0.770 0.799 9,223 6.3%
AML-R 194 2,368 0.892 0.721 0.798 9 0.01%
AML 202 2,432 0.880 0.730 0.798 1,044 0.7%
SPHeRe 8,136 2,610 0.846 0.753 0.797 1,054 0.7%
ServOMap 2,690 3,235 0.727 0.803 0.763 60,218 41.3%
LogMapLt 60 3,472 0.686 0.813 0.744 26,442 18.2%
IAMA 139 1,894 0.901 0.582 0.708 180 0.1%
Table 2.2: Results for the largebio task 2.

Results for the FMA-SNOMED matching problem

The following tables summarize the results for the tasks in the FMA-SNOMED matching problem.

YAM++ provided the best results in terms of F-measure on both Task 3 and Task 4. YAM++ also provided the best Precision and Recall in Task 3 and Task 4, respectively; while AML-BK provided the best Recall in Task 3 and AML-R the best Precision in Task 4.

Overall, the results were less positive than in the FMA-NCI matching problem and only YAM++ obtained an F-measure greater than 0.80 in the two tasks. Furthermore, 9 systems failed to provide a recall higher than 0.4. Thus, matching FMA against SNOMED represents a significant leap in complexity with respect to the FMA-NCI matching problem.

As in the FMA-NCI matching problem, efficiency also decreases as the ontology size increases. The most important variations were suffered by SPHeRe, IAMA and GOMMA in terms of precision.

Task 3: FMA-SNOMED small fragments

System Time (s) # Mappings Scores Incoherence Analysis
Precision  Recall  F-measure Unsat. Degree
YAM++ 100 6,635 0.982 0.729 0.836 13,040 55.3%
AML-BK 93 6,937 0.942 0.731 0.824 12,379 52.5%
AML 60 6,822 0.943 0.720 0.816 15,244 64.7%
AML-BK-R 122 6,554 0.950 0.696 0.803 15 0.06%
AML-R 86 6,459 0.949 0.686 0.796 14 0.06%
LogMap-BK 85 6,242 0.963 0.672 0.792 0 0.0%
LogMap 79 6,071 0.966 0.656 0.782 0 0.0%
ServOMap 391 5,828 0.955 0.622 0.753 6,018 25.5%
ODGOMS-v1.2 42,909 5,918 0.862 0.570 0.686 9,176 38.9%
Average 8,073 4,248 0.892 0.436 0.549 7,308 31.0%
GOMMA2012 54 3,666 0.924 0.379 0.537 2,058 8.7%
ODGOMS-v1.1 27,451 2,267 0.876 0.222 0.354 938 4.0%
HotMatch 32,244 2,139 0.872 0.209 0.337 907 3.9%
LogMapLt 15 1,645 0.973 0.179 0.302 773 3.3%
Hertuda 17,610 3,051 0.575 0.196 0.293 1,020 4.3%
SPHeRe 154 1,577 0.916 0.162 0.275 805 3.4%
IAMA 27 1,250 0.962 0.134 0.236 22,925 97.3%
XMapGen 12,127 1,827 0.694 0.142 0.236 23,217 98.5%
XMapSiG 11,720 1,581 0.760 0.134 0.228 23,025 97.7%
Table 2.3: Results for the largebio task 3.

Task 4: FMA whole ontology with SNOMED large fragment

System Time (s) # Mappings Scores Incoherence Analysis
Precision  Recall  F-measure Unsat. Degree
YAM++ 402 6,842 0.947 0.725 0.821 ≥57,074 ≥28.3%
AML-BK 530 6,186 0.937 0.648 0.766 ≥40,162 ≥19.9%
AML 542 5,797 0.963 0.624 0.758 ≥39,472 ≥19.6%
AML-BK-R 584 5,858 0.941 0.617 0.745 29 0.01%
AML-R 554 5,499 0.966 0.594 0.736 7 0.004%
ServOMap 4,059 6,440 0.861 0.620 0.721 ≥164,116 ≥81.5%
LogMap-BK 556 6,134 0.874 0.600 0.711 0 0.0%
LogMap 537 5,923 0.888 0.588 0.708 0 0.0%
Average 2,448 5,007 0.835 0.479 0.588 40,143 19.9%
GOMMA2012 634 5,648 0.406 0.257 0.315 9,918 4.9%
LogMapLt 101 1,823 0.878 0.179 0.297 ≥4,393 ≥2.2%
SPHeRe 20,664 2,338 0.614 0.160 0.254 6,523 3.2%
IAMA 218 1,600 0.749 0.134 0.227 ≥160,022 ≥79.4%
Table 2.4: Results for the largebio task 4.

Results for the SNOMED-NCI matching problem

The following tables summarize the results for the tasks in the SNOMED-NCI matching problem.

LogMap-BK and ServOMap provided the best results in terms of both Recall and F-measure in Task 5 and Task 6, respectively. YAM++ provided the best results in terms of precision in Task 5 while AML-R in Task 6.

As in the previous matching problems, efficiency decreases as the ontology size increases. For example, in Task 6, only ServOMap and YAM++ could reach an F-measure higher than 0.7. Furthermore, the results were less positive than in the FMA-SNOMED matching problem, and thus, matching NCI against SNOMED represented another leap in complexity.

Task 5: SNOMED-NCI small fragments

System Time (s) # Mappings Scores Incoherence Analysis
Precision  Recall  F-measure Unsat. Degree
LogMap-BK 444 13,985 0.894 0.677 0.770 ≥40 ≥0.05%
LogMap 433 13,870 0.896 0.672 0.768 ≥47 ≥0.06%
ServOMap 1,699 12,716 0.933 0.642 0.761 ≥59,944 ≥79.8%
AML-BK-R 397 13,006 0.920 0.648 0.760 ≥32 ≥0.04%
AML-BK 380 13,610 0.894 0.658 0.758 ≥66,389 ≥88.4%
AML-R 328 12,622 0.924 0.631 0.750 ≥36 ≥0.05%
YAM++ 391 11,672 0.967 0.611 0.749 ≥0 ≥0.0%
AML 291 13,248 0.895 0.642 0.747 ≥63,305 ≥84.3%
Average 602 12,003 0.925 0.599 0.723 32,222 42.9%
LogMapLt 55 10,962 0.944 0.560 0.703 ≥60,427 ≥80.5%
GOMMA2012 221 10,555 0.940 0.537 0.683 ≥50,189 ≥66.8%
SPHeRe 2,486 9,389 0.924 0.469 0.623 ≥46,256 ≥61.6%
IAMA 99 8,406 0.965 0.439 0.604 ≥40,002 ≥53.3%
Table 2.5: Results for the largebio task 5.

Task 6: NCI whole ontology with SNOMED large fragment

System Time (s) # Mappings Scores Incoherence Analysis
Precision  Recall  F-measure Unsat. Degree
ServOMap 6,320 14,312 0.822 0.637 0.718 ≥153,259 ≥81.0%
YAM++ 713 12,600 0.881 0.601 0.714 ≥116 ≥0.06%
AML-BK 571 11,354 0.918 0.564 0.699 ≥121,525 ≥64.2%
AML-BK-R 636 11,033 0.929 0.555 0.695 ≥41 ≥0.02%
LogMap-BK 1,088 12,217 0.871 0.576 0.693 ≥1 ≥0.001%
LogMap 1,233 11,938 0.882 0.570 0.692 ≥1 ≥0.001%
AML 570 10,940 0.927 0.549 0.689 ≥121,171 ≥64.1%
AML-R 640 10,622 0.938 0.539 0.685 ≥51 ≥0.03%
Average 1,951 11,581 0.880 0.549 0.674 72,365 38.3%
LogMapLt 132 12,907 0.802 0.560 0.660 ≥150,773 ≥79.7%
GOMMA2012 728 12,440 0.787 0.530 0.634 ≥127,846 ≥67.6%
SPHeRe 10,584 9,776 0.881 0.466 0.610 ≥105,418 ≥55.7%
IAMA 207 8,843 0.917 0.439 0.593 ≥88,185 ≥46.6%
Table 2.6: Results for the largebio task 6.

Summary results for the top systems

The following table summarises the results for the systems that completed all 6 tasks of the Large BioMed Track. The table shows the total time in seconds to complete all tasks and averages for Precision, Recall, F-measure and Incoherence degree. The systems have been ordered according to the average F-measure and Incoherence degree.

YAM++ was a step ahead and obtained the best average Precision, Recall and F-measure.

AML-R obtained the second best average Precision while AML-BK obtained the second best average Recall.

Regarding mapping incoherence, LogMap-BK computed, on average, the mapping sets leading to the smallest number of unsatisfiable classes. The configurations of AML using (R)epair also obtained very good results in terms of mapping coherence.

Finally, LogMapLt was the fastest system. The rest of the tools, apart from ServoMap and SPHeRe, were also very fast and only needed between 11 and 53 minutes to complete all 6 tasks. ServOMap required around 4 hours to complete the 6 tasks while SPHeRe required almost 12 hours.

System Total Time (s) Average
Precision  Recall  F-measure Incoherence
YAM++ 2,066 0.942 0.728 0.817 14.0%
AML-BK 1,814 0.908 0.709 0.792 44.2%
LogMap-BK 2,391 0.904 0.700 0.785 0.014%
AML-BK-R 1,987 0.921 0.692 0.785 0.027%
AML 1,681 0.926 0.683 0.783 43.1%
LogMap 2,485 0.910 0.689 0.780 0.015%
AML-R 1,821 0.939 0.666 0.776 0.029%
ServOMap 15,300 0.875 0.690 0.766 52.4%
GOMMA2012 1,920 0.813 0.567 0.654 28.8%
LogMapLt 371 0.874 0.517 0.598 34.1%
SPHeRe 42,040 0.857 0.464 0.569 21.4%
IAMA 704 0.912 0.386 0.517 46.4%
Table 3: Summary Results for the top systems.

Harmonization of the mapping outputs

Mapping repair evaluation

Contact

If you have any question/suggestion related to the results of this track or if you notice any kind of error (wrong numbers, incorrect information on a matching system, etc.), feel free to write an email to ernesto [at] cs [.] ox [.] ac [.] uk or ernesto [.] jimenez [.] ruiz [at] gmail [.] com

Original page: http://www.cs.ox.ac.uk/isg/projects/SEALS/oaei/