We have collected all generated alignments and made them available in a zip-file via the following link. These alignments are the raw results that the following report is based on.
We conducted experiments by executing each system in its standard setting and we compare precision, F-measure, and recall. The only exception was AgentOM which provided its own alignments (using commercial LLM).
We used the MELT platform to execute our evaluations for the executed systems. Currently, the reported evaluation results are directly taken from the MELT platform.
This year, we had four participants: AgentOM, LogMap, LogMapLt, and Matcha. With regard to F1-Measure, LogMap achieved the highest score.
The table also shows the results for precision, F-measure, recall and the size of the alignments for the optimal threshold.
LogMap and MATCHA use weights to score each pair in their generated alignments (so does AgentOM, however, all its mappings had confidence 1). To see if we can find an optimal threshhold for each system (below which the generated mappings would not be taken into consideration), we inspected the results.
The weights in LogMap's alignment range from 1.12 (?) to 0.5. There were multiple matches with the lowest weight (0.5). These mappings were a mix of correct mappings (majority) and false positives. The computed threshold for obtaining the highest F-measure is 0.5 (including) which corresponds to the mappings analysis. Therefore, LogMap achieves the highest F-measure while taking all the mappings into consideration, without threshold adjustments.
In case of MATCHA, the weights of its results range between 1 and 0.6. Surprisingly, all mappings with the highest weights were false positives. The correct mapping with the lowest weight was weighted to 0.9306. This is also the computed threshold for obtaining the highest F-measure which is 0.61. Threshold 0.9306 (including) is very specific. Using a more general threshold 0.9 (including), the F-measure is just a slightly lower 0.6. Applying the thresholds, the number of correct mappings stays the same while the number of false positives lowers from 222 to 44 for threshold 0.9306 or 46 for threshold 0.9.
Additionally, we analysed the false positives - alignments discovered by the tools which were evaluated as incorrect. The list of the false positives can be viewed here in the following structure: entity 1 (E1), entity 2 (E2), individual tools - if the alignment was discovered by the tool, it is marked with an "x", and the reason why the alignment was discovered (Why Found).
We looked at the reasons why an alignment was discovered from a general point of view, and defined 4 reasons why they could have been chosen:
Looking at the results, it can be said that when the reason an alignment was discovered was the same name, all or at least most tools generated the mapping. LogMap and MATCHA further generated mappings based on similar strings. AgentOM and MATCHA generated mappings based on synonyms. All four systems generated mappings where the same word was present in the entities names. Below are some comments for individual tools and the reasons why they discovered the false positives:
This track is organized by Jana Vataščinová, Ondřej Zamazal, Huanyu Li, Eva Blomqvist, and Patrick Lambrix. If you have any problems working with the ontologies, any questions related to tool wrapping, or any suggestions related to the ce track, feel free to write an email oaei-ce [at] groups [dot] liu [dot] se.