We have collected all generated alignments and make them available in a zip-file via the following link. These alignments are the raw results that the following report is based on.
We conducted experiments by executing each system in its standard setting and we compare precision, F-measure, and recall.
We used the MELT platform to execute our evaluations for all systems. Currently, the reported evaluation results are directly taken from the MELT platform.
This year, we had three participants: LogMap, LogMapLt, and Matcha. With regard to F1-Measure, MATCHA achieved the highest score.
The table also shows the results for precision, F-measure, recall and the size of the alignments for the optimal threshold. Regarding the recall, Matcha achieved the best score. LogMap and Matcha provide the correspondences with real-valued confidence. Therefore, we applied thresholding during evaluation.
LogMap and MATCHA use weights to score each pair in their generated alignments. To see if we can find an optimal threshhold for each system (below which the generated mappings would not be taken into consideration), we inspected the results.
The weights in LogMap's alignment range from 0.93 to 0.14 (with only one weight below 0.5). The mapping with the highest weight was a false positive, so was the mapping with the lowest weight. There were multiple matches with the second lower weight (0.5). These mappings were a mix of correct mappings and false positives. Based on these results, an optimal threshold for LogMap's results could be set to 0.5 (including) which is also the computed threshold with the highest F-measure.
In case of MATCHA, the weights of its results range between 1 and 0.600378464. Mappings with the highest weights were both correct and false positives. The correct mapping with the lowest weight was weighted to 0.65293388. This weight is just a little higher than the lowest weight. However, most true possitives have weights greater than 0.9. Based on these results, an optimal threshold for MATCHA's results could be set to 0.9 which is also the computed threshold with the highest F-measure.
The table below repeats precision, F-measure, and recall for the matching systems with their threshold set.
Additionally, we analysed the false positives - alignments discovered by the tools which were evaluated as incorrect. The list of the false positives can be viewed here in the following structure: entity 1 (E1), entity 2 (E2), individual tools - if the alignment was discovered by the tool, it is marked with an "x", and the reason why the alignment was discovered (Why Found).
We looked at the reasons why an alignment was discovered from a general point of view, and defined 4 reasons why they could have been chosen:
Looking at the results, it can be said that when the reason an alignment was discovered was the same name, all or at least most tools generated the mapping. LogMap and MATCHA further generated mappings based on similar strings. All three systems generated mappings where the same word was present in the entities names. Lastly, MATCHA produced 2 mappings where the reason isn't obvious. Below are some comments for individual tools and the reasons why they discovered the false positives:
As a possibly interesting observation, there were no false positives found which would be generated based on synonyms in entities names.
This track is organized by Jana Vataščinová, Ondřej Zamazal, Huanyu Li, Eva Blomqvist, and Patrick Lambrix. If you have any problems working with the ontologies, any questions related to tool wrapping, or any suggestions related to the ce track, feel free to write an email oaei-ce [at] groups [dot] liu [dot] se.