The following content is (mainly) based on the final version of the interactive section in the OAEI results paper.
If you notice any kind of error (wrong numbers, incorrect information on a matching system) do not hesitate to contact us.
A lot of Ontology Matching systems have been developed over the last years. After several years of experience, the results can only be slightly improved in terms of the alignment quality (precision/recall resp. F-Measure). Based on this insight, it is clear that fully automatic ontology matching systems slowly reach an upper bound of the results they can achieve. By incorporating user interaction, we expect to improve the alignments even further and to push this upper boundary. Semi-automatic ontology matching approaches are quite promising since humans can really help the systems, for example by detecting incorrect correspondences.
Whenever the user is directly involved, all required efforts of the human have taken into account and it has to be in an appropriate proportion to the result. Thus, beside the quality of the alignment, other measures like the number of interactions are interesting and meaningful to decide which matching system is best suitable for a certain matching task. By now, all OAEI tracks focus on fully automatic matching and semi-automatic matching is not evaluated although such systems already exist, e.g. LogMap2 (Jiménez-Ruiz et al., 2011) As long as the evaluation of such systems is not driven forward, it is hardly possible to systematically compare the quality of interactive matching approaches. With this track, we like to change this unfavorable situation by explicitly offering a systematic, automated evaluation of matching systems with user interaction.
For the second edition of the interactive track, we use the well-established OAEI Conference data set. This data set covers 16 ontologies describing the domain of conference organization. We only use the test cases for which an alignment is publicly available (altogether 21 alignments). Over the last years, the quality of the generated alignments has been constantly increased but only to small amount (by a few percent). In 2013, the best system according to F-Measure (YAM++) achieves a value of 70% (Cuenca Grau et al., 2013). This shows that there is significant room for improvement, which could be filled by interactive means.
Moreover, the Conference set has a suitable size such that most of the systems can participate and do not run into problems concerning the run time or memory consumption.
The interactive matching track was evaluated at OAEI 2014 for the second time. The goal of this evaluation is to simulate interactive matching (Paulheim et al., 2013), where a human expert is involved to validate mappings found by the matching system. In the evaluation, we look at how interacting with the user improves the matching results. For the evaluation, we use the conference dataset with the ra1 alignment, where there is quite a bit of room for improvement, with the best fully automatic, i.e., non-interactive matchers achieve an F-measure of 74%. The SEALS client was modified to allow interactive matchers to ask an oracle, which emulates a (perfect) user. The interactive matcher can present a correspondence to the oracle, which then tells the user whether the correspondence is right or wrong. All matchers participating in the interactive track support both interactive and non-interactive matching. This allows us to analyze how much benefit the interaction brings for the individual matchers.
Overall, four matchers participated in the interactive matching track: AML, Hertuda, LogMap, and WeSeE-Match. The systems AML and LogMap have been further developed compared to last year, the other three ones are the same as last year. All of them implement interactive strategies that run entirely as a post-processing step to the automatic matching, i.e., take the alignment produced by the base matcher and try to refine it by selecting a suitable subset.
AML asks the oracle if the similarity variance between the matching algorithms AML employs is significant. Further, an alignment repair step is also performed interactively. Last year, AML presented all correspondences below a certain confidence threshold to the oracle, starting with the highest confidence values. LogMap checks all questionable correspondences using the oracle. Hertuda and WeSeE-Match try to adaptively set an optimal threshold for cutting of mappings. They perform a binary search in the space of possible thresholds, presenting a correspondence of average confidence to the oracle first. If the result is positive, the search is continued with a higher threshold, otherwise with a lower threshold.
The results are depicted in Table 1. The biggest improvement in F-measure, as well as the best overall result is achieved by AML, which increases its F-measure by seven percentage points (compared to the non-interactive results). Furthermore, AML show a statistically significant increase in recall as well as precision, while all the other tools except for Hertuda show a significant increase in precision. The increase in precision is in all cases, except for AML, higher than the increase of recall. At the same time, LogMap has the lowest number of interactions with the oracle, which shows that it also makes efficient use of the oracle. While HerTUDA asks on average almost the whole amount of correspondences in the reference alignment, LogMap has around 0.4 interactions per correspondence in the reference alignment. In a truly interactive setting, the number of interactions would mean that the manual effort is minimized.
On the other hand, Hertuda and WeSeE even show a decrease in recall, which cannot compensate for the increase in precision. Thus, we conclude that their strategy is not as effective than those of the other participants.
When comparing to the results of last year, AML improved it's F-measure by almost 10%. On the other hand, LogMap shows a slight decrease in recall, and hence, in F-measure. Compared to the results of the non-interactive conference track, the best interactive matcher (in terms of F-measure) is better than all non-interactive matching systems.
The results show that current interactive matching tools mainly use interaction as a means to post-process an alignment found with fully automatic means. There are, however, other interactive approaches that can be thought of, which include interaction at an earlier stage of the process, e.g., using interaction for parameter tuning (Ritze and Paulheim, 2011), or determining anchor elements for structure-based matching approaches using interactive methods. The maximum F-measure of 0.801 achieved shows that there is still room for improvement. Furthermore, different variations of the evaluation method can be thought of, including different noise levels in the oracle's responses (i.e., simulating errors made by the human expert), or allowing other means of interactions than the validation of single correspondences, e.g., providing a random positive example, or providing the corresponding element in one ontology, given an element of the other one.