Resuts for the Library test set

The following content is (mainly) based on the final version of the library section in the OAEI results paper.
If you notice any kind of error (wrong numbers, incorrect information on a matching system) do not hesitate to contact us.

Reference Alignment

A mapping between STW and TheSoz already exists and has been manually created by domain experts in the KoMoHe project \cite{Mayr2008}. However, it does not cover the changes and enhancements in both thesauri since 2006. It is available in SKOS with the different matching types SKOS:exactMatch, SKOS:broaderMatch and SKOS:narrwowerMatch. Within the reference alignment, concepts of one thesaurus are aligned to more than one concept of the second thesaurus. Thus, we face a \textit{n:m} mapping of the concepts. All in all, 4,285 TheSoz concepts and 2,320 STW concepts are aligned with 2,839 exact matches, 34 broader matches and 1,416 narrower matches. It is important to note that the reference alignment only contains alignments between the descriptors of both thesauri, i.e., the concepts that are actually used for document indexing. The upper part of the hierarchy consists of non-descriptor concepts (or categories) that are only used to organize the descriptors below them. We take this specialty into account as we only assess the generated alignments between descriptors and ignore alignments between non-descriptors. However, this might change in the future, as the results of this track could be used to extend the reference alignment to the upper part of the hierarchy.

The reference alignment we used for the evaluation is now available. Download Reference Alignment

Experimental Setting

To compare the created alignments with the reference alignment, we use the Alignment API. For this first evaluation, we only included equivalence relations (skos:exactMatch).

The generated alignments are available here.

All matching processes have been performed on a Debian machine with one 2.4GHz core and 7GB RAM allocated to each system. The evaluation has been executed by using SEALS technologies. For ServOMap, ServOMapLt and Optima, we used slightly adapted ontologies as input since they cannot handle URIs with the last part only consisting of numbers as it is the case in the official version. Each participating system uses the OWL version. We computed precision, recall and F-measure (beta=1) for each matcher. We only consider equivalence correspondences between two descriptors as non-descriptors are not included in the reference alignment. This filtering improves the precision (~8%) as well as the F-measure (~4%) for all systems. Moreover, we measured the runtime, the size of the created alignment and checked whether a 1:1 alignment has been created. To assess the results of the matchers, we developed three straightforward matching strategies, using the original SKOS version of the thesauri:

MatcherPrefDE: Compares the German lower-case preferred labels and generates a correspondence if these labels are completely equivalent.
MatcherPrefEN: Compares the English lower-case preferred labels and generates a correspondence if these labels are completely equivalent.
MatcherPref: Creates a correspondence, if either Matcher_prefDE or Matcher_prefEN or both create a correspondence.
MatcherAllLabels: Creates a correspondences whenever at least one label (preferred or alternative, all languages) of an entity is equivalent to one label of another entity.

Results

All systems listed in the table above are sorted according to their F-measure values. Altogether 13 of the 21 submitted matching systems were able to create an alignment. Three matching systems (MaasMatch, MEDLEY, Wmatch) did not finish within the time frame of one week while five threw an exception.

Of all these systems, GOMMA performs best in terms of F-measure, closely followed by ServOMapLt and LogMap. However, the precision and recall measures vary a lot across the top three systems. Depending on the application, an alignment either achieving high precision or recall is preferable. If recall is in the focus, the alignment created by GOMMA is probably the best choice with a recall about 90%. Other systems generate alignments with higher precision, e.g. ServOMap with over 70% precision, while mostly having significantly lower recall values (except for Hertuda).

From the results obtained by the matching strategies taking the different types of labels into account, we can see that a matching based on preferred labels only, outperforms other matching strategies. MatcherPref achieves the highest F-measure in these tests. The results of MatcherPrefDE and MatcherPrefEN provide an insight into the language characteristics of both thesauri and the reference alignments. MatcherPrefDE achieves the highest precision value (nearly 90%), albeit with a recall of only 60%. Both thesauri as well as the reference alignment have been developed in Germany and focus on German terms. From the results of MatcherPrefEN, we can see the difference: precision and especially recall significantly decrease when only the preferred English labels are used. On the one hand, only about 80% of the found correspondences are correct and on the other hand, less than a half of all correspondences can be found this way. This can be a disadvantage for systems that use NLP techniques on English labels or rely on language-specific background knowledge like WordNet.

The high precision values of the pref matchers reflect the fact that the preferred labels are chosen specifically to unambiguously identify the concepts. Our interpretation is that the English translations are partly not as precise as the original German terms (drop in precision) and not consistent regarding the English terminology (drop in recall).

In contrast, the MatcherAllLabels achieves a quite high recall (90%) but a rather low precision (54%). This means that most but not all of the corresondences can be found by only having a look at equivalent labels. However, when following this idea, nearly a half of the found correspondences are incorrect. The rather high F-measure of MatcherAllLabels is therefore misleading, as at least if the results would be used unchecked in an retrieval system, a higher precision would clearly be preferred over a higher recall. In this respect, matchers like ServOMap show better results. In any case, it can be seen that a matching system using the original SKOS version could achieve a better result. The information loss when converting SKOS to OWL really matters.

Concerning the runtime, LogMap as well as ServOMap are quite fast with a runtime below 50 seconds. These values are comparable or even better (LogMapLt) than both strategies computing the equivalence between preferred labels. Thus, they are very effective in matching large ontologies while achieving very good results. Other matchers take several hours or even days and do not produce better alignments in terms of F-measure. By computing the correlation between F-measure and runtime, we notice a slightly negative correlation (-0.085) but the small amount of samples is not sufficient to make a significant statement. However, we can say for certain that a longer runtime does not necessarily lead to better results.

We further observe that the n:m reference alignment affects the results because some matching systems (ServOMap, WeSeE, HotMatch, CODI, MapSSS) only create 1:1 alignments and discard correspondences with entities that already occur in another correspondence. Whenever a system creates a lot of \textit{n:m} correspondences, e.g., Hertuda and GOMMA, the recall significantly increases. This difference becomes clear when comparing ServOMapLt and ServOMap. Both systems mostly base on the same methods but ServOMapLt does not use the 1:1 filtering. Consequently, the recall increases and the precision decreases.

Since the reference alignment has not been updated for about six years, it does not contain updates of both thesauri. Thus, new correct correspondences might be found by matching systems but they are indicated as incorrect because they are not included in the reference alignment. Therefore, we applied a manual evaluation to check whether the matching systems found correct correspondences which are not included in the reference alignment at all. In turn, these information can help to improve the reference alignment.

The manual evaluation has been conducted by domain experts. All newly detected correspondences, which have not been contained into the reference alignment yet, have been considered. Because exact matches have to be 1:1 relationships, only those correspondences have been examined, whose terms are descriptors and not yet involved into an existing correspondence. The other correspondences are considered as wrong as they contain a term, to or from which already a correspondence exists.

Since all matching systems delivered correspondences representing exact matches, they have been judged in this specific regard. That means that correspondences have been considered as wrong for now, whose terms cannot be seen as equivalent but maybe as related, broader or narrower.

The matchers detected between 38 and 251 correspondences, which have not been in the reference alignment before. This includes especially terms, which hold a strong syntactical similarity or equivalence. But, some matching systems even detected difficult correspondences, e.g., between the German label for "automated production" ("Automatische Produktion") and "CAM", which has been identified by their associated non-preferred labels. Furthermore, correspondences of geographical terms have been detected, but some of the matchers have not been able to distinguish between the terms for citizens of a country, their language or the country itself, although these differences can be derived from the structure of the thesauri.

But, the manual evaluation exposed several issues, which can either be explained by the typical behavior of matching systems or by domain-specific differences inside the thesauri. There are similar terms inside TheSoz and STW, which are used in totally different contexts, e.g. the term "self-assessment". Even when considering the structure of both thesauri these differences are difficult to identify. In general, term similarities often led to wrong correspondences, which is not surprising at first. But, in turn syntactically equal terms have not been detected simultaneously in some cases. By now, we did not have the possibility to evaluate the matching systems with the improved reference alignment but we plan to perform this additional evaluation soon.

Conclusion

It is the first time this track takes place, so we cannot compare the results with previous ones. As it is also the first time for the matching systems participating in this track, they do not have any experience with the data. This has to be kept in mind if the results are compared to other tracks.

Nevertheless, the newly detected correspondences determine already a useful result for the maintainers of the two thesauri. The correct correspondences can be added to the existing reference alignment, which is already applied in information portals for supporting search term recommendation and query expansion services among differently indexed databases. As all matching systems delivered exact matches for the correspondences, some of the wrong correspondences will be examined again in the future, whether other relationships like broader, narrower or related matches can be considered for those.

We expect further improvements, if the matchers are tailored more specifically to the library track, i.e., if they exploit the information found in the original SKOS version. A promising approach is also the use of additional knowledge, e.g., instance data -- resources that are indexed with different thesauri.

This time, we collected the results of the matchers as a first survey and compared them to our simple string-matching strategy that takes advantage of the different types of labels. In future evaluations, we assume that better results can be achieved and that these strategies simply form a baseline.

Acknowledgements

We would like to thank Andreas Oskar Kempf from GESIS for the manual evaluation of the new detected correspondences.

Contact

Dominique Ritze (Mannheim University Library) dominique[.]ritze[at]bib[.]uni-mannheim[.]de
Kai Eckert (Mannheim University Library)
Benjamin Zapilko (GESIS)
Joachim Neubert (ZBW)

Original page: http://web.informatik.uni-mannheim.de/oaei-library/results/2012/ [cached: 24/06/2014]