Vojtech Svatek - abstracts of selected papers

Svab-Zamazal O., Svatek V.: Analysing Ontological Structures through Name Pattern Tracking. In: EKAW 2008 - 16th International Conference on Knowledge Engineering and Knowledge Management. Springer LNCS, to appear. Full paper. Concept naming over the taxonomic structure is a useful indicator of the quality of design as well as source of information exploitable for various tasks such as ontology refactoring and mapping. We analysed collections of OWL ontologies with the aim of determining the frequency of several combined name&graph patterns potentially indicating underlying semantic structures. Such structures range from simple set-theoretic subsumption to more complex constructions such as parallel taxonomies of different entity types. The final goal is to help refactor legacy ontologies as well as to ease automatic alignment among different models. The results show that in most ontologies there is a significant number of occurrences of such patterns. Moreover, their detection even using very simple methods has precision sufficient for a semi-automated analysis scenario.

Vacura M., Svatek V., Smrz P.: A Pattern-based Framework for Representation of Uncertainty in Ontologies. In: 11th International Conference on Text, Speech and Dialogue. Brno, Czech Republic, September 8-12, 2008. Springer LNCS, to appear. Full paper. We present a novel approach to representing uncertain information in ontologies based on design patterns. We provide a brief description of our approach, present its use in case of fuzzy information and probabilistic information, and describe the possibility to model multiple types of uncertainty in a single ontology. We also shortly present an appropriate fuzzy reasoning tool and define a complex ontology architecture for well-founded handling of uncertain information.

Vacura M., Svatek V., Saathoff C., Franz T., Troncy R.: Describing Low-Level Image Features Using The COMM Ontology. In: First ICIP Workshop on Multimedia Information Retrieval 2008, October 12, 2008, San Diego, California, U.S.A. Full paper. We present an innovative approach for storing and processing extracted low-level image features based on current Semantic Web technologies. We propose to use the COMM multimedia ontology as a “semantic” alternative to the MPEG-7 standard, which is at the same time largely compliant with it. We describe how COMM can be used directly or through its associated Java API.

Karkaletsis V., Stamatakis, K., Karampiperis, P., Labsky, M., Ruzicka, M., Svatek, V., Cabrera, E. A., Polla, M., Mayer, M. A., Villaroel Gonzales, D.: Management of Medical Website Quality Labels via Web Mining. In: Berka P., Rauch J., Zighed D.A. (eds.): Data Mining and Medical Knowledge Management: Cases and Applications. IGI Global Inc., to appear 2008. Full paper. The WWW is an important channel of information exchange in many domains, including the medical one. The ever increasing amount of freely available healthcare-related information generates, on the one hand, excellent conditions for self-education of patients as well as physicians, but on the other hand entails substantial risks if such information is trusted irrespective of low competence or even bad intentions of its authors. This is why medical website certification (also called ‘quality labeling’) by renowned authorities is of high importance. In this respect, it recently became obvious that the labeling process could benefit from employment of web mining and information extraction techniques, in combination with flexible methods of web-based information management developed within the semantic web initiative. Achieving such synergy is the central issue in the MedIEQ project. The AQUA (Assisting QUality Assessment) system, developed within the MedIEQ project, aims to provide the infrastructure and the means to organize and support various aspects of the daily work of labeling experts.

Karkaletsis V., Karampiperis P., Stamatakis K., Labsky M., Ruzicka M., Svatek V., Mayer M.A., Leis A., Villarroel D.: Automating Accreditation of Medical Web Content. In: 5th Prestigious Applications of Intelligent Systems Conference (PAIS 2008), Greece. Incl. in Proc. ECAI'08, IOS Press, 2008. Full paper. The increasing amount of freely available health-related web content generates, on one hand, excellent conditions for self-education of patients as well as physicians, but on the other hand entails substantial risks if such information is trusted irrespective of low competence or even bad intentions of its authors. This is why medical web resources accreditation by renowned authorities is of high importance. However, various health web content surveys show that the proportion of accredited web resources is insufficient due to the difficulty of the labeling authorities to cope with the amount and dynamics of the medical web. In this paper, we address the problem of automating the accreditation of medical web content. To this end, we present a system which provides the infrastructure and the means to organize and support various aspects of the daily work of labeling experts, exploiting web content collection and information extraction techniques.

Praks P., Svatek V., Cernohorsky J.: Linear algebra for vision-based surveillance in heavy industry - convergence behavior case study. In: IEEE CBMI 2008 - Sixth International Workshop on Content-Based Multimedia Indexing. 18-20th June, 2008, Queen Mary, University of London, London, UK. pp. 346-352. Full paper. The surveillance application aims at improving the quality of technology via modelling human expert behaviour in the coking plant ArcelorMittal Ostrava, the Czech Republic. Video data on several industrial processes are captured by means of a CCD camera and classified by using Latent Semantic Indexing (LSI) with the respect to etalons classified by an expert. We also study the convergence behavior of proposed partial eigenproblem-based dimension reduction technique and its ability for knowledge acquisition. Having increased the computational effort of the dimension reduction technique did not imply the increasing quality of retrieved results in our cases.

Nemrava J., Buitelaar P., Svatek V., Declerck T.: Text Mining Support for Semantic Indexing and Analysis of A/V Streams In: Proc. of OntoImage, Workshop at LREC 2008, Marrakech, Morocco, May 2008. Full paper. The work described here concerns the use of complementary resources in sports video analysis; soccer in our case. Structured web data such as match tables with teams, player names, score goals, substitutions, etc. and multiple, unstructured, textual web data sources (minute-by-minute match reports) are processed with an ontology-based information extraction tool to extract and annotate events and entities according to the SmartWeb soccer ontology. Through the temporal alignment of the primary A/V data (soccer videos) with the textual and structured complementary resources, these extracted and semantically organized events can be used as indicators for video segment extraction and semantic classification, i.e. occurrences of particular events in the complementary resources can be used to classify the corresponding video segment, enabling semantic indexing and retrieval of soccer videos.

Kliegr T., Chandramouli K., Nemrava J., Svatek V., Izquierdo E.: Combining Image Captions and Visual Analysis for Image Concept Classification. In: The 9th Intl. Workshop on Multimedia Data Mining, held with ACM SIGKDD'08, Las Vegas, 2008. Full paper. We present a framework for efficiently exploiting free-text annotations as a complementary resource to image classification. A novel approach called Semantic Concept Mapping (SCM) is used to classify entities occurring in the text to a custom-defined set of concepts. SCM performs unsupervised classification by exploiting the relations between common entities codified in the Wordnet thesaurus. SCM exploits Targeted Hypernym Discovery (THD) to map unknown entities extracted from the text to concepts in Wordnet. We show how the result of SCM/THD can be fused with the outcome of Knowledge Assisted Image Analysis (KAA), a classification algorithm that extracts and labels multiple segments from an image. In the experimental evaluation, THD achieved an accuracy of 75%, and SCM an accuracy of 52%. In one of the first experiments with fusing the results of a free-text and image-content classifier, SCM/THD + KAA achieved a relative improvement of 49% and 31% over the text-only and image-content-only baselines.

Chandramouli K., Kliegr T., Nemrava J., Svatek V., Izquierdo E.: Query Refinement and User Relevance Feedback for Contextualized Image Retrieval. In: The 5th IET Visual Information Engineering 2008 Conference (VIE'08), China, 2008. Full paper. The motivation of this paper is to increase the user perceived precision of results of Content Based Information Retrieval (CBIR) systems with Query Refinement (QR), Visual Analysis (VA) and Relevance Feedback (RF) algorithms. The proposed algorithms were implemented as modules into K-Space CBIR system. The QR module discovers hypernyms for the given query from a free text corpus (Wikipedia) and uses these hypernyms as refinements for the original query. Extracting hypernyms from Wikipedia makes it possible to apply query refinement to more queries than in related approaches that use static predefined thesaurus such as Wordnet. The VA Module uses the K-Means algorithm for clustering the images based on low-level features. The RF Module uses the preference information expressed by the user to build user profiles by applying SOM-based supervised classification, which is further optimized by a hybrid Particle Swarm Optimization (PSO) algorithm. The experiments evaluating the performance of QR and VA modules show promising results.

Nekvasil M., Svatek V., Labsky M.: Transforming Existing Knowledge Models to Information Extraction Ontologies. In: 11th International Conference (BIS'08), Innsbruck, May 5-7, 2008. Springer LNBIP 7. Draft paper (final version available via SpringerLink). Diverse types of structured domain models are nowadays in use in various contexts. On the one hand there are generic models, especially domain ontologies, which are typically used in applications with artificial intelligence (reasoning) flavor; on the other hand there are more specific models that only come to use in areas like software engineering or business analysis. Furthermore, the discipline of information extraction has invented very specific knowledge models called extraction ontologies, whose purpose is to help extract and semantically annotate textual data. In this paper we present a method of authoring extraction ontologies (more specifically, their abstract constituents called presentation ontologies) via reusing different types of other knowledge models, especially domain ontologies and UML models. Our priority is to maintain consistency between extracted data and those prior models.

Labsky M., Svatek V.: Combining Multiple Sources of Evidence in Web Information Extraction. In: 17th International Symposium on Methodologies for Intelligent Systems (ISMIS'08), Toronto, May 20-23, 2008. Springer LNCS 4994. Draft paper (final version available via SpringerLink). Extraction of meaningful content from collections of web pages with unknown structure is a challenging task, which can only be successfully accomplished by exploiting multiple heterogeneous resources. In the Ex information extraction tool, so-called extraction ontologies are used by human designers to specify the domain semantics, to manually provide extraction evidence, as well as to define extraction subtasks to be carried out via trainable classifiers. Elements of an extraction ontology can be endowed with probability estimates, which are used for selection and ranking of attribute and instance candidates to be extracted. At the same time, HTML formatting regularities are locally exploited.

Rak D., Svatek V., Fidalgo M., Alm O.: Detecting MeSH Keywords and Topics in the Context of Website Quality Assessment. In: The 1st International Workshop on Describing Medical Web Resources (DRMed 2008), held in conjunction with the 21st International Congress of the European Federation for Medical Informatics (MIE 2008), May 27, 2008, Goteborg, Sweden. Full paper. Automatic detection of keywords and general topics is a special-purpose auxiliary task in the website quality assessment process. We describe the approach to obtaining such information used in the MedIEQ project, discuss problems related to the type of human language used in medical websites, and illustrate them on examples.

Svatek V., Svab O.: Towards Retrieving Scholarly Literature via Ontological Relationships. In: Znalosti 2008, Bratislava, Slovakia, February 2008. Full paper. We analyse the problem of retrieving scientific literature related to a problem with complex description, and outline the skeleton of a solution. The proposed mixture of methods and approaches covers manual as well as automatic methods, with emphasis on community tagging, automated ontology learning from text and ontology mapping. Symbiosis of RDF/OWL and Topic Maps as underlying formalisms is foreseen. As a very simple proof of concept, relational annotation of five research papers has been carried out independently by two annotators, and the results were analysed.

Nemrava J., Buitelaar P., Simou N., Sadlier D., Svatek V., Declerck T., Cobet A., Sikora T., O’Connor N., Tzouvaras V., Zeiner H., Petrak J.: An Architecture for Mining Resources Complementary to Audio-Visual Streams. In: Workshop on Knowledge Acquisition from Multimedia Content (KAMC) at SAMT 2007, Genova. Full paper. In this paper we attempt to characterize resources of information complementary to audio-visual (A/V) streams and propose their usage for enriching A/V data with semantic concepts in order to bridge the gap between low-level video detectors and high-level analysis. Our aim is to extract cross-media feature descriptors from semantically enriched and aligned resources so as to detect finer-grained events in video.We introduce an architecture for complementary resource analysis and discuss domain dependency aspects of this approach related to our domain of soccer broadcasts.

Svatek V., Svab O.: Tracking Name Patterns in OWL Ontologies. In: International Workshop on Evaluation of Ontologies (EON-07) collocated with the 6th International Semantic Web Conference (ISWC-2007), Busan, Korea. Full paper. Analysis of concept naming in OWL ontologies with set-theoretic semantics could serve as partial means for understanding their conceptual structure, detecting modelling errors and assessing their quality. We carried out experiments on three existing ontologies from public repositories, concerning the consistency of very simple name patterns---subclass name being a certain kind of parent class name extension, while considering thesaurus relationships. Several probable taxonomic errors were identified in this way.

Svab O., Svatek V.: In Vitro Study of Mapping Method Interactions in a Name Pattern Landscape. In: International Workshop on Ontology Matching (OM-07) collocated with the 6th International Semantic Web Conference (ISWC-2007), Busan, Korea. Full paper. Ontology mapping tools typically employ combinations of methods, the mutual effects of which deserve study. We propose an approach to analysis of such combinations using synthetic ontologies. Initial experiments have been carried out for two string-based and one graph-based method. Most important target of the study was the impact of name patterns over taxonomy paths on the mapping results.

Stamatakis K., Metsis V., Karkaletsis V., Ruzicka M., Svatek V., Amigo Cabrera E., Polla E., Spyropoulos C.: Content Collection for the Labelling of Health-related Web Content. In: 11th Conference on Artificial Intelligence in Medicine (AIME 07), 7-11 July 2007, Amsterdam, The Netherlands. Draft paper (final version available via SpringerLink). As the number of health-related web sites in various languages increases, it is more than necessary to implement control mechanisms that give the users adequate guarantee that the web resources they are visiting, meet a minimum level of quality standards. Based upon state-of-the-art technology in the areas of semantic web, content analysis and quality labeling, the AQUA system, designed for the EC-funded project MedIEQ, aims to support the automation of the labeling process in health-related web content. AQUA provides tools that crawl the web to locate unlabelled health web resources in different European languages, as well as tools that traverse websites, identify and extract information and, upon this information, propose labels or monitor already labeled resources. Two major steps in this automated labeling process are web content collection and information extraction. This paper focuses on content collection. We describe existing approaches, present the architecture of the content collection toolkit and how this is integrated within the AQUA system, and discuss our initial experimental results in the English language (six more languages will be covered by the end of the project).

Labsky M., Nekvasil M., Svatek V.: Towards Web Information Extraction using Extraction Ontologies and (Indirectly) Domain Ontologies. In: (Poster Paper) Int'l Conf. on Knowledge Capture (K-CAP'07), Whistler, BC, Canada, October 2007, ACM. Full paper. Extraction ontologies allow to swiftly proceed from initial domain modelling to running a functional prototype of a web information extraction application. We investigate the possibility of semi-automatically deriving extraction ontologies from third-party domain ontologies.

Labsky M., Svatek V., Nekvasil M., Rak D.: The Ex Project: Web Information Extraction using Extraction Ontologies. In: Proc. PriCKL'07, ECML/PKDD Workshop on Prior Conceptual Knowledge in Machine Learning and Knowledge Discovery. Warsaw, Poland, October 2007. Also as post-proceedings to appear at Univ. Bari. Full paper; and its extended version for post-proceedings. Extraction ontologies represent a novel paradigm in web information extraction (as one of ‘deductive’ species of web mining) allowing to swiftly proceed from initial domain modelling to running a functional prototype, without the necessity of collecting and labelling large amounts of training examples. Bottlenecks in this approach are however the tedium of developing an extraction ontology adequately covering the semantic scope of web data to be processed and the difficulty of combining the ontology-based approach with inductive or wrapper-based approaches.We report on an ongoing project aiming at developing a web information extraction tool based on richly-structured extraction ontologies and with additional possibility of (1) semi-automatically constructing these from third-party domain ontologies, (2) absorbing the results of inductive learning for subtasks where pre-labelled data abound, and (3) actively exploiting formatting regularities in the wrapper style.

Nemrava J., Buitelaar P., Svatek V., Declerck T.: Event Alignment for Cross-Media Feature Extraction in the Football Domain. In: Int'l Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS'07), Santorini, Greece, June 6-8, 2007. Full paper. This paper describes an experiment in creating cross-media descriptors from football-related text and videos. We used video analysis results and combined them with several textual resources – both semi-structured (tabular match reports) and unstructured (textual minute-by-minute match reports). Our aim was to discover the relations among six video data detectors and their behavior during a time window that corresponds to an event described in the textual data. Based on this experiment we show how football events extracted from text may be mapped to and help in analysing corresponding scenes in video.

Svatek V., Vacura M., Labsky M., ten Teije A.: Modelling Web Service Composition for Deductive Web Mining. Computing and Informatics, Vol. 26, 2007, 255-279. Full paper. Composition of simpler web services into custom applications is understood as promising technique for information requests in a heterogeneous and changing environment. This is also relevant for applications characterised as deductive web mining (DWM). We suggest to use problem-solving methods (PSMs) as templates for composed services. We developed a multi-dimensional, ontologybased framework, and a collection of PSMs, which enable to characterise DWM applications at an abstract level; we describe several existing applications in this framework. We show that the heterogeneity and unboundedness of the web demands for some modifications of the PSM paradigm used in the context of traditional arti- ficial intelligence. Finally, as simple proof of concept, we simulate automated DWM service composition on a small collection of services, PSM-based templates, data objects and ontological knowledge, all implemented in Prolog.

Svab O., Svatek V., Stuckenschmidt H.: A Study in Empirical and ‘Casuistic’ Analysis of Ontology Mapping Results. In: 4th European Semantic Web Conference (ESWC-2007), Innsbruck 2007. Springer LNCS 4519. Draft paper (final version available via SpringerLink). Many ontology mapping systems nowadays exist. In order to evaluate their strengths and weaknesses, benchmark datasets (ontology collections) have been created, several of which have been used in the most recent edition of the Ontology Alignment Evaluation Initiative (OAEI). While most OAEI tracks rely on straightforward comparison of the results achieved by the mapping systems with some kind of reference mapping created a priori, the 'conference' track (based on the OntoFarm collection of heterogeneous 'conference organisation' ontologies) instead encompassed multiway manual as well as automated analysis of mapping results themselves, with `correct' and `incorrect' cases determined a posteriori. The manual analysis consisted in simple labelling of discovered mappings plus discussion of selected cases (`casuistics') within a face-to-face consensus building workshop. The automated analysis relied on two different tools: the Drago system for testing the consistency of aligned ontologies and the LISp-Miner system for discovering frequent associations in mapping meta-data including the phenomenon of graph-based mapping patterns. The results potentially provide specific feedback to the developers and users of mining tools, and generally indicate that automated mapping can rarely be successful without considering the larger context and possibly deeper semantics of the entities involved.

Svab O., Svatek V.: Ontology Mapping Enhanced using Bayesian Networks. In: Znalosti 2007, Ostrava, TU Ostrava 2007. Full paper. Bayesian networks (BNs) can capture interdependencies among ontology mapping methods and thus possibly improve the way they are combined. We outline the basic idea behind our approach and show some experiments on ontologies from the OAEI ‘conference organisation’ collection. The possibility of modelling explicit mapping patterns in combination with methods is also discussed

(This is a long version of the OM06 paper...)

Svab O., Svatek V.: Combining Ontology Mapping Methods Using Bayesian Networks. In: International Workshop on Ontology Matching collocated with the 5th International Semantic Web Conference (ISWC-2006), November 5, 2006: GA Center, Athens, Georgia, USA. Full paper. Bayesian networks (BNs) can capture interdependencies among ontology mapping methods and thus possibly improve the way they are combined. Experiments on ontologies from the OAEI collection are shown, and the possibility of modelling explicit mapping patterns in combination with methods is discussed.

(This is a short pre-version of the Znal07 paper...)

Labsky M., Svatek V.: On the Design and Exploitation of Presentation Ontologies for Information Extraction. In: ESWC'06 Workhshop on Mastering the Gap: From Information Extraction to Semantic Representation, Budva, Montenegro, June, 2006. Full paper. The structure of ontologies that are considered as input to information extraction is mostly rather simple. In this paper we report on our ongoing effort of using rich ontologies with numerous constraints over the information to be extracted. Important aspects of the approach are the coupling of user-defined ontologies with other sources of knowledge such as training data and document formatting structures, and the distinction between proper domain ontologies and so-called presentation ontologies, where the latter (as `pragmatic bridges' over the `semantic gap') can partially be derived from the former. The extraction tool under construction builds on experience from an ongoing application in the domain of product catalogue analysis, and attempts to offer high flexibility with respect to availability of various input information sources.

Svatek V., Rauch J., Ralbovsky M.: Ontology-Enhanced Association Mining. In: Ackermann M. et al., eds., Semantics, Web, and Mining, Springer Verlag, LNCS 4289, 2006. Draft paper (final version available via SpringerLink). The roles of ontologies in KDD are potentially manifold. We track them through different phases of the KDD process, from data understanding through task setting to mining result interpretation and sharing over the semantic web. The underlying KDD paradigm is association mining tailored to our 4ft-Miner tool. Experience from two different application domains---medicine and sociology---is presented throughout the paper. Envisaged software support for prior knowledge exploitation via customisation of an existing user-oriented KDD tool is also discussed.

Labsky M., Svatek V., Svab O., Praks P., Kratky M., Snasel V.: Information Extraction from HTML Product Catalogues: from Source Code and Images to RDF. In: 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05), Compiegne 2005. Full paper. We describe an application of information extraction from company websites focusing on product offers. A statistical approach to text analysis is used in conjunction with different ways of image classification. Ontological knowledge is used to group the extracted items into structured objects. The results are stored in an RDF repository and made available for structured search.

Svatek V., Vacura M.: Automatic Composition of Web Analysis Tools: Simulation on Classification Templates. In: First International Workshop on Representation and Analysis of Web Space (RAWS-05). Full paper. Template-based composition of web services is considered as useful middle-way between their manual 'programming in the large' and fully automatic 'AI-planning-style' composition. This is also relevant for applications analysing the content and structure of the web space. As simple proof of concept, we simulate this approach on a collection of services, templates, data objects and ontological knowledge, all implemented in Prolog. The underlying task is multi-way recognition of sites containing pornography, understood as instance of classification task.

Kratky M., Andrt M., Svatek V.: XML Query Support for Web Information Extraction: A Study on HTML Element Depth Distribution. In: First International Workshop on Representation and Analysis of Web Space (RAWS-05). Full paper. Knowledge-based web information extraction methods can achieve very high precision in restricted domains; they are however slow and suffer from performance degradation beyond their specific domain. We thus plan to adapt an existing XML storage and query engine to act as efficient pre-processor for such methods. The critical point of the approach is the amount of information provided as XML environment of the start-up terms/elements. For this purpose, we carried out a statistical analysis of depth distribution in the WebTREC collection.

Kavalec M., Svatek V.: A Study on Automated Relation Labelling in Ontology Learning. In: P.Buitelaar, P. Cimiano, B. Magnini (eds.), Ontology Learning and Population, IOS Press, 2005. Full paper. Ontology learning from texts has been proposed as a technology helping ontology designers in the modelling process. Within ontology learning, the discovery of non-taxonomic relations is understood as the problem least addressed. We propose a technique for extraction of lexical items that may give cue in assigning semantic labels to otherwise `anonymous' non-taxonomic relations. The technique has been implemented as extension to the existing Text-to-Onto tool. Experiments have been carried out on a collection of texts describing tour destinations as well as on a semantically annotated general corpus. The paper also discusses evaluation aspects of relation labelling, among which the distinction of prior and posterior precision looks as most important.

Svatek V., ten Teije A., Vacura M.: Web Service Composition for Deductive Web Mining: A Knowledge Modelling Approach. In: Znalosti 2005, High Tatras 2005. Full paper. Composition of simpler web services into custom applications is understood as promising technique for information requests in a heterogeneous and changing environment. This is also relevant for applications analysing the content and structure of the web. We discuss the ways the problem-solving-method approach studied in artificial intelligence can be adopted for template-based service composition for this problem domain; main focus is on the classification task.

Labsky M., Praks P., Svatek V., Svab O.: Multimedia information extraction from HTML product catalogues. In: Workshop on Databases, Texts, Specifications and Objects (DATESO'05), Ostrava 2005. Full paper. We describe a demo application of information extraction from company websites, focusing on bicycle product offers. A statistical approach (Hidden Markov Models) is used in combination with different ways of image classification, including latent semantic analysis of image collections. Ontological knowledge is used to group the extracted items into structured objects. The results are stored in an RDF repository and made available for structured search.

Nemrava J., Svatek V.: Text mining tool for ontology engineering based on use of product taxonomy and web directory. In: Workshop on Databases, Texts, Specifications and Objects (DATESO'05), Ostrava 2005. Full paper. This paper presents our attempt to build a text mining tool for collecting specific words – verbs in our case – that usually occur together with particular product category as support for ontology designers. As the ontologies are headstone for the success of the semantic web, our effort is focused on building small and specialized ontologies concerning one product category and describing its frequent relations in common text. We describe the way we use web directories to obtain suitable information about the products from UNSPSC taxonomy and we propose the method how the extracted information could be further processed.

Svatek V., Labsky M., Vacura M.: Knowledge Modelling for Deductive Web Mining. In: 14th International Conference on Knowledge Engineering and Knowledge Management (EKAW 2004), Whittlebury Hall, Northamptonshire, UK. Draft paper (final version will be available via SpringerLink). Knowledge-intensive methods that can altogether be characterised as deductive web mining (DWM) already act as supporting technology for building the semantic web. Reusable knowledge-level descriptions may further ease the deployment of DWM tools. We developed a multi-dimensional, ontology-based framework, and a collection of problem-solving methods, which enable to characterise DWM applications at an abstract level. We show that the heterogeneity and unboundedness of the web demands for some modifications of the problem-solving method paradigm used in the context of traditional artificial intelligence.

Kavalec M., Svatek V.: Relation Labelling in Ontology Learning: Experiments with Semantically Tagged Corpus. In: EKAW 2004 Workshop on the Application of Language and Semantic Technologies to support Knowledge Management Processes, Whittlebury Hall, Northamptonshire, UK. Full paper. Ontology learning from text can be viewed as auxilliary technology for knowledge management application design. We proposed a technique for extraction of lexical entries that may give cue in assigning semantic labels to otherwise `anonymous' non-taxonomic relations. In this paper we present experiments on semantically annotated corpus SemCor, and compare them with previous experiments on plain texts.

Svatek V., Snasel V.: Formal Model of Meta-Information Acquisition from Information Resources. In: Workshop on Information Technology - Applications and Theory (ITAT2004), High Tatras 2004. Full paper. An outline of formal model describing the acquisition of ‘meta–information’ on information resources is being proposed, which should enable to compare the quality of different analysis procedures. It is illustrated an example from website analysis.

Svab O., Labsky M., Svatek V.: RDF-Based Retrieval of Information Extracted from Web Product Catalogues. In: SIGIR'04 Semantic Web Workshop, Sheffield. Full paper. Extraction of relevant data from the raw source of HTML pages poses specific requirements on their subsequent RDF storage and retrieval. We describe an application of statistical information extraction technique (Hidden Markov Models) on product catalogues, followed with conversion of extracted data to RDF format and their structured retrieval. The domain-specific query interface, built on the top of Sesame repository, offers a simple form of navigational retrieval. Integration of further web-analysis methods, within the Rainbow architecture, is forthcoming.

Cespivova H., Rauch J., Svatek V., Kejkula M., Tomeckova M.: Roles of Medical Ontology in Association Mining CRISP-DM Cycle. In: ECML/PKDD04 Workshop on Knowledge Discovery and Ontologies, Pisa. Full paper. We experimented with introduction of medical ontology and other background knowledge into the process of association mining. The inventory used consisted of the LISp-Miner tool, the UMLS ontology, the STULONG dataset on cardiovascular risk, and a set of simple qualitative rules. The experiment suggested that an ontology may bring benefits to all phases of the KDD cycle as described in CRISP-DM.

Labsky M., Svatek V., Svab O.: Types and Roles of Ontologies in Web Information Extraction. In: ECML/PKDD04 Workshop on Knowledge Discovery and Ontologies, Pisa. Full paper. We discuss the diverse types and roles of ontologies in web information extraction and illustrate them on a small study from the product offer domain. Attention is mainly paid to the impact of domain ontologies, presentation ontologies and terminological taxonomies.

Svab O., Svatek V., Kavalec M., Labsky M.: Querying the RDF: Small Case Study in the Bicycle Sale Domain. In: Workshop on Databases, Texts, Specifications and Objects (DATESO'04), Ostrava 2004. Full paper. We examine the suitability of RDF, RDF Schema (as simple ontology language), and RDF repository Sesame, for providing the back-end to a prospective domain-specific web search tool, targeted at the offer of bicycles and their components. Actual data for the RDF repository are to be extracted by analysis modules of a distributed knowledge-based system named Rainbow. Attention is paid to the comparison of different query languages and to the design of application-specific templates.

Svatek V.: Design Patterns for Semantic Web Ontologies: Motivation and Discussion. In: 7th Conference on Business Information Systems, Poznaň 2004. Full paper. The relatively high level of standardisation of semantic web ontology languages is in contrast to mostly ad hoc designed content of ontologies themselves. An overview of existing methods supporting ontology content creation is presented. Methods based on design patterns are then discussed in more detail as they seem most promising particularly for business environment. Examples of elementary problems typical for semantic web ontologies are shown, and their pattern–based solution is outlined.

Ruzicka M., Svatek V.: Mark-up based analysis of narrative guidelines with the Stepper tool. In: Symposium on Computerized Guidelines and Protocols (CGP-04), Praha 2004. IOS Press. Full paper. The Stepper tool was developed to assist a knowledge engineer in developing a computable version of narrative guidelines. The system is document-centric: it formalises the initial text in multiple user-definable steps corresponding to interactive XML transformations. In this paper, we report on experience obtained by applying the tool on a narrative guideline document addressing unstable angina pectoris. Possible role of the tool and associated methodology in developing a guideline-based application is also discussed.

Svatek V., Riha A., Peleska J., Rauch J.: Analysis of guideline compliance – a data mining approach. In: Symposium on Computerized Guidelines and Protocols (CGP-04), Praha 2004. IOS Press. Full paper. While guideline-based decision support is safety-critical and typically requires human interaction, offline analysis of guideline compliance can be performed to large extent automatically. We examine the possibility of automatic detection of potential non-compliance followed up with (statistical) association mining. Only frequent associations of non-compliance patterns with various patient data are submitted to medical expert for interpretation. The initial experiment was carried out in the domain of hypertension management.

Kavalec M., Maedche A., Svatek V.: Discovery of Lexical Entries for Non-Taxonomic Relations in Ontology Learning. In: SOFSEM – Theory and Practice of Computer Science, Springer LNCS 2932, 2004. Full paper. Ontology learning from texts has recently been proposed as a new technology helping ontology designers in the modelling process. Discovery of non-taxonomic relations is understood as the least tackled problem therein. We propose a technique for extraction of lexical entries that may give cue in assigning semantic labels to otherwise `anonymous' relations. The technique has been implemented as extension to the existing Text-to-Onto tool, and tested on a collection of texts describing worldwide geographic locations from a tour-planning viewpoint.

Svatek V., Ruzicka M.: Step-by-step formalisation of medical guideline content. International Journal of Medical Informatics, 2003, 70, 2–3, 329–335. Full paper. Approaches to formalisation of medical guidelines can be divided into model-centric and document-centric. While model-centric approaches dominate in the development of clinical decision support applications, document-centric, mark-up-based formalisation is suitable for application tasks requiring the 'literal' content of the document to be transferred into the formal model. Examples of such tasks are logical verification of the document or compliance analysis of health records. The quality and efficiency of document-centric formalisation can be improved using a decomposition of the whole process into several explicit steps. We present a methodology and software tool supporting the step-by-step formalisation process. The knowledge elements can be marked up in the source text, refined to a tree structure with increasing level of detail, rearranged into an XML knowledge base, and, finally, exported into the operational representation. User-definable transformation rules enable to automate a large part of the process. The approach is being tested in the domain of cardiology. For parts of the WHO/ISH Guidelines for Hypertension, the process has been carried out through all the stages, to the form of executable application, generated automatically from the XML knowledge base.

Svatek V., Berka P., Kavalec M., Kosek J., Vavra V.: Discovering company descriptions on the web by multiway analysis. In: New Trends in Intelligent Information Processing and Web Mining (IIPWM'03), Zakopane 2003. Springer-Verlag, 'Advances in Soft Computing' series, 2003. Full paper. We investigate the possibility of web information discovery and extraction by means of a modular architecture analysing separately the multiple forms of information presentation, such as free text, structured text, URLs and hyperlinks, by independent knowledge-based modules. First experiments in discovering a relatively easy target, general company descriptions, suggests that web information can be efficiently retrieved in this way. Thanks to the separation of data types, individual knowledge bases can be much simpler than those used in information extraction over unified representations.

Labsky M., Svatek V.: Ontology Merging in Context of Web Analysis. In: Workshop on Databases, Texts, Specifications and Objects (DATESO'03), Ostrava 2003. Full paper (zipped). The Rainbow system aims at the analysis of websites by means of distributed modules specialized in particular types of data, such as free text, HTML structures or link topology. In order to ease the integration of services offered by the individual modules, which may come from third parties, a collection of ontologies has been developed. Parts of the ontologies contain information specific to the different ways of analyses, resulting in a need for integration. This paper describes how ontology--merging, namely the FCA--Merge method, may be used to integrate the results of multiple analyses for a certain application domain.

Svatek V., Kosek J., Labsky M., Braza J., Kavalec M., Vacura M., Vavra V., Snasel V.: Rainbow - Multiway Semantic Analysis of Websites. In: 2nd International DEXA Workshop on Web Semantics (WebS03), Prague 2003, IEEE Computer Society Press. Full paper. The Rainbow project aims at the development of a reusable, modular architecture for web (particularly, website) analysis. Individual knowledge-based modules separately analyse different types of web data and communicate the results via web-service interface. The output of analysis has the form of classes (of web resources) predefined in an ontology, extracted text, and/or addresses of retrieved web resources. Within the project, several original methods of analysis as well as (analytic) knowledge acquisition have been developed. The current domains of investigation are sites of small organisations offering products or services, and pornography sites. The paper is the first systematic overview of diverse methods developed or envisaged in Rainbow.

Ruzicka M., Svatek V.: An interactive approach to rule-based transformation of XML documents. In: Datakon 2003, the annual database conference, Brno 2003, 277-288. Full paper. Transformation of XML documents is typically understood as non-interactive. In contrast, we formulate the specific task of XML-based transformation of knowledge contained in semi-formal documents, which heavily depends on human understanding of element content and thus requires frequent user intervention. Yet, many aspects of this process are pre-determined, and their automation is highly desirable. We implemented a software tool (called Stepper) supporting interactive step-by-step transformation of ‚knowledge blocks'. The transformation is governed by rules expressed in a new 'interactive transformation' language (called XKBT), while its non-interactive aspects are handled by embedded XSLT rules.

Svatek V., Ruzicka M.: Step-by-step Mark-up of Medical Guideline Documents. In: (Surjan G. et al., eds.) Health Data in the Information Society. Proceedings of MIE2002, Budapest 2002. IOS Press, 591-595. Full paper (zipped Postcript). The quality of document-centric formalisation of medical guidelines can be improved using a decomposition of the whole process into several explicit steps. We present a methodology and a software tool supporting the step-by-step formalisation process. The knowledge elements can be marked up in the text with increasing level of detail, rearranged into an XML knowledge base and exported into the operational representation. Semi-automated transitions can be specified by means of rules. The approach has been tested in a hypertension application.

Svatek V., Kosek J., Braza J., Kavalec M., Klemperer J., Berka P.: Framework and Tools for Multiway Extraction of Web Metadata. In: Information Systems Modelling, Roznov 2002. Full paper. We outline a generic conceptual framework for automated extraction of semantic metadata on the web, and present the results of experiments aiming at the development of an integrated multiway architecture for the metadata extraction task. The architecture will consist of separate modules specialised at different types of data, and cooperating via a SOAP-based message passing protocol.

Kavalec M., Svatek V.: Information Extraction and Ontology Learning Guided by Web Directory. In: ECAI Workshop on NLP and ML for Ontology engineering. Lyon, 2002. Full paper. The paper presents our ongoing effort to create an information extraction tool for collecting general information on products and services from the free text of commercial web pages. A promising approach is that of combining information extraction with ontologies. Ontologies can improve the quality of information extraction and, on the other hand, the extracted information can be used to improve and extend the ontology. We describe the way we use Open Directory as training data, analyse this resource from the ontological point of view, present some preliminary results related to information extraction, and outline our plans for building and deploying the ontology.

Lin V., Rauch J., Svatek V.: Content-based Retrieval of Analytic Reports. In: International Workshop on Rule Markup Languages for Business Rules on the Semantic Web. Sardinia, 2002. Full paper, Slides. Analytic reports are special textual documents containing condensed results from a data mining process. Embedded knowledge enables the interpretation of the reports by automated procedures, which opens the way to content-based retrieval. We elaborate the technique for statistical association rules as specific form of discovered knowledge, demonstrate its formal apparatus on examples from the medical domain, and outline the perspectives of sharing and reusing the content of analytic reports over the Semantic Web.

Riha A., Svatek V., Nemec P., Zvarova J.: Medical guideline as prior knowledge in electronic healthcare record mining. In: 3rd International Conference on Data Mining Methods and Databases for Engineering, Finance and Other Fields, 25-27 September 2002, Bologna, Italy. WIT Press, to appear. Full paper (zipped MS Word). We investigate the possibility of two-step approach to electronic healthcare record mining, in the context of analysing the compliance of healthcare practice with standards formulated in medical guidelines. Non-compliance patterns detected in the process of guideline-based data pre-processing provide additional attributes for subsequent association rule mining. The approach has been preliminarily tested on databases of hypertensive patients from different Czech hospitals. It should help reveal causes of frequent non-compliance; its sensitivity however depends on the quality of guideline formalisation, on the eligibility of patients for the given guideline, and on the coverage of datasets.

Svatek V., Kroupa T., Ruzicka M.: Guide-X - a Step-by-step, Markup-Based Approach to Guideline Formalisation. In: First European Workshop on Computer-based Support for Clinical Guidelines and Protocols, Leipzig 2000. IOS Press, 2001, 97-114. Full paper. The main difficulties of converting the original textual form of medical guidelines to a computer-tractable form are connected both with the ambiguity of the natural language text and with the complexity of the resulting formal (and operational) representation. Proceeding directly from one to the other is thus an extremely demanding task. The proposed Guide-X methodology addresses this problem by breaking the whole process of guideline operationalisation into several steps, each of which requires a different mixture of types (medical, knowledge representation, typographical) and degrees of expertise. The principal technology used is that of XML tagging (using both pre-existing and newly developed languages). The result of each step is connected, element-by-element, to the results of previous steps, thus making the verification and revision of the operationalisation process easier. The methodology is currently being tested in the field of hypertension treatment, within the framework of the Medical Guideline Technology project of the EU Fourth Framework Programme.

Kavalec M., Svatek V., Strossa P.: Web Directories as Training Data for Automated Metadata Extraction. In: Semantic Web Mining, Workshop at ECML/PKDD-2001, Freiburg 2001. Full paper. In this paper, we analyse the possibility of reusing the knowledge embedded in the structure of web directories in order to obtain labelled training data for Web Information Extraction with limited human effort.

Svatek V., Riha A., Zika T., Zvarova J., Jirousek R., Zdrahal Z.: Informal, Formal and Operational Modelling of Medical Guidelines. In: Hruska T., Hashimoto M. (eds.): Knowledge-Based Software Engineering. IOS Press 2000, 9-16. Full paper. Formal representation and automatic processing of medical guidelines is one of the foremost challenges in applied knowledge engineering. We describe a new, flexible model of medical guideline, which seems to be particularly suitable for analysis of guidelines with respect to patient records. To provide for clear transition between the analysis and design phases of development of guideline-processing software, we formulate the model at various levels: informal, formal and operational. For the first one, we use structured text plus diagrams, while for the latter two, the OCML (Operational Concept Modelling Language) seems to be suitable, since it enables both formal checking of concept definitions and execution of operational specifications. The model is currently being tested in the hypertension domain.

Svatek V., Zvarova J., Jirousek R.: A Two-Tiered Model of Medical Guideline. In: Hasman A., Blobel B., Dudeck J., Engelbrecht R., Gell G., Prokosch H.-U. (eds.): Telematics in Health Care - Medical Infobahn for Europe, MIE2000/GMDS2000 [CD-ROM]. Quintessenz Verlag, Berlin 2000. ISSN: 1616-2463. Full paper. One of the most important issues in designing a formal model for representing medical guidelines is the trade-off between modularity and compactness. We briefly analyse this problem, review some existing approaches, and suggest a two-tiered model that relies on a secondary structure superposed on the set of fine-grained knowledge modules. We hypothesise that such model is particularly suitable for tasks related to long-term patient management with respect to high-level, loosely structured guidelines. The model is currently being operationalised and tested in the hypertension treatment domain.

Svatek V., Kavalec M.: Supporting Case Acquisition and Labelling in the Context of Web Mining. In: (Zighed D., Komorowski J., Zytkow J.:) Principles of Data Mining and Knowledge Discovery - PKDD2000. LNAI 1910, Springer Verlag 2000, 626-631. Full paper. Case acquisition and labelling are important bottlenecks for predictive data mining. In the web context, a cascade of supporting techniques can be used, from general ones such as user interfaces, through filtering based on keyword frequency, to web-specific techniques exploiting public search engines. We show how a synergistic application of multiple techniques can be helpful in obtaining and pre-processing textual data, in particular for ILP-based web mining. The (two-fold) learning task itself consist in construction and disambiguation of categorisation rules, which are to process the results returned by web search engines.

Svatek V., Berka P.: URL as starting point for WWW document categorisation. In: (Mariani J., Harman D.:) RIAO'2000 - Content-Based Multimedia Information Access, CID, Paris, 2000, 1693-1702. Full paper. Information about the category (type) of a WWW page can be helpful for the user within search, filtering, as well as navigation tasks. We propose a multidimensional categorisation scheme, with bibliographic dimension as the primary one. We examine the possibilities and limits of performing such categorisation based on information extracted from URL, which is particularly useful for certain on-line applications such as meta-search or navigation support. In addition, we describe the problem of ambiguity of URL terms, and suggest a method for its partial overcoming by means of machine learning. As a side-effect, we show that general purpose WWW search engines can be used for providing input data for both human and computational analysis of the web.

Sramek D., Berka P., Kosek J., Svatek V.: Improving WWW Access - from Single-Purpose Systems to Agent Architectures? In: (Cerri S., Dochev D., eds.) Artificial Intelligence: Methodology, Systems, and Application. Berlin : Springer Verlag, 2000, 167-178. Full paper. Sophisticated techniques from various areas of Artificial Intelligence can be used to improve the access to the WWW; the most promising ones stem from Data Mining and Knowledge Modeling. We describe the process of building two experimental systems: the VSEved system for intelligent meta-search, and the VSEtecka system for navigation support. We discuss our experience from this process, which seems to justify the hypothesis that the Multi-Agent paradigm can improve the efficiency of web access tools, in the future. With this respect, we outline a web-oriented multi-agent architecture.

Vojtech Svatek - CV, topics, projects, bibliography Vojtech Svatek - homepage Knowledge Engineering Group

Vojtech Svatek , last update Jul 29, 2008