Conceptual Modeling - ER 2004: 23rd International Conference on Conceptual Modeling, Shanghai, China, November 8-12, 2004. Proceedings (Lecture Notes in Computer Science) - PDF Free Download (2024)

Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen

Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos New York University, NY, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany

3288

This page intentionally left blank

Paolo Atzeni Wesley Chu Hongjun Lu Shuigeng Zhou Tok Wang Ling (Eds.)

Conceptual Modeling – ER 2004 23rd International Conference on Conceptual Modeling Shanghai, China, November 8-12, 2004 Proceedings

Springer

eBook ISBN: Print ISBN:

3-540-30464-9 3-540-23723-2

©2005 Springer Science + Business Media, Inc. Print ©2004 Springer-Verlag Berlin Heidelberg All rights reserved No part of this eBook may be reproduced or transmitted in any form or by any means, electronic, mechanical, recording, or otherwise, without written consent from the Publisher Created in the United States of America

Visit Springer's eBookstore at: and the Springer Global Website Online at:

http://ebooks.springerlink.com http://www.springeronline.com

Foreword

On behalf of the Organizing Committee, we would like to welcome you to the proceedings of the 23rd International Conference on Conceptual Modeling (ER 2004). This conference provided an international forum for technical discussion on conceptual modeling of information systems among researchers, developers and users. This was the third time that this conference was held in Asia; the first time was in Singapore in 1998 and the second time was in Yokohama, Japan in 2001. China is the third largest nation with the largest population in the world. Shanghai, the largest city in China and a great metropolis, famous in Asia and throughout the world, is therefore a most appropriate location to host this conference. This volume contains papers selected for presentation and includes the two keynote talks by Prof. Hector Garcia-Molina and Prof. Gerhard Weikum, and an invited talk by Dr. Xiao Ji. This volume also contains industrial papers and demo/poster papers. An additional volume contains papers from 6 workshops. The conference also featured three tutorials: (1) Web Change Management and Delta Mining: Opportunities and Solutions, by Sanjay Madria, (2) A Survey of Data Quality Issues in Cooperative Information Systems, by Carlo Batini, and (3) Visual SQL – An ER-Based Introduction to Database Programming, by Bernhard Thalheim. The technical program of the conference was selected by a distinguished program committee consisting of three PC Co-chairs, Hongjun Lu, Wesley Chu, and Paolo Atzeni, and more than 70 members. They faced a difficult task in selecting 57 papers from many very good contributions. This year the number of submissions, 293, was a record high for ER conferences. We wish to express our thanks to the program committee members, external reviewers, and all authors for submitting their papers to this conference. We would also like to thank: the Honorary Conference Chairs, Peter P. Chen and Ruqian Lu; the Coordinators, Zhongzhi Shi, Yoshifumi Masunaga, Elisa Bertino, and Carlo Zaniolo; Workshop Co-chairs, Shan Wang and Katsumi Tanaka; Tutorial Co-chairs, Jianzhong Li and Stefano Spaccapietra; Panel Co-chairs, Chin-Chen Chang and Erich Neuhold; Industrial Co-chairs, Philip S. Yu, Jian Pei, and Jiansheng Feng; Demos and Posters Co-chair, Mong-Li Lee and Gillian Dobbie; Publicity Chair, Qing Li; Publication Chair cum Local Arrangements Chair, Shuigeng Zhou; Treasurer, Xueqing Gong; Registration Chair, Xiaoling Wang; Steering Committee Liaison, Arne Solvberg; and Webmasters, Kun Yue, Yizhong Wu, Zhimao Guo, and Keping Zhao. We wish to extend our thanks to the Natural Science Foundation of China, the ER Institute (ER Steering Committee), the K.C. Wong Education Foundation in Hong Kong, the Database Society of the China Computer Federation, ACM SIGMOD, ACM SIGMIS, IBM China Co., Ltd., Shanghai Baosight Soft-

VI

Foreword

ware Co., Ltd., and the Digital Policy Management Association of Korea for their sponsorships and support. At this juncture, we wish to remember the late Prof. Yahiko Kambayashi who passed away on February 5, 2004 at age 60 and was then a workshop co-chair of the conference. Many of us will remember him as a friend, a mentor, a leader, an educator, and our source of inspiration. We express our heartfelt condolence and our deepest sympathy to his family. We hope that the attendees found the technical program of ER 2004 to be interesting and beneficial to their research. We trust they enjoyed this beautiful city, including the night scene along the Huangpujiang River and the postconference tours to the nearby cities, leaving a beautiful and memorable experience for all.

November 2004

Tok Wang Ling Aoying Zhou

Preface

The 23rd International Conference on Conceptual Modeling (ER 2004) was held in Shanghai, China, November 8–12, 2004. Conceptual modeling is a fundamental technique used in analysis and design as a real-world abstraction and as the basis for communication between technology experts and their clients and users. It has become a fundamental mechanism for understanding and representing organizations, including new e-worlds, and the information systems that support them. The International Conference on Conceptual Modeling provides a major forum for presenting and discussing current research and applications in which conceptual modeling is the major emphasis. Since the first edition in 1979, the ER conference has evolved into the most prestigious one in the areas of conceptual modeling research and applications. Its purpose is to identify challenging problems facing high-level modeling of future information systems and to shape future directions of research by soliciting and reviewing high-quality applied and theoretical research findings. ER 2004 encompassed the entire spectrum of conceptual modeling. It addressed research and practice in areas such as theories of concepts and ontologies underlying conceptual modeling, methods and tools for developing and communicating conceptual models, and techniques for transforming conceptual models into effective information system implementations. We solicited forward-looking and innovative contributions that identify promising areas for future conceptual modeling research as well as traditional approaches to analysis and design theory for information systems development. The Call for Papers attracted 295 exceptionally strong submissions of research papers from 36 countries/regions. Due to limited space, we were only able to accept 57 papers from 21 countries/regions, for an acceptance rate of 19.3%. Inevitably, many good papers had to be rejected. The accepted papers covered topics such as ontologies, patterns, workflows, metamodeling and methodology, innovative approaches to conceptual modeling, foundations of conceptual modeling, advanced database applications, systems integration, requirements and evolution, queries and languages, Web application modeling and development, schemas and ontologies, and data mining. We are proud of the quality of this year’s program, from the keynote speeches to the research papers, with the workshops, panels, tutorials, and industrial papers. We were honored to host the outstanding keynote addresses by Hector Garcia-Molina and Gerhard Weikum. We appreciate the hard work of the organizing committee, with interactions around the clock with colleagues all over the world. Most of all, we are extremely grateful to the program committee members of ER 2004 who generously spent their time and energy reviewing submitted papers. We also thank the many external referees who helped with the review process. Last but not least, we thank the authors who wrote high-quality

VIII

Preface

research papers and submitted them to ER 2004, without whom the conference would not have existed.

November 2004

Paolo Atzeni, Wesley Chu, and Hongjun Lu

ER 2004 Conference Organization

Honorary Conference Chairs Peter P. Chen Ruqian Lu

Louisiana State University, USA Fudan University, China

Conference Co-chairs Aoying Zhou Tok Wang Ling

Fudan University, China National University of Singapore, Singapore

Program Committee Co-chairs Paolo Atzeni Wesley Chu Hongjun Lu

Università Roma Tre, Italy University of California at Los Angeles, USA Univ. of Science and Technology of Hong Kong, China

Workshop Co-chairs Renmin University of China, China Shan Wang Kyoto University, Japan Katsumi Tanaka Yahiko Kambayashi1 Kyoto University, Japan

Tutorial Co-chairs Harbin Institute of Technology, China Jianzhong Li Stefano Spaccapietra EPFL Lausanne, Switzerland

Panel Co-chairs Chin-Chen Chang Erich Neuhold

Chung Cheng University, Taiwan, China IPSI, Fraunhofer, Germany

Industrial Co-chairs Philip S. Yu Jian Pei Jiansheng Feng 1

IBM T.J. Watson Research Center, USA Simon Fraser University, Canada Shanghai Baosight Software Co., Ltd., China

Prof. Yahiko Kambayashi died on February 5, 2004.

X

ER 2004 Conference Organization

Demos and Posters Chair Mong-Li Lee Gillian Dobbie

National University of Singapore, Singapore University of Auckland, New Zealand

Publicity Chair Qing Li

City University of Hong Kong, China

Publication Chair Shuigeng Zhou

Fudan University, China

Coordinators Zhongzhi Shi Yoshifumi Masunaga Elisa Bertino Carlo Zaniolo

ICT, Chinese Academy of Science, China Ochanomizu University, Japan Purdue University, USA University of California at Los Angeles, USA

Steering Committee Liaison Arne Solvberg

Norwegian University of Sci. and Tech., Norway

Local Arrangements Chair Shuigeng Zhou

Fudan University, China

Treasurer Xueqing Gong

Fudan University, China

Registration Xiaoling Wang

Fudan University, China

Webmasters Kun Yue Yizhong Wu Zhimao Guo Keping Zhao

Fudan Fudan Fudan Fudan

University, University, University, University,

China China China China

ER 2004 Conference Organization

XI

Program Committee Jacky Akoka Hiroshi Arisawa Sonia Bergamaschi Mokrane Bouzeghoub Diego Calvanese Cindy Chen Shing-Chi Cheung Roger Chiang Stefan Conrad Bogdan Czejdo Lois Delcambre Debabrata Dey Johann Eder Ramez Elmasri David W. Embley Johann-Christoph Freytag Antonio L. Furtado Andreas Geppert Shigeichi Hirasawa Arthur ter Hofstede Matthias Jarke Christian S. Jensen Manfred Jeusfeld Yahiko Kambayashi Hannu Kangassalo Kamalakar Karlapalem Vijay Khatri Dongwon Lee Mong-Li Lee Wen Lei Mao Jianzhong Li Qing Li Stephen W. Liddle Ee-Peng Lim Mengchi Liu Victor Zhenyu Liu Ray Liuzzi Bertram Ludäscher Ashok Malhotra Murali Mani Fabio Massacci Sergey Melnik

CNAM & INT, France Yokohama National University, Japan Università di Modena e Reggio Emilia, Italy Université de Versailles, France Università di Roma La Sapienza, Italy University of Massachusetts at Lowell, USA Univ. of Science and Technology of Hong Kong, China University of Cincinnati, USA Heinrich-Heine-Universität Düsseldorf, Germany Loyola University, New Orleans, USA Oregon Health Science University, USA University of Washington, USA Universität Klagenfurt, Austria University of Texas at Arlington, USA Brigham Young University, USA Humboldt-Universität zu Berlin, Germany PUC Rio de Janeiro, Brazil Credit Suisse, Switzerland Waseda University, Japan Queensland University of Technology, Australia Technische Hochschule Aachen, Germany Aalborg Universitet, Denmark Universiteit van Tilburg, Netherlands Kyoto University, Japan University of Tampere, Finland Intl. Institute of Information Technology, India Indiana University at Bloomington, USA Pennsylvania State University, USA National University of Singapore, Singapore University of California at Los Angeles, USA Harbin Institute of Technology, China City University of Hong Kong, Hong Kong, China Brigham Young University, USA Nanyang Technological University, Singapore Carleton University, Canada University of California at Los Angeles, USA Air Force Research Laboratory, USA San Diego Supercomputer Center, USA Microsoft, USA Worcester Polytechnic Institute, USA Università di Trento, Italy Universität Leipzig, Germany

XII

ER 2004 Conference Organization

Xiaofeng Meng Renate Motschnig John Mylopoulos Sham Navathe Jyrki Nummenmaa Maria E. Orlowska Oscar Pastor Jian Pei Zhiyong Peng Barbara Pernici Dimitris Plexousakis Sandeep Purao Sudha Ram Colette Rolland Elke Rundensteiner Peter Scheuermann Keng Siau Janice C. Sipior Il-Yeol Song Nicolas Spyratos Veda C. Storey Ernest Teniente Juan C. Trujillo Michalis Vazirgiannis Dongqing Yang Jian Yang GeYu Lizhu Zhou Longxiang Zhou Shuigeng Zhou

Renmin University of China, China Universität Wien, Austria University of Toronto, Canada Georgia Institute of Technology, USA University of Tampere, Finland University of Queensland, Australia Universidad Politécnica de Valencia, Spain Simon Fraser University, Canada Wuhan University, China Politecnico di Milano, Italy FORTH-ICS, Greece Pennsylvania State University, USA University of Arizona, USA Univ. Paris 1 Panthéon-Sorbonne, France Worcester Polytechnic Institute, USA Northwestern University, USA University of Nebraska-Lincoln, USA Villanova University, USA Drexel University, USA Université de Paris-Sud, France Georgia State University, USA Universitat Politècnica de Catalunya, Spain Universidad de Alicante, Spain Athens Univ. of Economics and Business, Greece Peking University, China Tilburg University, Netherlands Northeastern University, China Tsinghua University, China Chinese Academy of Science, China Fudan University, China

ER 2004 Conference Organization

External Referees A. Analyti Michael Adams Alessandro Artale Enrico Blanzieri Shawn Bowers Paolo Bresciani Linas Bukauskas Ugo Buy Luca Cabibbo Andrea Calì Cinzia Cappiello Alain Casali Yu Chen V. Christophidis Fang Chu Valter Crescenzi Michael Derntl Arnaud Giacometti Paolo Giorgini Cristina Gómez Daniela Grigori

Wynne Hsu Stamatis Karvounarakis Ioanna Koffina George Kokkinidis Hristo Koshutanski Kyriakos Kritikos Lotfi Lakhal Domenico Lembo Shaorong Liu Stéphane Lopes Bertram Ludaescher Chang Luo Gianni Mecca Massimo Mecella Carlo Meghini Paolo Merialdo Antonis Misargopoulos Paolo Missier Stefano Modafferi Wai Yin Mok Enrico Mussi

Noel Novelli Alexandros Ntoulas Phillipa Oaks Seog-Chan Oh Justin O’Sullivan Manos Papaggelis V. Phan-Luong Pierluigi Plebani Philippe Rigaux Nick Russell Ulrike Sattler Monica Scannapieco Ka Cheung Sia Riccardo Torlone Goce Trajcevski Nikos Tsatsakis Haixun Wang Moe Wynn Yi Xia Yirong Yang Fan Ye

XIII

XIV

ER 2004 Conference Organization

Co-organized by Fudan University of China National University of Singapore

In Cooperation with Database Society of the China Computer Federation ACM SIGMOD ACM SIGMIS

Sponsored by National Natural Science Foundation of China (NSFC) ER Institute (ER Steering Committee) K.C. Wong Education Foundation, Hong Kong

Supported by IBM China Co., Ltd. Shanghai Baosight Software Co., Ltd. Digital Policy Management Association of Korea

Table of Contents

Keynote Addresses Entity Resolution: Overview and Challenges Hector Garcia-Molina

1

Towards a Statistically Semantic Web Gerhard Weikum, Jens Graupmann, Ralf Schenkel, and Martin Theobald

3

Invited Talk The Application and Prospect of Business Intelligence in Metallurgical Manufacturing Enterprises in China Xiao Ji, Hengjie Wang, Haidong Tang, Dabin Hu, and Jiansheng Feng

18

Conceptual Modeling I Conceptual Modelling – What and Why in Current Practice Islay Davies, Peter Green, Michael Rosemann, and Stan Gallo

30

Entity-Relationship Modeling Re-revisited Don Goelman and Il- Yeol Song

43

Modeling Functional Data Sources as Relations Simone Santini and Amarnath Gupta

55

Conceptual Modeling II Roles as Entity Types: A Conceptual Modelling Pattern Jordi Cabot and Ruth Raventós

69

Modeling Default Induction with Conceptual Structures Julien Velcin and Jean-Gabriel Ganascia

83

Reachability Problems in Entity-Relationship Schema Instances Sebastiano Vigna

96

Conceptual Modeling III A Reference Methodology for Conducting Ontological Analyses Michael Rosemann, Peter Green, and Marta Indulska Pruning Ontologies in the Development of Conceptual Schemas of Information Systems Jordi Conesa and Antoni Olivé

110

122

XVI

Table of Contents

Definition of Events and Their Effects in Object-Oriented Conceptual Modeling Languages Antoni Olivé

136

Conceptual Modeling IV Enterprise Modeling with Conceptual XML David W. Embley, Stephen W. Liddle, and Reema Al-Kamha

150

Graphical Reasoning for Sets of Functional Dependencies János Demetrovics, András Molnár, and Bernhard Thalheim

166

ER-Based Software Sizing for Data-Intensive Systems Hee Beng Kuan Tan and Yuan Zhao

180

Data Warehouse Data Mapping Diagrams for Data Warehouse Design with UML Sergio Luján-Mora, Panos Vassiliadis, and Juan Trujillo

191

Informational Scenarios for Data Warehouse Requirements Elicitation Naveen Prakash, Yogesh Singh, and Anjana Gosain

205

Extending UML for Designing Secure Data Warehouses Eduardo Fernández-Medina, Juan Trujillo, Rodolfo Villarroel, and Mario Piattini

217

Schema Integration I Data Integration with Preferences Among Sources Gianluigi Greco and Domenico Lembo Resolving Schematic Discrepancy in the Integration of Entity-Relationship Schemas Qi He and Tok Wang Ling Managing Merged Data by Vague Functional Dependencies An Lu and Wilfred Ng

231

245 259

Schema Integration II Merging of XML Documents Wanxia Wei, Mengchi Liu, and Shijun Li

273

Schema-Based Web Wrapping Sergio Flesca and Andrea Tagarelli

286

Web Taxonomy Integration Using Spectral Graph Transducer Dell Zhang, Xiaoling Wang, and Yisheng Dong

300

Table of Contents

XVII

Data Classification and Mining I Contextual Probability-Based Classification Gongde Guo, Hui Wang, David Bell, and Zhining Liao

313

Improving the Performance of Decision Tree: A Hybrid Approach LiMin Wang, SenMiao Yuan, Ling Li, and HaiJun Li

327

Understanding Relationships: Classifying Verb Phrase Semantics Veda C. Storey and Sandeep Purao

336

Data Classification and Mining II Fast Mining Maximal Frequent ItemSets Based on FP-Tree Yuejin Yan, Zhoujun Li, and Huowang Chen

348

Multi-phase Process Mining: Building Instance Graphs B.F. van Dongen and W.M.P. van der Aalst

362

A New XML Clustering for Structural Retrieval Jeong Hee Hwang and Keun Ho Ryu

377

Web-Based Information Systems Link Patterns for Modeling Information Grids and P2P Networks Christopher Popfinger, Cristian Pérez de Laborda, and Stefan Conrad

388

Information Retrieval Aware Web Site Modelling and Generation Keyla Ahnizeret, David Fernandes, João M.B. Cavalcanti, Edleno Silva de Moura, and Altigran S. da Silva

402

Expressive Profile Specification and Its Semantics for a Web Monitoring System Ajay Eppili, Jyoti Jacob, Alpa Sachde, and Sharma Chakravarthy

420

Query Processing I On Modelling Cooperative Retrieval Using an Ontology-Based Query Refinement Process Nenad Stojanovic and Ljiljana Stojanovic

434

Load-Balancing Remote Spatial Join Queries in a Spatial GRID Anirban Mondal and Masaru Kitsuregawa

450

Expressing and Optimizing Similarity-Based Queries in SQL Like Gao, Min Wang, X. Sean Wang, and Sriram Padmanabhan

464

XVIII Table of Contents

Query Processing II XSLTGen: A System for Automatically Generating XML Transformations 479 via Semantic Mappings Stella Waworuntu and James Bailey Efficient Recursive XML Query Processing in Relational Database Systems Sandeep Prakash, Sourav S. Bhowmick, and Sanjay Madria

493

Situated Preferences and Preference Repositories for Personalized Database Applications Stefan Holland and Werner Kießling

511

Web Services I Analysis and Management of Web Service Protocols Boualem Benatallah, Fabio Casati, and Farouk Toumani

524

Semantic Interpretation and Matching of Web Services Chang Xu, Shing-Chi Cheung, and Xiangye Xiao

542

Intentional Modeling to Support Identity Management Lin Liu and Eric Yu

555

Web Services II WUML: A Web Usage Manipulation Language for Querying Web Log Data Qingzhao Tan, Yiping Ke, and Wilfred Ng

567

An Agent-Based Approach for Interleaved Composition and Execution of Web Services Xiaocong Fan, Karthikeyan Umapathy, John Yen, and Sandeep Purao

582

A Probabilistic QoS Model and Computation Framework for Web Services-Based Workflows San-Yih Hwang, Haojun Wang, Jaideep Srivastava, and Raymond A. Paul

596

Schema Evolution Lossless Conditional Schema Evolution Ole G. Jensen and Michael H. Böhlen

610

Ontology-Guided Change Detection to the Semantic Web Data Li Qin and Vijayalakshmi Atluri

624

Schema Evolution in Data Warehousing Environments – A Schema Transformation-Based Approach Hao Fan and Alexandra Poulovassilis

639

Table of Contents

XIX

Conceptual Modeling Applications I Metaprogramming for Relational Databases Jernej Kovse, Christian Weber, and Theo Härder Incremental Navigation: Providing Simple and Generic Access to Heterogeneous Structures Shawn Bowers and Lois Delcambre Agent Patterns for Ambient Intelligence Paolo Bresciani, Loris Penserini, Paolo Busetta, and Tsvi Kuflik

654

668 682

Conceptual Modeling Applications II Modeling the Semantics of 3D Protein Structures Sudha Ram and Wei Wei

696

Risk-Driven Conceptual Modeling of Outsourcing Decisions Pascal van Eck, Roel Wieringa, and Jaap Gordijn

709

A Pattern and Dependency Based Approach to the Design of Process Models Maria Bergholtz, Prasad Jayaweera, Paul Johannesson, and Petia Wohed

724

UML Use of Tabular Analysis Method to Construct UML Sequence Diagrams Margaret Hilsbos and Il- Yeol Song

740

An Approach to Formalizing the Semantics of UML Statecharts Xuede Zhan and Huaikou Miao

753

Applying the Application-Based Domain Modeling Approach to UML Structural Views Arnon Sturm and Iris Reinhartz-Berger

766

XML Modeling A Model Driven Approach for XML Database Development Belén Vela, César J. Acuña, and Esperanza Marcos

780

On the Updatability of XML Views Published over Relational Data Ling Wang and Elke A. Rundensteiner

795

XBiT: An XML-Based Bitemporal Data Model Fusheng Wang and Carlo Zaniolo

810

XX

Table of Contents

Industrial Presentations I: Applications Enterprise Cockpit for Business Operation Management Fabio Casati, Malu Castellanos, and Ming-Chien Shan

825

Modeling Autonomous Catalog for Electronic Commerce Yuan-Chi Chang, Vamsavardhana R. Chillakuru, and Min Wang

828

GiSA: A Grid System for Genome Sequences Assembly Jun Tang, Dong Huang, Chen Wang, Wei Wang, and Baile Shi

831

Industrial Presentations II: Ontology in Applications Analytical View of Business Data: An Example Adam Yeh, Jonathan Tang, Youxuan Jin, and Sam Skrivan

834

Ontological Approaches to Enterprise Applications Dongkyu Kim, Yuan-Chi Chang, Juhnyoung Lee, and Sang-goo Lee

838

FASTAXON: A System for FAST (and Faceted) TAXONomy Design Yannis Tzitzikas, Raimo Launonen, Mika Hakkarainen, Pekka Korhonen, Tero Leppänen, Esko Simpanen, Hannu Törnroos, Pekka Uusitalo, and Pentti Vänskä

841

CLOVE: A Framework to Design Ontology Views Rosario Uceda-Sosa, Cindy X. Chen, and Kajal T. Claypool

844

Demos and Posters iRM: An OMG MOF Based Repository System with Querying Capabilities Ilia Petrov, Stefan Jablonski, Marc Holze, Gabor Nemes, and Marcus Schneider

850

Visual Querying for the Semantic Web Sacha Berger, Franois Bry, and Christoph Wieser

852

Query Refinement by Relevance Feedback in an XML Retrieval System Hanglin Pan, Anja Theobald, and Ralf Schenkel

854

Semantics Modeling for Spatiotemporal Databases Peiquan Jin, Lihua Yue, and Yuchang Gong

856

Temporal Information Management Using XML Fusheng Wang, Xin Zhou, and Carlo Zaniolo

858

SVMgr: A Tool for the Management of Schema Versioning Fabio Grandi

860

Table of Contents

GENNERE: A Generic Epidemiological Network for Nephrology and Rheumatology Ana Simonet, Michel Simonet, Cyr-Gabin Bassolet, Sylvain Ferriol, Cédric Gueydan, Rémi Patriarche, Haijin Yu, Ping Hao, Yi Liu, Wen Zhang, Nan Chen, Michel Forêt, Philippe Gaudin, Georges De Moor, Geert Thienpont, Mohamed Ben Saïd, Paul Landais, and Didier Guillon

XXI

862

Panel Beyond Webservices – Conceptual Modelling for Service Oriented Architectures Peter Fankhauser

865

Author Index

867

This page intentionally left blank

Entity Resolution: Overview and Challenges Hector Garcia-Molina Stanford University, Stanford, CA, USA [emailprotected]

Entity resolution is a problem that arises in many information integration scenarios: We have two or more sources containing records on the same set of real-world entities (e.g., customers). However, there are no unique identifiers that tell us what records from one source correspond to those in the other sources. Furthermore, the records representing the same entity may have differing information, e.g., one record may have the address misspelled, another record may be missing some fields. An entity resolution algorithm attempts to identify the matching records from multiple sources (i.e., those corresponding to the same real-world entity), and merges the matching records as best it can. Entity resolution algorithms typically rely on user-defined functions that (a) compare fields or records to determine if they match (are likely to represent the same real world entity), and (b) merge matching records into one, and in the process perhaps combine fields (e.g., creating a new name based on two slightly different versions of the name). In this talk I will give an overview of the Stanford SERF Project, that is building a framework to describe and evaluate entity resolution schemes. In particular, I will give an overview of some of the different entity resolution settings: De-duplication versus fidelity enhancement. In the de-duplication problem, we have a single set of records, and we try to merge the ones representing the same real world entity. In the fidelity enhancement problem, we have two sets of records: a base set of records of interest, and a new set of acquired information. The goal is to coalesce the new information into the base records. Clustering versus snapping. With snapping, we examine records pair-wise and decide if they represent the same entity. If they do, we merge the records into one, and continue the process of pair-wise comparisons. With clustering, we analyze all records and partition them into groups we believe represent the same real world entity. At the end, each partition is merged into one record. Confidences. In some entity resolution scenarios we must manage confidences. For example, input records may have a confidence value representing how likely it is they are true. Snap rules (that tells us when two records match) may also have confidences representing how likely it is that two records actually represent the same real world entity. As we merge records, we must track their confidences. Schema Mismatches. In some entity resolution scenarios we must deal, not just with resolving information on entities, but also with resolving discrepancies among the schemas of the different sources. For example, the attribute names and formats from one source may not match those of other sources. In the talk I will address some of the open problems and challenges that arise in entity resolution. These include: P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 1–2, 2004. © Springer-Verlag Berlin Heidelberg 2004

2

Hector Garcia-Molina

Performance. Entity resolution algorithms must perform very large number of field and record comparisons (via the user provided functions), so it is critical to perform only the absolutely minimum number of invocations to the comparison functions. Developing efficient algorithms is analogous to developing efficient join algorithms for relational databases. Confidences. Very little is understood as to how confidences should be manipulated in an entity resolution setting. For example, say we have two records, one reporting that “Joe” uses cell phone 123, and the other reporting that “Joseph” uses phone 456. The first record has confidence 0.9 and the second one 0.7. A snap rule tells us that “Joe” and “Joseph” are the same person with confidence 0.8. Do we assume this person has been using two phones? Or that 123 is the correct number because that record has a higher confidence? If we do merge the records, what are the resulting confidences? Metrics. Say we have two entity resolution schemes, A and B. How do we know if A yields “better” results and compared to B? Or say we have one base set of records, and we wish to enhance its fidelity with either new set X or new set Y. Since it costs money to acquire either new set, we only wish to use one. Based on samples of X and Y, how do we decide which set is more likely to enhance our base set? To address questions such as these we need to develop metrics that quantify not just to performance of entity resolution, but also its accuracy. Privacy. There is a strong connection between entity resolution and information privacy. To illustrate, say Alice has given out two records containing some of her private information: Record 1 gives Alice’s name, phone number and credit card number; record 2 gives Alice’s name, phone and national identity number. How much information has actually “leaked” depends on how well and adversary, Bob, can piece together these two records. If Bob can determine that the records refer to the same person, then he knows Alice’s credit card number and her national identity number, opening the door for say identity theft. If the records do not snap together, then Bob knows less and we have a smaller information leak. We need to develop good ways to model information leakage in an entity resolution context. Such a model can lead us, for example, to techniques for quantifying the leakage caused by releasing one new fact, or the decrease in leakage caused by releasing disinformation. Additional information on our SERF project can be found at http://www-db.stanford.edu/serf This work is joint with Qi Su, Tyson Condie, Nicolas Pombourcq, and Jennifer Widom.

Towards a Statistically Semantic Web Gerhard Weikum, Jens Graupmann, Ralf Schenkel, and Martin Theobald Max-Planck Institute of Computer Science Saarbruecken, Germany

Abstract. The envisioned Semantic Web aims to provide richly annotated and explicitly structured Web pages in XML, RDF, or description logics, based upon underlying ontologies and thesauri. Ideally, this should enable a wealth of query processing and semantic reasoning capabilities using XQuery and logical inference engines. However, we believe that the diversity and uncertainty of terminologies and schema-like annotations will make precise querying on a Web scale extremely elusive if not hopeless, and the same argument holds for large-scale dynamic federations of Deep Web sources. Therefore, ontology-based reasoning and querying needs to be enhanced by statistical means, leading to relevanceranked lists as query results. This paper presents steps towards such a “statistically semantic” Web and outlines technical challenges. We discuss how statistically quantified ontological relations can be exploited in XML retrieval, how statistics can help in making Web-scale search efficient, and how statistical information extracted from users’ query logs and click streams can be leveraged for better search result ranking. We believe these are decisive issues for improving the quality of next-generation search engines for intranets, digital libraries, and the Web, and they are crucial also for peer-to-peer collaborative Web search.

1 The Challenge of “Semantic” Information Search The age of information explosion poses tremendous challenges regarding the intelligent organization of data and the effective search of relevant information in business and industry (e.g., market analyses, logistic chains), society (e.g., health care), and virtually all sciences that are more and more data-driven (e.g., gene expression data analyses and other areas of bioinformatics). The problems arise in intranets of large organizations, in federations of digital libraries and other information sources, and in the most humongous and amorphous of all data collections, the World Wide Web and its underlying numerous databases that reside behind portal pages. The Web bears the potential of being the world’s largest encyclopedia and knowledge base, but we are very far from being able to exploit this potential. Database-system and search-engine technologies provide support for organizing and querying information; but all too often they require excessive manual preprocessing, such as designing a schema and cleaning raw data or manually classifying documents into a taxonomy for a good Web portal, or manual postprocessing such as browsing through large result lists with too many irrelevant items or surfing in the vicinity of promising but not truly satisfactory approximate matches. The following are a few example queries where current Web and intranet search engines fall short or where data P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 3–17, 2004. © Springer-Verlag Berlin Heidelberg 2004

4

Gerhard Weikum et al.

integration techniques and the use of SQL-like querying face unsurmountable difficulties even on structured, but federated and highly heterogeneous databases: Q1: Which professors from Saarbruecken in Germany teach information retrieval and do research on XML? Q2: Which gene expression data from Barrett tissue in the esophagus exhibit high levels of gene A01g? And are there any metabolic models for acid reflux that could be related to the gene expression data? Q3: What are the most important research results on large deviation theory? Q4: Which drama has a scene in which a woman makes a prophecy to a Scottish nobleman that he will become king? Q5: Who was the French woman that I met in a program committee meeting where Paolo Atzeni was the PC chair? Q6: Are there any published theorems that are equivalent to or subsume my latest mathematical conjecture? Why are these queries difficult (too difficult for Google-style keyword search unless one invests a huge amount of time to manually explore large result lists with mostly irrelevant and some mediocre matches)? For Q1 no single Web site is a good match; rather one has to look at several pages together within some bounded context: the homepage of a professor with his address, a page with course information linked to by the homepage, and a research project page on semistructured data management that is a few hyperlinks away from the homepage. Q2 would be easy if asked for a single bioinformatics database with a familiar query interface, but searching the answer across the entire Web and Deep Web requires discovering all relevant data sources and unifying their query and result representations on the fly. Q3 is not a query in the traditional sense, but requires gathering a substantial number of key resources with valuable information on the given topic; it would be best served by looking up a well maintained Yahoo-style topic directory, but highly specific expert topics are not covered there. Q4 cannot be easily answered because a good match does not necessarily contain the keywords “woman”, “prophecy”, “nobleman”, etc., but may rather say something like “Third witch: All hail, Macbeth, thou shalt be king hereafter!” and the same document may contain the text “All hail, Macbeth! hail to thee, thane of Glamis!”. So this query requires some background knowledge to recognize that a witch is a woman, “shalt be” refers to a prophecy, and thane is a title for a Scottish nobleman. Q5 is similar to Q4 in the sense that it also requires background knowledge, but it is more difficult because it additionally requires putting together various information fragments: conferences on which I served on the PC found in my email archive, PC members of conferences found on Web pages, and detailed information found on researchers’ homepages. And after having identified a candidate like Sophie Cluet from Paris, one needs to infer that Sophie is a typical female first name and that Paris most likely denotes the capital of France rather than the 500-inhabitants town of Paris, Texas, that became known through a movie. Q6 finally is what some researchers call “AI-complete”, it will remain a challenge for a long time. For a human expert who is familiar with the corresponding topics, none of these queries is really difficult. With unlimited time, the expert could easily identify relevant pages and combine semantically related information units into query answers. The challenge is to automate or simulate these intellectual capabilities and implement them so that they can handle billions of Web pages and petabytes of data in structured (but schematically highly diverse) Deep-Web databases.

Towards a Statistically Semantic Web

5

2 The Need for Statistics What if all Web pages and all Web-accessible data sources were in XML, RDF, or OWL (a description-logic representation) as envisioned in the Semantic Web research direction [25,1]? Would this enable a search engine to effectively answer the challenging queries of the previous section? And would such an approach scale to billions of Web pages and be efficient enough for interactive use? Or could we even load and integrate all Web data into one gigantic database and use XQuery for searching it? XML, RDF, and OWL offer ways of more explicitly structuring and richly annotating Web pages. When viewed as logic formulas or labeled graphs, we may think of the pages as having “semantics”, at least in terms of model theory or graph isomorphisms1. In principle, this opens up a wealth of precise querying and logical inferencing opportunities. However, it is extremely unlikely that all pages will use the very same tag or predicate names when they refer to the same semantic properties and relationships. Making such an assumption would be equivalent to assuming a single global schema: this would be arbitrarly difficult to achieve in a large intranet, and it is completely hopeless for billions of Web pages given the Web’s high dynamics, extreme diversity of terminology, and uncertainty of natural language (even if used only for naming tags and predicates). There may be standards (e.g., XML schemas) for certain areas (e.g., for invoices or invoice-processsing Web Services), but these will have limited scope and influence. A terminologically unified and logically consistent Semantic Web with billions of pages is hard to imagine. So reasoning about diversely annotated pages is a necessity and a challenge. Similarly to the ample research on database schema integration and instance matching (see, e.g., [49] and the references given there), knowledge bases [50], lexicons, thesauri [24], or ontologies [58] are considered as the key asset to this end. Here an ontology is understood as a collection of concepts with various semantic relationships among them; the formal representation may vary from rigorous logics to natural language. The most important relationship types are hyponymy (specialization into narrower concepts) and hypernymy (generalization into broader concepts). To the best of my knowledge, the most comprehensive, publicly available kind of ontology is the WordNet thesaurus hand-crafted by cognitive scientists at Princeton [24]. For the concept “woman” WordNet lists about 50 immediate hyponyms, which include concepts like “witch” and “lady” which could help to answer queries like Q4 from the previous section. However, regardless of whether one represents these hyponymy relationships in a graph-oriented form or as logical formulas, such a rigid “trueor-false” representation could never discriminate these relevant concepts from the other 48 irrelevant and largely exotic hyponyms of “woman”. In information-retrieval (IR) jargon, such an approach would be called Boolean retrieval or Boolean reasoning; and IR almost always favors ranked retrieval with some quantitative relevance assessment. In fact, by simply looking at statistical correlations of using words like “woman” and “lady” together in some text neighborhood within large corpora (e.g., the Web or large digital libraries) one can infer that these two concepts are strongly related, as opposed to concepts like “woman” and “siren”. Similarly, mere statistics strongly suggests that 1

Some people may argue that all computer models are mere syntax anyway, but this is in the eye of the beholder.

Gerhard Weikum et al.

6

a city name “Paris” denotes the French capital and not Paris, Texas. Once making a distinction of strong vs. weak relationships and realizing that this is a full spectrum, it becomes evident that the significance of semantic relationships needs to be quantified in some manner, and the by far best known way of doing this (in terms of rigorous foundation and rich body of results) is by using probability theory and statistics. This concludes my argument for the necessity of a “statistically semantic” Web. The following sections substantiate and illustrate this point by sketching various technical issues where statistical reasoning is key. Most of the discussion addresses how to handle non-schematic XML data; this is certainly still a good distance from the Semantic Web vision, but it is a decent and practically most relevant first step.

3 Towards More “Semantics” in Searching XML and Web Data Non-schematic XML data that comes from many different sources and inevitably exhibits heterogeneous structures and annotations (i.e., XML tags) cannot be adequately searched using database query languages like XPath or XQuery. Often, queries either return too many or too few results. Rather the ranked-retrieval paradigm is called for, with relaxable search conditions, various forms of similarity predicates on tags and contents, and quantitative relevance scoring. Note that the need for ranking goes beyond adding Boolean text-search predicates to XQuery. In fact, similarity scoring and ranking are orthogonal to data types and would be desirable and beneficial also on structured attributes such as time (e.g., approximately in the year 1790), geographic coordinates (e.g., near Paris), and other numerical and categorical data types (e.g., numerical sensor readings and music style categories). Research on applying IR techniques to XML data has started five years ago with the work [26,55,56,60] and has meanwhile gained considerable attention. This research avenue includes approaches based on combining ranked text search with XPath-style conditions [4,13,35,11,31,38], structural similarities such as tree-editing distances [5,54,69,14], ontology-enhanced content similarities [60,61,52], and applying probabilistic IR and statistical language models to XML [28,2]. Our own approach, the XXL2 query language and search engine [60,61,52], combines a subset of XPath with a similarity operator ~ that can be applied to element or attribute names, on one hand, and element or attribute contents, on the other hand. For example, the queries Q1 and Q4 of Section 1 could be expressed in XXL as follows (and executed on a heterogeneous collection of XML documents):

Here XML data is interpreted as a directed graph, including href or XLink/XPointer links within and across documents that go beyond a merely tree-oriented approach. End nodes of connections that match a path condition such as drama//scene are bound to node variables that can be referred to in other search conditions. Content conditions 2

Flexible XML Search Language.

Towards a Statistically Semantic Web

7

such as = "~woman" are interpreted as keyword queries on XML elements, using IR-style measures (based on statistics like term frequencies and inverse element frequencies) for scoring the relevance of an element. In addition and most importantly, we allow expanding the query by adding “semantically” related terms taken from an ontology. In the example, “woman” could be expanded into “woman wife lady girl witch ...”. The score of a relaxed match, say for an element containing “witch”, is the product of the traditional score for the query “witch” and the ontological similarity of the query term and the related term, .sim(woman, witch) in the particular example. Element (or attribute) name conditions such as ~course are analogously relaxed, so that, for example, tag names “teaching”, “class”, or “seminar” would be considered as approximate matches. Here the score is simply the ontological similarity, for tag names are only single words or short composite words. The result of an entire query is a ranked list of subgraphs of the XML data graph, where each result approximately matches all query conditions with the same binding of all variables (but different results have different bindings). The total score of a result is computed from the scores of the elementary conditions using a simple probabilistic model with independence assumptions, and the result ranking is in descending order of total scores. Query languages of this kind work nicely on heterogeneous and non-schematic XML data collections, but the Web and also large fractions of intranets are still mostly in HTML, PDF, and other less structured formats. Recently we have started to apply XXLstyle queries also to such data by automatically converting Web data into XML format. The COMPASS3 search engine that we have been building supports XML ranked retrieval on the full suite of Web and intranet data including combined data collections that include both XML documents and Web pages [32]. For example, query Q1 can be executed on an index that is built over all of DBLP (cast into XML) and the crawled homepages of all authors and other Web pages reachable through hyperlinks. Figure 1 depicts the visual formulation of query Ql. Like in the original XXL engine, conditions with the similarity operator ~ are relaxed using statistically quantified relationships from the ontology.

Fig. 1. Visual COMPASS Query 3

Concept-oriented Multi-format Portal-aware Search System.

8

Gerhard Weikum et al.

The conversion of HTML and other formats into XML is based on relatively simple heuristic rules, for example, casting HTML headings into XML element names. For additional automatic annotation we use the information extraction component ANNIE that is part of the GATE System developed at the University of Sheffield [20]. GATE offers various modules for analyzing, extracting, and annotating text; its capabilities range from part-of-speech tagging (e.g., for noun phrases, temporal adverbial phrases, etc.) and lexicon lookups (e.g., for geographic names) to finite state transducers for annotations based on regular expressions (e.g., for dates or currency amounts). One particularly useful and fairly light-weight component is the Gazetteer Module for named entity recognition based on part-of-speech tagging and a large dictionary containing names of cities, countries, person names (e.g., common first names), etc. This way one can automatically generate tags like and . For example, we were able to annotate the popular Wikipedia open encyclopdia corpus this way, generating about 2 million person and location tags. And this is the key for more advanced “semantics-aware” search on the current Web. For example, searching for Web pages about the physicist Max Planck would be phrased as person = "Max Planck", and this would eliminate many spurious matches that a Google-style keyword query “Max Planck” would yield about Max Planck Institutes and the Max Planck Society4. There is a rich body of research on information extraction from Web pages and wrapper generation. This ranges from purely logic-based or pattern-matching-driven approaches (e.g., [51,17,6,30]) to techniques that employ statistical learning (e.g., Hidden Markov Models) (e.g., [15,16,39,57,40]) to infer structure and annotations when there is too much diversity and uncertainty in the underlying data. As long as all pages to be wrapped come from the same data source (with some hidden schema), the logicbased approaches work very well. However, when one tries to wrap all homepages of DBLP authors or the course programs of all computer science departments in the world, uncertainty is inevitable and statistics-driven techniques are the only viable ones (unless one is willing to invest a lot of manual work for traditional schema integration, writing customized wrappers and mappers). Despite advertising our own work and mentioning our competitors, the current research projects on combining IR techniques and statistical learning with XML querying is still in an early stage and there are certainly many open issues and opportunities for further research. These include better theoretical foundations for scoring models on semistructured data, relevance feedback and interactive information search, and, of course, all kinds of efficiency and scalability aspects. Applying XML search techniques to Web data is in its infancy; studying what can be done with named-entity recognition and other automatic annotation techniques and understanding the interplay of queries with such statistics-based techniques for better information organization are widely open fields.

4 Statistically Quantified Ontologies The important role of ontologies in making information search more “semantics-aware” has already been emphasized. In contrast to most ongoing efforts for Semantic-Web on4

Germany’s premier scientific society, which encompasses 80 institutes in all fields of science.

Towards a Statistically Semantic Web

9

tologies, our work has focused on quantifying the strengths of semantic relationships based on corpus statistics [52,59] (see also the related work [10,44,22,36] and further references given there). In contrast to early IR work on using thesauri for query expansion (e.g., [64]), the ontology itself plays a much more prominent role in our approach with carefully quantified statistical similarities among concepts. Consider a graph of concepts, each characterized by a set of synonyms and, optionally, a short textual description, connected by “typed” edges that represent different kinds of relationships: hypernyms and hyponyms (generalization and specialization, aka. is-a relations), holonyms and meronyms (part-of relations), is-instance-of relations (e.g., Cinderella being an instance of a fairytale or IBM Thinkpad being a notebook), to name the most important ones. The first step in building an ontology is to create the nodes and edges. To this end, existing thesauri, lexicons, and other sources like geographic gazetteers (for names of countries, cities, rivers, etc. and their relationships) can be used. In our work we made use of the WordNet thesaurus [24] and the Alexandria Digital Library Gazetteer [3], and also started extracting concepts from page titles and href anchor texts in the Wikipedia encyclopedia. One of the shortcomings of WordNet is its lack of instances knowledge, for example, brand names and models of cars, cameras, computers, etc. To further enhance the ontology, we crawled Web pages with HTML tables and forms, trying to extract relationships between table-header column and form-field names and the values in table cells and the pulldown menus of form fields. Such approaches are described in the literature (see, e.g., [21,63,68]). Our experimental findings confirmed the potential value of these techniques, but also taught us that careful statistical thresholding is needed to eliminate noise and incorrect inferencing, once again a strong argument for the use of statistics. Once the concepts and relationships of a graph-based ontology are constructed, the next step is to quantify the strengths of semantic relationships based on corpus statistics. To this end we have performed focused Web crawls and use their results to estimate statistical correlations between the characteristic words of related concepts. One of the measures for the similarity of concepts and that we used is the Dice coefficient

In this computation we represent concept by the terms taken from its set of synonyms and its short textual description (i.e., the WordNet gloss). Optionally, we can add terms from neighbors or siblings in the ontological graph. A document in the corpus is considered to contain concept if it contains at least one word of the term set for and considered to contain both and if it contains at least one word from each of the two term sets. This is a heuristics; other approaches are conceivable which we are investigating. Following this methodology, we constructed an ontolgy service [59] that is accessible via Java RMI or as a SOAP-based Web Service described in WSDL. The service is used in the COMPASS search engine [32], but also in other projects. Figure 2 shows a screenshot from our ontology visualization tool. One of the difficulties in quantifying ontological relationships is that we aim to measure correlations between concepts but merely have statistical information about

10

Gerhard Weikum et al.

Fig. 2. Ontology Visualization

correlations between words. Ideally, we should first map the words in the corpus onto the corresponding concepts, i.e., their correct meanings. This is known as the word sense disambiguation problem in natural language processing [45], obviously a very difficult task because of polysemy. If this were solved it would not only help in deriving more accurate statistical measures for “semantic” similarities among concepts but could also potentially boost the quality of search results and automatic classification of documents into topic directories. Our work [59] presents a simple but scalable approach to automatically mapping text terms onto ontological concepts, in the context of XML document classification. Again, statistical reasoning, in combination with some degree of natural language parsing, is key to tackling this difficult problem. Ontology construction is a highly relevant research issue. Compared to the ample work on knowledge representations for ontological information, the aspects of how to “populate” an ontology and how to enhance it with quantitative similarity measures have been underrated and deserve more intensive research.

5 Efficient Top-k Query Processing with Probabilistic Pruning For ranked retrieval of semistructured, “semantically” annotated data, we face the problem of reconciling efficiency with result quality. Usually, we are not interested in a complete result but only in the top-k results with the highest relevance scores. The state-of-the-art algorithm for top-k queries on multiple index lists, each sorted in descending order of relevance scores, is the Threshold Algorithm, TA for short [23,33, 47]. It is applicable to both relational data such as product catalogs and text documents such as Web data. In the latter case, the fact that TA performs random accesses on very long, disk-resident index lists (e.g., all URLs or document ids for a frequently occurring word), with only short prefixes of the lists in memory, makes TA much less attractive, however.

Towards a Statistically Semantic Web

11

In such a situtation, the TA variant with sorted access only, coined NRA (no random accesses), stream-combine, or TA-sorted in the literature, is the method of choice [23, 34]. TA-sorted works by maintaining lower bounds and upper bounds for the scores of the top-k candidates that are kept in a priority queue in memory while scanning the index lists. The algorithm can safely stop when the lower bound for the score of the rank-k result is at least as high as the highest upper bound for the scores of the candidates that are not among the current top-k. Unfortunately, albeit theoretically instance-optimal for computing a precise top-k result [23], TA-sorted tends to degrade in performance when operating on a large number of index lists. This is exactly the case when we relax query conditions such as ~speaker = ~woman using semantically related concepts from the ontology5. Even if the relaxation uses a threshold for the similarity of related concepts, we may often arrive at query conditions with 20 to 50 search terms. Statistics about the score distributions in the various index lists and some probabilistic reasoning help to overcome this efficiency problem and re-gain performance. In TAsorted a top-k candidate that has already been seen in the index lists in achieving score in list and has unknown scores in the index lists satisfies:

where denotes the total, but not yet known, score that achieves by summing up the scores from all index lists in which occurs, and are the lower and upper bounds of score, and is the score that was last seen in the scan of index list upper-bounding the score that any candidate may obtain in list A candidate remains a candidate as long as where is the candidate that currently has rank with regard to the candidates’ lower bounds (i.e., the worst one among the current top-k). Assuming that can achieve a score in all lists in which it has not yet been encountered is conservative and, almost always, overly conservative. Rather we could treat these unknown scores as random variables and estimate the probability that total score can exceed Then is discarded from the candidate list if

with some pruning threshold This probabilistic interpretation makes some small, but precisely quantifiable, potential error in that it could dismiss some candidates too early. Thus, the top-k result computed this way is only approximate. However, the loss in precision and recall, relative to the exact top-k result using the same index lists, is stochastically bounded and can be set according to the application’s needs. A value of seems to be acceptable in most situations. Technically, the approach requires computing the convolution 5

Note that the TA and TA-sorted algorithms can be easily modified to handle both elementname and element-contents conditions (as opposed to mere keyword sets in standard IR and Web search engines).

12

Gerhard Weikum et al.

of the random variables based on assumed distributions (with parameter fitting) or precomputed histograms for the individual index lists and taking into account the current values, and predicting the of the sum’s distribution. Details of the underlying mathematics and the implementation techniques for this Prob-sorted method can be found in [62]. Experiments with the TREC-12 .Gov corpus and the IMDB data collection have shown that such a probabilistic top-k method gains about a factor of ten (and sometimes more) in run-time compared to TA-sorted. The outlined algorithm for approximate top-k queries with probabilistic guarantees is a versatile building block for XML ranked retrieval. In combination with ontologybased query relaxation, for example, expanding ~woman into (woman or wife or witch), it can add index lists dynamically and incrementally, rather than having to expand the query upfront based on thresholds. To this end, the algorithm considers the ontological similarity between concept from the original query and concept in the relaxed query, and multiplies it with the value of index list to obtain an upper bound for the score (and characterize the score distribution) that a candidate can obtain from the relaxation This information is dynamically combined with the probabilistic prediction of the other unknown scores and their sum. The algorithm can also be combined with distance-aware path indexes for XML data (e.g., the HOPI index structure [53]). This is required when queries contain elementname and element-contents conditions as well as path conditions of the form professor//course where matches for “course” that are close to matches for “professor” should be ranked higher than matches that are far apart. Thus, the Probsorted algorithm covers a large fraction of an XML ranked retrieval engine.

6 Exploiting Collective Human Input The statistical information considered so far refers to data (e.g., scores in index lists) or metadata (e.g., ontological similarities). Yet another kind of statistics is information about user behavior. This could include relatively static properties like bookmarks or embedded hyperlinks pointing to high-quality Web pages, but also dynamic properties inferred from query logs and click streams. For example, Google’s PageRank views a Web page as more important if it has many incoming links and the sources of these links are themselves high authorities [9,12]. Technically, this amounts to computing stationary probabilities for a Markov-chain model that mimics a “random surfer”. What PageRank essentially does is to exploit the intellectual endorsements that many human users (or Web administrators on behalf of organizations) provide by means of hyperlinks. This rationale can be carried over to analyzing and exploiting entire surf trails and query logs of individual users or an entire user community. These trails, which can be gathered from browser histories, local proxies, or Web servers, capture implicit user judgements. For example, suppose a user clicks on a specific subset of the top 10 results returned by a search engine for a query with several keywords, based on having seen the summaries of these pages. This implicit form of relevance feedback establishes a strong correlation between the query and the clicked-on pages. Further suppose that the user refines a query by adding or replacing keywords, e.g., to eliminate ambiguities in the previous query. Again, this establishes correlations between the new keywords and

Towards a Statistically Semantic Web

13

the subsequently clicked-on pages, but also, albeit possibly to a lesser extent, between the original query and the eventually relevant pages. We believe that observing and exploiting such user behavior is a key element in adding more “semantic” or “cognitive” quality to a search engine. The literature contains some very interesting work in this direction (e.g., [19,65,67]), but is rather preliminary at this point. Perhaps, the difficulties in obtaining comprehensive query logs and surf trails outside of big service providers is a limiting factor in this line of experimental research. Our own, very recent, work generalizes the notion of a “random surfer” into a “random expert user” by enhancing the underlying Markov chain to incorporate also query nodes and transitions from queries to query refinements as well as clicked-on documents. Transition probabilities are derived from the statistical analysis of query logs and click streams. The resulting Markov chain converges to stationary authority scores that reflect not only the link structure but also the implicit feedback and collective human input of a search engine’s users [43]. The de-facto monopoly that large Internet service providers have on being able to observe user behavior and statistically leverage this valuable information may be overcome by building next-generation Web search engines in a truly decentralized and ideally self-organized manner. Consider a peer-to-peer (P2P) system where each peer has a full-fledged Web search engine, including a crawler and an index manager. The crawler may be thematically focused or crawl results may be postprocessed so that the local index contents reflects the corresponding user’s interest profile. With such a highly specialized and personalized “power search engine” most queries should be executed locally, but once in a while the user may not be satisfied with the local results and would then want to contact other peers. A “good” peer to which the user’s query should be forwarded would have thematically relevant index contents, which could be measured by statistical notions of similarity between peers. These measures may be dependent on the current query or may be query-independent; in the latter case, statistics is used to effectively construct a “semantic overlay network” with neighboring peers sharing thematic interests [8,42,48,18,7,66]. Both query routing and “statistically semantic” networks could greatly benefit from collective human inputs in addition to standard IR measures like term and document frequencies or term-wise score distributions: knowing the bookmarks and query logs of thousands of users would be a great resource to build on. Further exploring these considerations on P2P Web search should become a major research avenue in computer science. Note that our interpretation of Web search includes ranked retrieval and thus is fundamentally more difficult than Gnutella-style file sharing or simple key lookups via distributed hash tables. Further note that, although query routing in P2P Web search resembles earlier work on metasearch engines and distributed IR (see, e.g., [46] and the references given there), it is much more challenging because of the large scale and the high dynamics of the envisioned P2P system with thousands or millions of computers and users.

7 Conclusion With the ongoing information explosion in all areas of business, science, and society, it will be more and more difficult for humans to keep information organized and

14

Gerhard Weikum et al.

extract valuable knowledge in a timely manner. The intellectual time for schema design, schema integration, data cleaning, data quality assurance, manual classification, directory and search result browsing, clever formulation of sophisticated queries, etc. is already the major bottleneck today, and the situation is likely to become worse. In my opinion, this will render all attempts to master Web-scale information in a perfectly consistent, purely logic-based manner more or less futile. Rather, the ability to cope with uncertainty, diversity, and high dynamics will be mandatory. To this end, statistics and their use in probabilistic inferences will be key assets. One may envision a rich probabilistic algebra that encompasses relational or even object-relational and XML query languages, but interprets all data and results in a probabilistic manner and always produces ranked result result lists rather than Boolean result sets (or bags). There are certainly some elegant and interesting, but mostly theoretical, approaches along these lines (e.g., [27,29,37]). However, there is still a long way to go towards practically viable solutions. Among the key challenges that need to be tackled are customizability, composability, and optimizability. Customizability: The appropriate notions of ontological relationships, “semantic” similarities, and scoring functions are dependent on the application. Thus, the envisioned framework needs to be highly flexible and adaptable to incorporate application-specific or personalized similarity and scoring models. Composability: Algebraic building blocks like a top-k operator need to be composable so as to allow the construction of rich queries. The desired property that operators produce ranked list with some underlying probability (or “score mass”) distribution poses a major challenge, for we need to be able to infer these probability distributions for the results of complex operator trees. This problem is related to the difficult issues of selectivity estimation and approximate query processing in a relational database, but goes beyond the state of the art as it needs to incorporate text term distributions and has to yield full distributions at all levels of operator trees. Optimizability: Regardless of how elegant a probabilistic query algebra may be, it would not be acceptable unless one can ensure efficient query processing. Performance optimization requires a deep understanding of rewriting complex operator trees into equivalent execution plans that have significantly lower cost (e.g., pushing selections below joins or choosing efficient join orders). At the same time, the topk querying paradigm that avoids computing full result sets before applying some ranking is a must for efficiency, too. This combination of desiderata leads to a great research challenge in query optimization for a ranked retrieval algebra.

References 1. Karl Aberer et al.: Emergent Semantics Principles and Issues, International Conference on Database Systems for Advanced Applications (DASFAA) 2004 2. Mohammad Abolhassani, Norbert Fuhr: Applying the Divergence from Randomness Approach for Content-Only Search in XML Documents, ECIR 2004 3. Alexandria Digital Library Project, Gazetteer Development, http://www.alexandria.ucsb.edu/gazetteer/

Towards a Statistically Semantic Web

15

4. Shurug Al-Khalifa, Cong Yu, H. V. Jagadish: Querying Structured Text in an XML Database, SIGMOD 2003 5. Sihem Amer-Yahia, Laks V. S. Lakshmanan, Shashank Pandit: FleXPath: Flexible Structure and Full-Text Querying for XML, SIGMOD 2004 6. Arvind Arasu, Hector Garcia-Molina: Extracting Structured Data from Web Pages, SIGMOD 2003 7. Mayank Bawa, Gurmeet Singh Manku, Prabhakar Raghavan: SETS: Search Enhanced by Topic Segmentation, SIGIR 2003 8. Matthias Bender, Sebastian Michel, Gerhard Weikum, Christian Zimmer: Bookmark-driven Query Routing in Peer-to-Peer Web Search, SIGIR Workshop on Peer-to-Peer Information Retrieval 2004 9. Sergey Brin, Lawrence Page: The Anatomy of a Large-Scale Hypertextual Web Search Engine, WWW Conference 1998 10. Alexander Budanitsky, Graeme Hirst: Semantic Distance in WordNet: An Experimental, Application-oriented Evaluation of Five Measures, Workshop on WordNet and Other Lexical Resources 2001 11. David Carmel, Yoëlle S. Maarek, Matan Mandelbrod, Yosi Mass, Aya Soffer: Searching XML Documents via XML Fragments, SIGIR 2003 12. Soumen Chakrabarti: Mining the Web: Discovering Knowledge from Hypertext Data, Morgan Kaufmann Publishers, 2002 13. T. Chinenyanga, N. Kushmerick: An Expressive and Efficient Language for XML Information Retrieval, Journal of the American Society for Information Science and Technology (JASIST) 53(6), 2002 14. Sara Cohen, Jonathan Mamou, Yaron Kanza, Yehoshua Sagiv: XSEarch: A Semantic Search Engine for XML, VLDB 2003 15. William W. Cohen, Matthew Hurst, Lee S. Jensen: A Flexible Learning System for Wrapping Tables and Lists in HTML Documents, in: A. Antonacopoulos, J. Hu (Editors), Web Document Analysis: Challenges and Opportunities, Word Scientific Publishing, 2004 16. William W. Cohen, Sunita Sarawagi: Exploiting Dictionaries in Named Entity Extraction: Combining Semi-markov Extraction Processes and Data Integration Methods, KDD 2004 17. Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo: RoadRunner: Towards Automatic Data Extraction from Large Web Sites, VLDB 2001 18. Arturo Crespo, Hector Garcia-Molina: Semantic Overlay Networks, Technical Report, Stanford University, 2003. 19. Hang Cui, Ji-Rong Wen, Jian-Yun Nie, Wei-Ying Ma: Query Expansion by Mining User Logs, IEEE Transactions on Knowledge and Data Engineering 15(4), 2003 20. Hamish Cunningham. GATE, a General Architecture for Text Engineering, Computers and the Humanities 36, 2002 21. Hasan Davulcu, Srinivas Vadrevu, Saravanakumar Nagarajan, I. V. Ramakrishnan: OntoMiner: Bootstrapping and Populating Ontologies from Domain-Specific Web Sites, IEEE Intelligent Systems 18(5), 2003 22. Anhai Doan, Jayant Madhavan, Robin Dhamankar, Pedro Domingos, Alon Y. Halevy: Learning to Match Ontologies on the Semantic Web, VLDB Journal 12(4), 2003 23. Ronald Fagin, Amnon Lotem, Moni Naor: Optimal Aggregation Algorithms for Middleware, Journal of Computer and System Sciences 66(4), 2003 24. Christiane Fellbaum (Editor): WordNet: An Electronic Lexical Database, MIT Press, 1998 25. Dieter Fensel, Wolfgang Wahlster, Henry Lieberman, James A. Hendler (Editors): Spinning the Semantic Web: Bringing the World Wide Web to Its Full Potential, MIT Press, 2002 26. Norbert Fuhr, Kai Großjohann: XIRQL – An Extension of XQL for Information Retrieval, SIGIR Workshop on XML and Information Retrieval 2000

16

Gerhard Weikum et al.

27. Norbert Fuhr: Probabilistic Datalog: Implementing Logical Information Retrieval for Advanced Applications, Journal of the American Society for Information Science (JASIS) 51(2), 2000 28. Norbert Fuhr, Kai Großjohann: XIRQL: A Query Language for Information Retrieval in XML Documents, SIGIR 2001 29. Lise Getoor, Nir Friedman, Daphne Koller, Avi Pfeffer: Learning Probabilistic Relational Models, in: S. Dzeroski, N. Lavrac (Editors), Relational Data Mining, Springer, 2001 30. Georg Gottlob, Christoph Koch, Robert Baumgartner, Marcus Herzog, Sergio Flesca: The Lixto Data Extraction Project – Back and Forth between Theory and Practice, PODS 2004 31. Torsten Grabs, Hans-Jörg Schek: Flexible Information Retrieval on XML Documents. in: H. Blanken et al. (Editors), Intelligent Search on XML Data, Springer, 2003 32. Jens Graupmann, Michael Biwer, Christian Zimmer, Patrick Zimmer, Matthias Bender, Martin Theobald, Gerhard Weikum: COMPASS: A Concept-based Web Search Engine for HTML, XML, and Deep Web Data, Demo Program, VLDB 2004 33. Ulrich Güntzer, Wolf-Tilo Balke, Werner Kießling: Optimizing Multi-Feature Queries for Image Databases, VLDB 2000 34. Ulrich Güntzer, Wolf-Tilo Balke, Werner Kießling: Towards Efficient Multi-Feature Queries in Heterogeneous Environments, International Symposium on Information Technology (ITCC) 2001 35. Lin Guo, Feng Shao, Chavdar Botev, Jayavel Shanmugasundaram: XRANK: Ranked Keyword Search over XML Documents, SIGMOD 2003 36. Maria Halkidi, Benjamin Nguyen, Iraklis Varlamis, Michalis Vazirgiannis: THESUS: Organizing Web Document Collections Based on Link Semantics, VLDB Journal 12(4), 2003 37. Joseph Y. Halpern: Reasoning about Uncertainty, MIT Press, 2003 38. Raghav Kaushik, Rajasekar Krishnamurthy, Jeffrey F. Naughton, Raghu Ramakrishnan: On the Integration of Structure Indexes and Inverted Lists, SIGMOD 2004 39. Nicholas Kushmerick, Bernd Thomas: Adaptive Information Extraction: Core Technologies for Information Agents. in: M. Klusch et al. (Editors), Intelligent Information Agents, Springer, 2003 40. Kristina Lerman, Lise Getoor, Steven Minton, Craig A. Knoblock: Using the Structure of Web Sites for Automatic Segmentation of Tables, SIGMOD 2004 41. Zhenyu Liu, Chang Luo, Junghoo Cho, Wesley W. Chu: A Probabilistic Approach to Metasearching with Adaptive Probing, ICDE 2004 42. Jie Lu, James P. Callan: Content-based Retrieval in Hybrid Peer-to-peer Networks, CIKM 2003 43. Julia Luxenburger, Gerhard Weikum: Query-log Based Authority Analysis for Web Information Search, submitted for publication 44. Alexander Maedche, Steffen Staab: Learning Ontologies for the Semantic Web, International Workshop on the Semantic Web (SemWeb) 2001 45. Christopher D. Manning, Hinrich Schütze: Foundations of Statistical Natural Language Processing, MIT Press, 1999 46. Weiyi Meng, Clement T. Yu, King-Lup Liu: Building Efficient and Effective Metasearch Engines, ACM Computing Surveys 34(1), 2002 47. Surya Nepal, M. V. Ramakrishna: Query Processing Issues in Image (Multimedia) Databases, ICDE 1999 48. Henrik Nottelmann, Norbert Fuhr: Combining CORI and the Decision-Theoretic Approach for Advanced Resource Selection, ECIR 2004 49. Erhard Rahm, Philip A. Bernstein: A Survey of Approaches to Automatic Schema Matching, VLDB Journal 10(4), 2001 50. Stuart J. Russell, Peter Norvig: Artificial Intelligence - A Modern Approach, Prentice Hall, 2002

Towards a Statistically Semantic Web

17

51. Arnaud Sahuguet, Fabien Azavant: Building Light-weight Wrappers for Legacy Web Datasources using W4F, VLDB 1999 52. Ralf Schenkel, Anja Theobald, Gerhard Weikum: Ontology-Enabled XML Search. in: H. Blanken et al. (Editors), Intelligent Search on XML Data, Springer, 2003 53. Ralf Schenkel, Anja Theobald, Gerhard Weikum: An Efficient Connection Index for Complex XML Document Collections, EDBT 2004 54. Torsten Schlieder, Holger Meuss: Querying and Ranking XML Documents, Journal of the American Society for Information Science and Technology (JASIST) 53(6), 2002 55. Torsten Schlieder, Holger Meuss: Result Ranking for Structured Queries against XML Documents, DELOS Workshop: Information Seeking, Searching and Querying in Digital Libraries, 2000 56. Torsten Schlieder, Felix Naumann: Approximate Tree Embedding for Querying XML Data, SIGIR Workshop on XML and Information Retrieval, 2000 57. Marios Skounakis, Mark Craven, Soumya Ray: Hierarchical Hidden Markov Models for Information Extraction, IJCAI 2003 58. Steffen Staab, Rudi Studer (Editors): Handbook on Ontologies, Springer 2004 59. Martin Theobald, Ralf Schenkel, Gerhard Weikum: Exploiting Structure, Annotation, and Ontological Knowledge for Automatic Classification of XML Data, International Workshop on Web and Databases (WebDB) 2003 60. Anja Theobald, Gerhard Weikum: Adding Relevance to XML. International Workshop on Web and Databases (WebDB) 2000, extended version in: LNCS 1997, Springer, 2001. 61. Anja Theobald, Gerhard Weikum: The Index-based XXL Search Engine for Querying XML Data with Relevance Ranking, EDBT 2002 62. Martin Theobald, Gerhard Weikum, Ralf Schenkel: Top-k Query Evaluation with Probabilistic Guarantees, VLDB 2004 63. Yuri A. Tijerino, David W. Embley, Deryle W. Lonsdale, George Nagy: Ontology Generation from Tables, WISE 2003 64. Ellen M. Voorhees: Query Expansion Using Lexical-Semantic Relations. SIGIR 1994 65. Ji-Rong Wen, Jian-Yun Nie, Hong-Jiang Zhang: Query Clustering Using User Logs, ACM TOIS 20(1), 2002 66. Linhao Xu, Chenyun Dai, Wenyuan Cai, Shuigeng Zhou, Aoying Zhou: Towards Adaptive Probabilistic Search in Unstructured P2P Systems. Asia-Pacific Web Conference (APWeb) 2004 67. Gui-Rong Xue, Hua-Jun Zeng, Zheng Chen, Wei-Ying Ma, Hong-Jiang Zhang, Chao-Jun Lu: Implicit Link Analysis for Small Web Search, SIGIR 2003 68. Shipeng Yu, Deng Cai, Ji-Rong Wen, Wei-Ying Ma: Improving Pseudo-Relevance Feedback in Web Information Retrieval Using Web Page Segmentation, WWW Conference 2003 69. Pavel Zezula, Giuseppe Amato, Fausto Rabitti: Processing XML Queries with Tree Signatures. in: H. Blanken et al. (Editors), Intelligent Search on XML Data, Springer, 2003

The Application and Prospect of Business Intelligence in Metallurgical Manufacturing Enterprises in China Xiao Ji, Hengjie Wang, Haidong Tang, Dabin Hu, and Jiansheng Feng Data Strategies Dept, Shanghai Baosight Software Co., Ltd Shanghai 201203, China

Abstract. This paper introduces the application of Business Intelligence (BI) technologies in metallurgical manufacturing enterprises in China. It sets forth the development procedure and successful cases of BI in Shanghai Baoshan Iron & Steel Co., Ltd (Shanghai Basteel in short), and puts forward the methodology adaptable to the construction of BI systems in the metallurgical manufacturing enterprises in China. Finally, it prospects the next generation of BI technologies in Shanghai Baosteel. It should be mentioned as well that it is the Data Strategies Dept of Shanghai Baosight Software Co., Ltd (Shanghai Baosight in short) and the Technology Center of Shanghai Baoshan Iron & Steel Co., Ltd. that supports and does research works on BI solutions in Shanghai Baosteel.

1 Introduction 1.1

The Application of BI Technologies in Metallurgical Manufacturing Enterprises in the World

The executives of enterprises sometimes are totally at a loss when they face with the explosive increasing data from different kinds of application systems with different levels such as MES, ERP, CRM, SCM, etc. Statistics show that the amount of data will be doubled within eighteen months. But among them, how much do we really need, and how much do we really can use for the further analysis? The main advantage of BI technologies is to discover and turn these massive data into the useful information for enterprise decision-making. The researches and application of BI have become a hot topic in global IT area since the term of BI technology was first brought forward by Howard Dresner from Gartner Group in 1989. Through our years practice, we consider BI a concept rather than an information technology. It is a business concept in solving the problems for enterprise production, operation, management, and etc. Taking enterprise data warehouse as basis, the BI technologies uses professional P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 18–29, 2004. © Springer-Verlag Berlin Heidelberg 2004

The Application and Prospect of Business Intelligence

19

knowledge and special data mining technologies to disclose key factors in solving business problems, and assisting operational management and decision-making. As the most advanced metallurgical manufacturing enterprise in China, Shanghai Baosteel has begun to use BI technologies in solving some key problems in daily production and management since last decade. It has applied BI technologies such as data analysis and data mining in both self-motion and selfconsciousness since the development of Solution of Iron Ores Mixing in 1995, and thereafter quality control system, SPC, IPC, and finally the large-scale enterprise data warehouse nowadays. In the meantime, Shanghai Baosight has formed its own characteristics of applying BI in metallurgical manufacturing enterprises, especially in quality control area. In addition, Shanghai Baosight has cultivated its experienced professional team in system development and project management. Following are the some achievements in specific areas. Data Warehouse: Considering the size, complexity and technical level, Shanghai Baosteel enterprise data warehouse system is a rare and advanced system in China. As a successful BI case, such data warehouse system has become a model in metallurgical manufacturing today. Quality Control and Analysis: In such area, many data mining techniques with high level technologies and characteristics have been widely applied for quality improvement, and can be extended to other manufacturing enterprises as well. SPC and IPC: As basis of quality control, SPC and IPC systems with special characteristics are commonly used in Shanghai Baosteel. Of course they are fitted to the other manufacturing enterprises too. The achievements in the above three areas prove that Shanghai Baosteel is leading in BI application in metallurgy and manufacturing enterprises in China. And with experience transfer, the others metallurgy manufacturing enterprises will follow the step of Shanghai Baosteels. And Shanghai Baosight will go further too in the related BI application areas. Comparing with international craft brothers such as POSCO and the United States Steel Corporation (UEC), Shanghai Baosteel is also among the top in BI application. UEC once invited Shanghai Baosteel to introduce its experience in building metallurgical manufacturing enterprise.

1.2

The Information System Development of Shanghai Baosteel

Shanghai Baosteel is the largest and the most modernized iron and steel complex in China. Baosteel has established its status as a world steel-making giant with comprehensive advantages in its reputation, talents, innovation, management and technology. According to the publication “Guide to the World Steel Industry”, Shanghai Baosteel ranks among the first three of the most competitive steel-makers worldwide, and is also believed as the most potentially competitive iron and steel enterprise in the future. Shanghai Baosteel specializes in producing high-tech and high-value-added steel products. Meanwhile it has become the main steel supplier to automobile industries, household appliances, container, oil and natural gas exploration, and

20

Xiao Ji et al.

pressure vessel in China. Meanwhile, Shanghai Baosteel exports its products to over forty countries and regions including Japan, South Korea and countries in Europe and America. All the facilities that the company possesses are based on the advanced technologies of contemporary steel smelting, cold and hot processing, hydraulic sensing, electronic control, computer and information communications. They feature large-scale, continuity and automation, and are kept the most advanced technology in the world. Shanghai Baosteel possesses tremendous strength of research and development. It has made great efforts in developing new technology, new products and new equipment, and has accumulated vigorous driving force for company’s further development. Shanghai Baosteel is located in Shanghai, China. Its first phase construction project began on the 23rd of December in 1978, and was completed and put into production on the 15th of September in 1985. Its second phase project went into operation in June, 1991 and third phase project was completed before the end of 2000. Shanghai Baosteel turned to be a stock company officially on the 3rd of February in 2000, and was successfully listed in Shanghai Security Exchange on the 12th December in the same year. In the early days when Shanghai Baosteel was setting up in 1978, the sponsors considered that they should build up computer systems to assist management. They realized it should import the most advanced equipments, techniques and management at the time from Japan, and take some factories of the Nippon Steel as models. In May 1981, with the impelling of the minister from the Ministry of the Metallurgy and Manufacturing, Shanghai Baosteel finished the “the Feasible Research of the Synthetic Computer System”, and lodged to build Shanghai Baosteel information system with five-level computer structures by setting up four area-control computer systems between the L3 systems and the central management information system. On the 15th February 1996, Shanghai Baosteel and IBM contracted to import the advanced computer system of IBM 9672 from the US as the area level management information system of hot and cold rolling areas in phase three project, changing the way in phase two project that there were two respective management systems within hot rolling areas and cool rolling areas. The decision was a revolution on information system construction in Shanghai Baosteel. And in the coming days, the executives of Shanghai Baosteel decided to build the comprehensive information system using IBM 9672 to integrate the whole distributed information systems. They then cancelled the fifth-level management information system, and the new system was put into production in March 1998, ensuring the proper production of 1580 hot rolling mill, 1420 cold rolling mill, and following second steel-making system. In May 2001, Shanghai Baosteel raised new strategic concept of Enterprise System Innovation. The ESI system included a three level architecture. First to rebuild the business processes of Shanghai Baosteel to bring up new effective

The Application and Prospect of Business Intelligence

21

ones; second, to reconstruct the organizational structure on the basis of new business processes; third to build corresponding information systems to assist to realize the new business processes. The main objective of ESI system is to help Shanghai Baosteel to realize its business target, and to be a good competitor in steel enterprises, and to prepare to face the overall challenges after China becomes a member of WTO. The above ESI decision was prospectively made by the executives of Shanghai Baosteel, to help Baosteel to realize its modernized management, and to be one of the Global 500 in the world. And now Shanghai Baosteel has successfully finished its third phase information system development. In the first phase project, several process control systems, self-developed central management system (IBM 4341) with batch processing, and PC networks were set up. In the second phase project, process control systems and product control systems, imported technology based management information system (IBM 4381) for 2050 hot rolling mill, self-developed management information system (IBM RS6000) for 2030 cold rolling mill, ironmaking regional management information system, steel-making regional management information system were built. In the third phase project, better configured process control systems, production control systems for 1580 hot rolling mill, and 1420, 1550 cold rolling mills, enterprise-wide OA and human resource management system, and ERP system which included integrated production and sales system and equipment management system, were successfully developed. After the three phase project construction, Shanghai Baosteel has formed its four-level production computer system. In recent years, with ESI concept, many assisted information systems were set up as well, such as integrated equipment maintenance management system, data warehouse and data mining applications, information services system for mills and departments, e-business platform BSTEEL.COM online, and Supply Chain Management, etc. The architecture of Shanghai Baosteel’s information system can be illustrated as followed.

Fig. 1. Information Architecture of Shanghai Baosteel

22

2 2.1

Xiao Ji et al.

Application of Business Intelligence in Shanghai Baosteel The Application and Methodology of Business Intelligence in Shanghai BaoSteel

As one of the most advanced metallurgical manufacturing enterprises in China, Shanghai Baosteel is now in its rapid development age. In order to continuously reduce cost and improve competitiveness in the international or the domestic markets, executives strongly realize the importance of the followings: To speed up the logistic turnover, and to improve the level of the products turnover. To stabilize and improve the products quality. To promote the sales and related capabilities to expand markets sharing. To strengthen the infrastructure of cost and finance. To optimize the allocation of enterprise resources, which farthest satisfies the markets’ requirements. In order to achieve the above objectives, the requirement to build an enterprise data warehouse system has been raised. In order to satisfy the strategy of Shanghai Baosteel’s information development, the data warehouse system should help Shanghai Baosteel to organize every kind of data required by the enterprise analysts and to transfer all needed information to end users. Then Shanghai Baosteel and Shanghai Baosight started to evaluate and plan the data warehouse system. The evaluation estimates the current enterprise infrastructure and the operational environments of Shanghai Baosteel. As the high level of information development, the data warehouse system could be built, and planned to build the first data warehouse subject area for Shanghai Baosteel - the technique and quality management data mart. Currently Shanghai Baosteel builds the enterprise data warehouse system on two IBM S85 machine with major data source from the ERP system. This data warehouse system includes ODS data stores, and perfectly integrated subject data stores according to the “Quick Data Warehouse Building” methodology. The first quality management data mart has accumulated much experience, and it has included the decision supporting information about the related products and their quality management. Nowadays, the system has already built the enterprise statistics management data mart, technique and quality management data mart, sales and marketing management data mart, production management data mart, equipment management data mart, finance and cost data mart which includes planning values, metal balancing, cost analysis, BUPC, finance analysis, and production administration information system, enterprise guidelines system, manufacturing mill area analysis which includes steel-making, hot rolling, cold rolling, etc. The amount of current data in the system is around 2TB, and the ETL task deals with about 3GB data everyday, and the newly appended data are about 1GB. In addition, nearly 1700 static analytical reports are produced each day, and 1600 kinds of dynamic queries are provided synchronously.

The Application and Prospect of Business Intelligence

23

At the same time, through many years’ practice and researches, Shanghai Baosight has abstracted a set of effective business intelligence solutions for manufacturing industry. This solution is significant for product designing, quality management, cost management in the metallurgical manufacturing enterprise. Typically, the implement of business intelligence for metallurgical manufacturer consists of the following 6 processes that offer the logical segmentation of works, and check whether the project is built steadily. The following flow chart illustrates the overview and work flow for the development phrases of this methodology.

Fig. 2. The Methodology of the BI construction

1. Assessment Within this phrase the users’ current situations and conditions should be studied. These factors will absolutely affect the data warehouse solutions. The target of phase is to analyze the users’ problems and the methods to resolve them. The initial assessment should identify and clarify the targets, and the requirements for the research for clarifying the targets. This kind of assessment will result in the decision of starting, delaying or the canceling of a project. 2. Requirements investigation In this phrase, the project group gathers the high level requirements in the aspects of operation and information technology (IT), and collects the information required by the departments’ targets. The result of this phrase is to submit a report, which identifies the business purpose, meanings, information requirements and the user interfaces. These requirements are also going to be used in other phases of the project and the design of data warehouse. In addition, the topic data model and data warehouse subject of enterprise level are accomplished in this phrase. 3. Design In the side of subject selection, the item group fasten on the collection detailed information request and designing of the scheme of the data flat roof

24

Xiao Ji et al.

include data, process, application modeling. In this phrase, many kinds of methods of collect information and test, such as data modeling, processing modeling, meeting, prototype presentation are used. Item group evaluate the technology scheme, business request and information request. Now, the difference between the IT scheme and the requested IT scheme is very outstanding. So it is advised that an appropriate data warehouse design and scheme should be applied. 4. Construction This phrase includes creating physical databases and data gathering, application testing and code review. The manager of the data warehouse and the leader of end-user should know well the system. After successfully test, the data platform can be used. 5. Deployment and maintenance In this phrase, the data warehouse and BI system can be displayed to business users. At the same time, trainings to the users should start too. After deployment, maintenance and users opinions should be considered. 6. Summary In this phrase, the whole project should be evaluated, and it consists of three steps. The first is to sum up the success and lessons learned. Second is to check whether the configuration is realized as expected. If needed, plans should be changed. The third is to evaluate the influence and the benefit to the company.

2.2

Successful cases of Shanghai Baosteel’s BI application

Shanghai Baosteel’s BI involves knowledge not only data warehouse, mathematics and statistics, data mining and knowledge discovery, but also professional knowledge of metallurgy, automatic control, management, etc. These are the main characteristics of Shanghai Baosteel’s BI application. And there are many successful cases in Shanghai Baosteel till now. The Production Administration System Based on Data Warehouse As a metallurgical manufacturing enterprise, rational production and proper administration is required in Shanghai Baosteel. According to the management requirement, in order to report the latest production status to the high level executives and get the latest guides from top managers, managers from all mills and functional departments must take part in the morning conference, which is presided by the general manager assistant or the vice general manager. Before the data warehouse system is built, the corresponding staff has to go to the production administration center everyday. And all conference information was organized by a Foxpro system with manual input, and the data mainly came from the phone and ERP system. The new production administration system then take the most advantages of enterprise data warehouse system. Based on the product information collected by data warehouse system, the system can automatically organize on Web daily information of production administration, material flow chart, quality analysis results, etc., to support the daily production administration and routine

The Application and Prospect of Business Intelligence

25

executives’ morning conference. And now with the online meeting system on Baosteel intranet, the managers can even take part in the conference and get all kinds of information in their offices. After the system has been put into production, it has won itself good reputation. Integrated Process Control Systems Quality is the life of the enterprise. In order to challenge the furious competition from the market, continuous improvement on the quality control is needed. The IPC systems have realized the improvement of the quality during the productions with lowest cost, and form the core ability of Shanghai Baosteel’s QA management - Know How. As a supporting analysis system, IPC system assists the quality manager’s control abilities during production processes, advances the technical person’s statistical and analysis abilities, and provides more accurate, convenient and institutional approaches for the operational manipulators to inspect products. These systems integrate both high visualized functions and multi-layered data mining functions in a subtle way. The Quality Data Mining System The quality data mart was the first BI system that brought benefits for Shanghai Baosteel, and it plays a more and more important role in daily management. On one side, it provides daily reports, and the analysis functions as online quality analysis, capability changing analysis, quality exception analysis, finished product quality analysis, quality cost statistics, index data maintenance, integrate analysis, KIV-KOV modeling, and so on. On the other side the data mart supports well the quality data mining. Quality data mining system based on data warehouse is strongly aid to the metallurgy industry. There are many cases of data mining and knowledge discovery, such as reducing sampling of steel ST12, improve the bend strength of the hot-dip galvanized products of the steel ST06Z, material calculation design based on knowledge. In the case of reducing the sampling of steel ST12, the original specification required that it must do sampling at both head and tail. It cost very much manpower and equipment. After the analysis of some key indexes such as the bend strength, tensile strength, etc., some

Fig. 3. The Production Administration System of Shanghai Baosteel.

Fig. 4. The Web Page of the Storage Presentation.

26

Xiao Ji et al.

similar analysis results of the head sampling and the tail sampling have been found out, and the tail sampling was a little bit worse. Through the reviews of some experts and testing practice, after the April of 2004 Shanghai Baosteel released a new operational specification to test only the tail sampling for steel ST12. As a result, it reduces cost RMB$2.60m annually. The Iron Ore Mixing System Raw material mixing is one of the most important jobs at the beginning of the steel-making. Shanghai Baosteel once faced many problems, such as how to evaluate a new ore that was not listed in the original ore mixing scheme? Which sintering ore mostly affect the final quality? Is there one scheme that can fit all different needs? Can we improve the quality of sintering mine while at the same time reduce the cost of sinter? Data mining in the Iron Ore Mixing System is to find out ways to meet the need of all kinds of sintering ores. The system forecasts the sinter quality through modeling, supports the mixing method with low cost, creates iron ore mixing knowledge database, and also provides friendly user interface. The data mining of iron ore mixing is in four steps: data preparation, iron ore evaluation with clustering analysis, modeling with neural networks, optimization. The evaluation results from the system are almost the same as those from experts. The forecasting accuracy reaches above 85 A Defect Diagnosis Expert System The defect diagnosis is an important basis of reliability engineering, and is and important component and key technology of total quality control. Computer aided defect diagnosis can reduce and prevent the same defects from occurring repeatedly. It can also assist to provide information for decisionmaking. The system comes from experiments and massive data made by technicians after real accidents happen. It was developed with computer technologies, statistics analysis, data mining technologies, and artificial intelligence, and is consisted of data storage, statistics analysis, knowledge repository, and defect diagnosis. The system contains both high generalized visualized functions and multi-layered data mining functions.

Fig. 5. The Quality Control of IPC.

Fig. 6. Improvement the Bend Strength of the Steel ST06Z Products.

The Application and Prospect of Business Intelligence

27

The Next Generation of Business Intelligence in Shanghai Baosteel

3

Shanghai Baosteel is the main body of Shanghai Baosteel Group. As Shanghai Baosteel Group became the Fortune’s 2003 Global 500, the application of BI in Shanghai Baosteel will be strengthened and developed further. Followings are the tasks to perform.

3.1

Carrying out the Application in Department Level

Shanghai Baosteel will persist in developing its own characteristics of BI, and will take quality control and synthetic reports as its main goal, and will extend the combination of IPC, data warehouse and data mining. Quality control is the everlasting subject in manufacturing and is a durative market in which product design and development should be strengthened. Nowadays enterprises emphasize strategies particularly on process improvement in response to both daily improvements from client’s requirements and drastic competitive market. In the industrial manufacturing, especially metallurgical manufacturing, there are many factors that cause quality problems, such as equipment invalidation, staff’s carelessness, parameter abnormal, raw material differences, fluctuating settings. Especially in large steel enterprises with complicated business and technical flows, “Timely finding and forecasting exceptions, promptly controlling and quality analysis” is a necessity. Therefore based on the quality control notion of 6 sigma, the application which based on data warehouse technologies, together with process control, fuzzy control, neural networks, expert system, data mining, can be applied in complicated working procedure as blast furnace, iron- making, steel-making, continuous casting, steel rolling. It is certainly the road to develop further the BI in department level.

Fig. 7. The Forecast of the RDI in the Iron Ore.

Fig. 8. The Diagnosis Expert System of the Steel Tube.

28

3.2

Xiao Ji et al.

Strengthening Researches on Application of BI in Enterprise Level

In enterprise level there are many requirements, which can lead to data warehouse based EIS, Key Performance Indexes (KPI) system, etc. KPI is a measurable management target which can set, sample, calculate and analyze the key parameters of the internal organization flow’s input and output. It is a tool that can decompose the organizational strategic goal to the operational tasks, and is the basis of the organizational performance management. It can make the definite responsibilities to a department manager, and extend to the staff in the department. So building a definite credible KPI system is the sticking point for a good performance management.

3.3

Following up the Technical Tide of BI and Applying New Technologies into Industry

BI is a subject which overlaps many disciplines. Shanghai Baosteel and Shanghai Baosight are actively following up the technical tide of BI and researching new BI techniques in the metallurgical manufacturing among fields as stream data management, text (message) data mining, KPI practice in manufacturing, customized information based on position, knowledge management, etc. Stream data management: Data which from L3 system (production control system) has the characteristics of stream data, so the knowledge of stream data management can be applied when IPC systems need to analyze timely and do data mining on the production line. Text (message) data mining: Data communication between the ERP system and other information systems of Shanghai Baosteel are implemented by messages. All the messages have been extracted and loaded into data warehouse system. So how to use text mining techniques to analyze and solve exceptions quickly will be a new challenge. The practice of KPI in manufacturing, customized information based on position, and knowledge management are new subjects and trends to provide extensive BI application in metallurgical manufacturing. Meanwhile, Shanghai Baosight and Technology Center of Shanghai Baosteel are fully taking the advantage of the previous experience to develop the data mining tools which have independent knowledge property rights. Practical Miner from Technology Center has been popular in Shanghai Baosteel for years, while Shanghai Baosight is developing a data mining tool according to the standards of CWM1.1 and CORBA, and is expected to release early in 2005.

4

Conclusions

Shanghai Baosteel is leading in Chinese metallurgical manufacturing industry, while it is a leader in BI application as well. With many years’ application and practice, it has benefited much from BI. And it will pursue an even further goal in BI in the near future.

The Application and Prospect of Business Intelligence

29

References 1. Inmon Bill, Data mart does not equal data warehouse, DM Direct,Nov.l997. 2. Inmon W.H., Building the data warehouse (Second Edition), John Wiley &: Sons, Inc, New York, 1996. 3. Ji Xiao, Yu Ge, Application of SAS software in the establishment of a data mart for quality analysis system in the metallurgical industry, Proc. of the 25th Annual SUGI Conf., USA. 2000. 4. 4. Kollios George, Gunopulos Dimitrios, Tsotras Vassilis J.On Indexing Mobile Objects. PODS 1999: 261-272. 5. Ji Xiao, Zhou Shichun, Data warehousing helps enterprise improve quality management, Proc. of the 26th Annual SUGI Conf., USA. 2001. 4. 6. Ji Xiao, Yu Ge, Bao Yubin, et al. Data warehousing and its application for procuct quality analysis in metallurgical enterprises, Proc. of 2000 Int. conf. on AMSMA, Guangzhou, China. 2000. 7. SAS Institute Inc., Rapid Warehousing Methodology. Cary, NC: SAS Institute Inc., 2000: 20-40. 8. Tang Haidong, Ji Xiao, Tracking your data warehouse using SAS software, Proc. of the 20th Annual SEUGI Conf., Paris. 2001. 12. 9. Theodoratos D, Sellis T., Designing data warehouses configuration, Proc. of the 23rd Intl. VLDB, 1997: 126-135. 10. Wang Daling, Bao Yubin, Ji Xiao, et al. Development of a data mining system supporting quality control in metallurgy enterprise, Proc. of 16th World Computer Congress 2000 on IIP, Beijing, China. 2000: 578-581. 11. Wang Daling, Bao Yubin, Ji Xiao, Wang Guoren, Integrated classification rule management system for data mining, Second Int. Conf. (WAIM 2001) Proceedings, 2001. 7. 122-129. 12. Wang Guoren, Yu Ge, Zhang Bin,et al. Schema integration architecture for multidatabase systems, The 21st Annual Int. Computer Software and Application Conf. (IEEE Compsac97). Washington, DC, Aug. 1997.IEEE Computer Society, 1997: 200-203. 13. Widom J., Research problems in data warehousing, Proc. of the 4th Int. Conf. on Information and Knowledge Management (CIKM), Balthore MD USA. November 1995: 25-30. 14. Wu M C, Buchmann A. Encoded, bitmap indexing for data warehouses, Proc. of ICDE 1998: 220-230. 15. Yu Ge, Wang Guoren, Zheng Huaiyuan, et al. Transform more semantics from relational databases to object-oriented database. Proc. of the 4th Int. Conf. on Database Systems for Advanced Application (DASEFAA) Apr1995. Singapore. 300-307.

Conceptual Modelling – What and Why in Current Practice Islay Davies1, Peter Green2, Michael Rosemann1, and Stan Gallo2 1

Centre for Information Technology Innovation Queensland University of Technology Brisbane, Australia

{ig.davies,m.rosemann}@qut.edu.au 2 UQ Business School University of Queensland Ipswich, Australia

{p.green,s.gallo}@uq.edu.au

Abstract. Much research has been devoted over the years to investigating and advancing the techniques and tools used by analysts when they model. As opposed to what academics, software providers and their resellers promote as should be happening, the aim of this research was to determine whether practitioners still embraced conceptual modelling seriously. In addition, what are the most popular techniques and tools used for conceptual modelling? What are the major purposes for which conceptual modelling is used? The study found that the top six most frequently used modelling techniques and methods were ER diagramming, data flow diagramming, systems flowcharting, workflow modelling, RAD, and UML. However, the primary contribution of this study was the identification of the factors that uniquely influence the continued-use decision of analysts, viz., communication (using diagrams) to/from stakeholders, internal knowledge (lack of) of techniques, user expectations management, understanding models integration into the business, and tool/software deficiencies.

1 Introduction The areas of business systems analysis, requirements analysis, and conceptual modelling are well-established research directions in academic circles. Comprehensive analytical work has been conducted on topics such as data modelling, process modelling, meta modelling, model quality, and the like. A range of frameworks and categorisations of modelling techniques have been proposed (e.g. [6, 9]). However, they mostly lack an empirical foundation. Thus, it is difficult to provide solid statements on the importance and potential impact of related research on the actual practice of conceptual modelling. More recently, Wand and Weber [13, p. 364] assume “the importance of conceptual modelling” and they state “Practitioners report that conceptual modelling is difficult and that it often falls into disuse within their organizations.” Unfortunately, anecdotal feedback to us from information systems (IS) practitioners confirmed largely the assertion of Wand and Weber [13]. Accordingly, as researchers involved in attempting to advance the theory of conceptual modelling in organisations, we were concerned to determine that practitioners still found conceptual modelling useful and that they were indeed still performing conceptual modelling as part of their business systems analysis processes. Moreover, if practitioners still found modelling useful, why P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 30–42, 2004. © Springer-Verlag Berlin Heidelberg 2004

Conceptual Modelling – What and Why in Current Practice

31

did they find it useful and what were the major factors that inhibited the wider use of modelling in their projects. In this way, the research that we were performing would be relevant for the practice of information systems development (See the IS Relevance debate on ISWorld, February 2001). Hence, the research in this paper is motivated in several ways. First, we want to obtain empirical data that conceptual modelling is indeed being performed in IS practice in Australia. Such data will give overall assurance to the practical relevance of the research that we perform in conceptual modelling. Second, we want to find out what are the principal tools, techniques, and purposes for which conceptual modelling is performed currently in Australia. In this way, researchers can obtain valuable information to help them direct their research towards aspects of conceptual modelling that contribute most to practice. Finally, we were motivated to perform this study so that we could gather and analyse data on major problems and benefits unique to the task of conceptual modelling in practice. So, this research aims to provide current insights into actual modelling practice. The underlying research question is “Do practitioners actually use conceptual modelling in practice?” The derived and more detailed questions are: What are popular tools and techniques used for conceptual modelling in Australia? What are the purposes of modelling? What are major problems and benefits unique to modelling? In order to provide answers for these questions, an empirical study using a webbased questionnaire has been designed. The goal was to determine what modelling practices are being used in business, as opposed to what academics, software providers and their resellers believe should be used. In summary, we found that the current state of usage of business systems/conceptual modelling in Australia is: ER diagramming, data flow diagramming, systems flowcharting, and workflow modelling being most frequently used for database design and management, software development, documenting and improving business processes. Moreover, this modelling work is supported in most cases by the use of Visio (in some version) as an automated tool. Furthermore, planned use of modelling techniques and tools into the short-term future appears to be expected to reduce significantly compared to current usage levels. The remainder of the paper unfolds in the following manner. The next section reviews the related work in terms of empirical data in relation to conceptual modelling practice. The third section explains briefly the instrument and methodology used. Then, an overview of the quantitative results of the survey is given. The fifth section presents succinctly the results of the analysis of the textual data on the problems and benefits of modelling. The last section concludes and gives an indication of further work planned.

2 Related Work Over the years, much work has been done on how to do modelling – the quality, correctness, completeness, goodness of representation, understandability, differences between novice and expert modellers, and many other aspects (e.g., [7]). Comparatively little empirical work however has been undertaken on modelling in practice. Floyd [3] and Necco et al. [8] conducted comprehensive empirical work into the use of modelling techniques in practice but that work is now considerably dated. Batra and Marakas [1] attempted to address this problem of a lack of current empirical evi-

32

Islay Davies et al.

dence however their work focused on comparing the perspectives of the academic and practitioner communities regarding the applications of conceptual data modelling. Indeed, these authors simply reviewed the academic and practitioner literatures without actually collecting primary data on the issue. Moreover, their work is now dated. However, it is interesting that they (p. 189) observe “there is a general lack of any substantive evidence, anecdotal or empirical, to suggest that the concepts are being widely used in the applied design environment.” Batra and Marakas [1, p. 190] state that “Researchers have not attempted to conduct case or field studies to gauge the cost-benefits of enterprise-wide conceptual data modelling (CDM).” This research has attempted to address the problems alluded to by Batra and Marakas [1]. Iivari [4] provided some data on these questions in a Finnish study of the perceptions of effectiveness of CASE tools. However, he found the adoption rate of CASE tools by developers in organisations very low (and presumably the extent of conceptual modelling to be low as well). More recently, Persson and Stirna [10] noted the problem, however, their work was limited in that it was only an exploratory study into practice. Most recently, Chang et al. [2] conducted 11 interviews with experienced consultants in order to explore the perceived advantages and disadvantages of business process modelling. This descriptive study did not, however, investigate the critical success factors of process modelling. Sedera et al. [11] have conducted three case studies to determine a process modeling success model, however they have not yet reported on a planned empirical study to test this model. Furthermore, the studies by Chang et al. [2] and Sedera et al. [11] are limited to the area of process modeling.

3 Methodology This study was conducted in the form of a web-based survey issued with the assistance of the Australian Computer Society (ACS) to its members. The survey consisted of seven pages1. The first page explained the objectives of our study. It also highlighted the available incentive, i.e., free participation in one of five workshops on business process modelling. The second page asked for the purpose of the modelling activities. In total, 17 purposes (e.g., database design and management, software development) were made available. The respondents were asked to evaluate the relevance of each of these purposes using a five-point Likert scale ranging from 1 (not relevant) to 5 (highly relevant). The third page asked for the modelling techniques2 used by the respondent. It provided a list of 18 different modelling techniques ranging from data flow diagram and ER diagrams, to the various IDEF standards, up to UML. For each modelling technique, the participants had to provide information about the past, current and future use of the modelling technique. It was possible to differentiate between infrequent and frequent use. Furthermore, participants could indicate whether they knew the technique or did not use it at all. It was possible also to add further modelling techniques that they used. The fourth page was related to the modelling tools. Following the same structure as for the modelling technique, a list of 24 modelling tools was provided. A hyperlink provided a reference to the homepage of each tool provided. It was clarified also if a tool had been known under a different name 1 2

A copy of the survey pages is available from the authors on request. ‘Technique’ here is used as an umbrella term referring to the constructs of the technique, their rules of construction, and the heuristics and guidelines for refinement.

Conceptual Modelling – What and Why in Current Practice

33

(e.g., Designer2000 for the Oracle9i Developer Suite). The fifth page explored qualitative issues. Participants were asked to list major problems and issues they had experienced with modelling as well as perceived key success factors. On the sixth page, demographic data was collected. This data included person type (practitioner, academic or student), years of experience in business systems analysis and modelling, working area (business or IT), training in modelling and the size of the organisation. The seventh page allowed contact details for the summarised results of the study and the free workshop to be entered. The instrument was piloted with 25 members of two research centres as well as with a selected group of practitioners. Minor changes were made based on the experiences within this pilot. A major contribution of this paper is an examination of the data gathered through the fifth page of the survey. This section of the survey asked respondents to list critical success factors for them in the use of conceptual modelling and problems or issues they encountered in successfully undertaking modelling in their organisations. The phenomena that responses to these questions allowed us to investigate were why do we continue/discontinue to use a technical method (implemented using a technological tool) – conceptual modelling. To analyse these phenomena, we used the following procedure: 1. What responses confirm the factors we already know about in regard to these phenomena; and 2. What responses are identifying new factors that are unique to the domain of conceptual modelling? To achieve step 1, we performed a review of the current thinking and literature in the areas of adoption and continued use of a technology. Then, using Nvivo 2, one researcher classified the textual comments, where relevant, according to these known factors. This researcher’s classification was then reviewed and confirmed with a second researcher. The factors identified from the literature and used in this first phase of the process are summarised and defined in Table 1. After step 1, there remained factors that did not readily fit into one or other of the known factor categories. These unclassified responses had the potential to provide us with insight on factors unique and important to the domain of conceptual modelling. However, the question was how to derive this information in a relatively objective and unbiased manner from the textual data. We used a new state-of-the-art textual content analysis tool called Leximancer3. Using this tool, we identified from the unclassified text five new factors specific to conceptual modelling. Subsequently, one researcher again classified the remaining responses using these newly identified factors. His classification was reviewed and confirmed by a second researcher. Finally, the relative importance of each of the new factors was determined.

3.1 Why Use Leximancer? The Leximancer system allows its users to analyse large amounts of text quickly. The tool performs this analysis both systematically and graphically by creating a map of the constructs – the document map – that are displayed in such a manner that links to related subtext may be subsequently explored. Each of the words on the document map represents a concept that was identified. The concept is placed on the map in 3

For more information on Leximancer, see www.leximancer.com

34

Islay Davies et al.

proximity of other concepts in the map through a derived combination of the direct and indirect relationships between those concepts. Essentially, the Leximancer system is a machine-learning technique based on the Bayesian approach to prediction. The procedure used for this is a self-ordering optimisation technique and does not use

Conceptual Modelling – What and Why in Current Practice

35

neural networks. Once the optimal weighted set of words is found for each concept, it is used to predict the concepts present in fragments of related text. In other words, each concept has other concepts that it attracts (or is highly associated with contextually) as well as concepts that it repels (or is highly disassociated with contextually). The relationships are measured by the weighted sum of the number of times two concepts are found in the same ‘chunk’. An algorithm is used to weight them and determine the confidence and relevancy of the terms to others in a specific chunk and across chunks. Leximancer was selected for this qualitative data analysis for several reasons: Its ability to derive the main concepts within text and their relative importance using a scientific, objective algorithm; Its ability to identify the strengths between concepts (how often they co-occur) – centrality of concepts; Its ability to assist the researcher in applying grounded theory analysis to a textual dataset; Its ability to assist in visually exploring textual information for related themes to create new ideas or theories; and Its ability to assist in identifying similarities in the context in which the concepts occur – contextual similarity.

4 Survey Results and Discussion From 674 individuals who started to fill out the survey, 370 actually completed the entire survey, which leads to a completion rate of 54.8%. Moreover, of the 12,000 members of the ACS, 1,567 indicated in their most recent membership profiles that they were interested in conceptual modelling/business systems analysis. Accordingly, our 370 responses indicate a relevant response rate of 23.6%, which is very acceptable for a survey. Moreover, we offered participation in one of five seminars on business process modelling free of charge as an inducement for members to participate. This offer was accepted by 186 of 370 respondents. Corresponding with the nature of the ACS as a professional organisation, 87% of the participants were practitioners. The remaining respondents were academics (6%) and students (7%). It is also not a surprise that 85% of the participants characterised themselves as an IT service person while only 15% referred to themselves as a businessperson or end user. Sixty-eight percent of the respondents indicated that they gained their knowledge in Business Systems Analysis from University. Further answers were TAFE (Technical and Further Education) (6%), ACS (3%). Twenty-three percent indicated that they did not have any formal training in Business Systems Analysis. Forty percent of the respondents indicated that they have less than five years experience with modelling. Thirty-eight percent have between 5 and 15 years of experience. A significant proportion, 22%, has more than 15 years of experience with modelling. These figures indicate that the average expertise of the respondents is supposedly quite high. Twentyeight percent of respondents indicated that they worked in firms employing less than 50 people, most likely small software consulting firms. However, a quarter of the respondents worked in organisations with 1000 or less employees. So, by Australian standards, they would be involved in software projects of reasonable size. We were concerned to obtain information in three principle areas of conceptual modelling in Australia viz., what techniques are used currently in practice, what tools

36

Islay Davies et al.

are used for modelling in practice, and what are the purposes for which conceptual modelling is used. Table 2 presents from the data the top six most frequently used modelling techniques. It describes the usage of techniques as not known or not used, infrequently used (which in the survey instrument was defined as used less than five times per week), and frequently used. The table clearly demonstrates that the top six most frequently used (used 5 or more times a week) techniques are ER diagramming, data flow diagramming, systems flowcharting, workflow modelling (range of workflow modelling techniques), RAD, and UML. It is significant to note that even though object-oriented analysis, design, and programming has been the predominant paradigm for systems development over the last decade 64 percent of respondents either did not know or did not use UML. While not every conceptual modelling technique available was named in the survey, the eighteen techniques used were selected based on their popularity reported in prior literature. It is interesting again to note that approximately 40 percent of respondents (at least) do either not know or use any of the 18 techniques named in the survey.

Moreover, while not explicitly reported in Table 2, this current situation of nonusage appears to be set to increase into the short-term future (next 12 months) as the planned frequent use of the top four techniques is expected to drop to less than half its current usage, viz., ER diagramming (17 percent), data flow diagramming (15 percent), systems flowcharting (10 percent), and workflow modelling (12 percent). Furthermore, no increase in the intention to use any of the other techniques was reported, to balance this out. Perhaps, this short-term trend reflects the perception that the current general downturn in the IT industry will persist into the future. Accordingly, respondents perceive a significant reduction of new developmental work requiring business systems modelling in the short-term future. It may also just reflect the lack of planning of future modelling activities. Our work was also interested in what tools were used to perform the conceptual modelling work that was currently being undertaken. Table 3 presents the top six most frequently used tools when performing business systems analysis and design. The data is reported using the same legend as that used for Table 2. Again, while not every conceptual modelling tool available was named in the survey, the twenty-four tools were selected based on their popularity reported in prior literature. Table 3 clearly indicates that Visio (58 percent – both infrequent and frequent use) is the preferred tool of choice for business systems modelling currently. This result is not surprising as the top four most frequently used techniques are well

Conceptual Modelling – What and Why in Current Practice

37

supported by Visio (in its various versions). A long way second in frequent use is Rational Rose (19 percent – both infrequent and frequent use) reflecting the current level of use of object-oriented analysis and design techniques. Again, at least 40 percent of respondents (approximately) do either not know or use any of the 24 tools named in the survey – even a relatively simple tool like Flowcharter or Visio.

Moreover, while not explicitly reported in Table 3, into the short-term future (next 12 months), the planned frequent use of the top two tools is expected to drop significantly from their current usage levels, viz., Visio (21 percent) and Rational Rose (8 percent) with no real increase reported for planned use of other tools to compensate for this drop. Again, this trend in planned tool usage appears to reflect the fact that respondents expect a significant reduction in new developmental work requiring business systems modelling in the short-term future. Business systems modelling (conceptual modelling) must be performed for some purpose. Accordingly, we were interested in obtaining data on the various purposes for which people might be undertaking modelling. Using a five-point Likert scale (where 5 indicates Very Frequent Use), Table 4 presents (in rank order from the highest to the lowest score) the average score for purpose of use from the respondents.

38

Islay Davies et al.

Table 4 indicates that database design and management remains the highest average purpose for use of modelling techniques. This fact links to the earlier result of ER diagramming being the most frequently used modelling technique. Moreover, software development as a purpose would support the high usage of data flow diagramming and ER diagramming noted earlier. Indeed, the relatively highly regarded purposes of documenting and improving business processes, and managing workflows, would support further the relatively high usage of workflow modelling and flowcharting indicated earlier. The more specialised tasks like identifying activities for activitybased costing and internal control purposes in auditing appear to be relatively infrequently used purposes for modelling. This fact however may derive from the type of population that was used for the survey, viz., members of the Australian Computer Society.

5 Textual Analysis Results and Discussion Nine hundred and eighty (980) individual comments were received across the questions on critical success factors and problems/issues for modelling. Using the known factors (Table 1) influencing continued use of new technologies in firms, Table 5 shows the classification of the 980 comments after phase 1 of the analysis using Nvivo.

Clearly, relative advantage (disadvantage)/usefulness from the perspective of the analyst was the major driving factor influencing the decision to continue (discontinue)

Conceptual Modelling – What and Why in Current Practice

39

modelling. Does conceptual modelling (and/or its supporting technology) take too much time, make my job easier, make my job harder, and make it easier/harder for me to elicit/confirm requirements with users? Such comments typically contributed to this factor. Furthermore, it is not surprising to see that complexity of the method and/or tool, compatibility of the method and/or tool with the responsibilities of my job, the views of “experts”, and top management support were other major factors driving analysts’ decisions on continued use. Prior literature had told us to expect these results, in particular, the key importance of top management support to the continued successful use of such key business planning and quality assurance mechanisms as conceptual modelling for systems. However, nearly one-fifth of the comments remained unclassified. Were there any new, important factors unique to the conceptual modelling domain contained in this data? Fig. 1 shows a document (concept) map produced by Leximancer from the unclassified comments.

Fig. 1. Concept map produced by Leximancer on the unclassified comments

Five factors were identified from this map using the centrality of concepts and the relatedness of concepts to each other within identifiable ‘chunks’. While the resolution of the Leximancer generated concept map (Fig. 1) may be difficult to read on its

40

Islay Davies et al.

own here, the concepts (terms) depicted are referred to within the discussion of the relevant factors below. A. Internal Knowledge (Lack of) of Techniques This group centred on such concepts as knowledge, techniques, information, large, easily and lack. Related concepts were work, systems, afraid, UML and leading. Accordingly, we used these concepts to identify this factor as the degree of direct/indirect knowledge (or lack of) in relation to the use of effective modelling techniques. Highlighted inadequacies raise issues of the modeller’s skill level and questions of insufficient training. B. User Expectations Management This group centred on such concepts as expectations, stakeholders, audience and review. Understanding, involved, logic and find were related concepts. Consequently, we used these items to identify this factor as issues arising from the need to manage the expectations of users as to what they expect conceptual modelling to do for them and to produce. In other words, the analyst must ensure that the stakeholders/audience for the outputs of conceptual modelling have a realistic understanding of what will be achieved. Continued (discontinued) use of conceptual modelling may be influenced by difficulties experienced (or expected) with users over such issues as acceptance, understanding and communication of the outcomes of the modelling techniques. C. Understanding the Models Integration into the Business This group centred on understanding, enterprise, high, details, architecture, logic, physical, implementation and prior. Accordingly, we identified a factor as the degree to which decisions are affected by stakeholder/modeller’s perceived understanding (or lack of) in relation to the models integration into business processes (initial and ongoing). In other words, for the user, to what extent do the current outputs of the modelling process integrate with the existing business processes and physical implementations to support the goals of the overall enterprise architecture? D. Tool/Software Deficiencies This group was focused on such concepts as software, issues, activities, and model. Subsequently, a factor was identified as the degree to which decisions are affected by issues relating directly to the perceived lack of capability of the software and/or the tool design. E. Communication (Using Diagrams) to/from Stakeholders This final group involved such concepts as diagram, information, ease, communication, method, examples, and articulate. Related concepts were means, principals, inability, hard, audience, find, and stakeholders. From these key concepts, we deduced a factor as the degree to which diagrams can facilitate effective communication between analysts and key stakeholders in the organisation. In other words, to what extent can the use of diagrams enhance (hinder) the explanation to, and understanding by, the stakeholders of the situation being modelled? Using these five new factors, we revisited the unclassified comments and, using the same dual coder process as before, we confirmed a classification for those outstanding comments easily. Table 6 presents this classification and the relative importance of those newly identified factors.

Conceptual Modelling – What and Why in Current Practice

41

As can be seen from Table 6, communication using diagrams and internal knowledge (lack of) of the modelling techniques are major issues specific to the continued use of modelling in organisations. To a lesser degree, properly managing users’ expectations of modelling and ensuring users understand how the outcomes of a specific modelling task support the overall enterprise systems architecture are important to the continued use of conceptual modelling. Deficiencies in software tools that support conceptual modelling frustrate the analyst’s work occasionally.

6 Conclusions and Future Work This paper has reported the results of a survey conducted nationally in Australia on the status of conceptual modelling. It achieved 370 responses and a relevant response rate of 23.6 percent. The study found that the top six most frequently used modelling techniques were ER diagramming, data flow diagramming, systems flowcharting, workflow modelling, RAD, and UML. Furthermore, it found that clearly Visio is the preferred tool of choice for business systems modelling currently. Rational Rose and Oracle Developer suite were a long way second in frequent use. Database design and management remains the highest average purpose for use of modelling techniques. This fact links to the result of ER diagramming being the most frequently used modelling technique. Moreover, software development as a purpose would support the high usage of data flow diagramming and ER diagramming. A major contribution of this study is the analysis of textual data concerning critical success factors and problems/issues in the continued use of conceptual modelling. Clearly, relative advantage (disadvantage)/usefulness from the perspective of the analyst was the major driving factor influencing the decision to continue (discontinue) modelling. Moreover, using a state-of-the-art textual analysis and machine-learning software package called Leximancer, this study identified five factors that uniquely influence the continued use decision of analysts, viz., communication (using diagrams) to/from stakeholders, internal knowledge (lack of) of techniques, user expectations management, understanding models integration into the business, and tool/software deficiencies. The results of this work are limited in several ways. Although every effort was taken to mitigate potential limitations, it still suffers from the usual problems with surveys, most notably, potential bias in the responses and lack of generalisability of the results to other people and settings. More specifically, in relation to the qualitative analysis, even though a form of dual coding (with confirmation) was employed, there still remains subjectivity in the classification of comments. Furthermore, while the members of the research team all participated, the identification of the factors using

42

Islay Davies et al.

the Leximancer document map and the principles of relatedness and centrality remains arguable. We intend to extend this work in two ways. First, we will analyse the data further investigating cross-tabulations and correlations between the quantitative data and the qualitative results reported in this paper. For example, do the factors influencing the continued-use decision vary by the demographic dimensions of source of formal training, years of modelling experience, and the like. Second, we want to administer the survey in other countries (Sweden and Netherlands already) to address the issues of lack of generalisability in the current results and cultural differences in conceptual modelling.

References 1. Batra, D., Marakas, G. M.: Conceptual data modelling in theory and practice. European Journal of Information Systems, Vol. 4 Nr. 3 (1995) 185-193 2. Chang, S., Kesari, M., Seddon, P.: A content-analytic study of the advantages and disadvantages of process modelling. 14th Australasian Conference on Information Systems. Eds.: J. Burn., C. Standing, P. Love. Perth (2003) 3. Floyd, C.: A comparative evaluation of system development methods. Information Systems Design Methodologies: Improving the Practice. North-Holland, Amsterdam (1986) 19-37 4. Iivari, J.: Factors affecting perceptions of CASE Effectiveness. IEEE Software. Vol. 4 (1995) 143-158 5. Karahanna, E., Straub, D. W., Chervany, N. L.: Information Technology Adoption Across Time: A Cross-Sectional Comparison of Pre-Adoption and Post-Adoption Beliefs. MIS Quarterly, Vol. 23 Nr. 2 (1999) 183-213 6. Karam, G. M., Casselman, R. S.: A cataloging framework for software development methods. IEEE Computer, Feb., (1993) 34-46 7. Lindland, O. I., Sindre, G., Solvberg, A.: Understanding Quality in Conceptual Modeling. IEEE Software, March, (1994) 42-49 8. Necco, C. R., Gordon, C. L., Tsai, N. W.: Systems analysis and design: Current practices. MIS Quarterly, Dec., (1987) 461-475 9. Olle, T. W., Hagelstein, J., Macdonald, I. G., Rolland, C., Sol, H. G., M.van Assche, F. J., Verrijn-Stuart, A. A.: Information Systems Methodologies: A Framework for Understanding. Addison-Wesley, Wokingham (1991) 10. Persson, A., Stirna, J.: Why Enterprise Modelling? An Explorative Study Into Current Practice. 13th Conference on Advanced Information Systems Engineering, Switzerland (2001) 465-468 11. Sedera, W., Gable, G., Rosemann, M., Smyth, R.: A success model for business process modeling: findings from a multiple case study. Pacific Asia Conference on Information Systems (PACIS’04). Eds.: Liang T.P., Zheng, Z.: Shanghai (2004) 12. Tan, M., Teo, T. S. H.: Factors Influencing the Adoption of Internet Banking. Journal of the Association for Information Systems, Vol. 1 (2000) 1-43 13. Wand, Y., Weber, R.: Research Commentary: Information Systems and Conceptual Modeling – A Research Agenda. Information Systems Research. Vol. 13 Nr. 4 (2002) 363-376.

Entity-Relationship Modeling Re-revisited Don Goelman1 and Il-Yeol Song2 1

Department of Computer Science Villanova University Villanova, PA 19085

[emailprotected] 2

College of Information Science and Technology Drexel University Philadelphia, PA 19104 [emailprotected]

Abstract. Since its introduction, the Entity-Relationship (ER) model has been the vehicle of choice in communicating the structure of a database schema in an implementation-independent fashion. Part of its popularity has no doubt been due to the clarity and simplicity of the associated pictorial Entity-Relationship Diagrams (“ERD’s”) and to the dependable mapping it affords to a relational database schema. Although the model has been extended in different ways over the years, its basic properties have been remarkably stable. Even though the ER model has been seen as pretty well “settled,” some recent papers, notably [4] and [2 (from whose paper our title is derived)], have enumerated what their authors consider serious shortcomings of the ER model. They illustrate these by some interesting examples. We believe, however, that those examples are themselves questionable. In fact, while not claiming that the ER model is perfect, we do believe that the overhauls hinted at are probably not necessary and possibly counterproductive.

1 Introduction Since its inception [5], the Entity-Relationship (ER) model has been the primary approach for presenting and communicating a database schema at the “conceptual” level (i.e., independent of its subsequent implementation), especially by means of the associated Entity-Relationship Diagram (ERD). There’s also a fairly standard method for converting it to a relational database schema. In fact, if the ER model is in some sense “correct,” then the associated relational database schema should be in pretty good normal form [15]. Of course, there have been some suggested extensions to Chen’s original ideas (e.g., specialization and aggregation as in [10, 19]), some different approaches for capturing information in the ERD, and some variations on the mapping to the relational model, but the degree of variability has been relatively minor. One reason for the remarkable robustness and popularity of the approach is no doubt the wide appreciation for the simplicity of the diagram. Consequently, the desirability of incorporating additional features in the ERD must be weighed against the danger of overloading it with so much information that it loses its visual power in communicating the structure of a database. In fact, the model’s versatility is also evident in its relatively straightforward mappability to the newer Object Data Model [7]. Now admittedly an industrial strength ERD reflecting an actual enterprise would necessarily be some order of magnitude more complex than even the production numbers in standard texts [e.g., 10]. However, this does not weaken the ability of a simple ERD to P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 43–54, 2004. © Springer-Verlag Berlin Heidelberg 2004

44

Don Goelman and Il-Yeol Song

capture local pieces of the enterprise, nor does it lessen the importance of ER-type thinking in communicating a conceptual model. Quite recently, however, both Camps and Badia have demonstrated [4, and 2 (from whose paper the title of this one is derived)] some apparent shortcomings in the ER model, both in the model itself and in the processes of conversion to the relational model and its subsequent normalization. They have illustrated these problems through some interesting examples. They also make some recommendations for improvements, based on these examples. However, while not claiming that the ER model can be all things to all users, we believe that the problems presented in the examples described in those two papers are due less to the model and more to its incorrect application. Extending the ERD to represent complex multi-relation constraints or constraints at the attribute level are interesting research topics, but are not always desirable. We claim that representing them would clutter the ERD as a conceptual model at the enterprise level; complex constraints would be better specified in a textual or language-oriented format than at the ERD level. The purpose of this paper is to take these examples as a starting point to discuss the possible shortcomings of the ER model and the necessity, or lack thereof, for modifying it in order to address them. We therefore begin by reviewing and analyzing those illustrations. Section 2 describes and critiques Camps’ scenarios; Section 3 does Badia’s. Section 4 considers some related issues, most notably a general design principle only minimally offered in the ER model. Section 5 concludes our paper.

2 The Camps Paper In [4], the author begins by describing an apparently simple enterprise. It has a straightforward ERD that leads to an equally straightforward relational database schema. But Camps then escalates the situation in stages, to the point where the ER model is not currently able to accommodate the design, and where normalizing the associated relational database schema is also unsatisfying. Since we are primarily concerned with problems attributed to the ER model, we will concentrate here on that aspect of the paper. However, the normalization process at this point is closely tied to that model, so we will include some discussion of it as well. We now give a brief recapitulation, with commentary. At first, Camps considers an enterprise with four ingredients: Dealer, Product, State, and Concession, where Concession is a ternary relationship among the other three, implemented as entity types. Each ingredient has attributes with fairly obvious semantics, paraphrased here: d-Id, d-Address; p-Id, p-Type; s-Id, s-Capital; and cDate. The last attribute’s semantics represents the date on which a given state awards a concession to a given dealer for a given product. As for functional dependencies, besides the usual ones, we are told that for a given state/product combination, there can only be one dealer. Thus, a minimal set of dependencies is as follows:

Entity-Relationship Modeling Re-revisited

45

An ERD for this is given in Figure 1 (attributes are eliminated in the figures, for the sake of clarity), and the obvious relational database schema is as follows:

Fig. 1. Example of 1:N:N relationship (from Figure 1 in [4], modified)

The foreign key constraints derive here from the two components of Concession’s key, which are primary keys of their native schemas. Since the only functional dependencies are those induced by keys, the schema is in BCNF. Here Camps imposes further constraints:

In other words, if a product is offered as a concession, then it can only be with a single dealer regardless of the state; and analogously on the state-dealer side. The author is understandably unhappy about the absence of a standard ERD approach to accommodate the resulting binary constraining relationships (using the language of [12]), which he renders in a rather UML-like fashion [17], similar to Figure 2. At this point, in order to highlight the generic structure, he introduces new notation (A, B, C, D for State, Dealer, Product, Concession, respectively). However, we will keep the current ones for the sake of comfort, while still pursuing the structure of his narrative. He notes that the resulting relational database schema includes the non-3NF relation schema Concession(s-Id,p-Id,d-Id,c-Date). Further, when Camps wishes to impose the constraints that a state (respectively product) instance can determine a dealer if and only if there has been a concession arranged with some product (respectively state), he expresses them with these conditions:

Each of these can be viewed as a double inclusion dependency and must be expressed using the CHECK construct in SQL.

46

Don Goelman and Il-Yeol Song

Fig. 2. Two imposed FDs (from Figure 2 of [4])

Now we note that it is actually possible to capture the structural properties of the enterprise at this stage by the simple (i.e., ternary-free) ERD of either Figure 3a [13] or Figure 3b [18]. The minimal set of associated functional dependencies in Figure 3a is as follows:

One, therefore, obtains the following relational database schema, which is, of course, in BCNF, since all functional dependencies are due to keys:

Fig. 3a. A binary model of Figure 2 with Concession as a M:N relationship

Entity-Relationship Modeling Re-revisited

47

Fig. 3b. A binary model of Figure 2 with Concession as an intersection (associate) entity

Admittedly, this approach loses something: the ternary character of Concession. However, any dealer-relevant information to a concession instance can be discovered by a simple join; a view can also be conveniently defined. The ternary relationship in Figure 2 is therefore something of a red herring when constraining binary relationships are imposed to a ternary relationship. In other words, it is possible that an expansion of the standard ERD language to include n-ary relationships’ being constrained by m-ary ones might be a very desirable feature, but its absence is not a surprising one. Jones and Song showed that the ternary schema with FDs imposed in Figure 2 can have lossless decomposition, but cannot have an FD-preserving schema (Pattern 11 in [13]). Camps now arrives at the same schema (E) (by normalizing his non-3NF one, not by way of our ERD in Figure 3a). The problem he sees is incorporating the semantics of (C). The constraints he develops are:

The last two conditions seem not to make sense syntactically. The intention is most likely the following (keeping the first condition and rephrasing the other two):

At any rate, Camps shows how SQL can accommodate these conditions too using CHECKS in the form of ASSERTIONS, but he considers any such effort (to need any conditions besides key dependencies and inclusion constraints) to be anomalous. We feel that this is not so surprising a situation after all. The complexity of real-world database design is so great that, on the contrary, it is quite common to encounter a situation where many integrity constraints are not expressible in terms of functional and inclusion dependencies alone. Instead, one must often use the type of constructions that Camps shows us or use triggers to implement complex real-world integrity constraints.

3 The Baida Paper In his paper [2] in turn, Badia revisits the ER model because of the usefulness and importance of the ER model. He contends that, as database applications get more

48

Don Goelman and Il-Yeol Song

complex and sophisticated and the need for capturing more semantics is growing, the ER model should be extended with more powerful constructs to express powerful semantics and variable constraints. He presents six scenarios that apparently illustrate some inadequacies of the ER model; he classifies the first five as relationship constraints that the model is not up to incorporating and the sixth as an attribute constraint. We feel that some of the examples he marshals, described below in 3.3 and 3.6, are questionable, leading us to ask whether they warrant extending the model. Badia does discuss the down side of overloading the model, however, including a thoughtful mention of tradeoffs between minimality and power. In this section we give a brief recapitulation of the examples, together with our analyses.

3.1 Camps Redux In this portion of his paper, Badia presents Camps’ illustrations and conclusions, which he accepts. We’ve already discussed this.

3.2 Commutativity in ERD’s In mathematical contexts, we call a diagram commutative [14] if all different routes from a common source to a common destination are equivalent. In Figure 4, from Badia’s paper (there called Figure 1), there are two different ways to navigate from Course to Department: directly, or via the Teacher entity. To say that this particular diagram commutes, then, is to say that for each course, its instructor must be a faculty member of the department that offers it. Again, there is a SQL construct for indicating this. Although Badia doesn’t use the term, his point here is that there is no mechanism for ERD’s to indicate a commutativity constraint. This is correct, of course. Consider the case of representing this kind of multi-relation constraints in the diagram with over just 50 entities and relationships, which are quite common in real-world applications. We believe, therefore, that this kind of a multi-relation constraint is better to be specified as a textual or a language-oriented syntax, such as OCL [17], rather than at a diagram level. In this way, a diagram can clearly deliver its major semantics without incurring visual overload and clutter.

Fig. 4. An example of multi-paths between two entities (from Figure 1 in [2])

Entity-Relationship Modeling Re-revisited

49

In certain limited situations [8] the Offers relationship might be superfluous and recovered by composing the other two relationships (or, in the relational database schema, by performing the appropriate joins). We would need to be careful about dropping Offers, however. For example, if a particular course were at present unstaffed, then the Teaches link would be broken. This is the case when Course entity has partial (optional) participation to Department entity. Without an explicit Offers instance, we wouldn’t know which department offers the course. This is an example of a chasm trap which requires an explicit Offers relationship [6]. Another case where we couldn’t rely on merely dropping one of the relationship links would arise if a commutative diagram involved the composition of two relationships in each path; then we would surely need to retain them both and to implement the constraint explicitly. We note that allowing cycles and redundancies in ERD’s has been a topic of research in the past. Atzeni and Parker [1] advise against it; Markowitz and Shoshani [15] feel that it is not harmful if it is done right. Dullea and Song [8, 9] provide a complete analysis of redundant relationships in cyclic ERD’s. Their decision rules on redundant relationships are based on both maximum and minimum cardinality constraints.

3.3 Acyclicity of a Recursive Closure Next, Badia considers the recursive relationship ManagerOf (on an Employee entity). He would like to accommodate the hierarchical property that nobody can be an indirect manager of oneself. Again, we agree with this observation but can’t comment on how desirable such an ER feature would be at a diagram level. Badia points out that this is a problem even at the level of the relational database, although some Oracle releases can now accommodate the constraint.

3.4 Fan Traps At this point the author brings Figure 5 (adapted from [6], where it appears as Figure 11.19(a); for Badia it is Figure 2) to our attention. (The original figure uses the “Merise,” or “look here” approach [17]; we’ve modified it to make it consistent with the other figures in this paper.) The problem, called a fan trap arises when one attempts to enforce a constraint that a staff person must work in a branch operated by her/his division. This ER anomaly percolates to the relational schemas as well. Further, if one attempts to patch things up by including a third binary link, between Staff and Branch, then one is faced with the commutativity dilemma of Section 3.2. In general fan traps arise when there are two 1 :N relationships from a common entity type to two different destinations. The two typical solutions for fan traps are either to add a third relationship between the two many-side entities or rearrange the entities to make the connection unambiguous. The problem in Figure 5 here is simply caused by an incorrect ERD and can be resolved by rearranging entities as shown in Figure 6. Figure 6 avoids the difficulties at both the ER and relational levels. In fact, this fix is even exhibited in the Connolly source itself. We note that the chasm trap discussed in Section 3.2 and the fan trap are commonly called connection traps [6] which make the connection between two entities separated by the third entity ambiguous.

50

Don Goelman and Il-Yeol Song

Fig. 5. A semantically wrong ERD with a fan trap (from Figure 2 in [2] and Figure 11.19(a) from [6])

Fig. 6. A correct ERD of Figure 5, after rearranging entities

3.5 Temporal Considerations Here Badia looks at a Works-in relationship, M:N between Employee and Project, with attributes start-date and end-date. A diagram for this might look something like Figure 7b; for the purposes of clarity, most attributes have been omitted. Baida states that the rule that even though en employee may work in many projects, an employee may not work in two projects at the same time may not be represented in an ERD. It appears impossible to express the rule, although the relationship is indeed M:N. But wouldn’t this problem be solved by creating a third entity type, TimePeriod, with the two date attributes as its composite key, and letting Works-in be ternary? The new relationship would be M:N:1, as indicated in Figure 7c, with the 1 on the Project node, of course. In figures of 7a through 7d, we show several variations of this case related to capturing the history of works-in relationships and the above constraint. We’ll comment additionally on this in Section 4.

Fig. 7a. An employee may work in only one project and each project can have many employees. The diagram already assumes that an employee must work for only one project at a time. This diagram is not intended to capture any history of works-in relationship

Fig. 7b. An employee may work in many projects and each project may have many employees. The diagram assumes that an employee may work for many projects at the same time. This diagram is also not intended to capture any history of works-in relationship

Entity-Relationship Modeling Re -revisited

51

Fig. 7c. An employee may work in only one project at a time. This diagram can capture a history of works-in relationship of an employee for projects and still satisfies the constraint that an employee may work in only one project at a time

Fig. 7d. In Figure 7.c, if entity TimePeriod is not easily materialized, we can reify the relationship Works-in to an intersection entity. This diagram can capture the history of works-in relationship, but does not satisfy the constraint that an employee may work in only one project

3.6 Range Constraints While the five previous cases exemplify what Badia calls relationship constraints, this one is an attribute constraint. The example given uses the following two tables: Employee (employee_id, rank_id, salary, ...) Rank (rank_id, max_salary, min_salary) The stated problem is that the ERD that represents the above schema cannot express the fact that the salary of an employee must be within the range determined by his or her rank. Indeed, in order to enforce this constraint, explicit SQL code must be generated. Baida correctly sates that the absence of information at the attribute level is a limitation and cause difficulty in solving semantic heterogeneity. We believe, however, that information and constraints at the attribute level could be expressed at the data dictionary level or in a separate low level diagram below the ERD level. Again, this will keep an ERD as a conceptual model at enterprise level without too much clutter. Consider the complexity of representing attribute constraints in ERDs for realworld applications that have over 50 entities and several hundreds of attributes. The use of a CASE tool that supports a conceptual ERD with its any low level diagram for attributes and/or its associated data dictionary should be a right direction for this problem.

4 General Cardinality Constraints While on the whole, as indicated above, we feel many of the alleged shortcomings of the ER model claimed in recent papers are not justified, some of those points have been well taken and are quite interesting. However, there is another important feature of conceptual design that we shall consider here, one that the ER model really does lack. In this section, we briefly discuss McAllister’s general cardinality constraints [16] and their implications. McAllister’s setting is a general n-ary relationship R. In other words, R involves n different roles. This term is used, rather than entity types, since the entity types may not all be distinct. For example, a recursive relationship, while binary in the mathe-

52

Don Goelman and Il-Yeol Song

matical sense, involves only a single entity type. Given two disjoint sets of roles A and B, McAllister defines Cmax(A,B) and Cmin(A,B) as follows: for a tuple , with one component from each role in A, and a tuple , with one component from each role in B, let us denote by the tuple generated by the two sets of components; we recall that A and B are disjoint. Then Cmax(A,B) (respectively Cmin(A,B)) is the maximum allowable cardinality over all of the set of tuples such that For example, consider the Concession relationship of Figure 1. Then to say that Cmax({State, Product},{Dealer}) = 1 is to express the fact that And the condition Cmin({Product},{State,Dealer}) = 1 is equivalent to the constraint that Product is total on Concession. Now, as we see from these examples, Cmax gives us information about functional dependencies and Cmin about participation constraints. When B is a singleton set and A its complement, this is sometimes called the “Chen” approach to cardinality [11] or “look across”; when A is a singleton set and B its complement, it is called the “Merise” approach [11] or “look here.” All told, McAllister shows that there are different combinations possible for A and B, where n is the number of different roles. Clearly, given this explosive growth, it is impractical to include all possible cardinality constraints in a general ERD, although McAllister shows a tabular approach that works pretty well for ternary relationships. He shows further that there are many equalities and inequalities that must hold among the cardinalities, so that the entries in the table are far from independent. The question arises as to which cardinalities have the highest priorities and should thus appear in an ERD. It turns out that the Merise and Chen approaches give the same information in the binary case but not in the ternary one, which becomes the contentious case (n>3 is rare enough not to be a serious issue). In fact one finds both Chen [as in 10] and Merise [as in 3] systems in practice. In his article, Genova feels that UML [17] made the wrong choice by using the Chen method for its Cmin’s, and he suggests that class diagrams include both sets of information (but only when either A or B is singleton). That does not seem likely to happen, though. Still, consideration of these general cardinality constraints and McAllister’s axioms comes in handy in a couple of the settings we have discussed. The general setting helps understand connections between, for example, ternary and related binary relationships as in Figure 2 and [12]. And it similarly sheds light on preservation (and loss) of information in Section 3.5 above, when a binary relationship is replaced by a ternary one. Finally, we believe that it also provides the deep structural information for describing the properties of decompositions of the associated relation schemas. It is therefore indisputable in our opinion that these general cardinality constraints do much to describe the fundamental structure of a relationship in the ER model; only portions of which, like the tip of an iceberg, are currently visible in a typical ERD. And yet we are not claiming that such information should routinely be included in the model.

5 Conclusion We have reviewed recent literature ([4] and [2]) that illustrate through some interesting examples areas of conceptual database design that are not accommodated suffi-

Entity-Relationship Modeling Re-revisited

53

ciently at the present time by the Entity-Relationship model. However, some of these examples seem not to hold up under scrutiny. Capabilities that the model does indeed lack are constraints on commutative diagrams (Section 3.2 above), recursive closures (3.3), and some range conditions (3.6) as pointed out by Badia. Another major conceptual modeling tool missing in the ER model is that of general cardinality constraints [16]. These constraints are the deep structure that underlies such more visible behavior as constraining and related relationships, Chen and Merise cardinality constraints, functional dependencies and decompositions, and participation constraints. How many of these missing features should actually be incorporated into the ER model is pretty much a question of triage, of weighing the benefits of a feature against the danger of circuit overload. We believe that some complex constraints such as multi-relation constraint are better to be represented as a textual or a language-oriented syntax, such as OCL [17], rather than at the ER diagram level. We also believe that information and constraints at the attribute level could be expressed at the data dictionary level or in a separate low level diagram below the ERD level. In these ways, we will keep an ERD as a conceptual model at enterprise level to deliver major semantics without visual overload and too much clutter. Consider the complexity of an ERD for a real-world application that has over 50 entities and hundreds of attributes and representing all those complex multi-relation and attribute constraints in the ERD. The use of a CASE tool that supports a conceptual ERD with its any low level diagram for attributes and/or its associated data dictionary should be a right direction for this problem. We note that we do not claim that some research topics suggested by Baida, such as relationships over relationships and attributes over attributes, are not interesting or worthy. Research in those topics would bring interesting new insights and powerful ways of representing complex semantics. What we claim here is that the ERD itself has much value as it is now, especially for relational applications, where all the examples of Baida indicate. We believe, however, that extending the ER model to support new application semantics such as biological applications should be encouraged. The “D” in ERD connotes to many researchers and practitioners the simplicity and power of communication that account for the model’s popularity. Indeed, as the Entity-Relationship model nears its birthday, we find its robustness remarkable.

References 1. 1.Atzeni, P. and Parker, D.S., “Assumptions in relational database theory”, in Proceedings of the ACM Symposium on Principles of Database Systems, March 1982. 2. Badia, A. “Entity-Relationship Modeling Revisited”, SIGMOD Record, 33(1), March 2004, pp. 77-82. 3. Batini, C., Ceri, S., and Navathe, S., Conceptual Database Design, Benjamin/Cummings, 1992. 4. Camps Paré, R. “From Ternary Relationship to Relational Tables: A Case against Common Beliefs”, SIGMOD Record, 31(20), June 2002, pp. 46-49. 5. Chen, P. “The Entity-Relationship Model – towards a Unified View of Data”, ACM Transactions on Database Systems, 1(1), 1976, pp. 9-36. 6. Connolly, T. and Begg, C., Database Systems, Edition, Addison-Wesley, 2002. 7. Dietrich, S. and Urban, S., Beyond Relational Databases, Prentice-Hall, to appear. 8. Dullea, J. and Song, I.-Y., “An Analysis of Cardinality Constraints in Redundant Relationships,” in Proceedings of Sixth International Conferences on Information and Knowledge Management (CIKM97), Las Vegas, Nevada, USA, Nov. 10-14, 1997, pp. 270-277.

54

Don Goelman and Il-Yeol Song

9. Dullea, J., Song, I.-Y., and Lamprou, I., “An Analysis of Structural Validity in EntityRelationship Modeling,” Data and Knowledge Engineering, 47(3), 2003, pp. 167-205. Ed., Addison10. Elmasri, R. and Navathe, S.B., Fundamentals of Database Systems, Wesley, 2003. 11. Genova, G., Llorenz, J., and Martinez, P., “The meaning of multiplicity of n-ary associations in UML”, Journal of Software and Systems Modeling, 1(2), 2002. 12. Jones, T. and Song, I.-Y., “Analysis of binary/ternary cardinality combinations in entityrelationship modeling”, Data & Knowledge Engineering, 19(1), 1996, pp. 39-64. 13. Jones, T. and Song, I.-Y., “Binary Equivalents of Ternary Relationships in EntityRelationship Modeling: a Logical Decomposition Approach.” Journal of Database Management, 11(2), 2000, (April-June, 2000), pp. 12-19. 14. MacLane, S., Categories for the Working Mathematician, Springer-Verlag, 1971. 15. Markowitz, V. and Shoshani, A., “Representing Extended Entity-Relationship Structures in Relational Databases: A Modular Approach”, ACM Transactions on Database Systems, 17(3), 1992, pp. 423-464. 16. McAllister, A., “Complete rules for n-ary relationship cardinality constraints”, Data & Knowledge Engineering, 27,1998, pp. 255-288. 17. Rumbaugh, J., Jacobson, I., and Booch, G., The Unified Modeling Language Reference Manual, Addison-Wesley, 1999. 18. Song, I.-Y., Evans, M., and Park, E.K., “A Comparative Analysis of Entity-Relationship Diagrams,” Journal of Computer and Software Engineering, 3(4) (1995), pp. 427-459. 19. Teorey, T., Database Modeling & Design, Edition, Morgan Kaufmann, 1999.

Modeling Functional Data Sources as Relations Simone Santini and Amarnath Gupta* University of California, San Diego

Abstract. In this paper we present a model of functional access to data that, we argue, is suitable for modeling a class of data repositories characterized by functional access, such as web sites. We discuss the problem of modeling such data sources as a set of relations, of determining whether a given query expressed on these relations can be translated into a combination of functions defined by the data sources, and of finding an optimal plan to do so. We show that, if the data source is modeled as a single relation, an optimal plan can be found in a time linear in the number of functions in the source but, if the source is modeled as a number of relations that can be joined, finding the optimal plan is NP-hard.

1 Introduction These days, we see a great diversification in the type, structure, and functionality of the data repositories with which we have to deal, at least when compared with as little as fifteen or twenty years ago. Not too long ago, one could quite safely assume that almost all the data that a program had both the need and the possibility to access were stored in a relational database or, were this not the case, that the amount of data, their stability, and their format made their insertion into a relational database feasible. As of today, such a statement would be quite undefensible. A large share of the responsibility for this state of affairs must be ascribed, of course, to the rapid diffusion of data communication networks, which created a very large collection of data that a person or a program might want to use. Most of the data available on data communication networks, however, are not in relational form [1] and, due to the volume and the instability of the medium, the idea of storing them all into a stable repository is quite unfeasible. The most widely known data access environment of today, the world-wide web, was created with the idea of displaying reasonably well formatted pages of material to people, and of letting them “jump” from one page to another. It followed, in other words, a rather procedural model, in which elements of the page definition language (tags) often stood for actions: a link specified a “jump” from one page to another. While a link establishes a connection between two pages, this connection is not symmetric (a link that carries you from page A to page B will not carry you from page B to page A) and therefore is not a relation between two pages (in the sense in which the term *

The work presented in this paper was done under the auspices and with the funding of NIH project NCRR RR08 605, Biomedical Informatics Research Network, which the authors gratefully acknowledge.

P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 55–68, 2004. © Springer-Verlag Berlin Heidelberg 2004

56

Simone Santini and Amarnath Gupta

“relation” is used in databases), but rather a functional connection that, given page A, will produce page B. In addition to this basic mechanism, today many web sites that contain a lot of data allow one to specify search criteria using the so-called forms. A form is an input device through which a fixed set of values can be assigned to an equally fixed set of parameters, the values forming a search criterion against which the data in the web site will be matched, returning the data that satisfy the criterion. Consider the web site of a public library (an example to which we will return in the following). Here one can have a form that, given the name of an author returns a web page (or other data structures) containing the titles of the books written by that author. This doesn’t imply that a corresponding form will exist that, given the title of a book, will return its author. In other words, the dependence is not necessarily invertible. This limitation tells us that we are not in the presence of a set of relations but, rather, in the presence of a data repository with functional access. The diffusion of the internet as a source of data has, of course, generated a great interest in the conceptual modeling of web sites [2–4]. In this paper we present a formalization of the problem of representing a functional data source as a set of relations, and of translating (whenever possible) relational queries into sequences of functions.

2 The Model For the purpose of this work, a functional data source is a set of procedures that, given a number of attributes whose value has been fixed, instructs us on how to obtain a data structure containing further attributes related to the former. To fix the ideas, consider again the web site of a library. A procedure is defined that, given the name of an author, retrieves a data structure containing the titles of all the books written by that author. The procedure for doing so looks something like this: Procedure 1: author - > set(title) i) go to the “search by author” page; ii) put the desired name into the “author” slot of the form that you find there; iii) press the button labeled “go”; iv) look at the page that will be displayed next, and retrieve the list of titles. Getting the publisher and the year of publication of a book, given its author and title is a bit more complicated: Procedure 2: author, title - > publisher, year i) execute procedure 1 and get a list of titles; ii) search the desired title in the list; iii) if found then iii.1) access the book page, by clicking on the title; iii.2) search the publisher and year, and return them; iv) else fail. On the other hand, in most library web pages there is no procedure that allows one to obtain a list of all the books published by a given publisher in a given year, and a query asking for such information would be impossible to answer.

Modeling Functional Data Sources as Relations

57

We start by giving an auxiliary definition, and then we give the definition of the kind of functional data sources that we will consider in the rest of the paper. Definition 1. A data sort S is a pair (N, T), written S = N : T, where N is the name of the sort, and T its type. Two data sorts are equal if their names and their types coincide. A data sort, in the sense in which we use the term here, is not quite a “physical” data type. For instance, author:string and title:string are both of the same data type (string) but they represent different data sorts1.The set of complex sorts is the transitive closure of the set of data sorts with respect to Cartesian product of sorts and the formation of collection types (sets, bags, and lists). Definition 2. A functional data source is a pair (S, F) where set of data sorts, is a set of functions are composite sorts made of sorts in S.

is a where both

and

In the library web site, author:string, and year:int are examples of data sorts. The procedures are instantiations of functions. Procedure 1, for example, instantiates a function The elements “author:string” and “title:string” are examples of composite sorts. Sometimes, when there is no possibility of confusion, we will omit the type of the sort. Our goal in this paper is to model a functional data source like this one in a way that resembles a set of relations upon which we can express our query conditions. To this end, we give the following definition. Definition 3. A relational model of a functional data source is a set of relations where and all the are sorts of the functional data source. The relation is called a relational façade for the underlying data source, and will sometimes be indicated as The problems we consider in this paper are the following: (1) Given a model R of a functional data source (S, F) and a query on the model, is it possible to answer the query using the procedures defined for the functional data source?, and (2) if the answer to the previous question is “yes,” is it possible to find an optimal sequence of procedures that will answer the query with minimal cost? It goes without saying that not all the queries that are possible on the model are also possible on the data source. Consider again the library web site; a simple model for this data source is composed of a single relation, that we can call “book,” and defined as: book(name:string, title:string, publisher:string, year:int). 1

The entities that we call data sorts are known in other quarters as “semantic data types.” This name, however, entails a considerable epistemological commitment, quite out of place for a concept that, all in all, has nothing semantic about it: an author:string is as syntactic an entity as any abstract data type, and does not require extravagant semantic connotations.

58

Simone Santini and Amarnath Gupta

A query like (N, T):- book(N, T, ‘dover’, 1997), asking for the author and title of all books published by dover in 1997 is quite reasonable in the model, but there are no procedures on the web site to execute it. We will assume, to begin with, that the model of the web site contains a single relation. In this case we can also assume, without loss of generality, that the relation is defined in the Cartesian product of all the sorts in the functional data source: Throughout this paper, we will only consider non-recursive queries. It should be clear in the following that recursive queries require a certain extension of our method, but not a complete overhaul of it. Also, we will consider conjunctive queries2, whose general form can be written as:

where are constants, all the S’s come from the sorts of the relation R, and the are comparison operators drawn from a suitable set, say We will for the moment assume that the functional data source provides no mechanism for verifying conditions of the type The only operations allowed are retrieving data by entering values (constants) in a suitable field of a form or traversing a link in a web site with a constant as a label (such as the title of a book in the library example). Given the query (1) in a data source like this, we would execute it by first determining whether the function can be computed. If it can, we compute and, for each result returned, check whether the conditions are verified. The complicated part of this query schema is the first step: the determination of the function that, given the constants in the query, allows us to obtain the query outputs augmented with all the quantities needed for the comparisons.

3 Query Translation Informally, the problem that we consider in this section is the following. We have a collection of data sorts Given two data sorts defined as Cartesian products of elements of and one can define a formal (and unique) correspondence function This function operates on the model of the data source (this is why we used the adjective “formal” for it: it is not necessarily a function that one can compute) and, given the values returns the corresponding values If are the input values, this function computes the relational algebra operation

where the N’s are the names of the sorts S, as per definition 1. A correspondence function can be seen, in other words, as the functional counterpart of the query (2) 2

Any query can, of course, be translated in a disjunctive normal form, that is, in a disjunction of conjunctive queries. The system in this case will simply pose all the conjunctive queries and then take the union of all the results.

Modeling Functional Data Sources as Relations

59

which, on a single table, is completely general. (Remember that we don’t yet consider conditions other than the equality with a constant.) The set of all correspondence functions contains the grounding of all queries that we might ask on the model. The functional data source, on the other hand, has procedures each one of which implements a specific function a situation that we will indicate with The set of all implemented correspondence functions if Our query implementation problem is then, given a query with the relative correspondence function to find a suitable combination of functions in that is equal to In order to make this statement more precise, we need to clarify what do we mean by “suitable combination of functions” that is, we need to specify a function algebra. We will limit our algebra to three simple operations that create sequences of functions, as shown in Table 1. (We assume, pragmatically, that more complex manipulations are done by the procedures

A function for which a procedure is defined, and that transforms a data sort S into a data sort P can be represented as a diagram

The operators of the function algebra generate diagrams like those in the first and third column of Table 2. In order to obtain the individual data types, we introduce the formal operator of projection. The projection is “formal” in that it exists only in the diagrams: in practice, when we have the data type P × Q we simply select the portion of it that we need. The projections don’t correspond to any procedure and their cost is zero. The dual of the projection operator is the Cartesian product which, given two data of type A and B produces from them a datum of type A × B. This is also a formal operator with zero cost. where the dotted line with the × symbols is there to remind us that we are using a Cartesian product operator, and the arrow goes from the type that will appear first in the product to the type that will appear second (we will omit the arrow when this indication is superfluous). The Cartesian product of the functions

and

is represented

as

With these operations, and the corresponding diagrams, in place, we can arrange the correspondence functions in a diagram, which we call the computation diagram of a data source.

60

Simone Santini and Amarnath Gupta

Definition 4. The computation diagram of a functional data source is a graph G = (N, E) with nodes labeled by a labeling function S being the set of composite data sorts of the source, and edges labeled by the labeling function such that each edge is one of the following: 1. A function edge, such that if the edge is and represented as in (3); 2. projection edges, 3. cartesian product edges

then

Let us go back now to our original problem. We have a query and a corresponthat we need to compute, where dence function are the data sorts for which we give values, and are the results that we desire. In order to see whether the computation is possible, we adopt the following strategy: first, build the computation diagram of the data source, then we add a node called to the graph, and connect it to as well as a node with edges coming from finally, we check whether a path exists from to If we are to find an optimal solution to the grounding of a correspondence function we need to assign a cost to each node of the graph and, in order to do this, we need to determine the cost of traversing an edge. The cost functions of the various combinations that appear in a computation graph are defined in table 2.

The problem of finding the optimal functional expression for a given query can therefore be reduced to that of finding the shortest path in a suitable function graph, a problem that we will now briefly elucidate. Let G be a function graph, G.V the set of its vertices, and G.E the set of its edges. For every node let be the distance between and the source of the path, the predecessor(s) of in the minimal path, and the set of nodes adjacent to (accounting for the edge directions) In addition, a cost function vertex × vertex real is defined such that is the cost of the edge If then The algorithm in table 3 uses the Djikstra’s shortest path algorithm to build a function graph that produces a given set of output from a given set of input, if such a graph

Modeling Functional Data Sources as Relations

61

exists: the function returns the set of nodes in G where, for each node is set to the cost of the path from to according to the cost function Dijkstra’s algorithm is a standard one and is not reported here.

4 Relaxing Some Assumptions The model presented so far is a way of solving a well known problem: given a set of functions, determine what other functions can be computed using their combination; our model is somewhat more satisfying from a modeling point of view because of the explicit inclusion of the cartesian product of data sorts and the function algebra operators necessary to take them into account but, from an algorithmic point of view, what we are doing is still finding the transitive closure of a set of functional dependencies. We will now try to ease some of the restrictions on the data source. These extensions, in particular the inclusion of joins, can’t be reduced to the transitive closure of a set of functional dependencies, and therein lies, from our point of view, the advantage of the particular form of our model. Comparisons. The first limitation that we want to relax is the assumption that the data source doesn’t have the possibility of expressing any of the predicates in the query (1). There are cases in which some limited capability in this sense is available.

62

Simone Santini and Amarnath Gupta

We will assume that the following limitations are in place: firstly, the data sources provides a finite number of predicate possibilities; secondly each predicate is of the form where S and R are fixed data sorts, and “op” is an operator that can be chosen amongst a finite number of alternative. The general idea here comes, of course, from an attempt to model web sites in which conditions can be expressed as part of “forms.” In order to incorporate these conditions into our method, one can consider them as data sorts: each condition is a data sort that takes values in the set of triples with of sort S and of sort R. In other words, indicating a sort as a pair N : T, where N is the name and T the data type of the sort, a comparison data sort is isomorphic to where 2 is the data type of the booleans. A procedure that accepts in input a value of a data sort and a condition on the data sorts would be represented as

The only difference between condition data sorts and regular data sorts is that conditions can’t be obtained as the result of a procedure, so that in a computation graph a condition should not have any incoming edge. Joins. Let us consider now the case in which the model of the functional data source consists of a number of relations. We can assume, for the sake of clarity, that there are only two relations in the model:

Each of these relations supports intra-relational queries that can be translated into functions and executed using the computation graph of that part of the functional source that deals with the data sorts in the relation. In addition, however, we have now queries that need to join data between the two relations. Consider the relations: and the following query:

We can compute this query in two ways. The first makes use of the following two correspondence functions:

To implement this query, we adopt the following procedure:

Modeling Functional Data Sources as Relations

63

Procedure 3: use the computation graph of to compute returning a set of i) pairs ii) for each pair returned: ii.1) compute using the graph of obtaining a set of results ii.2) for each form the pair and add it to the output. The procedure can be represented using a computation graph in which the graphs that compute and are used as components. Let us indicate the graph that computes the function as:

Then a join like that in the example is computed by the following diagram:

The second possibility to compute the join is symmetric. While in this case we used the relation to produce the variable on which we want to join and the relation to impose the join condition, we will now do the reverse. We will use the functions

and a computation diagram similar to the previous one. Checking whether the source can process the join, therefore, requires checking if either the pairs of functions or can be computed. The concept can be easily extended to a source with many relations and a query with many joins as follows. Take a conjunctive query, and let the set of its joins, with We can always rewrite a query so that each variable X will appear in only one relation, possibly adding some join conditions. Consider, for example, the fragment R(A, X), P(B, X), Q(C, X), which can be rewritten as

We will assume that all queries are normalized in this way. Given a variable X, let be the relation in which X appear. Also, given a relation R in the query, let the Cartesian product of its input sorts, and the Cartesian product of its output sorts.

64

Simone Santini and Amarnath Gupta

The algorithm for query rewriting is composed of two parts. The first is a function that determines whether a function from a given set of input to a given set of outputs can be implemented, and represented in Table 4. The second finds a join combination that satisfies the query. It is assumed that a set of the join conditions that appear in the query is given. The algorithm, reported in table 5 returns a computation graph that computes the query with all the required joins.

The correctness of the algorithm is proven in the following proposition: Proposition 1. Algorithm 1 succeeds if and only if the query with the required joins can be executed. The proof can be found in [5]. While the algorithm “joins” is an efficient (linear in the number of joins) way of finding a plan whose correctness is guaranteed, finding an optimal plan is inherently harder: Theorem 1. Finding the minimal set of functions that implements all the joins in the query is NP-hard. Proof. We prove the theorem with a reduction from graph cover. let G = (V, E) be a graph, with sets of nodes edges and with

Modeling Functional Data Sources as Relations

65

Given such a graph, we build a functional source and a query as follows. For each node define a sort and a function All the sorts are of the same data type. For each edge define a condition Also, define a function Finally, define the relations and the query

where the equality conditions are derived from the edges of the graph. The reduction procedure is clearly polynomial so, in order to prove the theorem we only need to prove that a solution of graph cover for G exists if and only if a cost-bound plan can be found for the query. 1. Suppose that a query plan for the query exists that uses B + 1 functions: (the function must obviously be part of every plan, since it is the only function that gives us the required output Y). Consider the set which contains, clearly, B nodes, and the edge of the graph. This edge is associated to a condition in the query and, since the query has been successfully planned, either the function or are in the plan. Consequently, either or are in the set, and the edge is covered. be a covering and consider the plan 2. let now The output is clearly produced correctly as long as all the join conditions are satisfied. let be a join condition. This corresponds to an edge and, since S is a covering, either or are in S. Assume that it is (if it is we can clearly carry out a similar argument). Then the plan contains the function which computes so that the variables and and the join are determined by the following graph fragment

5 Related Work The idea of modeling certain types of functional sources using a relational façade (or some modification thereof) is, of course, not new. The problem of conciliating the broad matching possibilities of a relation with the constraints deriving from the source has been solved in various ways the most common of which, to the best of our knowledge, is by the use of adornments [6,7], which also go under the name of binding patterns.

66

Simone Santini and Amarnath Gupta

Given a relation a binding pattern is a classification of the variables into input variables (which must be “bound” when the relation is accessed in the query, hence the name of the technique), output variables (which must be free when the relation is accessed), and dyadic variables, which can be indifferently inputs or outputs. Any query that accesses the relation by assigning values to the input variables and requiring values for some or all the output variables can be executed on that relation façade. A relational façade can, of course, have multiple binding patterns. If the relational façade is used to model an relation isomorphic to it, for instance, it allows all the possible bound/free binding patterns on its variables or, equivalently, all its variables are dyadic. In the following, a binding pattern for any relation will be represented as a string (where and stand for input, output, and dyadic, respectively, although dyadic variables will not appear in the examples that follow). Unlike our techinque, which determines query feasibility at run time, binding patterns are determined as part of the model. This difference results in a number of limitations of binding patterns, some examples of which are given below. Multiple relations with hidden sorts. Consider a source with five sorts, X, Y, P,W,Q, and the functional dependencies shown in the following diagram

We want to model this source as a pair of relations: and while the sort W should not be exported. Considering the two relations and the functions needed to answer queries on them, we can see that the relation has two binding patterns: and while has only A query such as would be rejected by the binding pattern verification system because produces a set of X values from the query constant but can’t take the X’s as an input, Mapping the query to a functional diagram, however, produces

Modeling Functional Data Sources as Relations

67

which is computable. Therefore, the query can be answered using the model presented in this paper. Non-binding conditions. Binding patterns are based, as the name suggests, on the idea of binding certain variables in a relation, that is, on the idea of assigning them specific values. Because of these foundations, binding pattern models are ill-equipped to deal with non-binding conditions (that is, essentially, with all conditions except equality and membership in a finite set). As an example, consider a source with three sorts, A, B, and C, and a function in addition, the source has a comparison capability which allows it to compare B with a fourth sort D and return C’s for a specified value of A such that a specified condition is verified: the diagram of this source is:

Because the condition is non-binding, it doesn’t contribute any binding pattern to the relation R(A, B, C) for which the only binding pattern is, therefore, A query such as where “=18 and self.agesize() select(dueDate>Today)->size(). A type with this stereotype defines an event type. Like any other entity type, event types may be specialized and/or generalized. This will allow us to build a taxonomy of event types, where common elements are defined only once. It is convenient to define a root entity type, named Event, as shown in Figure 2. All event types are direct or indirect subtypes of Event. In fact, Event is defined as derived by union of its subtypes. We define in this event type the attribute time, which gives the occurrence time of each event. We define also the abstract operation effect, whose purpose will be made clear later. It is not necessary to stereotype event types as > because all direct or indirect subtypes of Event will be considered event types. The view of events as entities is not restricted to domain events. We apply it also to query events.

3.2 Event Characteristics The characteristics of an event are the set of relationships in which it participates. There is at least one relationship between each event entity and a time point, representing the event occurrence time. We assume that the characteristics of an event are determined when the event occurs, and remain fixed. In a particular language, the characteristics of events should be modeled like those of ordinary entities. In the UML, we model them as attributes or associations. Figure 2 shows the definition of the external domain event type NewProduct, with four attributes (including time) and an association with Vendor. The immutability of characteristics can be defined by setting their isReadOnly property to true (not shown in the Figure) [29, p. 89+].

Fig. 2. Definition of event type NewProduct

Definition of Events and Their Effects

141

Event characteristics may be derived. The value for a derived characteristic may be computed from other characteristics and/or the state of the IB when the event occurs, as specified by the corresponding derivation rule. The practical importance of derived characteristics is that they can be referred to in any expression (integrity constraints, effect, etc.) exactly as the base ones, but their definition appears in a single place (derivation rule). In the UML, derived elements (attributes, associations, entity types) are marked with a slash (/). We define derivation rules by means of defining operations [26]. In the example of Figure 2, attribute vendorName gives the name of the vendor that will supply the new product. The association between NewProduct and Vendor may be derived from the vendor’s name. The defining operation: NewProduct::vendor( ): vendor gives the vendor associated with an event instance. In the UML 2.0, the result of operations is specified by a body expression [29, p. 76+]. Using the OCL, the formal specification of the above operation may be:

3.3 Event Constraints An event constraint is a condition an event must satisfy to occur [8]. An event constraint involves the event characteristics and the state of the IB before the event occurrence. It is assumed that the state of the IB before the event occurrence satisfies all defined constraints. Therefore, an event E can occur when the domain is in state S if: (1) state S satisfies all constraints, and (2) event E satisfies its event constraints. An IS checks event constraints when the events occur and the values of their characteristics have been established, but before the events have any effect in the IB or produce any answer. Events that do not satisfy their constraints are not allowed to occur and, therefore, they must be rejected. Event constraints checking is (assumed to be) done instantaneously. In a particular conceptual modeling language, event constraints can be represented like any other constraint. In the UML, they can be expressed as invariants or as constraint operations [27]. Event constraints are always creation-time constraints because they must be satisfied when events occur. Here we will define constraints by operations, called constraint operations, and we specify them in the OCL. In the UML, we show graphically constraint operations with the stereotype . The result of the evaluation of constraint operations must be true. A constraint of the NewProduct event (Figure 2) is that the product being added cannot exist already. We define it with the constraint operation doesNotExist. The specification in the OCL is:

On the other hand, the vendor must exist. This is also an event constraint. However, in this case the constraint can be expressed as a cardinality constraint. The multiplicity 1 in the vendor role requires that each instance of NewProduct must be

142

Antoni Olivé

linked to exactly one vendor. The constraint is violated if the vendor() operation does not return an instance of Vendor. An event constraint defined in a supertype applies to all its direct and indirect instances. This is one of the advantages of defining event taxonomies: common constraints can be defined in a generalized event type. Figure 3 shows an example. The event type ExistingProductEvent is defined as the union of NewRequirement, PurchaseOrderRelease and ProductDetails. The constraint that the product must exist is defined in ExistingProductEvent, and it applies to all its indirect instances. Note that the constraint has been defined by a cardinality constraint, as explained above. Although it is not shown in Figure 3, the event type ExistingProductEvent is a subtype of Event. Figure 3 shows also the constraint validDate in NewRequirement. The constraint is satisfied if dateRequired is greater than the event date.

Fig. 3. ExistingProductEvent is asubtype of Event (not shown here) and a common supertype of domain event types. NewRequirement and PurchaseOrderRelease and of the query event type ProductDetails

3.4 Query Events Effects The effect of a query event is an answer providing the requested information. The effect is specified by an expression whose evaluation on the IB gives the requested information. The query is written in some language, depending on the conceptual modeling language used. In the UML, we can represent the answer to a query event and the query expression in several ways. We explain one of them here, which can be used as is, or as a basis for the development of alternative ways. The answer to a query event is modeled as one or more attributes and/or associations of the event, with some predefined name. In the examples, we shall use names starting with answer. An alternative could be the use of a stereotype to indicate that an attribute or association is the answer of the event. Now, we need a way to define the value of the answer attributes and associations. To this end, we use the operation effect that we have defined in Event. This operation will have a different specification in each event type. For query events, its purpose is

Definition of Events and Their Effects

143

to specify the values of the answer attributes and associations. The specification of the operation can be done by means of postconditions, using the OCL. Figure 3 shows the representation of external query event type ProductDetails. The answer is given by attribute: The specification of the effect operation may be:

Alternatively, in O-O languages the answer to a query event could be specified as the invocation of some operation. The effect of this operation would then be the answer of the query event.

3.5 Domain Events Effects: The Postcondition Approach The effect of a domain event is a set of structural events. There are two main approaches to the definition of that set: the postconditions and the structural events approaches [25]. These approaches are called declarative and imperative specifications, respectively, in [34]. In the former, the definition is a condition satisfied by the IB after the application of the event effect. In the latter, the definition is an expression whose evaluation gives the corresponding structural events. Both approaches can be used in our method, although we (as many others) tend to favor the use of postconditions. We deal with the postcondition approach in this subsection, and the structural events approach in the next one.

Fig. 4. Definition of OrderReception and OrderReschedule event types

In the postcondition approach, the effect of an event Ev is defined by a condition C over the IB. The idea is that the event Ev leaves the IB in a state that satisfies C. It is also assumed that the state after the event occurrence satisfies all constraints defined over the IB. Therefore, the effect of event Ev is a state that satisfies condition C and all IB constraints.

144

Antoni Olivé

In an O-O language, we can represent the effect of a domain event in several ways. As we did for query events, we explain one way here, which can be used as is, or as a basis for the development of alternative ways. We define a particular operation in each domain event type, whose purpose is to specify the effect. To this end, we use the operation effect that we have defined in Event. This operation will have a different specification in each event type. Now, the postcondition of this operation will be exactly the postcondition of the corresponding event. As we have been doing until now, in the UML we also use the OCL to specify these postconditions formally. As an example, consider the external domain event type NewRequeriment, shown in Figure 3. The effect of one instance of this event type is the addition of one instance into entity type Requirement (see Figure 1). Therefore, in this case the specification of the effect operation is:

In our method, we do not define preconditions in the specification of effect operations. The reason is that we implicitly assume that the events satisfy their constraints before the application of their effect. In the example, we assume implicitly that a NewRequirement event references an existing product, and that its required date is valid. The postcondition states simply that a new instance of Requirement has been created in the IB, with the corresponding values for its attributes and association. Any implementation of the effect operation that leaves the IB in a state that satisfies the postcondition and the IB constraints is valid. Another example is the external domain event type OrderReception (see Figure 4). An instance of OrderReception occurs when a scheduled receipt is received. The event effect is that the purchase order now becomes ReceivedOrder (see Figure 1), and that the quantity on hand of the corresponding product is increased by the quantity received. We specify this effect with two postconditions of effect() in OrderReception:

3.6 Domain Events Effects: The Structural Events Approach In the structural events approach, the effect of an event Ev is defined by an expression written in some language. The idea is that the evaluation of the expression gives the

Definition of Events and Their Effects

145

set S of structural events corresponding to the event Ev effect. The application of S to the previous state of the IB produces the new state. The new state of the IB is the previous state plus the entities or relationships inserted, and minus the entities or relationships deleted. This approach is in contrast with the previous one, which defines a condition that characterizes the state of the IB after the event. It is assumed that the set S is such that it leaves the IB in a new state that satisfies all the constraints. Therefore, when defining the expression, one must take into account the existing constraints, and to ensure that the new state of the IB will satisfy all of them. Our method could be used in O-O languages that follow the structural events approach. The idea is to provide a method for the effect operations. The method is a procedural expression, written in the corresponding language, whose evaluation yields the structural events.

3.7 Comparison with Previous Work In most current conceptual modeling methods and languages, events are not considered objects. Instead of this, events are represented as invocations of actions or operations, or the reception of signals or messages. Event types are defined by means of operations (with their signatures) or an equivalent construct. We believe that the view of events as entities (albeit of a special kind) provides substantial benefits to behavioral modeling. The reason is that the uniform treatment of event and entity types implies that most (if not all) language constructs available for entity types can be used also for event types. In particular: (1) Event types with common characteristics, constraints, derivation rules and effects can be generalized, so that common parts are defined in a single place, instead of repeating them in each event type. We have found that, in practice, many event types have characteristics, constraints and derivation rules in common with others [14]; (2) The graphical notation related to entity types (including attributes, associations, multiplicities, generalization, packages, etc.) can be used also for event types; and (3) Event types can be specialized in a way similar to entity types, as we explain in the next section.

4 Event Specialization One of the fundamental constructs of O-O conceptual modeling languages is the specialization of entity types. When we consider events as entities, we have the possibility of defining specializations of event types. We may use these specializations when we want to define an event type whose characteristics, constraints and/or effect are extensions and/or specializations of another event type. For example, assume that some instances of NewRequirement are special because they require a large quantity of their product and, for some reason, the quantity required must be ordered immediately to the corresponding vendor. This special behavior can be defined in a new event type, SpecialRequirement, defined as a specialization of NewRequirement, as shown Figure 5. Note that SpecialRequirement redefines the constraint validDate, and adds a new constraint called largeQuantity. The required date of the new events must be beyond

146

Antoni Olivé

Fig. 5. Two specializations of the event type NewRequirement (Fig. 3)

the current date plus the vendor’s lead time, and the quantity required must be at least ten times the product order minimum. In the UML, the body of operations may be overridden when an operation is redefined, whereas preconditions and postconditions can only be added [29, p. 78]. Therefore, we redefine validDate as:

The new constraint largeQuantity can be defined as:

The effect of a SpecialRequirement is the same as that of a NewRequirement, but we want the system to generate an instance of PurchaseOrderRelease (see Figure 3). We define this extension as an additional postcondition of the effect operation:

On the other hand, we can define event types derived by specialization. A derived event type is an event type whose instances at any time can be inferred by means of a derivation rule. An event type Ev is derived by specialization of event types when Ev is derived and their instances are also instance of [28]. We may use event types defined by specialization when we want to define particular constraints and/or effect for events that satisfy some condition. For example, suppose that some instances of NewRequirement are urgent because they are required within the temporal horizon of the current MRP plan (seven days), and therefore they could not have been taken into account when the plan was generated. We want a behavior similar to the previous example. The difference is that now we determine automatically which are the urgent requirements. We define a new

Definition of Events and Their Effects

147

event type, UrgentRequirement, shown Figure 5, defined as derived by specialization of NewRequirement. In the UML, the name we give to the defining operations for derived entity types is allInstances [26]. In this case, allInstances is a class operation that gives the population of the type. The derivation rule of UrgentRequirement is then:

The effect of an urgent requirement is the same as that of a new requirement, but again we want the system to generate an instance of PurchaseOrderRelease (see Figure 3). We would define this extension as an additional postcondition of the effect operation, as we did in the previous example. Comparison with Previous Work. Event specialization is not possible when events are seen as operation invocations. The consequence is that this powerful modeling construct cannot be used in methods like those mentioned in the Introduction.

5 Conclusions In the context of O-O conceptual modeling languages, we have proposed a method that models events as entities (objects), and event types as a special kind of entity types. The method makes an extensive use of language constructs such as constraints, derived types, derivation rules, type specializations, operations and operation redefinition, which are present in all complete conceptual modeling languages. The method can be adapted to most O-O languages. In this paper we have explained in detail its adaptation to the UML. The method is fully compatible with the UML-based CASE tools, and thus it can be adopted in industrial projects, if it is felt appropriate. The main advantage of the method we propose is the uniform treatment we give to event and entity types. The consequence is that most (if not all) language constructs available for entity types can be used also for event types. Event types may have constraints and derived characteristics, like entity types. Characteristics, constraints and effects shared by several event types may be defined in a single place. Event specialization allows the incremental definition of new event types, as refinements of their supertypes. Historical events ease the definition of constraints, derivation rules and event effects. In summary, we believe that the view of events as entities provides substantial benefits to behavioral modeling. Among the work that remains to be done, there is the integration of the proposed method with the state transition diagrams. These diagrams allow defining the kinds of events that were beyond the scope of this paper.

Acknowledgements I would like to thank Jordi Cabot, Jordi Conesa, Dolors Costal, Xavier de Palol, Cristina Gómez, Anna Queralt, Ruth Raventós, Maria Ribera Sancho and Ernest

148

Antoni Olivé

Teniente for their help and many useful comments to previous drafts of this paper. This work has been partly supported by the Ministerio de Ciencia y Tecnologia and FEDER under project TIC2002-00744.

References 1. Abrial, J-R. The B-Book. Cambridge University Press, 1996, 779p. 2. Bonner, A.J.; Kifer, M. “The State of Change: A Survey”. LNCS 1472, 1998, pp. 1-36. 3. Borgida, A.; Greenspan, S. “Data and Activities: Exploiting Hierarchies of Classes”. Workshop on Data Abstraction, Databases and Conceptual Modelling, 1980, pp. 98-100. 4. Bubenko, J.A.jr. “Information Modeling in the Context of System Development”. Proc. IFIP 1980, North-Holland, 1980, pp. 395-411. 5. Cabot, J.; Olivé, A.; Teniente, E. “Representing Temporal Information in UML”. Proc. UML’03. LNCS 2863, pp. 44-59. 6. Ceri, S.; Fraternalli, P. Designing Database Applications with Objects and Rules. The IDEA Methodology. Addison-Wesley, 1997, 579p. 7. Coleman, D.; Arnold, P.; Bodoff, S.; Dollin, C.; Gilchrist, H.; Hayes, F.; Jeremaes, P. Object-Oriented Development. The Fusion Method. Prenticel Hall, 1994, 316p. 8. Cook, S.; Daniels, J. Designing Object Systems. Object-Oriented Modelling with Syntropy. Prentice Hall, 1994, 389 p. 9. Costal, D.; Olivé, A.; Sancho, M-R. “Temporal Features of Class Populations and Attributes in Conceptual Models”. Proc. ER’97, LNCS 1331, Springer, pp. 57-70. 10. D’Souza, D.F.; Wills, A.C. Objects, Components and Frameworks with UML. The Catalysis Approach. Addison-Wesley, 1999, 785 p. 11. Dardenne, A.; van Lamsweerde, A.; Fickas, S. “Goal-directed requirements acquisition”. Science of Computer Programming, 20(1993), pp. 3-50. 12. Davis, A.M. Software Requirements. Objects, Functions and States. Prentice-Hall, 1993. 13. Embley, D.W.; Kurtz, B.D.; Woodfield, S.N. Object-Oriented System Analysis. A ModelDriven Approach. Yourdon Press, 1992, 302 p. 14. Frias, L.; Olivé, A.; Queralt, A. “EU-Rent Car Rentals Specification”. UPC, Research Report LSI 03-59-R, 2003,159 p., http: //www.lsi.upc.es/dept/techreps/techreps.html. 15. Gamma, E.; Helm, R.; Johnson, R.; Vlissides, J. Design Patterns. Elements of Reusable Object-Oriented Software. Addison-Wesley, 1995, 395 p. 16. Harel, D.; Gery, E. “Executable Object Modeling with Statecharts”. IEEE Computer, July 1997, pp. 31-42. 17. IEEE. IEEE Standard for Conceptual Modeling Language Syntax and Semantics for IDEF1X97 (IDEFobject). IEEE Std 1320.2-1998, 1999. 18. ISO/TC97/SC5/WG3. “Concepts and Terminology for the Conceptual Schema and the Information Base”, J.J. van Griethuysen (ed.), March, 1982. 19. Jungclaus, R.; Saake, G.; Hartmann, T.; Sernadas, C. ‘TROLL-A Language for ObjectOriented Specification of Information Systems”. ACM TOIS, 14(2), 1996, pp. 175-211. 20. Larman, C. Applying UML and Patterns. Prentice Hall, 2002, 627 p. 21. Martin, J.; Odell, J.J. Object-Oriented Methods: A Foundation. Prentice Hall, 1995,412 p. 22. Martin, R.C. Agile Software Development, Principles, Patterns and Practices. Prentice Hall, 2003, 529p. 23. Mellor, S.J.; Balcer, M.J. Executable UML. A Foundation for Model-Driven Architecture. Addison-Wesleyy, 2002, 368p. 24. Mylopoulos, J.; Bernstein, P.A.; Wong, H.K.T. “A Language Facility for Designing Database-Intensive Applications”. ACM TODS, 5(2), pp. 185-207, 1980.

Definition of Events and Their Effects

149

25. Olivé, A. “Time and Change in Conceptual Modeling of Information Systems”. In Brinkkemper, S.; Lindencrona, E.; Solvberg, A. “Information Systems Engineering. State of the Art and Research Themes”, Springer, 2000, pp. 289-304. 26. Olivé, A. “Derivation Rules in Object-Oriented Conceptual Modeling Languages”. Proc. CAiSE 2003, LNCS 2681, pp. 404-420. 27. Olivé, A. “Integrity Constraints Definition in Object-Oriented Conceptual Modeling Languages”. Proc. ER 2003, LNCS 2813, pp. 349-362, 2003. 28. Olivé, A.; Teniente, E. “Derived types and taxonomic constraints in conceptual modeling”. Information Systems, 27(6), 2002, pp. 391-409. 29. OMG. UML Superstructure 2.0 Final Adopted Specification, 2003, http://www.omg.org/cgi-bin/doc?ptc/2003-08-02. 30. Robinson, K.; Berrisford, G. Object-oriented SSADM. Prentice Hall, 1994, 524p. 31. Rumbaugh, J.; Jacobson, I.; Booch, G. The Unified Modeling Language Reference Manual. Addison-Wesley, 1999,550 p. 32. Selic,B.; Gullekson, G.; and Ward, P.T. Real-Time Object-Oriented Modeling. John Wiley & Sons, 1994, 525p. 33. Teisseire, M; Poncelet, P.; Cichetti, R. “Dynamic Modelling with Events”, Proc. CAiSE’94, LNCS 811, pp. 186-199, 1994. 34. Wieringa, R. “A survey of structured and object-oriented software specification methods and techniques”. ACM Computing Surveys, 30(4), December 98, pp. 459-527.

Enterprise Modeling with Conceptual XML David W. Embley1, Stephen W. Liddle2, and Reema Al-Kamha1 1

Department of Computer Science Brigham Young University, Provo, Utah 84602, USA {embley,reema}@cs.byu.edu 2

School of Accountancy and Information Systems Brigham Young University, Provo, Utah 84602, USA [emailprotected]

Abstract. An open challenge is to integrate XML and conceptual modeling in order to satisfy large-scale enterprise needs. Because enterprises typically have many data sources using different assumptions, formats, and schemas, all expressed in – or soon to be expressed in – XML, it is easy to become lost in an avalanche of XML detail. This creates an opportunity for the conceptual modeling community to provide improved abstractions to help manage this detail. We present a vision for Conceptual XML (C-XML) that builds on the established work of the conceptual modeling community over the last several decades to bring improved modeling capabilities to XML-based development. Building on a framework such as C-XML will enable better management of enterprise-scale data and more rapid development of enterprise applications.

1

Introduction

A challenge [3] for modern enterprise modeling is to produce a simple conceptual model that: (1) works well with XML and XML Schema; (2) abstracts well for conceptual entities and relationships; (3) scales to handle both large data sets and complex object interrelationships; (4) allows for queries and defined views via XQuery; and (5) accommodates heterogeneity. The conceptual model must work well with XML and XML Schema because XML is rapidly becoming the de facto standard for business data. Because conceptualizations must support both high-level understanding and high-level program construction, the conceptual model must abstract well. Because many of today’s huge industrial conglomerations have large, enterprise-size data sets and increasingly complex constraints over their data, the conceptual model must scale up. Because XQuery, like XML, is rapidly becoming the industry standard, the conceptual model must smoothly incorporate both XQuery and XML. Finally, because we can no longer assume that all enterprise data is integrated, the conceptual model must accommodate heterogeneity. Accommodating heterogeneity also supports today’s rapid acquisitions and mergers, which require fast-paced solutions to data integration. We call the answer we offer for this challenge Conceptual XML (C-XML). C-XML is first and foremost a conceptual model, being fundamentally based on object-set and relationship-set constructs. As a central feature, C-XML supports P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 150–165, 2004. © Springer-Verlag Berlin Heidelberg 2004

Enterprise Modeling with Conceptual XML

151

high-level object- and relationship-set construction at ever higher levels of abstraction. At any level of abstraction the object and relationship sets are always first class, which lets us address object and relationship sets uniformly, independent of level of abstraction. These features of C-XML make it abstract well and scale well. Secondly, C-XML is “model-equivalent” [9] with XML Schema, which means that C-XML can represent each component and constraint in XML Schema and vice versa. Because of this correspondence between C-XML and XML Schema, XQuery immediately applies to populated C-XML model instances and thus we can raise the level of abstraction for XQuery by applying it to high-level model instances rather than low-level XML documents. Further, we can define high-level XQuery-based mappings between C-XML model instances over in-house, autonomous databases, and we can declare virtual views over these mappings. Thus, we can accommodate heterogeneity at a higher level of abstraction and provide uniform access to all enterprise data. Besides enunciating a comprehensive vision for the XML/conceptual-modeling challenge [3], our contributions in this paper include: (1) mappings to and from C-XML and XML Schema, (2) defined mechanisms for producing and using firstclass, high-level, conceptual abstractions, and (3) XQuery view definitions over both standard and federated conceptual-model instances that are themselves conceptual-model equivalent. As a result of these contributions, C-XML and XML Schema can be fully interchangable in their usage over both standard and heterogeneous XML data repositories. This lets us leverage conceptual model abstractions for high-level understanding while retaining all the complex details involved with low-level XML Schema intricacies, view mappings, and integration issues over heterogeneous XML repositories. We present the details of our contributions as follows. Section 2 describes C-XML. Section 3 shows that C-XML is “model-equivalent” with XML Schema by providing mappings between the two. Section 4 describes C-XML views. We report the status of our implementation and conclude in Section 5.

2

C-XML: Conceptual XML

C-XML is a conceptual model consisting of object sets, relationship sets, and constraints over these object and relationship sets. Graphically a C-XML model instance M is an augmented hypergraph whose vertices and edges are respectively the object sets and relationship sets of M, and whose augmentations consist of decorations that represent constraints. Figure 1 shows an example. In the notation boxes represent object sets – dashed if lexical and not dashed if nonlexical because their objects are represented by object identifiers. With each object set we can associate a data frame (as we call it) to provide a rich description of its value set and other properties. A data frame lets us specify, for example, that OrderDate is of type Date or that ItemNr values must satisfy the value pattern “[A-Z]{3}-\d{7}”. Lines connecting object sets are relationship sets; these lines may be hyper-lines (hyper-edges in hyper-graphs) with diamonds when they have more than two connections to object sets. Optional

152

David W. Embley, Stephen W. Liddle, and Reema Al-Kamha

Fig. 1. Customer/Order C-XML Model Instance.

or mandatory participation constraints respectively specify whether objects in a connected relationship may or must participate in a relationship set (an “o” on a connecting relationship-set line designates optional while the absence of an “o” designates mandatory). Thus, for example, the C-XML model instance in Figure 1 declares that an Order must include at least one Item but that an Item need not be included in any Order. Arrowheads on lines specify functional constraints. Thus, Figure 1 declares that an Item has a Price and a Description and is in a one-to-one correspondence with ItemNr and that an Item in an Order has one Qty and one SalePrice. In cases when optional and mandatory participation constraints along with functional constraints are insufficient to specify minimum and maximum participation, explicit min..max constraints may be specified. Triangles denote generalization/specialization hierarchies. We can constrain ISA hierarchies by partition union or mutual exclusion (+) among specializations. Any object-set/relationship-set connection may have a role, but a role is simply a shorthand for an object set that denotes the subset consisting of the objects that actually participate in the connection.

3

Translations Between C-XML and XML Schema

Many translations between C-XML and XML Schema are possible. In recent ER conferences, researchers have described varying conceptual-model translations to and/or from XML or XML DTD’s or XML-Schema-like specifications. (See,

Enterprise Modeling with Conceptual XML

153

for example, [4, 6, 10].) It is not our purpose here to argue for or against a particular translation. Indeed, we would argue that a variety of translations may be desirable. For any translation, however, we require information and constraint preservation. This ensures that an XML Schema and a conceptual instantiation of an XML Schema as a C-XML model instance correspond and that a system can reflect manipulations of the one in the other. To make our correspondence exact, we need information- and constraintpreserving translations in both directions. We do not, however, require that translations be inverses of one another – translations that generate members of an equivalence class of XML Schema specifications and C-XML model instances are sufficient. In Section 3.1 we present our C-XML-to-XML-Schema translation, and in Section 3.2 we present an XML-Schema-to-C-XML translation. In Section 3.3 we formalize the notions of information and constraint preservation and show that the translations we propose preserve information and constraints.

3.1

Translation from C-XML to XML Schema

We now describe our process for translating a C-XML model instance C to an XML Schema We illustrate our translation process with the C-XML model instance of Figure 1 translated to the corresponding XML Schema excerpted in Figure 2. Fully automatic translation from C to is not only possible, but can be done with certain guarantees regarding the quality of Our approach is based on our previous work [8], which for C generates a forest of scheme trees such that (1) has a minimal number of scheme trees, and (2) XML documents conforming to have no redundant data with respect to functional and multivalued constraints of C. For our example in Figure 1, the algorithms in [8] will generate the following two nested scheme trees.

Observe that the XML Schema in Figure 2 satisfies these nesting specifications. Item in the second scheme tree appears as an element on Line 8 with ItemNr, Description, and Price defined as its attributes on Lines 28–30. PreviousItem is nested, by itself, underneath Item, on Line 18, and Manufacturer, RequestDateTime, and Qty are nested underneath Item as a group on Lines 13–15. The XML-Schema notation that accompanies these C-XML object-set names obscures the nesting to some extent, but this additional notation is necessary either to satisfy the syntactic requirements of XML Schema or to allow us to specify the constraints of the C-XML model instance. As we continue, recall first that each C-XML object set has an associated data frame that contains specifications such as type declarations, value restrictions, and any other annotations needed to specify information about objects in

154

David W. Embley, Stephen W. Liddle, and Reema Al-Kamha

Fig. 2. XML Schema Excerpt for the C-XML Model Instance in Figure 1.

the object set. For our work here, we let the kind of information that appears in a data frame correspond exactly to the kind of data constraint information specifiable in XML Schema. One example we point out explicitly is order information, which is usually absent in conceptual models, but is unavoidably present in XML. Thus, if we wish to say that CustomerName precedes CustomerAddr, we add the annotation CustomerAddr” to the CustomerName data frame and add the annotation CustomerName” to the CustomerAddr data frame. In our discussion, we assume that these annotations are in the data frames that accompany the object sets CustomerName and CustomerAddr in Figure 1. Our conversion algorithm preserves all annotations found in C-XML data frames. This is where we obtain all the type specifications in Figure 2. We cap-

Enterprise Modeling with Conceptual XML

155

ture the order specification, CustomerName Customer Addr, by making CustomerName and CustomerAddr elements (rather than attributes) and placing them, in order, in their proper place in the nesting – for our example in Lines 58 and 59 nested under CustomerDetails. In the conversion from C-XML to XML Schema we use attributes instead of elements where possible. An object set can be represented as an attribute of an element if it is lexical, is functionally dependent on the element, and has no order annotations. The object sets OrderID and OrderDate, for example, satisfy these conditions and appear as attributes of an Order element on Lines 75 and 76. Both attributes are also marked as “required” because of their mandatory connection to Order as specified by the absence of an “o” on their connection to Order in Figure 1. When an object set is lexical but not functional and order constraints do not hold, the object set becomes an element with minimum and maximum participation constraints. PreviousItem in Line 18 has a minimum participation constraint of 0 and a maximum of unbounded. Because XML Schema will not let us directly specify relationship sets we convert them all to binary relationship sets by introducing a tuple identifier. We can think of each diamond in a C-XML diagram as being replaced by a nonlexical object set containing these tuple identifiers. To obtain a name for the object set containing the tuple identifiers, we concatenate names of nonfunctionally dependent object sets. For example, given the relationship set for Order, Item, SalePrice, and Qty, we generate an OrderItem element (Line 63). If names become too long, we abbreviate using only the first letter of some object-set names. Thus, for example, we generate ItemMR (Line 11) for the relationship set connecting Item, Manufacturer, RequestDateTime, and Qty. When a lexical object set has a one-to-one relationship with a nonlexical object set, we use the lexical object set as a surrogate for the nonlexical object set and generate a key constraint. In our example, this generates key constraints for Order/OrderID in Line 35 and Item/ItemNr in Line 39. We also use these surrogate identifiers, as needed, to maintain explicit referential integrity. Observe that in the scheme trees above, Item in the first tree references Item in the root of the second scheme tree and also that PreviousItem in the second scheme tree is a role and therefore a specific specialization (or subset) of Item in the root. Thus, we generate keyref constraints, one in Lines 69–72 to ensure the referential integrity of ItemNr in the OrderItem element and another in Lines 22–25 for the PreviousItem element. Another construct in C-XML we need to translate is generalization/specialization. XML Schema uses the concept of substitution groups to allow the use of multiple element types in a given context. Thus, for example, we generate an abstract element for Customer in Line 44, but then specify in Lines 45–55 a substitution group for Customer that allows RegularCustomer and PreferredCustomer to appear in a Customer context. We model content that would normally be associated with the generalization by generating a group that is referenced in each specialization (in Lines 47 and 52). In our example, we generate the group

156

David W. Embley, Stephen W. Liddle, and Reema Al-Kamha

CustomerDetails and nest the details of Customer such as CustomerName, CustomerAddr, and Orders under CustomerDetails as we do beginning in Line 56. Further, we can nest any information that only applies to one of the specializations directly with that specialization; thus, in Line 48 we nest Discount under PreferredCustomer. Finally, XML documents need to have a single content root node. Thus, we assume the existence of an element called Document (Line 4) that serves as the universal content root.

3.2

Translation from XML Schema to C-XML

We translate XML Schema instances to C-XML by separating structural XML Schema concepts (such as elements and attributes) from non-structural XML Schema concepts (such as attribute types and order constraints). Then we generate C-XML constructs for the structural concepts and annotate generated C-XML object sets with the non-structural information. We can convert an XML Schema S to a C-XML model instance by generating object sets for each element and attribute type connected by relationship sets according to the nesting structure of S. Figure 3 shows the result of applying our conversion process to the XML Schema instance of Figure 2. Note that we nest object and relationship sets inside one another corresponding to the nested element structure of the XML Schema instance. Whether we display C-XML object sets inside or outside one another has no semantic significance. The nested structure, however, is convenient because it corresponds to the natural XML Schema instance structure. The initial set of generated object and relationship sets is straightforward. Each element or attribute generates exactly one object set, and each element that is nested inside another element generates a relationship set connecting the two. Each attribute associated with an element always generates a corresponding object set and a relationship set connecting to the object set generated by Participation constraints for attribute-generated relationship sets are always on the side and are either 1 or 0..1 on the side. Participation constraints for relationship sets generated by element nesting require a bit more work. If the element is in a sequence or a choice, there may be specific minimum/maximum occurrence constraints we can use directly. For example, according to the constraints on Line 60 in Figure 2 a CustomerDetails element may contain a list of 0 or more Order elements. However, an Order element must be nested inside a CustomerDetails element. Thus, for the relationship set connecting CustomerDetails and Order, we place participation constraints of on the CustomerDetails side, and 1 on the Order side. In order to make the generated C-XML model instance less redundant, we look for certain patterns and rewrite the generated model instance when appropriate. For example, since ItemNr has a key constraint, we infer that it is one-toone with Item. Further, the keyref constraints on ItemNr for PreviousItem and OrderItem indicate that rather than create two additional ItemNr object sets, we can instead relate PreviousItem and OrderItem to the ItemNr nested in Item.

Enterprise Modeling with Conceptual XML

157

Fig. 3. C-XML Model Instance Translated from Figure 2.

Another optimization is the treatment of substitution groups. In our example, since RegularCustomer and PreferredCustomer are substitutable for Customer, we construct a generalization/specialization for the three object sets and factor out the common substructure of the specializations into the generalization. Thus, CustomerDetails exists in a one-to-one relationship with Customer. Another complication in XML Schema is the presence of anonymous types. For example, the complex type in Line 5 of Figure 2 is a choice of 0 or more Customer or Item elements. We need a generalization/specialization to represent this, and since C-XML requires names for object sets, we simply concatenate all the top-level names to form the generalization name CustomerItem. There are striking differences between the C-XML model instances of Figures 1 and 3. The translation to XML Schema introduced new elements Document, CustomerDetails, OrderItem, and ItemMR in order to represent a toplevel root node, generalization/specializations, and decomposed relationship sets. If we knew that a particular XML Schema instance was generated from

158

David W. Embley, Stephen W. Liddle, and Reema Al-Kamha

an original C-XML model instance, we could perform additional optimizations. For example, if we knew CustomerDetails was fabricated by the translation to XML Schema, we could observe that in the reverse translation to C-XML it is superfluous because it is one-to-one with Customer. Similarly, we could recognize that Document is a fabricated top-level element and omit it from the reverse translation; this would also eliminate the need for CustomerItem and its generalization/specialization. Finally, we could recognize that relationship sets have been decomposed, and in the reverse translation reconstitute them. The original C-XML to XML Schema translation could easily place annotation objects in the generated XML Schema instance marking elements for this sort of optimization.

3.3

Information and Constraint Preservation

To formalize information and constraint preservation for schema translations, we use first-order predicate calculus. We represent any schema specification in predicate calculus by generating an predicate for each tuple container and a closed formula for each constraint [7]. Using the closed-world assumption, we can then populate the predicates to form an interpretation. If all the constraints hold over the populated predicates, the interpretation is valid. For any schema specification of type A there is a corresponding valid interpretation We can guarantee that a translation T translates a schema specification to a constraint-equivalent schema specification by checking whether the constraints of the generated predicate calculus for the schema specification of type B imply the constraints of the generated predicate calculus for the schema specification of type A. A translation T that translates a schema specification into a schema translation induces a translation from an interpretation for a schema of type A to an interpretation for a schema of type B. We can guarantee that a T-induced translation translates any valid interpretation into an information equivalent valid interpretation by translating both of the corresponding valid interpretations to predicate calculus interpretations and and checking for information equivalence. Definition 1. A translation T from schema specification to a schema specification preserves information if there exists a procedure P that for any valid interpretation corresponding to computes from where is the interpretation corresponding to induced by T. Definition 2. A translation T from schema specification to a schema specification preserves constraints if the constraints of imply the constraints of Lemma 1. Let be a valid interpretation for a populated C-XML model instance There exists a translation that correctly represents as a valid interpretation in predicate calculus1. 1

Due to space constraints, we have omitted all proofs in this paper.

Enterprise Modeling with Conceptual XML

Lemma 2. Let Schema instance rectly represents calculus.

159

be an XML document that conforms to an XML There exists a translation that coras a valid interpretation in predicate

Theorem 1. Let T be the translation described in Section 3.1 that translates a C-XML model instance to an XML Schema instance T preserves information and constraints. Theorem 2. Let T be the translation described in Section 3.2 that translates an XML Schema instance to a C-XML model instance T preserves information and constraints.

4

C-XML Views

This section describes three types of views – simple views that help us scale up to large and complex XML schemas, query-generated views over a single XML schema, and query-generated views over heterogeneous XML schemas.

4.1

High-Level Abstractions in C-XML

We create simple views in two ways. Our first way is to nest and hide C-XML components inside one another [7]. Figure 3 shows how we can nest object sets inside one another. We can pull any object set inside any other connected object set, and we can pull any object set inside any connected relationship set so long as we leave at least two object sets outside (e.g. in Figure 1 we can pull Qty and/or SalePrice inside the diamond). Whether an object set appears on the inside or outside has no effect on the meaning. Once we have object sets on the inside, we can implode the object set or relationship set and thus remove the inner object sets from the view. We can, for example, implode Customer, Item, and PreferredCustomer in Figure 3, presenting a much simpler diagram showing only five object sets and two generalization/specialization components nested in Document. To denote an imploded object or relationship set, we shade the object set or the relationship-set diamond. Later, we can explode object or relationship sets and view all details. Since we allow arbitrary nesting, it is possible that relationship-set lines may cross object- or relationship-set boundaries. In this case, when we implode, we connect the line to the imploded object or relationship set and make the line dashed to indicate that the connection is to an interior object set. Our second way to create simple views is to discard C-XML components that are not of interest. We can discard any relationship set, and we can discard all but any two connections of an relationship set We can also discard any object set, but then must discard (1) any connecting binary relationship sets, (2) any connections to relationship sets and (3) any specializations and relationship sets or relationship-set connections to these specializations. Figure 4 shows an example of a high-level abstraction of Figure 1. In

160

David W. Embley, Stephen W. Liddle, and Reema Al-Kamha

Fig. 4. High-Level View of Customer/Order C-XML Model Instance.

Figure 4 we have discarded Price and its associated binary relationship set, the relationship set for PreviousItem, and the connections to RequestDateTime and Qty in the relationship set involving Manufacturer. We have also hidden OrderID, OrderDate, and all customer information except CustomerName inside Order, and we have hidden SalePrice and Qty inside the Order-Item relationship set. Note that both the Order object set and the Order-Item relationship set are shaded, indicating the inclusion of C-XML components; that neither the Item object set nor the Item-Manufacturer relationship set are shaded, indicating that the original connecting information has been discarded rather than hidden within; and that the line between CustomerName and Order is dashed, indicating that CustomerName connects, not to Order directly, but rather to an object set inside Order. Theorem 3. Simple, high-level views constructed by properly discarding C-XML components are valid C-XML model instances. Corollary 1. Any simple, high-level view can be represented by an XML Schema.

4.2

C-XML XQuery Views

We now consider the use of C-XML views to generate XQuery views. As other researchers have pointed out [2,5], XQuery can be hard for users to understand and manipulate. One reason XQuery can be cumbersome is because it must follow the particular hierarchical structure of an underlying XML schema, rather than the simpler, logical structure of an underlying conceptual model. Further, different XML sources might specify conflicting hierarchical representations of the same conceptual relationship [2]. Thus, it is highly desirable to be able to construct XQuery views by generating them from a high-level conceptual modelbased description. [5] describes an algorithm for generating XQuery views from ORA-SS descriptions. [2] also describes how to specify XQuery views by writing conceptual XPath expressions over a conceptual schema and then automatically generating the corresponding XQuery specifications. In a similar fashion, we can

Enterprise Modeling with Conceptual XML

161

Fig. 5. C-XQuery View of Customers Nested within Items Ordered.

generate XQuery views directly from high-level C-XML views. In some situations a graphical query language would be an excellent choice for creating C-XML views [9], but in keeping with the spirit of C-XML we define an XQuery-like textual language called C-XQuery. Figure 5 shows a high-level view written in C-XQuery over the model instance of Figure 1. We introduce a view definition with the phrase define view, and specify the contents of the view with FLWOR (for, let, where, order by, return) expressions [14]. The first for $item in Item phrase creates an iterator over objects in the Item object set. Since there is no top-level where clause, we iterate over all the items. Also, since C-XML model instances do not have “root nodes” the idea of context is different. In this case, Item defines the Item object set as the context of the path expression. For each such item, we return an ... structure populated according to the nested expressions. C-XQuery is much like ordinary XQuery, with the main distinguishing factor that our path expressions are conceptual, and so, for example, they are not concerned with the distinction between attributes and elements. Note particularly that for the data fields, such as ItemNr, CustomerName, and OrderDate, we do not care whether the generated XML treats them as attributes or elements. A more subtle characteristic of our conceptual path expressions is that since they operate over a flat C-XML structure, we can traverse the conceptual-model graph more flexibly, without regard for hierarchical structure. Thus, we generalize the notion of a path expression so that the expression A//B designates the path from A to B regardless of hierarchy or the number of intervening steps in the path [9]. This can lead to ambiguity in the presence of cycles or multiple paths between nodes, but we can automatically detect ambiguity and require the user to disambiguate the expression (say, by designating an intermediate node that fixes a unique path).

162

David W. Embley, Stephen W. Liddle, and Reema Al-Kamha

Fig. 6. C-XQuery over the View of Customers Nested within Items Ordered.

Given a view definition, we can write queries against the view. For the view in Figure 5, for example, the query in Figure 6 finds customers who have purchased more than $300 worth of nitrogen fertilizer within the last 90 days. To execute the query, we unfold the the view according to the view definition and minimize the resulting XQuery. See [13] for a discussion of the underlying principles. The view in Figure 6 illustrates the use of views over views. Indeed, applications can use views as first-class data sources, just like ordinary sources, and we can write queries against the conceptual model and views over that model. In any case, we translate the conceptual queries to XQuery specifications over the XML Schema instance generated for the C-XML conceptual model. Theorem 4. A C-XQuery view Q over a C-XML model instance C can be translated to an XQuery query over an XML Schema instance Observe that by the definition of XQuery [14], any valid XQuery instance generates an underlying XML Schema instance. By Theorem 4, we thus know that for any C-XQuery view we retain a correspondence to XML Schema. In particular, this means we can compose views of views to an arbitrary depth and still retain a correspondence to XML Schema.

4.3

XQuery Integration Mappings

To motivate the use of views in enterprise conceptual modeling, suppose through mergers and acquisitions we acquire the catalog inventory of another company. Figure 7 shows the C-XML for this assumed catalog. We can rapidly integrate this catalog into the full inventory of the parent company by creating a mapping from the acquired company’s catalog to the parent company’s catalog. Figure 8 shows such a mapping. In order to integrate the source (Figure 7) with the target (Figure 1), the mapping needs to generate target names in the source. In

Enterprise Modeling with Conceptual XML

163

Fig. 7. C-XML Model Instance for the Catalog of an Acquired Company.

Fig. 8. C-XQuery Mapping for Catalog Integration.

this example, CatalogItem, CatalogNr, and ShortName correspond respectively to Item, ItemNr, and Description. We must compute Price in the target from the MSRP and MarkupPercent values in the source, as Figure 8 shows. We assume the function CatalogNr-to-ItemNr is either a hand-coded lookup table, or a manually-programmed function to translate source catalog numbers to item numbers in the target. The underlying structure of this mapping query corresponds directly to the relevant section of the C-XML model instance in Figure 1, so integration is now immediate. The mapping in Figure 8 creates a target-compatible C-XQuery view over the acquired company’s catalog in Figure 7. When we now query the parent company’s items, we also query the acquired company’s catalog. Thus, the previous examples are immediately applicable. For example, we can find those customers who have ordered more than $300 worth of nitrogen fertilizer from either the inventory of the parent company or the inventory of the acquired company by simply issuing the query in Figure 6. With the acquired company’s catalog integrated, when the query in Figure 6 iterates over customer orders, it iterates over data instances for both Item in Figure 1 and CatalogItem in Figure 8. (Now, if the potential terrorist has purchased, say $200 worth of nitrogen fertilizer from the original company and $150 worth from the acquired company, the potential terrorist will appear on the list, whereas the potential terrorist would have appeared on neither list before.) We could also write a mapping query going in the opposite direction, with Figure 1 as the source and Figure 7 as the target. Such bidirectional integration is useful in circumstances where we need to shift between perspectives, as is often the case in enterprise application development. This is especially true because all enterprise data is rarely fully integrated.

164

David W. Embley, Stephen W. Liddle, and Reema Al-Kamha

In general it would be nice to have a mostly automated tool for generating integration mappings. In order to support such a tool, we require two-way mappings between both schemas and data elements. Sometimes we can use automated element matchers [1, 12] to help us with the mapping. However, in other cases the mappings are intricate and require programmer intervention (e.g. calculating Price from MSRP plus a MarkupPercent or converting CatalogNr to ItemNr). In any case, we can write C-XQuery views describing each such mapping, with or without the aid of tools (e.g. [11]), and we can compose these views to provide larger C-XQuery schema mappings. Of course there are many integration details we do not address here, such as handling dirty data, but the approach of integrating by composing C-XQuery views is sound. Theorem 5. A C-XQuery view Q over a C-XML model instance C of an external, federated XML Schema can be translated to an XQuery query over an XML Schema instance

5

Concluding Remarks

We have offered Conceptual-XML (C-XML) as an answer to the challenge of modern enterprise modeling. C-XML is equivalent in expressive power to XML Schema (Theorems 1 and 2). In contrast to XML Schema, however, C-XML provides for high level conceptualization of an enterprise. C-XML allows users to view schemas at any level of abstraction and at various levels of abstraction in the same specification (Theorem 3), which goes a long way toward mitigating the complexity of large data sets and complex interrelationships. Along with CXML, we have provided C-XQuery, a conceptualization of XQuery that relieves programmers from concerns about the often arbitrary choice of nesting and arbitrary choice of whether to represent values with attributes or with elements. Using C-XQuery, we have shown how to define views and automatically translate them to XQuery (Theorem 4). We have also shown how to accommodate heterogeneity by defining mapping views over federated data repositories and automatically translate them to XQuery (Theorem 5). Implementing C-XML is a huge undertaking. Fortunately, we have a foundation on which to build. We have already implemented tools relevant to CXML include graphical diagram editors, model checkers, textual model compilers, a model execution engine, and several data integration tools. We are actively continuing development of an Integrated Development Environment (IDE) for modeling-related activities. Our strategy is to plug new tools into this IDE rather than develop stand-alone programs. Our most recent implementation work consists of tools for automatic generation of XML normal form schemes. We are now working on the implementation of the algorithms to translate C-XML to XML Schema, XML Schema to C-XML, and C-XQuery to XQuery.

Enterprise Modeling with Conceptual XML

165

Acknowledgements This work is supported in part by the National Science Foundation under grant IIS-0083127 and by the Kevin and Debra Rollins Center for eBusiness at Brigham Young University.

References 1. J. Biskup and D. Embley. Extracting information from heterogeneous information sources using ontologically specified target views. Information Systems, 28(3) :169– 212, 2003. 2. S. Camillo, C. Heuser, and R. dos Santos Mello. Querying heterogeneous XML sources through a conceptual schema. In Proceedings of the 22nd International Conference on Conceptual Modeling (ER2003), Lecture Notes in Computer Science 2813, pages 186–199, Chicago, Illinois, October 2003. 3. M. Carey. Enterprise information integration – XML to the rescue! In Proceedings of the 22nd International Conference on Conceptual Modeling (ER2003), Lecture Notes in Computer Science 2813, page 14, Chicago, Illinois, October 2003. 4. Y. Chen, T. Ling, and M. Lee. Designing valid XML views. In Proceedings of the 21st International Conference on Conceptual Modeling (ER’02), pages 463–477, Tampere, Finland, October 2002. 5. Y. Chen, T. Ling, and M. Lee. Automatic generation of XQuery view definitions from ORA-SS views. In Proceedings of the 22nd International Conference on Conceptual Modeling (ER2003), Lecture Notes in Computer Science 2813, pages 158– 171, Chicago, Illinois, October 2003. 6. R. Conrad, D. Scheffner, and J. Freytag. XML conceptual modeling using UML. In Proceedings of the Ninteenth International Conference on Conceptual Modeling (ER2000), Salt Lake City, Utah, October 2000. 558–571. 7. D. Embley, B. Kurtz, and S. Woodfield. Object-oriented Systems Analysis: A Model-Driven Approach. Prentice Hall, Englewood Cliffs, New Jersey, 1992. 8. D. Embley and W. Mok. Developing XML documents with guaranteed ‘good‘ properties. In Proceedings of the 20th International Conference on Conceptual Modeling (ER2001), pages 426–441, Yokohama, Japan, November 2001. 9. S. Liddle, D. Embley, and S. Woodfield. An active, object-oriented, modelequivalent programming language. In M. Papazoglou, S. Spaccapietra, and Z. Tari, editors, Advances in Object-Oriented Data Modeling, pages 333–361. MIT Press, Cambridge, Massachusetts, 2000. 10. M. Mani, D. Lee, and R. Muntz. Semantic data modeling using xml schemas. In Proceedings of the 20th International Conference on Conceptual Modeling (ER2001), pages 149–163, Yokohama, Japan, November 2001. 11. R. Miller, L. Haas, and M. Hernandez. Schema mapping as query discovery. In Proceedings of the 26th International Conference on Very Large Databases (VLDB’00), pages 77–88, Cairo, Egypt, September 2000. 12. E. Rahm and P. Bernstein. A survey of approaches to automatic schema matching. The VLDB Journal, 10:334–350, 2001. 13. I. Tatarinov and A. Halevy. Efficient query reformulation in peer data management systems. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, 2004. (to appear). 14. XQuery 1.0: An XML Query Language, November 2003. URL: http://www.w3.org/TR/xquery/.

Graphical Reasoning for Sets of Functional Dependencies János Demetrovics1, András Molnár2, and Bernhard Thalheim3 1

MTA SZTAKI, Computer and Automation Institute of the Hungarian Academy of Sciences Kende u. 13-17, H-1111 Budapest, Hungary [emailprotected]

2

Department of Information Systems, Faculty of Informatics, Eötvös Loránd University Budapest Pázmány Péter stny. 1/C, H-1117 Budapest, Hungary [emailprotected]

3

Computer Science and Applied Mathematics Institute, University Kiel, Olshausenstrasse 40, 24098 Kiel, Germany [emailprotected]

Abstract. Reasoning on constraint sets is a difficult task. Classical database design is based on a step-wise extension of the constraint set and on a consideration of constraint sets through generation by tools. Since the database developer must master semantics acquisition, tools and approaches are still sought that support reasoning on sets of constraints. We propose novel approaches for presentation of sets of functional dependencies based on specific graphs. These approaches may be used for the elicitation of the full knowledge on validity of functional dependencies in relational schemata.

1

Design Problems During Database Semantics Specification and Their Solution

Specification of database structuring is based on three interleaved and dependent parts [9]: Syntactics: Inductive specification of structures uses a set of base types, a collection of constructors and an theory of construction limiting the application of constructors by rules or by formulas in deontic logics. In most cases, the theory may be dismissed. Semantics: Specification of admissible databases on the basis of static integrity constraints describes those database states which are considered to be legal. Pragmatics: Description of context and intension is based either on explicit reference to the enterprise model, to enterprise tasks, to enterprise policy, and environments or on intensional logics used for relating the interpretation and meaning to users depending on time, location, and common sense. P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 166–179, 2004. © Springer-Verlag Berlin Heidelberg 2004

Graphical Reasoning for Sets of Functional Dependencies

167

Specification of syntactics is based on the database modeling language. Specification of semantics requires a logical language for specification of classes of constraints. Typical constraints are dependencies such as functional, multivalued, and inclusion dependencies, or domain constraints. Specification of pragmatics is often not explicit. The specification of semantics is often rather difficult due to the complexity. For this reason, it must be supported by a number of solutions supporting acquisition and reasoning on constraints. Prerequisites of Database Design Approaches Results obtained during database structuring are evaluated on two main criteria: completeness [8] of and unambiguity of specification. Completeness requires that all constraints that must be specified are found. Unambiguity is necessary in order to provide a reasoning system. Both criteria have found their theoretical and pragmatical solution for most of the known classes of constraints. Completeness is, however, restricted by the human ability to survey large constraint sets and to understand all possible interactions among constraints. Theoretical Approaches to Problem Solution: A number of normalization and restructuring algorithms have been developed for functional dependencies. We do not know simple representation systems for surveying constraint sets and for detecting missing constraints beyond functional dependencies yet. Pragmatical Approaches to Problem Solution: A step-wise constraint acquisition procedure has been developed in [7,10,12]. The approach is based on the separation of constraints into: The set of valid functional dependencies : All dependencies that are known to be valid and all those that can be implied from the set of valid and excluded functional dependencies. The set of excluded functional dependencies : All dependencies that are known to be invalid and all those that are invalid and can be implied from the set of valid and excluded functional dependencies. This approach leads to the following simple elicitation algorithm: 1. Basic step: Design obvious constraints. and do not change. 2. Recursion step: Repeat until the constraint sets Find a functional dependency that is neither in nor in If is valid then add to If is invalid then add to Generate the logical closures of and This algorithm can be refined in various ways. Elicitation algorithms know so far are all variation of this simple elicitation algorithm. However, neither the theoretical solutions nor the pragmatical approach provides a solution to problem 1: Define a pragmatical approach that allows simple representation of and reasoning on database constraints. This problem becomes more severe in association with the following problems.

168

János Demetrovics, András Molnár, and Bernhard Thalheim

Complexity of Semantics Typical algorithms such as normalization algorithms can only generate a correct result if specification is complete. Such completeness is not harmful as long as constraint sets are small. The number of constraints may however be exponential in the number of attributes [3]. Therefore, specification of the complete set of functional dependencies may be a task that is infeasible. This problem is closely related to another well-known combinatoric problem presented by János Demetrovics during MFDBS’87 [11] and that is still only partially solved: Problem 2. What is the size of sets of independent functional dependencies for an n-ary relation schema? Inter-dependence Within a Constraint Set Constraints such as functional dependencies are not independent from each other. Typical axiomatizations use rules such as the union, transitivity and path rules. Developers do not reason this way. Therefore, the impact of adding, deleting or modifying a constraint within a constraint set is not easy to capture. Therefore, we need a system for reasoning on constraint sets. Theoretical Approaches to Problem Solution: [14] and [1] propose to use a graphbased representation of sets of functional dependencies. This solution provides a simple survey as long as constraints are simple, i.e., use singleton sets for the left sides. [13] proposes to use a schema architecture by developing first elementary schema components and constructing the schema by application of composition operations which use these components. [4] propose to construct a collection of interrelated lattices of functional dependencies. Each lattice represents a component of [13]. The set of functional dependencies is then constructed through folding of the lattices. Pragmatical Approaches to Problem Solution: [6] proposes to use a fact-based approach instead of modeling of attributes. Elementary facts are ‘small’ objects that cannot be decomposed without loosing meaning. We, thus, must solve problem 3. Develop a reasoning system that support easy maintenance and development of constraint sets and highlight logical inter-dependence among constraints. Instability of Normalization Normalization is based on the completeness of constraint sets. This is impractical. Althouth database design tools can support completeness, incompleteness of specification should be considered the normal situation. Therefore, normalization approaches should be robust with regard to incompleteness. Problem 4.[12] Find a normalization theory which is robust for incomplete constraint sets or robust according to a class of changes in constraint sets.

Graphical Reasoning for Sets of Functional Dependencies

169

Problems That Currently Defy Solution Dependency theory consists of work on about 95 different classes of dependencies, with a very few classes that have been treated together. Moreover, properties of sets of functional dependencies remain still unknown. In most practical cases several negative results obtained in the dependency theory do not restrict the common utilization of several classes. The reason for this is that the used constraint sets do not have these properties. Therefore, we need other classification principles for describing ‘real life’ constraint sets. Problem 5. [12] Classify ‘real life’ constraint sets which can be easily maintained and specified. This problem is related to one of the oldest problems in database research expressed by Joachim Biskup in the open problems session [11] of MFDBS’87: Problem 6. Develop a specification method that supports consideration of sets of functional dependencies and derivation of properties of those sets. Outline of the Paper1 and the Kernel Problem Behind Open Problems The six problems above can be solved on one common basis: Find a simple and sophisticated representation of sets of constraints that supports reasoning on constraints. This problem is infeasible in general. Therefore, we provide first a mechanism to reason on sets of functional dependencies defined on small sets of attributes. Geometrical figures such as polygons or tetrahedra nicely support reasoning on constraints. Next we demonstrate the representation for attribute sets consisting of three or four attributes. Finally we introduce the implication system for graphical representations and show how these representations lead to a very simple and sophisticated treatment of sets of functional dependencies.

Sets of Functional Dependencies for Small Relation Schemata

2 2.1

Universes of Functional Constraints

Besides functional dependencies (FDs), we use excluded functional constraints (also called negated functional dependencies): eg. states that the functional dependency is not valid. Treating sets of functional constraints becomes simpler if we avoid dealing with obviously redundant constraints. In our notation, a trivial constraint (a functional dependency or an excluded functional constraint) is a constraint with 1

Due to the lack of space, this paper does not contain proofs or the representation of all possible sets of functional dependencies. All technical details as well as some other means of representation can be read in a technical report available under [5].

170

János Demetrovics, András Molnár, and Bernhard Thalheim

at least one attribute of its left-hand side and right-hand side in common or has the empty set as its right-hand side. Furthermore, a canonical (singleton) functional dependency or a singleton excluded functional constraint has exactly one attribute on its right-hand side. We introduce the notations and for the universes of functional dependencies, non-trivial functional dependencies and non-trivial canonical (singleton) functional dependencies, respectively, over a fixed underlying domain of attribute symbols. Similarly, and denote the universes of excluded functional constraints, non-trivial excluded constraints and non-trivial singleton excluded functional constraints (negated non-trivial, canonical dependencies) over the same set of attribute symbols, respectively. The traditional universe of functional constraints (including functional dependencies and excluded constraints) is while our graphical representations deal with sets of constraints over In other words, the graphical representations we present in this paper deal with non-trivial canonical functional dependencies and non-trivial singleton excluded functional constraints only. It will be shown that we do not loose relevant deductive power applying this restriction to the universe of functional constraints. In most of the cases, we focus on closed sets of functional dependencies. A finite set is closed (over iff where is the closure of ie.

2.2

The Notion of Dimension

For the classification of functional constraints and the attributes they refer to, we introduce the notion of dimension first. The dimension of a constraint is simply the size of its left-hand side, i.e. the number of attributes on its left-hand side. For a functional dependency denote by its dimension, defined as (the dimension of an excluded functional constraint can be defined similarly). For a single attribute A, given a set of functional dependencies the dimension of A is denoted by (or just simply [A]) and defined as

This definition is extended with for the case when no exists in The dimensions of attributes classify the sets of functional dependencies.

2.3

Summary of the Number of Closed Sets

Let be the number of attributes of the considered relation schema. Denote by the set of closed sets of (singleton, non-trivial) functional dependencies for this (with constant attributes disallowed). Defining as the equivalence relation on these sets classifying them into different types or cases (for two equivalent sets there exists a permutation of attributes transforming one set to

Graphical Reasoning for Sets of Functional Dependencies

171

another), the set of different classes is We are focusing on these different classes and the size of this set. Another possibility is to let the attributes to be stated as constants. Performing this extension to we get a larger set, denoted by The different cases (types) of functional dependency sets taking these zero-dimensional constraints into account form the set It can be easily verified that holds for each With these notations, Table 1 shows the number of closed sets of functional dependencies for unary, binary, ternary, quaternary and quinary relational schemata and demonstrates the combinatorial of the search space.

3

The Graphical Representation of Sets of Functional Dependencies

There have been several proposals for graphical representation of sets of functional dependencies. Well-known books such as [1] and [14] have used a graphtheoretic notion. Nevertheless, these graphical notations have not made there way into practice and education. The main reason for this failure is the complexity of representation. Graphical representations are simple as long as the set of functional dependencies are not too complex. [2] has proposed a representation for the ternary case based on either assigning an N-notation if nothing is known or assigning a 1-notation to an edge from X to Y at the Y end if the functional dependency is valid. This representation is simple enough but already redundant in the case of ternary relationship types. Moreover, it is not generalizable to cases of n-ary relationship types with We use a simpler notation which reflects the validity of functional dependencies in a simpler and better understandable fashion. We distinguish two kinds of functional dependencies for One-dimensional (singleton left sides): Functional dependencies of the form can be decomposed to canonical functional dependencies and They are represented by endpoints of binary edges (1D shapes) in the triangular representation. Two-dimensional (two-element left sides): Functional dependencies with two-element left-hand sides cannot be decomposed. They are represented in the triangular (2D shape) on the node relating their right side to the corner.

172

János Demetrovics, András Molnár, and Bernhard Thalheim

Fig. 1. Triangular representation of sets of functional dependencies for the ternary case

We may represent also candidates for excluded functional dependencies by crossed circles for the case that we know that the corresponding functional dependency is not valid in applications or by small circles for the case that we do not know whether the functional dependency holds or does not hold. We use now the following notations in the figures: Basic functional dependencies are denoted by filled circles. Implied functional dependencies are denoted by circles. Negated basic functional dependencies are either denoted by dots or by crossed filled circles. Implied negated functional dependencies are either denoted by dots or by crossed circles.

Fig. 2. Examples of the triangular representation

Figure 2 shows some examples of the triangular representation. The functional dependency and the implied functional dependency are shown in the left part. The functional dependencies

Graphical Reasoning for Sets of Functional Dependencies

173

and their implied functional dependencies are pictured in the middle triangle. The negated functional dependency and the implied negated functional dependencies and are given in the right picture. As mentioned above, the triangular representation can be generalized to higher number of attributes. Generalization can be performed in two directions: representation in a higher-dimensional space (3D in the case of 4 attributes, resulting the tetrahedral representation) or constructing a planar (2D, quadratic) representation. We use the same approach as before in the case of three attributes. An example is displayed in Figure 3 (implication is explained later). In this paper we concentrate on the 2D representation.

Fig. 3. The tetraherdal and quadratic representations of the set generated by

This representation can be generalized to the case of 5 attributes.

4

Implication Systems for the Graphical Representations

Excluded functional constraints and functional dependencies are axiomatizable by the following formal system [12].

174

János Demetrovics, András Molnár, and Bernhard Thalheim

The universe of the extended Armstrong implication system is (see Section 2.1) while our graphical and spreadsheet representations deal with sets 2 of constraints over . However, the axiom and rules of the extended Armstrong implication system do not correspond to this restriction. It will be shown that an equivalent implication system can be constructed if these restrictions are applied to the universe of constraints. We develop a new implication system for graphical reasoning:

The rules presented here can directly be applied for deducing consequences of a set of constraints given in terms of the graphical or spreadsheet representation. We use the following two implication systems: the ST implication system over with rules (S) and (T) and no axioms, the PQRST implication system over with all the presented rules and the symbolic axiom which is used for indicating contradiction. These systems are sound and complete for deducing non-trivial, singleton constraints. Theorem 1 The ST system is sound and complete over ie. for each finite subset of and Theorem 2 Let be a finite subset of and The PQRST system without is sound over and complete with the restriction that cannot be contradictory, ie. for each non-contradictory Moreover, can be derived iff is contradictory. The implication systems introduced above have the advantage of the existence of a specific order of rules which provides a complete algorithmic method for getting all the implied functional dependencies and excluded functional constraints starting with an initial set, allowing one to determine the possible types of relationships the initial set of dependencies defines. Theorem 33 1. Let and be finite subsets of If then all elements of can be deduced starting with by using the rules (S) and (T) the way that no application of (T) precede any application of (S). 2

3

For example, is represented as and Excluded functional constraints with more than one attribute on their right-hand sides can not be eliminated this way. However, omitting these can also be achieved (see [5]). Proofs of the theorems are given in [5].

Graphical Reasoning for Sets of Functional Dependencies

175

2. If and are finite subsets of and then all elements of can be deduced starting with by using the rules (S), (T), (R), (P) and (Q) the way that no application of (T) precede any application of (S), no application of (R) precede any application of (T) and no application of (P) or (Q) precede any application of (R). Order of (P) and (Q) is arbitrary. Furthermore, (R) is needed to be applied at most once if

5

Graphical Reasoning

Rules of the PQRST implication system support graphical reasoning. We will discuss first the case of

Fig. 4. Graphical versions of rules (S), (T) and (P), (Q), (R)

Graphical versions of rules are shown on Figure 4 for the triangular representation (case Y = {C}). The small black arrows indicate support (necessary context) while the large grey arrows show the implication effects. Rule (S) is a simple extension rule and rule (T) can be called as “rotation rule” or “reduction rule”. We may call the left-hand side of a functional dependency the determinant of it and the right-hand side the determinate. Rule (S) can be used to extend the determinant of a dependency resulting another dependency with one dimension higher, while rule (T) is used for rotation, that is, to replace the determinate of a functional dependency by the support of another functional dependency with one dimension higher (the small black arrow at B indicates support of Another possible way to interpret rule (T) is for reduction of the determinant of a higher-dimensional dependency by omitting an attribute if a dependency holds among the attributes of the determinant. For excluded functional constraints, rule (Q) acts as the extension rule (needs support of a positive constraint, ie. functional dependency) and (R) as the rotation rule (needs a positive support too). These two rules can also be viewed as negations of rule (T). Rule (P) is the reduction rule for excluded functional con-

176

János Demetrovics, András Molnár, and Bernhard Thalheim

straints, with the opposite effect of rule (Q) (but without the need of support). Rule (Q) is also viewed as the negation of rule (S). These graphical rules can be generalized to higher dimensional cases, where the number of attributes is more than 3. Figure 5 shows the patterns of rules (S) and (T) for the case We use two or three patterns for a single case since we need a way to survey constraint derivation by (not completely symmetric) 2D diagrams. We differentiate between the case that the rules (S) and (T) use functional dependencies consisting of singleton left sides and the case that the minimum dimension of functional dependencies is two.

Fig. 5. Patterns of graphical rules (S) and (T) for the quadratic representation

Theorem 3 in Section 4 shows that for positive dependencies, using (S) first as many times as possible and using (T) as many times as possible afterwards is a complete method for obtaining all non-trivial positive consequences of a given set of constraints. We may call it ST algorithm4. This can be extended for the case with excluded functional constraints. We now present it as an algorithm for FD derivation based on the graphical representation: The STRPQ Algorithm for Sets of Both Positive and Negative Constraints. Rules (P), (Q) and (R) can be applied as complements of rules (S) and (T), resulting the following algorithm called STRPQ algorithm (based on part 2 of Theorem 3): 4

With some modifications, this algorithm has been used for generating and counting all sets of functional dependencies (see Section 2.3) with a PROLOG program.

Graphical Reasoning for Sets of Functional Dependencies

177

1. Starting with the given initial set of non-trivial, singleton functional dependencies and excluded functional constraints as input, 2. extend the determinants of each dependency using rule (S) as many times as possible, then 3. apply rule (T) until no changes occur, 4. apply rule (R) until no changes occur, 5. reduce and extend the determinants of excluded constraints using rules (P) and (Q) as many times as possible. 6. Output the generated set.

The algorithm just presented can be used for reasoning on sets of functional constraints, especially in terms of the graphical representations. The structure of the generalized triangular representations (2D–triangular, 3D–tetrahedral, etc.) may also be used for designing a data structure representing sets of functional constraints for the algorithms.

6

Applying Graphical Reasoning to Sets of Functional Dependencies

Let us consider a more complex example discussed in [12]. We are given a part of the Berlin airport management database for scheduling flights and pilots at one of its airports. Flights depart at one departure time and to one destination. A flight can get different pilots and can be assigned to different gates each day of a week. In the given application we observe the following functional dependencies for the attributes Flight#, (Chief)Pilot#, Gate#, Day, Hour, Destination: { Flight#, Day } { Pilot, Gate#, Hour } { Flight# } { Destination, Hour } { Day, Hour, Gate# } { Flight# } { Pilot#, Day, Hour } { Flight# } As noticed in [12] we can model this database in a five very different ways. Figure 6 displays one of the solutions. All types in Figure 6 are in the third normal form. Additionally, the following constraints are valid for solution in Figure 6: flies : { GateSchedule.Time, Pilot.# } { GateSchedule }. The two schemata have additionally transitive path constraints, e.g.: flies: { GateSchedule.Time,Day, Flight.# } { GateSchedule.Gate.# }. But the types are still in third normal form since for each functional dependency defined for the types either X is a key or Y is a part of a key. The reason for the existence of rather complex constraints is the twofold usage of Hour. For instance, in our solution we find the equality constraint: flies.Flight.Hour = flies.GateSchedule.Time.Hour. We must know now whether the set of functional dependencies is complete. The combinatorial complexity of brute-force consideration of dependency sets is overwhelming. Let us now apply our theoretical findings to cope with the complexity and to reason on the sets of functional dependencies. We may use the following algorithm:

178

János Demetrovics, András Molnár, and Bernhard Thalheim

Fig. 6. An extended ER Schema for the airline database with transitive path constraints

1. Consider attributes which are not used in any left side of a functional dependency whether they are really dangling. This is done by using the STRPQ algorithm with each of the attributes and the rest of other attributes. We may strip out dangling attributes not loosing the reasoning power. In the example we strip out Destination. 2. Combine attributes to groups such that they appear together in left sides of functional dependencies. Consider first the relations among those attribute groups using the STRPQ algorithm. In our example we consider the groups (A) Day, Hour, (B) Flight#, Day, (C) (Chief)Pilot#, and (D) Gate#. The result is shown on Figure 3. 3. Recursively now apply the STRPQ algorithm to decompositions of attribute groups. The example shows how graphical reasoning can be directly applied to larger sets of attributes which have complex relations among them and can be expressed through functional dependencies.

7

Conclusion

The problem whether there exists a simple and sophisticated representation of sets of constraints that supports reasoning on constraints is solved in this paper by introducing a more surveyable means for the representation of constraint sets: the graphical representation. It requires a different implication system than the classical Armstrong system. We, thus, introduced another system and could show (Theorem 1 and 2) its soundness and completeness. This system has another useful property (Theorem 3): Constraint derivation may be ordered on the basis of sequences of rules. Derivation rule application can be described using the regular expression This order of rule application is extremely useful whenever we want to know whether the set of generated functional constraints is full (closed), i.e., consists of all (positive or both positive and negative) dependencies that follow from the given initial system of functional constraints. Based on this, we were able to generate all possible sets of initial functional dependencies for Graphical reasoning supports a simpler style of reasoning on constraint sets. Completeness and soundness of systems of functional dependencies and excluded

Graphical Reasoning for Sets of Functional Dependencies

179

functional dependencies becomes surveyable. Since database design approaches rely on completeness and soundness of constraint sets, our approach enables database designers to obtain better database design results.

Acknowledgements We would like to thank Tibor Ásványi for his help in improving the efficiency of our PROLOG program, which generates the sets of functional dependencies and Zoltán Csaba Regéci for his assistance in running the program at MTA SZTAKI. We are also grateful to Andrea Molnár for her valuable comments on the illustration of the graphical rules and the tetrahedral representation.

References 1. P. Atzeni and V. De Antonellis. Relational database theory. Addison-Wesley, Redwood City, 1993. 2. R. Camps. From ternary relationship to relational tables: A case against common beliefs. ACM SIGMOD Record, 31(2), pages 46–49, 2002. 3. J. Demetrovics and G. O. H. Katona. Combinatorial problems of database models. In Colloquia Mathematica Societatis Janos Bolyai 42, Algebra, Combinatorics and Logic in Computer Science, pages 331–352, Györ, Hungary, 1983. 4. J. Demetrovics, L. O. Libkin, and I. B. Muchnik. Functional dependencies and the semilattice of closed classes. In Proc. MFDBS’89, LNCS 364, pages 136–147, 1989. 5. J. Demetrovics, A. Molnar, and B. Thalheim. Graphical and spreadsheet reasoning for sets of functional dependencies. Technical Report 0402, Kiel University, Computer Science Institute, http://www.informatik.uni-kiel.de/reports/2004/0402.html, 2004. 6. T. A. Halpin. Conceptual schema and relational database design. Prentice-Hall, Sydney, 1995. 7. M. Klettke. Akquisition von Integritätsbedingungen in Datenbanken. DISBIS 51. infix-Verlag, St. Augin, 1998. 8. O.I. Lindland, G. Sindre, and A. Solvberg. Understanding quality in conceptual modeling. IEEE Software, 11(2):42–49, 1994. 9. C.W. Morris. Foundations of the theory of signs. In International Encyclopedia of Unified Science. University of Chicago Press, 1955. 10. V.C. Storey, H.L. Yang, and R.C. Goldstein. Semantic integrity constraints in knowledge-based database design systems. Data & Knowledge Engineering, 20:1– 37, 1996. 11. B. Thalheim. Open problems in relational database theory. Bull. EATCS, 32:336 – 337, 1987. 12. B. Thalheim. Entity-relationship modeling – Foundations of database technology. Springer, Berlin, 2000. See also http://www.informatik.tu-cottbus.de/~thalheim/HERM.htm. 13. B. Thalheim. Component construction of database schemes. In Proc. ER’02, LNCS 2503, pages 20–34, 2002. 14. C.-C. Yang. Relational Databases. Prentice-Hall, Englewood Cliffs, 1986.

ER-Based Software Sizing for Data-Intensive Systems Hee Beng Kuan Tan and Yuan Zhao School of Electrical and Electronic Engineering, Block S2, Nanyang Technological University Nanyang Avenue, Singapore 639798 [emailprotected]

Abstract. Despite the existence of well-known software sizing methods such as Function Point method, many developers still continue to use ad-hoc methods or so called “expert” approaches. This is mainly due to the fact that the existing methods require much implementation information that is difficult to identify or estimate in the early stage of a software project. The accuracy of ad-hoc and “expert” methods also has much problem. The entity-relationship (ER) model is widely used in conceptual modeling (requirements analysis) for data-intensive systems. From our observation, the characteristic of a data-intensive system, and therefore the source code of its software, is well characterized by the ER diagram that models its data. Based on this observation, this paper proposes a method for building software size model from extended ER diagram through the use of regression models. We have collected some real data from the industry to do a preliminary validation of the proposed method. The result of the validation is very encouraging. As software sizing is an important key to software cost estimation and therefore vital to the industry for managing their software projects, we hope that the research and industry communities can further validate the proposed method.

1 Introduction Estimating project size is a crucial task in any software project. Overestimates may lead to the abortion of projects or loss of projects to competitors. Underestimates pressurize project teams and may also adversely affect the quality of projects. Despite the existence of well known software sizing methods such as Function Point method [1], [10] and the more recent Full Function Point method [7], many practitioners and project managers continue to produce estimates based on ad-hoc or so called “expert” approaches [2], [8], [15]. This is mainly due to the fact that existing sizing methods require much implementation information that is not available in the earlier stage of a software project. However, the accuracy of ad-hoc and expert approaches also has much problem that results to questionable project budgets and schedules. The entity-relationship (ER) model originally proposed by Chen [5] is generally regarded as the most widely used tool for the conceptual modeling of data-intensive systems. An ER model is constructed to depict the ideal organization of data, independent of the physical organization of the data and where and how data are used. Indeed, much requirement of data-intensive systems is reflected from their ER models that depict their data conceptually. This paper proposes a novel method for P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 180–190, 2004. © Springer-Verlag Berlin Heidelberg 2004

ER-Based Software Sizing for Data-Intensive Systems

181

building software size model to estimate the size of source code for a data-intensive system based on extended ER diagram. It also discusses the validation effort conducted by us to validate the proposed method for building software size models for data-intensive systems written in Visual Basic and Java languages. The paper is organized as follows. Section 2 gives the background information of the paper. Section 3 discusses our observation and its rationale. Section 4 presents the proposed method for building software size models to estimate the sizes of source codes for data-intensive systems. Section 5 discusses our preliminary validation of the proposed method. Section 6 concludes the paper and compares the proposed method with related methods.

2 Background Entity-relationship (ER) model was originally proposed by Chen [5] for data modeling. And, it has been extended by Chen and others subsequently [17]. In this paper, we refer to the extended ER model that has the same set of concepts as the class diagram in terms of data modeling. In summary, the extended ER model uses the concept of entity, attribute and relationship to model the conceptual data for a problem. Each entity has a set of attributes each of which is an entity’s property or characteristic that is concerned by the problem. Relationships can be classified into three types: association, aggregation and generalization. There are four main stages in developing software systems: requirements capture, requirements analysis, design and implementation. The requirements are studied and specified in the requirements capture stage. They are realized conceptually in the requirements analysis. The design for implementing the requirements with the target environments taken into considerations is constructed in the design stage. In the implementation stage, the design is coded using the target programming language and the resulting code is tested to ensure its correctness. Though UML (Unified Modeling Language) has gained its popularity as a standard software modeling language, many data-intensive systems are still developed in the industry through some form of data-oriented approach. In such an approach, some form of extended entity-relationship (ER) model is constructed to model the data conceptually in the requirements capture and analysis stages. And, the subsequent design and implementation activities are very much based on the extended ER model. For projects that use UML, a class diagram is usually constructed in the requirements analysis stage. Indeed, for a data intensive system, the class diagram constructed can be viewed as an extended ER model with the extension of behavioral properties (processing). Therefore, in the early stage of software development, some form of extended ER model is more readily available than information such as external inputs, outputs and inquiries, and external logical files and external interface files that are required for the computation of function points.

3 Our Observation Data-intensive systems constitute one of the largest domains in software. These systems usually maintain large amount of structured data in a database built using a da-

182

Hee Beng Kuan Tan and Yuan Zhao

tabase management system (DBMS). And, it provides operational, control and management support to end-users through referencing and analyzing these data. The support is usually accomplished through accepting inputs from user, processing inputs, updating databases, printing of reports, and providing inquiries to help users in the management and decision making processes. The proposed method for building software size model for data-intensive systems is based on our observation of these systems. Next, we shall discuss the observation and its rationale. The Observation: Under the same development environment (that is, a particular programming language and tool used), the size of source codes for a data-intensive system usually depends on the extended ER diagram that models its data. Rationale: The constituents of the data-intensive system can be classified into the following: 1) Support business operations through accepting inputs to maintain entities modeling in the ER diagram. 2) Support decision making processes through producing outputs from information possessed by entities modeled in the ER diagram. 3) Implement business logic to support the business operation and control. 4) Reference to entities modeled in the ER diagram to support the first three constituents. Since the first two and the last constituents are based on the ER diagram, as such, they depend on the ER diagram. At the first glance, it seems that the third constituent may not depend on the ER diagram. However, since a data-intensive system usually does not perform complex computation within the source code (any complex computation is usually achieved through calling pre-developed function), business logic in the source code is mainly for the navigation between entities via relationship types with simple computation. For example, for the business logic that if a customer has two overdue invoices, then no further orders will be processed, the source code for implementing the business logic retrieves overdue invoices in the Invoice entity type for the customer in the Customer entity type via the relationship type that associates a customer with its invoices. There is no complex computation involved. Therefore, it is reasonable to assume that usually, the implementation of business logic in a dataintensive system also depends on the ER diagram. This completes the rationale of the observation.

4 The Proposed Software Sizing From the observation discussed in the previous section, the size of the source code for a data-intensive system usually depends and only depends on the structure and size of an extended ER diagram that models its data. Furthermore, ER diagram has been widely and well use in the requirements modeling and analysis stages. Thus, it is more suitable to base on extended ER diagram for the estimation of the size of source code for a data-intensive system. Therefore, we propose a novel method for building software size model based on extended ER diagram. This section discusses the method.

ER-Based Software Sizing for Data-Intensive Systems

183

The proposed method builds software size models through well-known linear regression models. For a data-intensive system, the variables that sufficiently characterize the extended ER diagram for the system form the independent variables. The dependent variable is the size of its source code in thousand lines of code (KLOC). Note that in this case, the extended ER diagram is implemented and is only implemented by the system. That is, the extended ER diagram and the system must coincide and have a one-to-one correspondence. As such, any source code that references or updates the database that is designed from the extended ER diagram must be included as part of the source code. In the proposed approach, a separate software size model should be built for each different development environment (that is, each programming language and tool used). For example, different software size models should be built for systems written in Visual Basic with VB Script and SQL, and systems written in Java with JSP, Java Script and SQL. In the most precise case, the independent variables that characterize the extended ER diagram comprise of the following: 1) Total number of entity types. 2) Total number of attributes. 3) Total numbers of association types classified based on their degrees and multiplicities: Usually, the degrees can be classified in exact for those below an upper limit. The remaining can all be lumped into one. Multiplicities can be classified into zero-or-one, one and many. More precise classification can also be tried. 4) Total numbers of aggregation types classified based on their degrees and multiplicities: Same as the association types. 5) Total numbers of generalization types classified based on the number of subclasses: Usually, the number of sub-classes can be classified in exact for those below an upper limit. The remaining can all be lumped into one. However, we do not propose to build a software size model based on a fixed set of independent variables. It all depends on the kind of ER diagrams used in organizations for which we develop the software size model. Note that the above-mentioned association refers to association that is not aggregation. The separation of relationship types into associations, aggregations and generalizations is because of the differences in their semantics. These differences may result to some differences in navigation and updating needs in the database. We propose that the independent variables should be defined according to the type of ER diagram constructed during the requirements modeling and analysis stages. So, at least the data required for software sizing is readily available in the early stage of requirements analysis. From our experience in building proposed software size models using the data collected from the industry, hardly any relationship type is ternary or higher order. And, most of the ER diagrams do not classify their relationship types into association, aggregation and generalization. The precision of the independent variables depends on the types of extended ER diagram constructed in the requirements modeling and analysis stages in the organization. However, a larger set of independent variables will require a larger set of data for building and evaluating the model.

184

Hee Beng Kuan Tan and Yuan Zhao

The steps for building proposed software size models are as follows: 1) Independent variables identification: Based on the type of data model (a class or ER diagram) constructed during requirements modeling and analysis, we identify a set of independent variables that sufficiently characterize the diagram. 2) Data collection: Collect ER diagrams and sizes of source codes (in KLOC) of sufficient data-intensive systems. A larger set of independent variables will require a larger set of data. There are many free tools available for the automated extraction of source code size. 3) Model building and evaluation: There are quite a number of commonly used regression models [16]. Both linear and non-linear models can be considered. The size of source code (in KLOC) and the independent variables identified in the first step form the dependent and the independent variables respectively for the model. Statistical packages (e.g., SAS) should be used for the model building. Ideally, we should have separate data sets for modeling building and evaluation. However, if the data is limited, the same data set may also be used for model building and evaluation. Let n be the number of data points and k be the number of independent variables. Let and are the real and the estimate values respectively of a project. Let be the mean of all The evaluation of model goodness can be done according to the examination of the following parameters: Magnitude of relative error, MRE, and mean magnitude of relative error, MMRE: They are defined as follows:

If the MMRE is small, then we have a good set of predictions. A usual criterion for accepting a model as good is that the model has a Prediction at level l – Pred(l) – where l is a percentage: It is defined as the ratio of number of cases in which the estimates are within the l absolute limit of the actual values divided by the total number of cases. A standard criteria for considering a model as acceptable is Multiple coefficient of determination, and adjusted multiple coefficient of determination, These are some usual measures in regression analysis, denoting the percentage of variance accounted for by the independent variables used in the regression equations. They are computed as follows:

and

ER-Based Software Sizing for Data-Intensive Systems

where sum of squared errors

185

and

In general, the larger the value of and the better fit of the data. implies a perfect fit of the model passing through every data point. However, can only be used as a measure to access the usefulness of the model if the number of data points is substantially more than the number of independent variables. If the same data set is used for both model building and evaluation, we can further examine the following parameters to evaluate the model goodness: Relative root mean squared error, is defined as follows [6]:

where

A model is considered acceptable if

Prediction sum of squares, PRESS [16]: PRESS is a measure of how well the use of the fitted values for a subset model can predict the observed responses The error sum of squares, is also such a measure. The PRESS measure differs from SSE in that each fitted value for the PRESS is obtained by deleting the i th case from the data set, estimating the regression function for the subset model from the remaining n - 1 cases, and then using the fitted regression function to obtain the predicted for the ith case. That is, it is defined as follows:

Models with smaller PRESS values are considered good candidate models. The PRESS value is always larger than SSE because the regression fit for the i th case is included. A smaller PRESS value supports the validity of the model built.

5 Preliminary Validation As ER diagrams constructed in most projects in the industry do not classify relationship types into associations, aggregations and generalizations, a complete validation of the proposed method is not possible. We have spent much effort to pursue organizations in the industry to supply us their project data for the validation of the proposed software sizing method. As such, the whole validation took about one and a half year. This section discusses our validation.

186

Hee Beng Kuan Tan and Yuan Zhao

Due to the above-mentioned constraint, the independent variables for characterizing an ER diagram in our validation is simplified as follows: 1) Number of entity types (E) 2) Number of attributes (A) 3) Number of relationship types (R) These variables provide a reasonable and concise characterization of the ER diagram. Our validation bases on the following linear regression models [14]: where Size is the total KLOC (thousand lines of code) of all the source code that is developed based on the ER diagram and is a coefficient to be determined.

5.1 The Dataset We collected three datasets from multiple organizations in the industry including software house and end-users such as public organizations and insurance companies. These projects cover a wide range of application domains including freight management, administrative and financial systems. The first dataset comprises 14 projects that were developed using Visual Basic with VB Scripts and SQL. The second dataset comprises 10 projects that were developed using Java with JSP, Java Script and SQL. Table 1 and 2 show the details of the two data sets. The first and second datasets are for the building of software size models for the respective development environments. The third dataset comprises of 8 projects developed using the same Visual Basic development environment as the first dataset. Table 3 shows the details of the third dataset.

5.2 The Resulting Models From the Visual Basic based project data set (Table 1), the resulting first order model that we built for estimating the size of source code (in KLOC) developed using Visual Basic with VB Script and SQL is as follows: Adjusted multiple coefficient of determination

for this model is 0.84. The

value of is reckoned as good. From the Java based project data set (Table 2), the resulting first order model that we built for estimating the size of source code (in KLOC) developed using Java with JSP, Java Script and SQL is as follows: Adjusted multiple coefficient of determination model. The value of

is reckoned as very good.

for this model is 0.99 for this

ER-Based Software Sizing for Data-Intensive Systems

187

5.3 Model Evaluation For the first order model that we built for estimating size of source code (in KLOC) developed using Visual Basic with VB Script and SQL, we managed to collect a separate data set for the evaluation of the model. Note that for this model has already been computed during model building and is 0.84, which is reckoned as good. MMRE and Pred (0.25) computed from the evaluation data set are 0.16 and 0.88 respectively. These values fall well within the acceptable level. The detailed results of the evaluation are shown together with the evaluation dataset in Table 3. Therefore, the evaluation results support the validity of the model built. For the first order model that we built for estimating the size of source code (in KLOC) developed using Java with JSP, Java Script and SQL, we did not manage to collect a separate data set for the evaluation of the model. As such, we used the same

188

Hee Beng Kuan Tan and Yuan Zhao

data set for the evaluation. Note that for this model has already been computed during model building and is 0.99, which is reckoned as very good. MMRE, Pred (0.25), SSE and PRESS computed from the same data set are 0.07, 1.00, 10.04 and 556.84 respectively. The detailed results of the evaluation is shown in Table 4. Both MMRE and Pred(0.25) fall well within the acceptable level. Although there is a difference between SSE and PRESS, the difference is not too substantial too. Note that computed from SSE in this case is 0.02. If we replace SSE by PRESS in the computation of then the value of is 0.18. Both of these values fall well below the acceptable level 0.25. Therefore, the evaluation results support the validity of the model built.

Though we managed to build only simplified software size models from the proposed approach due to the limitation in the industry practice, the evaluation results have already supported the validity of the models built. As such, our empirical validation supports the validity of the proposed method for building software size models.

ER-Based Software Sizing for Data-Intensive Systems

189

6 Comparative Discussion We have proposed a novel method for building software size models for dataintensive systems. Due to the lack of complete data for validating the proposed method from completed projects in the industry, we only managed to do a validation based on building and evaluating simplified proposed software size models. The statistical evaluation supports the validity of the proposed method. Due to the above-mentioned simplification and limited size of our dataset, we do not claim that the models built in this paper are ready for use. However, at least, we believe that our work has shown some promise to study the proposed method for software sizing further. Software size estimation is an important key to project estimation, which in turn is vital for project control and management [3], [4], [11]. There is much problem in existing software size estimation methods. As the software estimation community requires totally new datasets for the building and evaluation of software size models built using the proposed method, we call for collaborations between the industry and the research communities to validate the proposed method further and more comprehensively. From the history in establishing of Function Point method, without such effort, it is not likely to succeed in building usable software size model. As discussed in [15], most of the existing software sizing methods [9], [12], [13], [18] require much implementation information that is not available and is difficult to predict in the early stage of a software project. The information is not even available after the requirements analysis stage. It is only available in the design or implementation stage. For example, Function Point method is based on external inputs, outputs and inquiries, and external logical files and external interface files. Such implementation details are not even available at the end of requirements analysis stage. ER diagram has been well used in the conceptual modeling for developing data-intensive systems. Some proposals for software projects have also included ER diagrams as part of project requirement. As such, ER diagrams are at least practically available after the requirements analysis stage. Once the ER diagram is constructed, the proposed software size model can be applied without much difficulty. Therefore, in the worst case, we can apply the proposed approach after the requirements analysis stage. Ideally, a brief extended ER model should be constructed during the project proposal or planning stage. And, the proposed software size model can be applied to estimate the software size to serve as an input for project effort estimation. Subsequently, when a more accurate extended ER model is available, the model can be reapplied for more accurate project estimation. A final revision of project estimation should be carried out at the end of requirements analysis stage, in which an accurate extended ER diagram should be available. The well-known Function Point method is also mainly for data-intensive systems. As such, the domain of application for the proposed method for software sizing is similar to that of Function Point method.

190

Hee Beng Kuan Tan and Yuan Zhao

Acknowledgement We would like to thank IPACS E-Solution (S) Pte Ltd, Singapore Computer Systems Pte Ltd, NatSteel Ltd, Great Eastern Life Assurance Co. Limited, JTC Corporation and National Computer Systems Pte Ltd for providing the project data. Without their support, this work would not be possible.

References 1. A.J. Albrecht, and J. E. Gaffney, Jr., “Software function, source lines of code, and development effort prediction: a software science validation,” in IEEE Trans. Software Eng., vol. SE-9, no. 6, Nov. 1983, pp. 639-648. 2. P. Armour, “Ten unmyths of project estimation: reconsidering some commonly accepted project management practices,” in Comm. ACM 45(11), Nov. 2002, pp. 15-18. 3. B.W. Boehm, and R. E. Fairley, “Software estimation perspectives,” in IEEE Software, Nov./Dec. 2000, pp. 22-26. 4. B.W. Boehm et al., Software Cost Estimation with COCOMO II, Prentice Hall, 2000. 5. P.P. Chen, “The entity-relationship model - towards a unified view of data,” in ACM Trans. Database Syst. 1(1), Mar. 1976, pp. 9-36. 6. S.D. Conte, H. E. Dunsmore, and V. Y. Shen, Software Engineering Metrics and Models, Benjamin/Cummings, 1986. 7. COSMIC-Full Functions – Release 2.0, September 1999. 8. J.J. Dolado, “A validation of the component-based method for software size estimation,” in IEEE Trans. Software Eng., vol. SE-26, no.10, Oct. 2000, pp. 1006-1021. 9. Daniel. V. Ferens, “Software Size Estimation Techniques,” Proceedings of the IEEE NAECON 1988 701-705. 10. D. Garmus, and D. Herron, Function Point Analysis: measurement practices for successful software projects, Addison Wesley, 2000. 11. C.F. Kemerer, “An empirical validation of software project cost estimation models,” in Comm. ACM 30(5), May 1987, pp. 416-429. 12. R. Lai, and S. J. Huang, “A model for estimating the size of a formal communication protocol application and its implementation,” in IEEE Trans. Software Eng., vol. 29, no. 1, pp. 46-62, Jan, 2003. 13. L.A. Laranjeira, “Software Size Estimation of Object-Oriented Systems,” in IEEE Trans. Software Eng., vol. 16, no. 5, May 1990, pp. 510-522. 14. J.T. McClave, and T. Sincich, Statistics, Ed, Prentice Hall, 2003. 15. E. Miranda, “An evaluation of the paired comparisons method for software sizing,” Proc. Int. Conf. On Software Eng., 2000, pp. 597-604. 16. J. Neter, M. H. Kutner, C. J. Nachtsheim, and W. Wasserman, Applied Linear Regression Models, IRWIN, 1996. 17. T.J. Teorey, D. Yang, and J. P. Fry, “A logical design methodology for relational databases using the extended entity-relationship model,” in ACM Computing Surveys 18(2), June, 1986, pp. 197-222. 18. J. Verner, and G. Tate, “A software size model,” in IEEE Trans. Software Eng., vol. SE-18, no. 4, Apr. 1992, pp. 265-278.

Data Mapping Diagrams for Data Warehouse Design with UML Sergio Luján-Mora1, Panos Vassiliadis2, and Juan Trujillo1 1

Dept. of Software and Computing Systems University of Alicante, Spain {slujan,jtrujillo}@dlsi.ua.es 2

Dept. of Computer Science University of Ioannina, Hellas [emailprotected]

Abstract. In Data Warehouse (DW) scenarios, ETL (Extraction, Transformation, Loading) processes are responsible for the extraction of data from heterogeneous operational data sources, their transformation (conversion, cleaning, normalization, etc.) and their loading into the DW. In this paper, we present a framework for the design of the DW back-stage (and the respective ETL processes) based on the key observation that this task fundamentally involves dealing with the specificities of information at very low levels of granularity including transformation rules at the attribute level. Specifically, we present a disciplined framework for the modeling of the relationships between sources and targets in different levels of granularity (including coarse mappings at the database and table levels to detailed inter-attribute mappings at the attribute level). In order to accomplish this goal, we extend UML (Unified Modeling Language) to model attributes as first-class citizens. In our attempt to provide complementary views of the design artifacts in different levels of detail, our framework is based on a principled approach in the usage of UML packages, to allow zooming in and out the design of a scenario. Keywords: data mapping, ETL, data warehouse, UML

1 Introduction In Data Warehouse (DW) scenarios, ETL (Extraction, Transformation, Loading) processes are responsible for the extraction of data from heterogeneous operational data sources, their transformation (conversion, cleaning, normalization, etc.) and their loading into the DW. DWs are usually populated with data from different and heterogeneous operational data sources such as legacy systems, relational databases, COBOL files, Internet (XML, web logs) and so on. It is well recognized that the design and maintenance of these ETL processes (also called DW back stage) is a key factor of success in DW projects for several reasons, the most prominent of which is their critical mass; in fact, ETL development can take up as much as 80% of the development time in a DW project [1,2]. Despite the importance of designing the mapping of the data sources to the DW structures along with any necessary constraints and transformations, P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 191–204, 2004. © Springer-Verlag Berlin Heidelberg 2004

192

Sergio Luján-Mora, Panos Vassiliadis, and Juan Trujillo

unfortunately, there are few models that can be used by the designers to this end. The front end of the DW has monopolized the research on the conceptual part of DW modeling, while few attempts have been made towards the conceptual modeling of the back stage [3,4]. Still, to this day, there is no model that can combine (a) the desired detail of modeling data integration at the attribute level and (b) a widely accepted modeling formalism such as the ER model or UML. One particular reason for this, is that both these formalisms are simply not designed for this task; on the contrary, they treat attributes as second-class, weak entities, with a descriptive role. Of particular importance is the problem that in both models attributes cannot serve as an end in an association or any other relationship. One might argue that the current way of modeling is sufficient and there is no real need to extend it in order to capture mappings and transformations at the attribute level. There are certain reasons that we can list against this argument: The design artifacts are acting as blueprints for the subsequent stages of the DW project. If the important details of this design (e.g., attribute interrelationships) are not documented, the blueprint is problematic. Actually, one of the current issues in DW research involves the efficient documentation of the overall process. Since design artifacts are means of communicating ideas, it is best if the formalism adopted is a widely used one (e.g., UML or ER). The design should reflect the architecture of the system in a way that is formal, consistent and allows the what-if analysis of subsequent changes. Capturing attributes and their interrelations as first-class modeling elements (FCME, also known as first-class citizens) improves the design significantly with respect to all these goals. At the same time, the way this issue is handled now would involve a naive, informal documentation through UML notes. In previous lines of research [5], we have shown that by modeling attribute interrelationships, we can treat the design artifact as a graph and actually measure the aforementioned design goals. Again, this would be impossible with the current modeling formalisms. To address all the aforementioned issues, in this paper, we present an approach that enables the tracing of the DW back-stage (ETL processes) particularities at various levels of detail, through a widely adopted formalism (UML). This is enabled by an additional view of a DW, called the data mapping diagram. In this new diagram, we treat attributes as FCME of the model. This gives us the flexibility of defining models at various levels of detail. Naturally, since UML is not initially prepared to support this behavior, we solve this problem thanks to the extension mechanisms that it provides. Specifically, we employ a formal, strict mechanism that maps attributes to proxy classes that represent them. Once mapped to classes, attributes can participate in associations that determine the inter-attribute mappings, along with any necessary transformations and constraints. We adopt UML as our modeling language due to its wide acceptance and the possibility of using various complementary diagrams for modeling different system aspects. Actually, from our point of view, one of the main advantages of the approach presented in this paper is that it is totally integrated

Data Mapping Diagrams for Data Warehouse Design with UML

193

in a global approach that allows us to accomplish the conceptual, logical and the corresponding physical design of all DW components by using the same notation ([6–8]). The rest of the paper is structured as follows. In Section 2, we briefly describe the general framework for our DW design approach and introduce a motivating example that will be followed throughout the paper. In Section 3, we show how attributes can be represented as FCME in UML. In Section 4, we present our approach to model data mappings in ETL processes at the attribute level. In Section 5, we review related work and finally, in Section 6 we present the main conclusions and future work.

2

Framework and Motivation

In this section we discuss our general assumptions around the DW environment to be modelled and briefly give the main terminology. Moreover, we define a motivating example that we will consistently follow through the rest of the paper. The architecture of a DW is usually depicted as various layers of data in which data from one layer is derived from data of the previous layer [9]. Following this consideration, we consider that the development of a DW can be structured into an integrated framework with five stages and three levels that define different diagrams for the DW model, as explained in Table 1.

In previous works, we have presented some of the diagrams (and the corresponding profiles), such as the Multidimensional Profile [6,7] and the ETL Profile [4]. In this paper, we introduce the Data Mapping Profile. To motivate our discussion we will introduce a running example where the designer wants to build a DW from the retail system of a company. Naturally, we

194

Sergio Luján-Mora, Panos Vassiliadis, and Juan Trujillo

consider only a small part of the DW, where the target fact table has to contain only the quarterly sales of the products belonging to the computer category, whereas the rest of the products are discarded. In Fig. 1, we zoom-in the definition of the SCS (Source Conceptual Schema), which represents the sources that feed the DW with data. In this example, the data source is composed of four entities represented as UML classes: Cities, Customers, Orders, and Products. The meaning of the classes and their attributes, as depicted in Fig. 1 is straightforward. The “...” shown in this figure simply indicates that other attributes of these classes exist, but they are not displayed for the sake of simplicity (this use of“...” is not a UML notation).

Fig. 1. Source Conceptual Schema (SCS)

Fig. 2. Data Warehouse Conceptual Schema (DWCS)

Finally, the DWCS (Data Warehouse Conceptual Schema) of our motivating example is shown in Fig. 2. The DW is composed of one fact (ComputerSales) and two dimensions (Products and Time). In this paper, we present an additional view of a DW, called the Data Mapping that shows the relationships between the data sources and the DW and between the DW and the clients’ structures. In this new diagram, we need to treat attributes as FCME of the models, since we need to depict their relationships at attribute level. Therefore, we also propose a UML extension to accomplish this goal in this paper. To the best of our knowledge, this is the first proposal of representing attributes as FCME in UML diagrams.

3

Attributes as First-Class Modeling Elements in UML

Both in the Entity-Relationship (ER) model and in UML, attributes are embedded in the definition of their comprising “element” (an entity in the ER or a class in UML), and it is not possible to create a relationship between two attributes. As we have already explained in the introduction, in some situations (e.g., data integration, constraints over attributes, etc.) it is desirable to represent attributes as FCME. Therefore, in this section we will present an extension of UML to accommodate attributes as FCME. We have chosen UML instead of

Data Mapping Diagrams for Data Warehouse Design with UML

195

ER on the grounds of its higher flexibility in terms of employing complementary diagrams for the design of a certain system. Throughout this paper, we frequently use the term first-class modeling elements or first-class citizens for elements of our modeling languages. Conceptually, FCME refer to fundamental modeling concepts, on the basis of which our models are built. Technically, FCME involve an identity of their own, and they are possibly governed by integrity constraints (e.g., relationships must have at least two ends refering to classes.). In a UML class diagram, two kinds of modeling elements are treated as FCME. Classes, as abstract representations of real-world entities are naturally found in the center of the modeling effort. Being FCME, classes stand-alone entities also acting as attribute containers. The relationships between classes are captured by associations. Associations can be also FCME, called association classes. Even though an association class is drawn as an association and a class, it is really just a single model element [10]. An association class can contain attributes or can be connected to other classes. However, the same is not possible with attributes. Naturally, in order to allow attributes to play the same role in certain cases, we propose the representation of attributes as FCME in UML. In our approach, classes and attributes are defined as normally in UML. However, in those cases where it is necessary to treat attributes as FCME, classes are imported to the attribute/class diagram, where attributes are automatically represented as classes; in this way, the user only has to define the classes and the attributes once. In the importing process from the class diagram to the attribute/class diagram, we refer to the class that contains the attributes as the container class and to the class that represents an attribute as the attribute class. In Table 2, we formally define attribute/class diagrams, along with the new stereotypes, and

4

The Data Mapping Diagram

Once we have introduced the extension mechanism that enables UML to treat attributes as FCME, we can proceed in defining a framework on its usage. In

196

Sergio Luján-Mora, Panos Vassiliadis, and Juan Trujillo

this section, we will introduce the data mapping diagram, which is a new kind of diagram, particularly customized for the tracing of the data flow, at various degrees of detail, in a DW environment. Data mapping diagrams are complementary to the typical class and interaction diagrams of UML and focus on the particularities of the data flow and the interconnections of the involved data stores. A special characteristic of data mapping diagrams is that a certain DW scenario is practically described by a set of complementary data mapping diagrams, each defined at a different level of detail. In this section, we will introduce a principled approach to deal with such complementary data mapping diagrams. To capture the interconnections between design elements, in terms of data, we employ the notion of mapping. Broadly speaking, when two design elements (e.g., two tables, or two attributes) share the same piece of information, possibly through some kind of filtering or transformation, this constitutes a semantic relationship between them. In the DW context, this relationship, involves three logical parties: (a) the provider entity (schema, table, or attribute), responsible for generating the data to be further propagated, (b) the consumer, that receives the data from the provider and (c) their intermediate matching that involves the way the mapping is done, along with any transformation and filtering. Since a data mapping diagram can be very complex, our approach offers the possibility to organize it in different levels thanks to the use of UML packages. Our layered proposal consists of four levels (see Fig. 3), as it is explained in Table 3.

At the leftmost part of Fig. 3, a simple relationship among the DWCS and the SCS exists: this is captured by a single Data Mapping package and these three design elements constitute the data mapping diagram of the database level (or Level 0). Assuming that there are three particular tables in the DW that we

Data Mapping Diagrams for Data Warehouse Design with UML

197

would like to populate, this particular Data Mapping package abstracts the fact that there are three main scenarios for the population of the DW, one for each of this tables. In the dataflow level (or Level 1) of our framework, the data relationships among the sources and the targets in the context of each of the scenarios, is practically modeled by the respective package. If we zoom in one of these scenarios, e.g., Mapping 1, we can observe its particularities in terms of data transformation and cleaning: the data of Source 1 are transformed in two steps (i.e., they have undergone two different transformations), as shown in Fig. 3. Observe also that there is an Intermediate data store employed, to hold the output of the first transformation (Step 1), before passed on to the second one (Step 2). Finally, at the right lower part of Fig. 3, the way the attributes are mapped to each other for the data stores Source 1 and Intermediate is depicted. Let us point out that in case we are modeling a complex and huge DW, the attribute transformation modelled at level 3 is hidden within a package definition, thereby avoiding the use of cluttered diagrams.

Fig. 3. Data mapping levels

The constructs that we employ for the data mapping diagrams at different levels are as follows: The database and dataflow diagrams (Levels 0 and 1) use traditional UML structures for their purpose. Specifically, in these diagrams we employ (a) packages for the modeling of data relationships and (b) simple dependencies among the involved entities. The dependencies state that the mapping packages are dependent upon the changes of the employed data stores. The table level (Level 2) diagram extends UML with three stereotypes: (a) used as a package that encapsulates the data interrelationships among data stores, (b) and which explain the roles of providers and consumers for the The diagram at the attribute level (Level 3) is also using several newly introduced stereotypes, namely and for the definition of data mappings.

198

Sergio Luján-Mora, Panos Vassiliadis, and Juan Trujillo

We will detail the stereotypes of the table level in the next section and defer the discussion for the stereotypes of the attribute level to subsection 4.2.

4.1

The Data Mapping Diagram at the Table Level

During the integration process from data sources into the DW, source data may undergo a series of transformations, which may vary from simple algebraic operations or aggregations to complex procedures. In our approach, the designer can segment a long and complex transformation process into simple and small parts represented by means of UML packages that are materialization of a stereotype and contain an attribute/class diagram. Moreover, packages are linked by and dependencies that represent the flow of data. During this process, the designer can create intermediate classes, represented by the stereotype, in order to simplify or clarify the models. These classes represent intermediate storage that may or may not exist actually, but they help to understand the mappings. In Fig. 4, a schematic representation of a data mapping diagram at the table level is shown. This level specifies data sources and target sources, to which these data are directed. At this level, the classes are represented as usually in UML with the attributes depicted inside the container class. Since all the classes are imported from other packages, the legend (from ...) appears below the name of each class. The mapping diagram is shown as a package decorated with the stereotype and hides the complexity of the mapping, because a vast number of attributes can be involved in a data mapping. This package presents two kinds of stereotyped dependencies: to the data providers (i.e., the data sources) and to the data consumers (i.e., the tables of the DW).

4.2

The Data Mapping Diagram at the Attribute Level

As already mentioned, in the attribute level, the diagram includes the relationships between the attributes of the classes involved in a data mapping. At this level, we offer two design variants: Compact variant: the relationship between the attributes is represented as an association, and the semantic of the mapping is described in a UML note attached to the target attribute of the mapping. Formal variant: the relationship between the attributes is represented by means of a mapping object, and the semantic of the mapping is described in a tag definition of the mapping object. With the first variant, the data mapping diagrams are less cluttered, with less modeling elements, but the data mapping semantics are expressed as UML notes that are simple comments that have no semantic impact. On the other hand, the size of the data mapping diagrams obtained with the second variant is larger, with more modeling elements and relationships, but the semantics are better defined as tag definitions. Due to the lack of space, in this paper we

Data Mapping Diagrams for Data Warehouse Design with UML

199

will only focus on the compact variant. In this variant, the relationship between the attributes is represented as an association decorated with the stereotype and the semantic of the mapping is described in a UML note attached to the target attribute of the mapping.

Fig. 4. Level 2 of a data mapping diagram

The content of the package Mapping diagram from Fig. 4 is defined in the following way (recall that Mapping diagram is a package that contains an attribute/class diagram): The classes DS1, DS2, ..., and Dim1 are imported in Mapping diagram. The attributes of these classes are suppressed because they are shown as classes in this package. The classes are connected by means of association relationships and we use the navigability property to specify the flow of data from the data sources to the DW. The association relationships are adorned with the stereotype to highlight the meaning of this relationship. A UML note can be attached to each one of the target attributes to specify how the target attribute is obtained from the source attributes. The language for the expression is a choice of the designer (e.g., a LAV vs. a GAV approach [11] can be equally followed).

4.3

Motivating Example Revisited

From the DW example shown in Fig.s 1 and 2, we define the corresponding data mapping diagram shown in Fig. 5. The goal of this data mapping is to calculate the quarterly sales of the products belonging to the computer category. The result of this transformation is stored in ComputerSales from the DWCS. The transformation process has been segmented in three parts: Dividing, Filtering, and Aggregating; moreover, DividedOrders and FilteredOrders, two classes, have been defined.

200

Sergio Luján-Mora, Panos Vassiliadis, and Juan Trujillo

Fig. 5. Level 2 of a data mapping diagram

Following with the data mapping example shown in Fig. 5, attribute prod_list from Orders table contains the list of ordered products with product ID and (parenthesized) quantity for each. Therefore, Dividing splits each input order according to its prod_list into multiple orders, each with a single ordered product (prod_id) and quantity (quantity), as shown in Fig. 6. Note that in a data mapping diagram the designer does not specify the processes, but only the data relationships. We use the one-to-many cardinality in the association relationships between Orders.prod_list and DividedOrders.prod_id and DividedOrders.quantity to indicate that one input order produces multiple output orders. We do not attach any note in this diagram because the data are not transformed, so the mapping is direct.

Fig. 6. Dividing Mapping

Filtering (Fig. 7) filters out products not belonging to the computer category. We indicate this action with a UML note attached to the prod_id mapping, because it is supposed that this attribute is going to be used in the filtering process. Finally, Aggregating (Fig. 8) computes the quarterly sales for each product. We use the many-to-one cardinality to indicate that many input items are needed to calculate a single output item. Moreover, a UML note indicates how the ComputerSales.sales attribute is calculated from FilteredOrders.quantity

Data Mapping Diagrams for Data Warehouse Design with UML

201

Fig. 7. Filtering Mapping

Fig. 8. Aggregating Mapping

and Products.price. The cardinality of the association relationship between Products.price and ComputerSales.sales is one-to-many because the same price is used in different quarters, but to calculate the total sales of a particular product in a quarter we only need one price (we consider that the price of a product never changes along time).

5

Related Work

There is a relatively small body of research efforts around the issue of conceptual modeling of the DW back-stage.

202

Sergio Luján-Mora, Panos Vassiliadis, and Juan Trujillo

In [12,13], the model management, a framework for supporting meta-data related applications where models and mappings are manipulated is proposed. In [13], two scenarios related to loading DWs are presented as case studies: on the one hand, the mapping between the data sources and the DW, on the other hand, the mapping between the DW and a data mart. In this approach, a mapping is a model that relates the objects (attributes) of two other models; each object in a mapping is called a mapping object and has three properties: domain and range, which point to objects in the source and the target respectively, and expr, which is an expression that defines the semantics of that mapping object. This is an isolated approach in which authors propose their own graphical notation for representing data mappings. Therefore, from our point of view, there is a lack of integration with the design of other parts of a DW. In [3] the authors attempt to provide a first model towards the conceptual modeling of the DW back-stage. The notion of provider mapping among attributes is introduced. In order to avoid the problems caused by the specific nature of ER and UML, the authors adopt a generic approach. The static conceptual model of [3] is complemented in [5] with the logical design of ETL processes as data-centric workflows. ETL processes are modeled as graphs composed of activities that include attributes as FCME. Moreover, different kinds of relationships capture the data flow between the sources and the targets. Regarding data mapping, in [14] authors discuss issues related to the data mapping in the integration of data. A set of mapping operators is introduced and a classification of possible mapping cases is presented. However, no graphical representation of data mapping scenarios is provided, thereby making difficult using it in real world projects. The issue of treating attributes as FCME has generated several debates from the beginning of the conceptual modeling field [15]. More recently, some objectoriented modeling approaches such as OSM (Object Oriented System Model) [16] or ORM (Object Role Modeling) [17] reject the use of attributes (attributefree models) mainly because of their inherent instability. In these approaches, attributes are represented with entities (objects) and relationships. Although an ORM diagram can be transformed into a UML diagram, our data mapping diagram is coherently integrated in a global approach for the modeling of DW’s [6,7], and particularly, of ETL processes [4]. In this approach, we have used the extension mechanisms provided by UML to adapt it to our particular needs for the modeling of DW’s. In this case, we always use formal extensions of the UML for modeling all parts of DWs.

6

Conclusions and Future Work

In this paper, we have presented a framework for the design of the DW backstage (and the respective ETL processes) based on the key observation that this task fundamentally involves dealing with the specificities of information at very low levels of granularity. Specifically, we have presented a disciplined framework for the modeling of the relationships between sources and targets in

Data Mapping Diagrams for Data Warehouse Design with UML

203

different levels of granularity (i.e., from coarse mappings at the database level to detailed inter-attribute mappings at the attribute level). Unfortunately, standard modeling languages like the ER model or UML are fundamentally handicapped in treating low granule entities (i.e., attributes) as FCME. Therefore, in order to formally accomplish the aforementioned goal, we have extended UML to model attributes as FCME. In our attempt to provide complementary views of the design artifacts in different levels of detail, we have based our framework on a principled approach in the usage of UML packages, to allow zooming in and out the design of a scenario. Although we have developed the representation of attributes as FCME in UML in the context of DW, we believe that our solution can be applied in other application domains as well, e.g., definition of indexes and materialized views in databases, modeling of XML documents, specification of web services, etc. Currently, we are extending our proposal in order to represent attribute constraints such as uniqueness or disjunctive values.

References 1. SQL Power Group: How do I ensure the success of my DW? Internet: http://www.sqlpower.ca/page/dw_best_practices (2002) 2. Strange, K.: ETL Was the Key to this Data Warehouse’s Success. Technical Report CS-15-3143, Gartner (2002) 3. Vassiliadis, P., Simitsis, A., Skiadopoulos, S.: Conceptual Modeling for ETL Processes. In: Proc. of 5th Intl. Workshop on Data Warehousing and OLAP (DOLAP 2002), McLean, USA (2002) 14–21 4. Trujillo, J., Luján-Mora, S.: A UML Based Approach for Modeling ETL Processes in Data Warehouses. In: Proc. of the 22nd Intl. Conf. on Conceptual Modeling (ER’03). Volume 2813 of LNCS., Chicago, USA (2003) 307–320 5. Vassiliadis, P., Simitsis, A., Skiadopoulos, S.: Modeling ETL Activities as Graphs. In: Proc. of 4th Intl. Workshop on the Design and Management of Data Warehouses (DMDW’02), Toronto, Canada (2002) 52–61 6. Luján-Mora, S., Trujillo, J., Song, I.: Extending UML for Multidimensional Modeling. In: Proc. of the 5th Intl. Conf. on the Unified Modeling Language (UML’02). Volume 2460 of LNCS., Dresden, Germany (2002) 290–304 7. Luján-Mora, S., Trujillo, J., Song, I.: Multidimensional Modeling with UML Package Diagrams. In: Proc. of the 21st Intl. Conf. on Conceptual Modeling (ER’02). Volume 2503 of LNCS., Tampere, Finland (2002) 199–213 8. Lujá-Mora, S., Trujillo, J.: A Comprehensive Method for Data Warehouse Design. In: Proc. of the 5th Intl. Workshop on Design and Management of Data Warehouses (DMDW’03), Berlin, Germany (2003) 1.1–1.14 9. Jarke, M., Lenzerini, M., Vassiliou, Y., Vassiliadis, P.: Fundamentals of Data Warehouses. 2 edn. Springer-Verlag (2003) 10. Object Management Group (OMG): Unified Modeling Language Specification 1.4. Internet: http://www.omg.org/cgi-bin/doc?formal/01-09-67 (2001) 11. Lenzerini, M.: Data Integration: A Theoretical Perspective. In: Proceedings of the Twenty-first ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, Madison, Wisconsin, USA (2002) 233–246

204

Sergio Luján-Mora, Panos Vassiliadis, and Juan Trujillo

12. Bernstein, P., Levy, A., Pottinger, R.: A Vision for Management of Complex Models. Technical Report MSR-TR-2000-53, Microsoft Research (2000) 13. Bernstein, P., Rahm, E.: Data Warehouse Scenarios for Model Management. In: Proc. of the 19th Intl. Conf. on Conceptual Modeling (ER’00). Volume 1920 of LNCS., Salt Lake City, USA (2000) 1–15 14. Dobre, A., Hakimpour, F., Dittrich, K.R.: Operators and Classification for Data Mapping in Semantic Integration. In: Proc. of the 22nd Intl. Conf. on Conceptual Modeling (ER’03). Volume 2813 of LNCS., Chicago, USA (2003) 534–547 15. Falkenberg, E.: Concepts for modelling information. In: Proc. of the IFIP Conference on Modelling in Data Base Management Systems, Amsterdam, Holland (1976) 95–109 16. Embley, D., Kurtz, B., Woodfield, S.: Object-oriented Systems Analysis: A ModelDriven Approach. Prentice-Hall (1992) 17. Halpin, T., Bloesch, A.: Data modeling in UML and ORM: a comparison. Journal of Database Management 10 (1999) 4–13

Informational Scenarios for Data Warehouse Requirements Elicitation Naveen Prakash1, Yogesh Singh2, and Anjana Gosain2 1

JIIT A10, Sector 62, Noida 201307, India [emailprotected]

2

USIT, GGSIP University, Kashmere Gate, Delhi 110006, India [emailprotected], [emailprotected]

Abstract. We propose a requirements elicitation process for a data warehouse (DW) that identifies its information contents. These contents support the set of decisions that can be made. Thus, if the information needed to take every decision is elicited, then the total information determines DW contents. We propose an Informational Scenario as the means to elicit information for a decision. An informational scenario is written for each decision and is a sequence of pairs of the form < Query, Response >. A query requests for information necessary to take a decision and the response is the information itself. The set of responses for all decisions identifies DW contents. We show that informational scenarios are merely another sub class of the class of scenarios.

1

Introduction

In the last decade, great interest has been shown in the development of data warehouses (DWs). We can look at data warehouse development at the design, the conceptual, and the requirements engineering levels. Two different approaches for the development of DWs have been proposed at the design level. These are the data-driven [9], and the requirements-driven [2,12,8,19] approaches. Given data needs, these approaches identify the logical structure of the DW. Jarke et al. [11] propose to add a conceptual layer on top of the logical layer. Whereas they propose the basic notion of the conceptual layer, it is assumed that the conceptual objects represented in the Enterprise Model can be determined but the question of what are useful conceptual objects for a DW and how these are to be determined is not addressed. Thus, the conceptual level does not take into account the larger context in which the DW is to function. A relationship of the Data Warehouse to the organizational context is established at the requirements level. Fabio Rilson and Jaelson Freire [7] adapt traditional requirements engineering techniques to Data Warehouses. This approach starts with Requirements Management Planning phase, for which the authors propose guidelines concerning acquisition, documentation and control of selected requirements. The second phase talks about a) Requirements Specification, which includes Requirements elicitation through, interviews, workshops, P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 205–216, 2004. © Springer-Verlag Berlin Heidelberg 2004

206

Naveen Prakash, Yogesh Singh, and Anjana Gosain

prototyping and scenarios, b) Requirements Analysis and c) Requirements Documentation. In third and fourth phases, requirements are conformed and validated respectively. The proposal of [7] is a “top” level proposal that builds a framework for DW requirements engineering. While providing pointers to RE approaches that may be applicable, this proposal does not establish their feasibility and also does not consider any detailed technical solutions. Boehnein et al. [3] presents a goal-driven approach that is based on the SOM (Semantic object Model) process modeling technique. It starts with the determination of two kinds of goals- one specifies the product and services to be provided whereas the other determines the extent to which the goal is to be pursued. Information requirements are derived by analyzing business processes in increasing details and by transforming relevant data structures of business processes into data structures of the data warehouse. According to [19], since data warehouse systems are developed to support exclusively decision processes, a detailed business process analysis is not feasible for decision processes because the respective tasks are unique and often not structured. Moreover, sometimes knowledge workers refuse to disclose their process in detail. The proposal of [14] aims to identify the decisional information to be kept in the Data Warehouse. This process starts with determination of the goals of an organization, uses these to arrive at its decision-making needs, and finally, identifies the information needed for the decisions to be supported. Therefore, the requirements engineering product is a Goal-Decision-Information (GDI) diagram that uses two associations 1) goal-decision coupling, and 2) decision- information coupling respectively. Whereas this proposal relates DW information contents to its decision-making capability as embedded in organizational goals, it is not backed up by a requirement elicitation process. In this paper, we look at requirement elicitation process for arriving at the GDI diagram. The total process is a two-part one. In the first part, the goaldecision coupling is elicited. That is, the set of decisions that can fulfill the goals of an organization are elicited. Thereafter, in the second part, from elicited decisions, the decision-information coupling can yield decisional information. Here, we deal with the second part of this process. We base our proposal on the notion of scenarios [13,16,6,10,18]. A scenario has been considered as a typical interaction between a user and the system to be developed. We treat this as the generic notion of a scenario. This is shown in Fig.1 as the root node of the scenario typology hierarchy. We refer to traditional scenarios as transactional scenarios since they reveal the system functionality needed in the new system. We propose a second kind of scenarios called Data Warehouse scenarios. In consonance with our two-part process, for goal-decision coupling we propose decisional scenarios and for decision-information coupling we postulate informational scenarios (see Fig. 1). As mentioned earlier, our interest in this paper is in informational scenarios. Informational scenarios reveal the information contents of a system. An informational scenario represents a typical interaction between a decision-maker

Informational Scenarios for Data Warehouse Requirements Elicitation

207

Fig. 1. Scenario Typology.

and the decisional system. This interaction is a sequence of pairs < Q, R >, where Q represents the query input to the system by the decision-maker and R represents the response obtained. This response yields the information to be kept in the decisional system, the data warehouse. In the next section we present the GDI model. In section 3, we define and illustrate informational scenarios. In subsection 3.1, we position them in the 4-dimensional classification system of scenarios. In subsection 3.2, we show elicitation of decisions from an informational scenario. The paper ends with a conclusion.

2

The GDI

The Goal-Decision-information (GDI) model is shown in Fig.2. In accordance with goal-orientation [1,4], we view a goal as an aim or objective that is to be met. A goal is a passive concept and unlike an activity/process/event it cannot perform or cause any action to be performed. A goal is set, and once so defined it needs an active component to realize it. The active component is decision. Further to fulfil the decisions appropriate information is required.

Fig. 2. GDI Diagram.

208

Naveen Prakash, Yogesh Singh, and Anjana Gosain

As shown in Fig.2 a goal can be either simple or complex. A simple goal cannot be decomposed into simpler ones. A complex goal is built out of other goals which may themselves be simple or complex. This makes a goal hierarchy. The component goals of a complex one may be mandatory or optional. A decision is a specification of an active component that causes goal fulfillment. It is not the active component itself: when a decision is selected for implementation then one or more actions may be performed to give effect to it. In other words, a decision is the intention to perform the actions that cause its implementation. Decision-making is an activity that results in the selection of the decision to be implemented It is while performing this activity that information to select the right decision is needed. As shown in Fig.2, a decision can be either simple or complex. A simple decision cannot be decomposed into simpler ones whereas a complex decision is built out of other simple or complex decisions. Fig.2 shows that there is an association ‘is satisfied by’ between goals and decisions. This association identifies the decisions which when taken can lead to goal satisfaction. Knowledge necessary to take decisions is captured in the notion of decisional information shown in Fig.2. This information is a specification of the data that will eventually be stored in the Data Warehouse. Fig.2 shows that there is an association ‘is required for’ between decisions and decisional information. This association identifies the decisional information required to take a decision. An instance of the GDI diagram, the GDI schema is shown in Fig.4. It shows a goal hierarchy (solid lines between ‘Maximize profit’ and ‘increase the no. of customers’, and ‘increase sales’) and a decision hierarchy(solid lines between ‘improve the quality of the product’ and ‘introduce statistical quality control techniques’ and ‘use better quality material’) for a given set of goals and decisions. The figure shows the ‘is satisfied by’ relationship between the goal ‘increase sales’ and decisions ‘open new sales counter’ and ‘improve the quality of the product’ by dashed lines. The ‘is required for’ relationship between decisions and associated information is shown by dotted lines. The dynamics of the interaction between goals, decisions and information is shown in Fig. 3. A goal suggests a set of decisions that lead to its satisfaction. A decision can be taken after consulting the information relevant to it and available in the decisional system. In the reverse direction, information helps in selecting a decision, which in turn satisfies a goal. For example the goal ‘increase sales’ suggests the decisions ‘improve the quality of the product’ and ‘open new sales counter’. These decisions may modify

Fig. 3. The Interaction Cycle.

Informational Scenarios for Data Warehouse Requirements Elicitation

209

Fig. 4. A GDI Schema.

the goal state. To implement the decision ‘open new sales counter’ it consults the information ‘Existing product demand’ and ‘Existing service/customer centers’.

3

Informational Scenario

In this section, we show elicitation of decisional information. The decisioninformation coupling suggests that the information needed to select a decision can be obtained from the knowledge of the decision itself. Thus, if we focus attention on a decision then through a suitable elicitation mechanism, we can obtain the information relevant to it. Our informational scenario is one such elicitation mechanism. It can be seen that the informational scenario is an expression of the ‘is required for’ relationship between a simple decision and information of the GDI diagram(see fig.2 ) An informational scenario is a typical interaction between the decision-maker and the decisional system. An informational scenario is written for each simple decision of the GDI diagram, and is a sequence of pairs < Q, R >, where Q represents the query input to decisional system and R represents the response of the decisional system. An informational scenario is thus of the form

The set of queries, through is an expression of the information relevant to the decision of the scenario. The information contents of the data warehouse can be derived from set of responses through We represent query in SQL and a response is represented as a relation Once a response has been received, it can be used in two ways (a) the relation attributes identify the information type to be maintained in the warehouse, and

210

Naveen Prakash, Yogesh Singh, and Anjana Gosain

(b) the tuple values can suggest the formulation of supplementary queries to elicit additional information. It is possible that, all values in all tuples may be non-null. Therefore, there is full knowledge of the data and a certain supplementary query sequence follows from this. We refer to such a < Q, R > sequence as a normal scenario (see Fig.6). In case a tuple contains a null value then this ‘normal’ sequence will not be followed and the next query may be formulated to explore the null value more deeply. This results in the breaking of the normal thought process and results in a different sequence of < Q, R >. We call this sequence as an exceptional scenario. Fig.5 shows these two types of informational scenarios. Let us illustrate the notion of normal and exceptional scenarios. Let us be given a decision “Open new sales counter”. In order to make this decision, the decision maker would like to know the units sold for different products at the various sales counters of each region. After all, a new sales counter makes sense in an area where (a) units sold is so high that existing counters are overloaded or (b) in a region where units sold is very low and this could be merely due to the absence of a sales outlet. So, the first query is formulated to reveal this information: How many units of different products have been sold at various sales counters in each region? This query shows that Region, Sales counter, Product and Number of units sold must be made available in the data warehouse. Select regions, sales counter, product, units sold From sales, region Let the response be as follows: Regions NR NR SR SR SR SR SR SR ER ER ER CR CR

Sales Counter Null Null Lata Lata Lata Kanika Kanika Kanika Rubina Rubina Rubina Null Null

Product Radio TV

Radio TV

Fridge Radio TV

Fridge Radio TV

Fridge Radio TV

Units Sold Null Null 90 200 200

Null 110 110 80 250 230

Null Null

Let it be that the decision-maker is not interested in exploring ‘null’ for the moment. Instead, he wishes to see if unsold stock exists in some large quantity. If so, then the opening of a sales counter might help in clearing unsold stock. So, the decision-maker may asks for the number of units manufactured. If the manufactured quantity is not sold then he may think of opening new sales counter in a particular region. This query and its response is shown in Fig. 7a. This results in the normal scenario shown in Fig. 6.

Informational Scenarios for Data Warehouse Requirements Elicitation

211

Fig. 5. An Information Scenario.

Fig. 6. Normal and Exceptional Scenario.

Suppose that the decision-maker wishes to explore ‘null’ values found in sales counter of regions. The reasoning followed is that if there are service centers in the regions NR and CR which are, in fact, servicing a number of company products then there is sufficient demand in these regions. This may call for the opening of sales counters. This query and its response is shown in Fig. 7b. The sequence and any further < Q, R > pairs following from this constitute exceptional scenario shown in fig. 6. In fig. 7b, if the response contains null values for service centers for any region, and the decision-maker again wishes to explore ‘null’ values found in services center of regions. The reasoning followed is that if there are sales counter and no service centers in the region CR then to take the decision open new sales counter in CR, he may ask for the number of sales counter of other companies manufacturing the same products. This query and its response is shown in Fig. 7c. The sequence and any further < Q, R > pairs following from this constitute another exceptional scenario. It also shows that an exceptional scenario can lead to another exceptional scenario and so on.

3.1

Positioning of Informational Scenario

Here we show that an informational scenario is a subclass of the class of in the 4-dimensional scenario classification framework proposed by [17].

212

Naveen Prakash, Yogesh Singh, and Anjana Gosain

Fig.7a

Fig.7b

Fig.7c

The 4-dimensional framework considers scenarios along four different views, each view allowing to capture a particular relevant aspect of the scenarios. The

Informational Scenarios for Data Warehouse Requirements Elicitation

213

Form view deals with the expression mode of a scenario. The Contents view concern the kind of knowledge which is expressed in a scenario. The Purpose view is used to capture the role that a scenario is aiming to play in the requirements engineering process. The Lifecycle view suggests to consider scenarios as artifacts existing and evolving in time through the execution of operations during the requirements engineering process. A set of facets is associated with each view. Facets are considered as viewpoints suitable to characterize and classify a scenario according to this view. A facet has a metric attached to it. Each facet is measured by a set of relevant attributes. Table 1 shows the views, facets, attributes and possible values of these attributes in the 4-dimensional framework together with attribute values that our informational scenario takes on. Consider the level of formalism attribute of the Form view. This takes on the value Formal because of the use of SQL and relations in the scenario expression. It is possible to express a scenario less formally by using free format. Were such a scenario to exist, its level of Formalism would have the value Informal. Information scenario proposed by us is also characterized according to these four views.

3.2

Elicitation of Decisions

In this section we show that informational scenarios can help in eliciting decisions as well. These decisions are suggested by an analysis of < Q, R > sequence of the scenario. Let us consider the decision “Open new sales counter” again. Let the decisionmaker makes a query as follows: What are the units sold for different products at various sales counters in each region? Let the response be as follows: Regions SR SR SR SR SR SR ER ER ER

Sales Counter Lata Lata Lata Kanika Kanika Kanika Rubina Rubina Rubina

Product Radio TV Fridge Radio TV Fridge Radio TV Fridge

Units Sold

30 100 90 25 90 90 40 120 100

The response shows that number of product units sold for different products is very low. Now the decision-maker may no longer be interested in continuing with the decision “open new sales counter” any more. Since number of units sold is low, the decision-maker may now be interested in improving product sales. This leads to the elicitation of new decision ‘Improve Product Sales’. Informational scenario is now written out for this decision.

214

Naveen Prakash, Yogesh Singh, and Anjana Gosain

Informational Scenarios for Data Warehouse Requirements Elicitation

215

Thus it is possible to move in both directions in the decision-information coupling. An informational scenario is written is for a given decision, which may lead to elicitation of other decisions which, leads to informational scenarios.

4

Conclusions

Information Systems/Software Engineering moved from early ‘code and fix’ approaches through design to requirements engineering. Thus considerable exploration of the problem space is performed before implementation. We can see the same evolution in DW engineering: as mentioned in the Introduction, attempts have been made to introduce the design and conceptual layers. This evolution has the same expectations as before, namely, development systems that better-fit organisation needs and user requirements. Thus, we expect that today’s practice where analysts understand DW use after it has been implemented /used will give way to a systematic approach satisfying the various stakeholders. Analysts will understand DW use partly through the argumentation and reasoning process of requirements engineering and partly through the use of the prototyping process model. Just as traditional scenarios elicit the functional requirements of transactional sys-tems, informational scenarios elicit the informational requirements of decisional sys-tems. Both these belong to general class of scenarios and represent typical interac-tions between the user and the system to be developed. In traditional scenarios the interest is in functional interaction: if the user does this then the system does that. In informational scenarios the interest is in obtaining information and we have an infor-mation seeking interaction: if I ask for this information, what do I get. Information may be missing or available. Depending upon this the user may for-mulate other information seeking interactions. We have used this to classify scenarios as exceptional or normal. Finally, it is possible that informational scenarios may sug-gest new decisions, thus helping in decision elicitation. We are working on framing of guidelines for informational scenarios. Future work also concerns decision elicitation by exploiting the goal-decision coupling.

References 1. Anton, A.I. : Goal based requirements analysis. Proceedings of the 2nd International Conference on Requirements Engineering ICRE’96, (1996) 136-144. 2. Ballard, C., Herreman, D., Schau, D., Bell, R., Kim, E., Valencic, A. : Data Modeling Techniques for Data Warehousing, redbooks.ibm.com (1998). 3. Boehnlein, M., Ulbrich vom Ende, A.: Business Process Oriented Development of Data Warehouse Structures. Proceedings of Data Warehousing 2000, Physica Verlag (2000). 4. Bubenko, J., Rolland, C., Loucopoulos, P., De Antonellis, V. : Facilitating ‘fuzzy to formal’ requirements modelling. IEEE 1st Conference on Requirements Engineering, ICRE’94 (1994) 154–158.

216

Naveen Prakash, Yogesh Singh, and Anjana Gosain

5. Cockburn, A. : Structuring use cases with goals. Technical report. Human and Technology, 7691 Dell Rd, Salt Lake City, UT 84121, HaT.TR.95.1 (1995). 6. Dano, B., Briand, H., Barbier, F.: A use case driven requirements engineering process. Third IEEE International Symposium On Requirements Engineering RE’97, Antapolis, Maryland, IEEE Computer Society Press (1997). 7. Rilson, F., Freire, J.: DWARF: AN Approach for Requirements Definition and Management of Data Warehouse Systems. Proceeding of the llth IEEE International Requirements Engineering Conference, September 08 - 12 (2003), 10901099. 8. Golfarelli, M., Maio, D., Rizzi, S.: Conceptual Design of Data Warehouses from E/R Schemes. Proceedings of the 31 st HICSS, IEEE Press (1998). 9. Inmon, W.H. : Building the Data Warehouse. John Wiley and Sons, (1996). 10. Jacobson, I. : The use case construct in object-oriented software Engineering. In Scenario-based design: envisioning work and technology in system development, J. M. Carroll (ed.), John Wiley and Sons, (1995) 309–336. 11. Jarke, M., Jeusfeld, A., Quix, C., Vassiliadis, P.: Architecture and Quality in Data Warehouses. Proceedings 10th CAiSE Conference (1998) 93–113. 12. Poe, V. : Building a Data Warehouse for Decision Support. Prentice Hall (1996). 13. Potts, C., Takahashi, K., Anton, A. I. : Inquiry-based requirements analysis. IEEE Software 11(2), (1994) 21-23. 14. Prakash, N., Gosain, A.: Requirements Engineering for Data warehouse Development. Proceedings of CAiSE03 Forum (2003). 15. Bruckner, R. M., List, B. : Developing requirements for data warehouses using use cases. Seventh Americas Conference on Information Systems (2003). 16. Rolland, C., Souveyet, C., Achour, C. B.: Guiding goal modelling using scenarios. IEEE Transactions on Software Engineering, Special Issue on Scenario Management. 24(12) (1998). 17. Rolland, C., Grosz, G., Kla, R.: A proposal for a scenario classification framework. Journal of Requirements Engineering (RE’98), (1998). 18. Rubin, K. S., Golberg, A.: Object behavior analysis. Communications of the ACM. 35(9) (1992) 48–62. 19. Winter, R., Strauch, B.: A Method for demand driven information requirements analysis in data warehouse projects. Proceeding of the Hawai International conference on system sciences. January 6-9, (2003).

Extending UML for Designing Secure Data Warehouses Eduardo Fernández-Medina1, Juan Trujillo2, Rodolfo Villarroel3, and Mario Piattini1 1

Dep. Informática, Univ. Castilla-La Mancha, Spain

{Eduardo.FdezMedina,Mario.Piattini}@uclm.es 2

Dept. Lenguajes y Sistemas Informáticos, Univ. Alicante, Spain [emailprotected]

3

Dept. Comput. e Informática, Univ. Católica del Maule, Chile [emailprotected]

Abstract. Data Warehouses (DW), Multidimensional (MD) Databases, and OnLine Analytical Processing Applications are used as a very powerful mechanism for discovering crucial business information. Considering the extreme importance of the information managed by these kinds of applications, it is essential to specify security measures from early stages of the DW design in the MD modeling process, and enforce them. In the past years, there have been some proposals for representing main MD modeling properties at the conceptual level. Nevertheless, none of these proposals considers security measures as an important element in their models, so they do not allow us to specify confidentiality constraints to be enforced by the applications that will use these MD models. In this paper, we discuss the confidentiality problems regarding DW’s and we present an extension of the Unified Modeling Language (UML) that allows us to specify main security aspects in the conceptual MD modeling, thereby allowing us to design secure DW’s. Then, we show the benefit of our approach by applying this extension to a case study. Finally, we also sketch how to implement the security aspects considered in our conceptual modeling approach in a commercial DBMS. Keywords: Secure data warehouses, UML extension, multidimensional modeling, OCL

1 Introduction Multidimensional (MD) modeling is the foundation of Data Warehouses (DW), MD Databases and On Line Analytical Processing Applications (OLAP). These systems are used as a very powerful mechanism for discovering crucial business information in strategic decision making processes. Considering the extreme importance of the information that a user can discover by using these kinds of applications, it is crucial to specify confidentiality measures in the MD modeling process, and enforce them. On the other hand, information security is a serious requirement which must be carefully considered, not as an isolated aspect, but as an element presented in all stages of the development lifecycle, from the requirement analysis to implementation and maintenance[4, 6]. To achieve this goal, different ideas for integrating security in the system development process are proposed [2, 8], but they only considered information security from a cryptographic point of view, and without considering database and DW specific issues. There are some proposals that try to integrate security into conceptual modeling. UMLSec [9], where UML is extended to develop secure systems, is probably the most P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 217–230, 2004. © Springer-Verlag Berlin Heidelberg 2004

218

Eduardo Fernández-Medina et al.

relevant one. This approach is very interesting, but it only deals with information systems (IS) in general, whilst conceptual database and DW design are not considered. A methodology and a set of models have recently been proposed [5] in order to design secure databases to be implemented with Oracle9i Label Security (OLS) [11]. This approach, based on the UML, is important because it considers security aspects in all stages of the development process, from requirement gathering to implementation. Together with the previous methodology, the proposed Object Security Constraint Language (OSCL) [14], based on the Object Constraint Language (OCL) [19] of UML, allows us to specify security constraints in the conceptual and logical database design process, and to implement these constraints in a concrete database management system (DBMS) such as OLS. Nevertheless, the previous methodology and models do not consider the design of secure MD models for DW’s. In the literature, we can find several initiatives to include security in DW [15, 16]. Many of them are focused on interesting aspects related to access control, multilevel security, its applications to federated databases, applications using commercial tools and so on. These initiatives refer to specific aspects that allow us to improve DW security in acquisition, storage, and access aspects. However, neither of them considers the security aspects comprising all stages of the system development cycle nor considers security in the MD conceptual modeling. Regarding the conceptual modeling of DW’s, various approaches have proposed to represent main MD properties at the conceptual level (due to space constraints, we refer the reader to [1] for a detail comparison between the most relevant ones). These proposals provide their own non-standard graphical notations, and none of them has been widely accepted as a standard conceptual model for MD modeling. Recently, another approach [12, 18] has been proposed as an object-oriented conceptual MD modeling approach. This proposal is a profile of the UML [13], which uses its standard extension mechanisms (stereotypes, tagged values and constraints). However, none of these approaches considers security as an important issue in their conceptual models, so they do not solve the problem of security in DW’s. In this paper, we present an extension of the UML (profile) that allows us to represent main security information of data and their constraints in the MD modeling at the conceptual level. The proposed extension is based on the profile presented in [12] for the conceptual MD modeling because it allows us to consider main MD modeling properties as well as it is based on the UML (designers avoid learning a new specific notation or language). We consider the multilevel security model [17], but focusing on considering aspects regarding read operations because this is the most common operation for final user applications. This model allows us to classify both information and users into security classes, and enforce the mandatory access control [17]. By using this approach, we are able to implement secure MD models with any commercial DBMS that is able to implement multilevel databases, such as OLS [11] or DB2 Universal Database (UDB) [3]. The remainder of this paper is structured as follows: Section 2 briefly summarizes the conceptual approach for MD modeling in which we based on. Section 3 proposes the new UML extension for secure MD modeling. Section 4 presents a case study and apply our UML extension for secure MD modeling, Section 5 sketches some further implementation issues. Finally, Section 6 presents the main conclusions and introduces immediate our future work.

Extending UML for Designing Secure Data Warehouses

219

2 Object-Oriented Multidimensional Modeling In this section, we outline our approach, based on the UML [12, 18], for DW conceptual modeling. This approach has been specified by means of a UML profile that contains the necessary stereotypes to represent all main features of MD modeling at the conceptual level [7]. In this approach, structural properties are specified by a UML class diagram in which information is organized into facts and dimensions. Facts and dimensions are represented by means of fact classes and dimension classes respectively. Fact classes are defined as composite classes in shared aggregation relationships of n dimension classes. The many-to-many relations between a fact and a specific dimension are specified by means of the multiplicity on the role of the corresponding dimension class. In our example in Fig. 1, we can see how the Sales fact class has a many-to-many relationship with the Product dimension. A fact is composed of measures or fact attributes. By default, all measures are considered to be additive. For nonadditive measures, additive rules are defined as constrains. Moreover, derived measures can also be explicitly represented (by /) and their derivation rules are placed between braces near the fact class. Our approach also allows the definition of identifying attributes in the fact class (stereotype OID). In this way degenerated dimensions can be considered [10], thereby representing other fact features in Fig. 1. Multidimensional modeling using the UML addition to the measures for analysis. For example, we could store the ticket number (ticket_number) as degenerated dimensions, as reflected in Fig. 1. Regarding dimensions, each level of a classification hierarchy is specified by a base class (stereotype Base). An association of base classes specifies the relationship between two levels of a classification hierarchy. These classes must define a Directed Acyclic Graph (DAG) rooted in the dimension class (DAG constraint). The DAG structure can represent both multiple and alternative path hierarchies. Every base class must also contain an identifying attribute (OID) and a descriptor attribute1 (stereotype D). These attributes are necessary for an automatic generation process into commercial OLAP tools, as these tools store this information on their metadata. 1

A descriptor attribute will be used as the default label in the data analysis in OLAP tools.

220

Eduardo Fernández-Medina et al.

We can also consider non-strict hierarchies (an object at a hierarchy’s lower level belongs to more than one higher-level object) and complete hierarchies (all members belong to one higher-class object and that object consists of those members only). These characteristics are specified by means of the multiplicity of the roles of the associations and defining the constraint {completeness} in the target associated class role respectively. See Store dimension in Fig. 1 for an example of all kinds of classification hierarchies. Lastly, the categorization of dimensions is considered by means of the generalization / specialization relationships of UML.

3 UML Extension for Secure Multidimensional Modeling The goal of this UML extension is to allow us to design MD conceptual models, but classifying the information in order to define which properties users must own to be entitled to access the information. Therefore, we have to consider three main stages: 1. Defining precisely the organization of the users that will have access to the MD system. We can define a precise level of granularity considering three ways of organizing the users: Security hierarchy levels (which indicates the clearance level of the user), user Compartments (which indicates a horizontal classification of users), and user Roles (which indicates a hierarchical organization of users according to their roles or responsibilities into the organization). 2. Classifying the information into the MD model. We can define the security information for each element of the model (fact class, dimension class, etc.) by using a tuple composed of a sequence of security levels, a set of user compartments, and a set of user roles. We can also specify security constraints considering this security information. This security information and constraints indicate the security properties that users must own to be able to access the information. 3. Enforcing the mandatory access control (AC). The typical operations executed by final users in this type of systems are query operations. So, the mandatory access control has to be enforced for the read operations, whose access control rule is as follows: A user can access to an information only if, a) the security level of the user is greater or equal than the security level of the information, b) all the user compartments that have been defined for the information is owned by the user, and, c) at least one of the user roles defined for the information, is played by the user. In this paper, we will only focus on the second stage by defining a UML extension that allows us to classify the security elements in a conceptual MD model and to specify security constraints. Furthermore, in Section 5, we sketch a prominent work to deal with the third stage by generating the needed structures in the target DBMS to consider all security aspects represented in the conceptual MD model. Finally, let us point out that the first stage concerns with security policies defined in the organization by managers, and it is out of the scope of this paper. We define our UML extension for secure conceptual MD modeling following the schema composed of these elements: description, prerequisite extensions, stereotypes/tagged values, well-formedness rules, and comments. For the definition of the stereotypes, we consider an structure that is composed of a name, the base metaclase, the description, the tagged values and a list of constraints defined by means of OCL. For the definition of tagged values, the type of the tagged values, the multiplicity, the description, and the default value are defined.

Extending UML for Designing Secure Data Warehouses

221

3.1 Description This UML extension reuses a set of stereotypes previously defined in [12], and defines new tagged values, stereotypes, and constraints, which enables us to define secure MD models. The 20 tagged values we have defined are applied to certain components that are specially particular to MD modeling, allowing us to represent them in the same model and in the same diagrams that describe the rest of the system. These tagged values will represent the sensitive information for the different elements of the MD modeling (fact class, dimension class, etc.), and they will allow us to specify security constraints depending on this security information and on the value of certain attributes of the model. The stereotypes will help us identify a special class that will define the profile of the system users. A set of inherent constraints are specified in order to define well-formedness rules. The correct use of our extension is assured by the definition of constraints in both natural language and OCL [19].

Fig. 2. Extension of the UML with stereotypes

Thus, we have defined 7 new stereotypes: one specializes in the Class model element, two specializes in the Primitive model element and four specialize in the Enumeration model element. In Fig. 2, we have represented portions of the UML metamodel2 to show where our stereotypes fit. We have only represented the specialization hierarchies, as the most important fact about a stereotype is the base class that the stereotype specializes. In these figures, new stereotypes are colored in a dark grey, whereas stereotypes we reuse from our previous profile [27] are in a light grey and classes from the UML metamodel remain white.

3.2 Prerequisite Extensions This UML profile reuses stereotypes previously defined in another UML profile [12]. This profile provided the needed stereotypes, tagged values, constraints to accomplish 2

All the metaclasses come from the Core Package, a subpackage of the Foundation Package. We based our extension on the UML 1.5 as this is the current accepted standard. To the best of our knowledge, the current UML 2.0 is not the final accepted standard yet.

222

Eduardo Fernández-Medina et al.

the MD modeling properly, allowing us to represent main MD properties at the conceptual level. To facilitate the comprehension of the UML profile we present and use in this paper, we provide a brief description of the of these stereotypes in Table 1.

3.3 Datatypes First of all, we need the definition of some new data types to be used in our tagged values definitions. The type Level (Fig. 3 (a)) will be the ordered enumeration composed by all security levels that have been considered (these values, tipically are unclassified, confidential, secret and top secret, but they colud be different). The type Levels (Fig. 3 (b)) will be an interval of levels composed by a lower level and an upper level. The type Role (Fig. 3 (c)) will represent the hierarchy of user roles that can be defined for the organization. The type Roles is a set of role trees or subtrees. The type Compartment (Fig. 3 (d)) is the enumeration composed by all user compartments that have been considered for the organization. The type compartments is a set of user compartments. The type Privilege (Fig. 3 (e)) will be an ordered enumeration composed by all different privileges that have been considered (these values, typically are read, inserte, delete, update, and all). The type Attempt Fig. 3 (f) wille be an ordered enumeration composed by all different access attempt that have been considered (these values are typically none, all, frustratedAttempt, sucessfullAccess, but they could be different.

Fig. 3. New Data types

In Fig. 2 we can see the base classes these new stereotypes are specialized from. All the information surrounded in these new stereotypes has to be defined for each

Extending UML for Designing Secure Data Warehouses

223

MD model depending on its confidentiality properties, and on the number of users and complexity of the organization in which the MD model will be operative. Finally, we need some syntactic definitions that are not considered in the standard OCL. Particularly, we need the new collection type Tree with its typical operations.

3.4 Tagged Values In this section, we provide the definition of several tagged values for the model, classes, attributes, instances and constraints.

224

Eduardo Fernández-Medina et al.

Table 2 shows the tagged values of all elements in this extension. All default values of security tagged values of the model are empty collections. On the other hand, the default value of security tagged values for each class is the less restrictive (the lower security level, the security role hierarchy that has been defined for the model and the empty set of compartments). The default value of the security tagged values for attributes is inherited from the class they belong. If we need to specify the situation in which accesses to the information of a class have to be recorded in a log file for future audit, we should use LogType and LogCond tagged values together in that class. By default, the value of LogType is none, so audit is not necessary by default. On the other hand, if we need to specify a security constraint, we can use OCL and the InvolvedClasses tagged value to specify in which situation the constraint has to be enforced. By default, the value of this tagged value is the class to which the constraint is associated. Finally, if we need to specify a special security constraint in which a user/s (depending on a condition) can or cannot access to the corresponding class, independently of the security information of that class, we should use exceptions together with the following tagged values: InvolvedClasses, ExceptSign, ExceptPrivilege and ExceptCond. The default value of InvolvedClasses is the own class. The default value for ExceptSign is +, and for ExceptPrivilege is Read.

3.5 Stereotypes By using all these tagged values, we can specify security constraints on a MD model depending on the values of attributes and tagged values. In this extension, we need to define one stereotype in order to specify other types of security constraints (Table 3). The stereotype UserProfile can be necessary to specify constraints depending on a particular information of a user or a group of users, e.g., depending on citizenship, age, etc. Then, the previously-defined data types and tagged values will be used on the fact, dimension and base stereotypes in order to consider other security aspects.

3.6 Well-Formedness Rules We can identify and specify in both natural language and OCL constraints some wellformedness rules. These rules are grouped in Table 4.

Extending UML for Designing Secure Data Warehouses

225

226

Eduardo Fernández-Medina et al.

3.7 Comments Many of the previous constraints are very intuitive, although we have to ensure its fulfillment, otherwise the system can be inconsistent. Moreover, the designer can specify security constraints with OCL. If the security information of a class or an attribute depends on the value of an instance attribute, it can be specified as an OCL expression (Fig. 4). Normally, security constraints defined for stereotypes of classes (fact, dimension and base) will be defined by using a UML note attached to the corresponding class instance. We do not impose any restriction on the content of these notes in order to allow the designer the greatest flexibility, only those imposed by the

Extending UML for Designing Secure Data Warehouses

227

tagged values definitions. The connection between a note and the element it applies to is shown by a dashed line without an arrowhead as this is not a dependency [13].

4 A Case Study Applying Our Extension for Secure MD Modeling In this section, we apply our extension to the conceptual design of a secure DW in the context of a reduced health-care system. The simplified hierarchy of the system user roles is as follows: HospitalEmployees are classified into health and non-health users, health users can be Doctors or Nurses and non-health users can be Maintenance or Administrative. The defined security levels are unclassified, secret and topSecret. 1. Fig. 4 shows an MD model that includes a fact class (Admission), two dimensions (Diagnosis and Patient), two base classes (Diagnosis_group and City), and a class (UserProfile). UserProfile class (stereotype UserProfile) contains the information of all users who will have access to this MD model. Admission fact class stereotype Fact- contains all individual admissions of patients in one or more hospitals, and can be accessed by all users who have secret or top secret security levels -tagged value SecurityLevels (SL) of classes-, and play health or administrative roles -tagged value SecurityRoles (SR) of classes-. Note that the cost attribute can only be accessed by users who play administrative role -tagged value SR of attributes- Patient dimension contains the information of hospital patients, and can be accessed by all users who have secret security level –tagged value SL-, and play health or administrative roles –tagged value SR-. The Address attribute can only be accessed by users who play administrative role –tagged value SR of attributes-. City base class contains the information of cities, and it allows us to group patients by cities. Cities can be accessed by all users who have confidential security level – tagged value SL-. Diagnosis dimension contains the information of each diagnosis, and can be accessed by users who play health role –tagged value SR-, and have secret security level –tagged value SL-. Finally, Diagnosis_group contains a set of general groups of diagnosis. Diagnosis groups can be accessed by all users who have confidential security level –tagged value SLs-. Several security constraints have been specified by using the previously defined constraints, stereotypes and tagged values (the number of each numbered paragraph corresponds to the number of each note in Fig. 4): 2. The security level of each instance of Admission is defined by a security constraint specified in the model. If the value of the description attribute of the Diagnosis_group which belongs to the diagnosis that is related to the Admission is cancer or AID, the security level –tagged value SL- of this admission will be top secret, otherwise secret. This constraint is only applied if the user makes a query whose the information comes from the Diagnosis dimension or Diagnosis_ group base classes together with the Patient dimension –tagged value involvedClasses-. 3. The security level –tagged value SL- of each instance of Admission can also depend on the value of the cost attribute, which indicates the price of the admission service. In this case, the constraint is only applicable for queries that contain information of the Patient dimension –tagged value involvedClasses-. 4. The tagged value logType has been defined for the Admission class, specifying the value frustratedAttempts. This tagged value specifies that the system has to record, for future audit, the situation in which a user tries to access to information of this fact class, and the system denies it because of lack of permissions.

228

Eduardo Fernández-Medina et al.

5. For confidentiality reasons, we could deny access to admission information to users whose working area is different than the area of a particular admission instance. This is specified by another exception in Admission fact class, considering tagged values involvedClasses, exceptSign and exceptCond. If patients are special users of the system, they could access to their own information as patients (e.g., for querying their personal data). This constraint is specified by using the excepSign and exceptCond tagged values in the Patient class.

Fig. 4. Example of multidimensional model with security information and constraints3

5 Implementation Oracle9i Label Security [11] allows us to implement multilevel databases. It defines labels that are assigned to the rows and users of a database that contain confidentiality information and authorization information for rows and users respectively. Moreover, OLS allows us to specify labeling functions and predicates that are triggered when an operation is executed, and which define the value of security labels. A secure MD model can be implemented by OLS. The two main security elements that we include in this UML extension are confidentiality information of data, and security constraints. The basic concepts of a MD model (facts, dimension and base classes) are implemented as tables in a relational database. The security information 3

Version 2 of OCL considers a special syntaxis for enumerations (EnumTypeName::Enum LiteralValue), but in this example, for the sake of readability, we consider only EnumLiteralValue.

Extending UML for Designing Secure Data Warehouses

229

of the MD model can be implemented by the security labels of OLS, and the security constraints can be implemented by labeling functions and predicates of OLS. For instance, we could consider the table Admission with CodeAdmission, Type, Cost, CodeDiagnosis and PatientSSN columns. This table will have a special column to store the security label for each instance. For each instance, this label will contain the security information that has been specified in Fig. 4 (Security Level=Secret.. TopSecret; SecurityRoles=Health, Admin). But this security information depends on several security constraints that can be implemented by labeling functions. Table 5(1) shows an example in which we implement the security constraints labeled with number 2 in Fig. 4. If the value of Cost column is greater than 10000 then the security label will be composed of TopSecret security level and Health and Admin user roles, else the security label will be composed of Secret security level and the same user roles. Table 5 (2) shows how to link this labeling function with Admission table.

6 Conclusions and Future Work In this paper, we have presented an extension of the UML that allows us to represent main security aspects in the conceptual modeling of Data Warehouses. This extension contains the needed stereotypes, tagged values and constraints for a complete and powerful secure MD modeling. These new elements allow us to specify security aspects such as security levels on data, compartments and user roles on the main elements of a MD modeling such as facts, dimensions and classification hierarchies. We have used the OCL to specify the constraints attached to these new defined elements, thereby avoiding an arbitrary use of them. We have also sketched how to implement a secure MD model designed with our approach in a commercial DBMS. The main relevant advantage of this approach is that it uses the UML, a widely-accepted objectoriented modeling language, which saves developers from learning a new model and its corresponding notations for specific MD modeling. Furthermore, the UML allows us to represent some MD properties that are hardly considered by other conceptual MD proposals. Our immediate future work is to extend the implementation issues presented in this paper to allow us to use the considered security aspects when querying a MD model from OLAP tools. Moreover, we also plan to extend the set of privileges considered in this paper to allow us to specify security aspects in the ETL processes for DWs.

Acknowledgements This research is part of the CALIPO and RETISTIC projects, supported by the Dirección General de Investigación of the Ministerio de Ciencia y Tecnología.

230

Eduardo Fernández-Medina et al.

References 1. Abelló, A., Samos, J., and Saltor, F., A Framework for the Classification and Description of Multidimensional Data Models. 12th International Conference on Database and Expert Systems Aplications. LNCS 2113., 2001: pp. 668-677. 2. Chung, L., Nixon, B., Yu, E., and Mylopoulos, J., Non-functional requirements in software engineering. 2000, Boston/Dordrecht/London: Kluwer Academic Publishers. 3. Cota, S., For Certain Eyes Only. DB2 Magazine, 2004. 9(1): pp. 40-45. 4. Devanbu, P. and Stubblebine, S., Software engineering for security: a roadmap, in The Future of Software Engineering, Finkelstein, A., Editor. 2000, ACM Press. pp. 227-239. 5. Fernández-Medina, E. and Piattini, M., Designing Secure Database for OLS, in Database and Expert Systems Applications: 14th International Conference (DEXA 2003), Marik, V., Retschitzegger, W., and Stepankova, O., Editors. 2003, Springer. LNCS 2736: Prague, Czech Republic. pp. 886-895. 6. Ferrari, E. and Thuraisingham, B., Secure Database Systems, in Advanced Databases: Technology Design, Piattini, M. and Díaz, O., Editors. 2000, Artech House: London. 7. Gogolla, M. and Henderson-Sellers, B. Analysis of UML Stereotypes within the UML Metamodel, in UML’02. Springer, LNCS 2460. pp. 84-99. Dresden, Germany. 8. Hall, A. and Chapman, R., Correctness by Construction: Developing a Commercial Secure System. IEEE Software, 2002.19(1): pp. 18-25. 9. Jürjens, J., UMLsec: Extending UML for secure systems development, in UML’02 Springer. LNCS 2460.: Dresden, Germany. pp. 412-425. 10. Kimball, R., The data warehousing toolkit. 2 edn. 1996: John Wiley. 11. Levinger, J., Oracle label security. Administrator’s guide. Release 2 (9.2). 2002: http://www.csis.gvsu.edu/GeneralInfo/Oracle/network.920/a96578.pdf. 12. Luján-Mora, S., Trujillo, J., and Song, I.Y. Extending the UML for Multidimensional Modeling. in UML’02. Springer, LNCS 2460. pp. 290-304. Dresden, Germany. 13. OMG, Object Management Group: Unified Modeling Language Specification 1.5. 2004. 14. Piattini, M. and Fernández-Medina, E. Specification of Security Constraint in UML. in 35th Annual 2001 IEEE Intl. Carnahan Conf.on Security Technology. London pp. 163-171 15. Priebe, T. and Pernul, G. Towards OLAP Security Design - Survey and Research Issues. in 3rd ACM International Workshop on Data Warehousing and OLAP (DOLAP’00). Washington DC, USA. pp. 33-40 16. Rosenthal, A. and Sciore, E. View Security as the Basic for Data Warehouse Security. in 2nd International Workshop on Design and Management of Data Warehouse (DMDW’00). Sweden. pp. 8.1-8.8 17. Samarati, P. and De Capitani di Vimercati, S., Access control: Policies, models, and mechanisms, in Foundations of Security Analysis and Design, Focardi, R. and Gorrieri, R., Editors. 2000, Springer: Bertinoro, Italy. pp. 137-196. 18. Trujillo, J., Palomar, M., Gómez, J., and Song, I.Y., Designing Data Warehouses with OO Conceptual Models. IEEE Computer, special issue on DWs, 2001(34): pp. 66-75. 19. Warmer, J. and Kleppe, A., The Object Constraint Language Second Edition. Getting Your Models Ready for MDA. 2003: Addison Wesley.

Data Integration with Preferences Among Sources* Gianluigi Greco1 and Domenico Lembo2 1

Dip. di Matematica, Università della Calabria, Italy [emailprotected]

2

Dip. di Informatica e Sistemistica, Università di Roma “La Sapienza”, Italy [emailprotected]

Abstract. Data integration systems represent today a key technological infrastructure for managing the enormous amount of information even more and more distributed over many data sources, often stored in different heterogeneous formats. Several different approaches providing transparent access to the data by means of suitable query answering strategies have been proposed in the literature. These approaches often assume that all the sources have the same level of reliability and that there is no need for preferring values “extracted” from a given source. This is mainly due to the difficulties of properly translating and reformulating source preferences in terms of properties expressed over the global view supplied by the data integration system. Nonetheless preferences are very important auxiliary information that can be profitably exploited for refining the way in which integration is carried out. In this paper we tackle the above difficulties and we propose a formal framework for both specifying and reasoning with preferences among the sources. The semantics of the system is restated in terms of preferred answers to user queries, and the computational complexity of identifying these answers is investigated as well.

1

Introduction

The enormous amount of information even more and more distributed over many data sources, often stored in different heterogeneous formats, had boosted in recent years the interest for data integration systems. Roughly speaking, a data integration system offers transparent access to the data by providing users with the so-called global schema, which they can query in order to extract data relevant for their aims. Then, the system is in charge of accessing each source separately, and combining local results into the global answer. The means that the system exploit to answer users’ queries is the mapping specifying the relationship between the sources and the global schema [16]. However, data at the sources, may result mutually inconsistent, because of the presence of integrity constraints specified on the global schema in order to * This work has been partially be supported by the European Commission FET Programme Project IST-2002-33570 INFOMIX. P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 231–244, 2004. © Springer-Verlag Berlin Heidelberg 2004

232

Gianluigi Greco and Domenico Lembo

enhance its expressiveness. To remedy this problem, several papers (see, e.g., [3, 6,4,11]) proposed to handle the inconsistency by suitably “repairing” retrieved data. Basically, such papers extend to data integration systems previous studies focused on a single inconsistent database or on the merging of mutually inconsistent databases in a single consistent theory [2,14,17]. Intuitively, one aspect deserving particular care, which characterizes the inconsistency problem in data integration with respect to the latter works, is the presence of the mapping relating data stored at the sources with the elements of the (virtual) global schema, over which constraints of interest for the integration application are issued. Here, the suitability of a possible repair depends on the underlying semantic assumptions which are adopted for the mapping and on the type of constraints on the global schema. Roughly speaking, the assumptions for the mapping provide the means for interpreting data at the sources with respect to the intended extension of the global schema. In this respect, mappings are in general considered sound, i.e., data that it is possible to retrieve from the sources through the mapping are assumed to be a subset of the intended data of the corresponding global elements [16]. This is for example the mapping interpretation adopted in [3,6,4], where soundness is exploited for constructing those database extensions for the global schema that are enforced by the data stored at the sources and the mapping. Since obtained global databases may result inconsistent with respect to global constraints, suitable repairs (basically deletions and additions of tuples) are performed to restore consistency. None of the above mentioned works takes into account preference criteria when trying to solve inconsistencies among data sources. We could say that they implicitly assume that all the sources have the same level of reliability, and that there is no reason for preferring values coming from a source with respect to data retrieved from another source. On the other hand, in practical applications it often happens that some sources are known to be more reliable than others, thus determining some potentially useful criteria exploitable to establish the suitability of a repair. In other words, besides the semantic assumption on the mapping, also preference criteria expressed among sources should be taken into account when solving inconsistency. Despite the wide interest in this field, few efforts have been paid for enriching the data integration setting with qualitative or quantitative descriptions of the sources. The first (and almost isolated) attempt is in [18], where the authors introduce two parameters for characterizing each source: the soundness, which is used for assessing the confidence we can place in the answers provided by the source, and the completeness, which is used for measuring how many relevant information is stored in the source. However, the framework proposed does not fit the requirements of typical data integration systems, since it does not admit constraints over the global schema, and since it is only focused on the consistency problem, i.e., determining whether a global database exists that is consistent with all the claims of soundness and completeness of individual sources. Other works (see, e.g., [14,10,15]) deal instead with special cases, where preferences are defined among repairs of a single database, and, hence, they do

Data Integration with Preferences Among Sources

233

not capture the many facets of the data integration setting. In other words, such approaches do not tackle inconsistency in the presence of a mapping between the database schema, that has to be maintained consistent, and information sources that provide possibly inconsistent data. This is instead the challenging setting when tackling inconsistency in data integration in the presence of source preferences, which calls for suitable translations and reformulations, in which preferences between sources are mapped into preferences between repairs. In this paper, we face this problem by proposing a formal framework for both specifying and reasoning on preferences among sources. Specifically, the main contributions of this paper are the following. We introduce a new semantics which is based on the repair of data stored at the sources in the presence of global inconsistency, rather than considering the repair of global database instances constructed according to the mapping. This approach is essentially a form of abductive reasoning [19], since it directly resolves the conflicts by isolating their causes at the sources. This part is described in Section 3. We show that our novel repair semantics allow us to properly take into account source preferences. Following the extensive literature (see, e.g., [8,7] and the references therein) from database community, prioritized logics, logic programming, and decision theory, we exploit two different approaches for specifying preferences among sources. Specifically, we consider unary and binary constraints for defining quantitative properties and relationships between sources, respectively. We show how preferences expressed over the sources can be exploited for refining the way of answering queries in data integration systems. To this aim, we introduce the concept of strongly preferred answers, characterizing the answers that can be obtained after the system is repaired according to users’ preferences. Actually, we also investigate a weaker semantics that looks for weakly preferred answers, i.e., answers that are as close as possible to any strong preferred one. This part and the above one are described in Section 4. Finally, the computational complexity of computing both strongly and weakly preferred answers is studied, by considering the most common integrity constraints that can be issued on relational databases. We show that computing strongly preferred answers is co-NP-compete, and hence it is as difficult as computing answers without additional constraints [5]. However, while turning to the weak semantics, we evidence a small increase in complexity that does not lift the problem to higher levels of the polynomial hierarchy. Indeed, the problem is complete for the class Computational complexity is treated in Section 5.

2

Relational Databases

In this section we recall the basic notions of the relation model with integrity constraints. For further background on relational database theory, we refer the reader to [1].

234

Gianluigi Greco and Domenico Lembo

We assume a (possibly infinite) fixed database domain whose elements can be referenced by constants under the unique name assumption, i.e. different constants denote different objects. A relational schema is a pair where: is a set of relation symbols, each with an associated arity that indicates the number of its attributes, and is a set of integrity constraints, i.e., assertions that have to be satisfied by each database instance. We deal with quantified constraints [1], i.e., first order formulas of the form:

where and are positive literals, are built-in literals, is a list of distinct variables, is a list of variables occurring in only. Notice that classical constraints issued on a relational schema, as functional, exclusion, or inclusion dependencies, can be expressed in the form 1. Furthermore, they are also typical of conceptual modelling languages. A database instance (or simply database) for a schema is a set of facts of the form where is a relation of arity in and is an of constants of We denote as the set A database for a schema is said to be consistent with if it satisfies (in the first order logic sense) all constraints expressed on A relational query (or simply query) over is a formula that is intended to extract tuples of elements of We assume that queries over are Union of Conjunctive Queries (UCQs), i.e., formulas of the form where, for each is a conjunction of atoms whose predicate symbols are in and involve and where is the arity of the query, and each and each is either a variable or a constant of is called the head of Given a database for the answer to a UCQ over denoted is the set of of constants such that, when substituting each with the formula evaluates to true in

3

Data Integration Systems

Framework. A data integration system is a triple where is the global (relational) schema of the form is the source (relational) schema of the form i.e., there are no integrity constraints on the sources, and is the mapping between and We assume that the mapping is specified in the global-as-view (GAV) approach [16], where every relation of the global schema is associated with a view, i.e., a query, over the source schema. Therefore, is a set of UCQs expressed over where the predicate symbol in the head is a relation symbol of Example 1 Consider the data integration system where the global schema consists of the relation predicates employee(Name, Dept) and

Data Integration with Preferences Among Sources

boss(Employee, Manager). The associated set of constraints following assertions (quantifiers are omitted)

235

contains the two

stating that managers are never employees. The source schema comprises the relation symbols and the mapping contains the following UCQs

We call any database for the source schema a source database for Based on we specify the semantics of which is given in terms of database instances for called global databases for In particular, we construct a global database by evaluating each view in the mapping over Such a database is called retrieved global database, and is denoted by Example 1 (contd.) Let be a source database. Then, the evaluation of each view in the mapping over is In general, the retrieved global database is not the only database that we consider to specify the semantics of w.r.t. but we account for all global databases that contain This means considering sound mappings: data retrieved from the sources by the mapping views are assumed to be a subset of the data that satisfy the corresponding global relation. This is a classical assumption in data integration, where sources in general do not provide all the intended extensions of the global schema, hence extracted data are to be considered sound but not necessarily complete. Next, we formalize the notion of mapping satisfaction. Definition 1. Given a data integration system and a source database for a global database for satisfies the mapping w.r.t. if Notice that databases that satisfy the mapping might be inconsistent with respect to dependencies in since data stored in local and autonomous sources are not in general required to satisfy constraints expressed on the global schema. Furthermore, cases might arise in which no global database exists that satisfies both mapping and constraints over (for example when a key dependency on is violated by data retrieved from the sources). On the other hand, constraints issued over the global schema must be satisfied by those global databases that we want to consider “legal” for the system [16]. Repairing global databases. In order to solve inconsistency, several approaches have been recently proposed, in which the semantics of a data integration system is given in terms of the repairs of the global databases that the mapping forces to be in the semantic of the system [3,5,4]. Such papers extend to data integration previous proposals given in the field of inconsistent

236

Gianluigi Greco and Domenico Lembo

databases [12,2,14], by considering a sound interpretation of the mapping. In this context, repairs are obtained by means of addition and deletion of tuples over the inconsistent database. Modifications are performed according to minimality criteria that are specific for each approach. Analogously, works on inconsistency in data integration basically propose to properly repair the global databases that satisfy the mapping in order to make them satisfy constraints on the global schema. In this respect, we point out that [3,4] consider local-as-view (LAV) mappings, where, conversely to the GAV approach, each source relation is associated with a query over the global schema. In such papers, the notion of retrieved global database is replaced with the notion of minimal global databases that can be constructed according to the mapping specification and data stored at the sources. Then, a global database satisfies the mapping if it contains at least one minimal global database. Repairs computed in [3,5,4] are in general global databases that do not satisfy the mapping. Furthermore, they cannot always be retrieved through the mapping from a source database instance. According to [16], we could say that in these approaches, constraints are considered strong, whereas the mapping is considered soft. Example 2 Consider the simple situation in which the global schema of a data integration system contains two relation symbols and both of arity 1, that are mutually disjoint, i.e., the constraint is issued over Assume that the mapping comprises the queries and where is a unary source relation symbol. Let be the source database for Then, is inconsistent w.r.t. the global constraint. In this case, the above mentioned approaches propose to repair by eliminating from each database satisfying the mapping either or (but not both), thus producing in the two cases two different classes of global databases that are in the semantics of the system. Notice, however, that each global database that contains only or only does not satisfy the soundness of the mapping, and cannot be retrieved from any source database for Even if effective for repairing global database instances in the presence of inconsistency, the above approaches do not seem appropriate when preferences specified over the sources should be taken into account for solving inconsistency. Indeed, in these cases, one would prefer, for example, to drop tuples coming from less reliable source relations rather than considering all possible repairs to be at the same level of reliability. Nonetheless, it is not always easy to understand how preferences over tuples stored at the sources could be mapped on preferences over tuples of the global schema. Example 3 Consider for example the simple data integration system in which the mapping contains the query and a constraint stating that the first component of the global relation is the key of Assume to have the source database and to know that source relation is more reliable than source relation Then, violates the key constraint on and it seems

Data Integration with Preferences Among Sources

237

reasonable to prefer dropping the fact in order to guarantee consistency, rather than according to source preferences. However, we do not have a preference specified between this two global facts in such a way that we can adopt this choice. The above example shows that we should need some mechanism to infer preferences over tuples of the global schema starting from preferences at the sources. On the other hand, it is not always obvious or easy to define such a mechanism. A different solution could be to move the focus, when repairing, from tuples of the global schema to tuples of the sources, i.e., minimally modify the source database. In this way, we could compare two repairs (at the sources) on the basis of the preferences established over the source relations. Repairing the sources. The idea at the basis of our approach consists in finding the proper set of facts at the sources that imply as a consequence a global database that satisfy the integrity constraints. Basically, such a way of proceeding is a form of abductive reasoning [19]. Notice also that, according to this approach we consider “strong” both the mapping and the constraints, i.e., we take into account only global databases that satisfy both the mapping and the constraints on the global schema. Furthermore, each global database that we consider, can be computed by means of the mapping from a suitable source database. Let us now precisely characterize the ideas informally described above. Definition 2. Given a data integration system and a source database for is satisfiable w. r. t. global database for such that

where if there exists a

satisfies w.r.t. and satisfies the mapping We next introduce a partial order between source databases for which the system in satisfiable. Definition 3. Given a data integration system where Given two source databases for such that is satisfiable w.r.t. and Then, we say that if Furthermore, if and does not hold We say that a source database is minimal w.r.t. if there does not exist such that Furthermore, we indicate with the set of source databases that are minimal w.r.t. Example 1 (contd.) The retrieved global database violates the constraints on the global schema witnessed by the facts employee (Mary, D1) and boss (John, Mary) for which Mary is both an employee and a manager. ThereThen, where is not satisfiable w.r.t. fore, and We are now able to define the semantics of a data integration system.

238

Gianluigi Greco and Domenico Lembo

Definition 4. Given a data integration system and given a source database for a global database w.r.t. if satisfies satisfies the mapping i.e., there exists

w.r.t. a minimal source database such that

where is legal for

w.r.t.

The set of all the legal databases is denoted by We point out that, under the standard cautious semantics, answering a query posed over the global schema amounts to evaluate it on every legal database Example 1 (contd.) The set contains all global databases that satisfy the global schema and that contain either or Then, the answer to the user query which asks for all employees that have a boss, is Summarizing, our approach consists in properly repairing the source database in order to obtain another source database such that is satisfiable w.r.t. Obviously, if is satisfiable w.r.t. we do not need to repair Before concluding, we point out that the set is in general different from the set of global databases that can be obtained by repairing the retrieved global database instead of repairing the source database This results evident in Example 2, in which repairing is performed by dropping at the sources, therefore legal databases exist that neither contain nor We conclude this section by considering the difficulty of checking whether a global database is indeed a repair. Such a difficulty will be evaluated by following the data complexity approach [20], i.e., by considering a given problem instance having as its input the source database — this is, in fact, the approach we shall follow in all the complexity results presented in the paper. Theorem 4 (Repair Checking). Let be a data integration systems with a source database for and a global database for Then, checking whether is legal is feasible in polynomial time.

4

Formalizing Additional Properties of the Sources

In many real world applications, users often have some additional knowledge about data sources besides the mapping with the global schema, which can be modelled in terms of preference constraints specified over source relations. In this scenario, the aim is to exploit preference constraints for query answering in the presence of inconsistency. The framework we have introduced in Section 3 allows us to easily take into account information on such preferences when trying to solve inconsistency, since

Data Integration with Preferences Among Sources

239

repairing is performed by directly focusing on the sources, whose integration has caused inconsistency. Intuitively, when a data integration system is equipped with some additional preference constraints, we can easily exploit these further requirements for identifying, given a source database those elements of which are preferred for answering a certain query. In this respect, we distinguish between unary constraints, i.e., properties which characterize a given data source, and binary constraints, i.e., properties which are expressed over pairs of relations in the source schema

4.1

Unary and Binary Constraints

As already evidenced in [18], in order to provide accurate query answering, each relation can be equipped with two parameters: the soundness measure and the completeness measure. The former is used for assessing the confidence that we place in the answers provided by whereas the latter is used for evaluating how much relevant information is contained in In [18], the problem of querying partially sound and complete data sources has been studied in the context of data integration systems with LAV mapping and without integrity constraints on the global schema. In such a setting, it has been shown that deciding the existence of a global database satisfying some assumptions is NP-complete. Here, we extend such analysis for sound GAV mappings, in our repair semantics for data integration systems. In this framework, we observe that the completeness measure is of no practical interest, since each is such that Therefore, constraints that can be satisfied by adding tuples to can be seen as “automatically repaired”. Indeed, in our repair semantics we do consider addition of tuples at the sources in order to repair constraints on the global schema. Therefore, we are actually interested in bounding only the number of tuple deletions required at the sources in order to repair the system. Then, for each source relation we denote by the value of such bound, also called soundness constraint, whose semantics is as follows. Definition 5. Let be a data integration system, a source database for a relation symbol in and a soundness constraint for Then, a source database satisfies if

Even though in several situations the soundness measure is not directly available for characterizing a source relation in an absolute way, the user might be able to compare the soundness of two different sources. For instance, he might not know the soundness constraint for source relations and but he might have observed that is more reliable than Such intuition is formalized by the notion of binary constraints. Let and be two relation symbols of and let A denote a set of pairs such that with and attributes of and

Gianluigi Greco and Domenico Lembo

240

respectively. Any expression of the form and its semantics is as follows.

is a binary constraint over

Definition 6. Let be a data integration system and a source database for Then, a source database satisfies a binary constraint of the form with for if

where

indicates the projection of the tuple

on

Roughly speaking, satisfies if for each tuple that has been deleted from in order to obtain a tuple sharing the same values on the attributes in A has been deleted from This behavior guarantees that is modified only if there is no way for repairing the data integration system by modifying only. Example 1 (contd.) Assume now to specify the binary constraint over the source schema where $2 indicates the second attribute of and $1 the first attribute of Then, violates the constraint, since it is obtained by dropping from the fact whose second component coincides with the first component of which conversely has not been dropped. On the contrary, it is easy to see that satisfies the constraint.

4.2

Soft Constraints

As defined in the section above, unary and binary constraints often impose very severe restrictions on the possible ways repairs can be carried out. For instance, it might even happen that no minimal source database exists that satisfies such constraints, thereby leading to a data integration system with an empty semantics. In order to face such situations, whenever it is necessary we can also turn to “weak” semantics that looks for repairs as close as possible to the preferred ones. In this respect, preference constraints are interpreted in a soft version, and we aim at minimizing the number of violations, rather than imposing the absence of such violations. Definition 7 (Satisfaction Factor). Let be a data integration system, a source database for and be a source database in Then, the satisfaction factor for a constraint is if is of the form the number of tuples

and

does not satisfies such that if has the form

Finally, the satisfaction factor of a set of constraints

or with with is the value

Data Integration with Preferences Among Sources

4.3

241

Preferred Answers

After unary and binary constraints have been formalized, we next show how they can be practically used for pruning the set of legal databases of a data integration system. Specifically, we first focus on the definition of preferred legal databases, and we next show how this notion can be exploited for defining preferred answers. In the following, given a data integration system we denote by the set of preference constraints defined over Then, the pair is also said to be a constrained data integration system. The semantics of the system is provided in terms of those legal databases that are “retrieved” from source databases of that satisfy Definition 8. Let be a constrained data integration system with and and let be a source database for Then, a global database is a (weakly) preferred legal database for w.r.t. if satisfies where is a minimal source database w.r.t. that no minimal source database exists with If

then

is a strongly preferred legal database for

such w.r.t.

We next provide the notion of answers to a query posed to a constrained data integration system. Definition 9. Given a constrained data integration system with a source database for and a query the set of the weakly preferred answers to denoted

of arity over is

for each weakly preferred legal database The set of the strongly preferred answers to

denoted

is

for each strongly preferred legal database Example 1 (contd.) Consider again the constraint We have already observed that only satisfies such requirement. Then, the set of strongly preferred databases is Therefore, for the query We conclude this section by observing that the constraints we have defined can be evaluated in polynomial time on a given global database. However, they suffice for blowing up the intrinsic difficulty of identifying (preferred) global databases. Theorem 5 (Preferred Repair Checking). Let and let be a source database for Then, checking whether a global database is (strongly) preferred for w.r.t. is NP-hard. Proof (Sketch). NP-hardness can be proven by a reduction of three colorability problem to our problem. Indeed, given a graph G we can build a data integration system a source database and a legal database for such that is preferred G is 3-colorable.

242

5

Gianluigi Greco and Domenico Lembo

Complexity Results

We next study the computational complexity of query answering in a constrained data integration system under the novel semantics proposed in the paper. Our aim is to point out the intrinsic difficulty of dealing with constraints at the sources. Specifically, given a source database for we shall face the following problems: StronglyAnswering: given a UCQ of arity over and an of constants of is WeaklyAnswering: given a UCQ of arity over and an of constants of is where are constant of the domain which occur also in tuples of We shall consider the (common) case in which is such that contains only key dependencies (KDs), functional dependencies (FDs) and exclusion dependencies (EDs). We recall that these are classical constraints issued on a relational schema, and that they can be expressed in the form (1) introduced in Section 2. We also point out that violations of constraints of such form, e.g., two global tuples violating a key constraint, lead always to inconsistency in our framework, since they can be repaired only by means of tuple deletions from the source database. We are now ready to provide the first result of this section. Theorem 6. Let be a constrained data integration system with where in which contains only FDs and EDs, and let be a source database for Then, the StronglyAnswering problem is co-NP-complete. Hardness holds even if contains either only KDs or only EDs, and if is empty. Proof (Sketch). As for the membership, we consider the dual problem of deciding whether and we show that it is feasible in NP. In fact, we can guess a source database obtained by removing tuples of only. Then, we can show how to verify that is minimal w.r.t. in polynomial time. Hardness for the general case can be derived from the results reported in [5] and in [9] (where the problem of query answering under different semantics in the presence of KDs is studied). Hardness for EDs only can be proven in an analogous way by a reduction from the three colorability problem to the complement of our problem. The above result suggests that adding constraints to the sources, enriches the representation features of a data integration systems, and it is well-behaved from a computational viewpoint. In fact, selecting preferred answers is as difficult as selecting answers without additional preference constraints, whose complexity has been widely studied in [5]. We next turn to the WeaklyAnswering problem, in which a weaker semantics is considered. Intuitively, this scenario provides an additional source of complexity, since finding weakly preferred global databases amount at solving an implicit (NP) optimization problem. Interestingly, the increase in complexity is rather small and does not lift the problem to higher levels of the polynomial hierarchy.

Data Integration with Preferences Among Sources

243

Actually, the problem stays within the polynomial time closure of NP, i.e., More precisely, it is complete for the class in which the NP oracle access is limited to queries, where is the size of the source database in input. Theorem 7. Let with where and let be a source database for

be a constrained data integration system in which contains only FDs and EDs, Then, the WeaklyAnswering problem is

Proof (Sketch). For the membership, we can preliminary compute the maximum value, say max, that the satisfaction factor for any source database may assume. Then, by a binary search in [0, max], we can compute the best satisfaction factor, say at each step of this search, we are given a threshold and we call an NP oracle to know whether there exists a source database such that Finally, we ask an other NP oracle for checking whether there exists a source database with satisfaction factor such that does not belong to for the minimal satisfying the constraints in Hardness can be proved by a reduction from the following problem: Given a formula in conjunctive normal form on the variables a subset and a variable decide whether is true in all the models, where a model M (satisfying assignment) is if it has the largest i.e., if the number of variables in the set that are true w.r.t. M is the maximum over all the satisfying assignments.

6

Conclusions

In this paper we have introduced and formalized the problem of enriching data integration systems with preferences among sources. Our approach is based on a novel semantics which relies on repairing the data stored at the sources in the presence of global inconsistency. Repairs performed at the sources allow us to properly take into account preference expressed over the sources when trying to solve inconsistency. Exploiting the presence of preference constraints, we have introduced the notion of (strongly and weakly) preferred answers. Finally, we have studied the computational complexity of computing both strongly and weakly preferred answers for classes of key, functional end exclusion dependencies, which are relevant classes of constraints for relational databases as well as conceptual modelling languages. Complexity results given in this paper can be easily extended to the presence of also inclusion dependencies on the global schema in the cases in which the problem of query answering is decidable, which have been studied in [5]. To the best of our knowledge, the present work is the first one that provide formalizations and complexity results to the problem of dealing with inconsistencies by taking into account preferences specified among data sources in a pure GAV data integration framework. Only recently, the same problem has been studied for LAV integration systems in [13].

244

Gianluigi Greco and Domenico Lembo

References 1. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison Wesley Publ. Co., Reading, Massachussetts, 1995. 2. M. Arenas, L. E. Bertossi, and J. Chomicki. Consistent query answers in inconsistent databases. In Proc. of PODS’99, pages 68–79, 1999. 3. L. Bertossi, J. Chomicki, A. Cortes, and C. Gutierrez. Consistent answers from integrated data sources. In Proc. of FQAS’02, pages 71–85, 2002. 4. L. Bravo and L. Bertossi. Logic programming for consistently querying data integration systems. In Proc. of IJCAI’03, pages 10–15, 2003. 5. A. Calì, D. Lembo, and R. Rosati. On the decidability and complexity of query answering over inconsistent and incomplete databases. In Proc. of PODS’03, pages 260–271, 2003. 6. A. Calì, D. Lembo, and R. Rosati. Query rewriting and answering under constraints in data integration systems. In Proc. of IJCAI’03, pages 16–21, 2003. 7. J. Chomicki. Preference formulas in relational queries. Technical Report cs.DB/0207093, arXiv.org e-Print archive. ACM Trans. on Database Systems. 8. J. Chomicki. Querying with intrinsic preferences. In Proc. of EDBT’02, pages 34–51, 2002. 9. J. Chomicki and J. Marcinkowski. On the computational complexity of consistent query answers. Technical Report cs.DB/0204010 vl, arXiv.org e-Print archive, Apr. 2002. Available at http://arxiv.org/abs/cs/0204010. 10. P. Dell’Acqua, L. M. Pereira, and A. Vitória. User preference information in query answering. In Proc. of FQAS’02, pages 163–173, 2002. 11. T. Eiter, M. Fink, G. Greco, and D. Lembo. Efficient evaluation of logic programs for querying data integration systems. In Proc. of ICLP’03, volume 2237 of Lecture Notes in Artificial Intelligence, pages 348–364. Springer, 2003. 12. R. Fagin, J. D. Ullman, and M. Y. Vardi. On the semantics of updates in databases. In Proc. of PODS’83, pages 352–365, 1983. 13. G. D. Giacomo, D. Lembo, M. Lenzerini, and R. Rosati. Tackling inconsistencies in data integration through source preferences. In In Proc. of the SIGMOD Int. Workshop on Infomration Quality in Information Systems, 2004. 14. G. Greco, S. Greco, and E. Zumpano. A logic programming approach to the integration, repairing and querying of inconsistent databases. In Proc. of ICLP’01, volume 2237 of Lecture Notes in Artificial Intelligence, pages 348–364. Springer, 2001. 15. S. Greco, C. Sirangelo, I. Trubitsyna, and E. Zumpano. Preferred repairs for inconsistent databases. In Proc. of IDEAS’03, pages 202–211, 2003. 16. M. Lenzerini. Data integration: A theoretical perspective. In Proc. of PODS’02, pages 233–246, 2002. 17. J. Lin and A. O. Mendelzon. Merging databases under constraints. Int. J. of Cooperative Information Systems, 7(l):55–76, 1998. 18. A. O. Mendelzon and G. A. Mihaila. Querying partially sound and complete data sources. In Proc. of PODS’01, pages 162–170, 2001. 19. C. S. Peirce. Abduction and induction. In Philosophical Writings of Peirce, pages 150–156, 1955. 20. M. Y. Vardi. The complexity of relational query languages. In Proc. of STOC’82, pages 137–146, 1982.

Resolving Schematic Discrepancy in the Integration of Entity-Relationship Schemas Qi He and Tok Wang Ling School of Computing, National University of Singapore {heqi,lingtw}@comp.nus.edu.sg

Abstract. In schema integration, schematic discrepancies occur when data in one database correspond to metadata in another. We define this kind of semantic heterogeneity in general using the paradigm of context that is the meta information relating to the source, classification, property etc of entities, relationships or attribute values in entity-relationship (ER) schemas. We present algorithms to resolve schematic discrepancies by transforming metadata into entities, keeping the information and constraints of original schemas. Although focusing on the resolution of schematic discrepancies, our technique works seamlessly with existing techniques resolving other semantic heterogeneities in schema integration.

1 Introduction Schema integration involves merging several schemas into an integrated schema. More precisely, [4] defines schema integration as “the activity of integrating the schemas of existing or proposed databases into a global, unified schema”. It is regarded as an important work to build a heterogeneous database system [6, 22] (also called multidatabase system or federated database system), to integrate data in a data warehouse, or to integrate user views in database design. In schema integration, people have identified different kinds of semantic heterogeneities among component schemas: naming conflict (homonyms and synonyms), key conflict, structural conflict [3, 15], and constraint conflict [14, 21]. A less touched problem is schematic discrepancy, i.e., the same information is modeled as data in one database, but metadata in another. This conflict arises frequently in practice [11, 19]. We adopt a semantic approach to solve this issue. One of the outstanding features of our proposal is that we preserve the cardinality constraints in the transformation/integration of ER schemas. Cardinality constraints, in particular, functional dependencies (FDs) and multivalued dependencies (MVDs), are useful in verifying lossless schema transformation [10], schema normalization and semantic query optimization [9, 21] in multidatabase systems. The following example illustrates schematic discrepancy in ER schemas. To focus our contribution and simplify the presentation, in the example below, schematic discrepancy is the only kind of conflicts among schemas. Example 1. Suppose we want to integrate supply information of products from several databases (Fig. 1). These databases record the same information, i.e., product numbers, product names, suppliers and supplying prices in each month, but have discrepant schemas. In DB1, suppliers and months are modeled as entity types. In DB2, months are modeled as meta-data of entity types, i.e., each entity type models P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 245–258, 2004. © Springer-Verlag Berlin Heidelberg 2004

246

Qi He and Tok Wang Ling

the products supplied in one month, and suppliers are modeled as meta-data of attributes, e.g., the attribute S1_PRICE records the supplying prices by supplier s11. In DB3, months are modeled as meta-data of relationship types, i.e., each relationship type models the supply relation in one month. We propose (in Section 4) to resolve the discrepancies by transforming the metadata into entities, i.e., transforming DB2 and DB3 into a form of DB1. The statements on the right side of Fig. 1 provide the semantics of the constructs of these schemas using ontology, which will be explained in Section 3.

Fig. 1. Schematic discrepancy: months and suppliers modeled differently in DB1, DB2 and DB3

Paper organization. The rest of the paper is organized as follows. Section 2 is an introduction to the ER approach. Section 3 and 4 are the main contributions of this paper. In Section 3, we first introduce the concepts of ontology and context, and the mappings from schema constructs of ER schemas onto types of ontology. Then we define schematic discrepancy in general using the paradigm of context. In Section 4, we present algorithms to resolve schematic discrepancies in schema integration, without any loss of information and cardinality constraints. In Section 5, we compare our work with related work. Section 6 concludes the whole paper. 1

Without causing confusion, we blur the difference on entities and identifiers of entities. E.g., we use supplier number s1 to refer to a supplier with identifier S# = s1, i.e., s1 plays both the roles of an attribute value of S# and an entity of supplier.

Resolving Schematic Discrepancy in the Integration of Entity-Relationship Schemas

247

2 ER Approach In the ER model, an entity is an object in the real world and can be distinctly identified. An entity type is a collection of similar entities that have the same set of predefined common attributes. Attributes can be single-valued, i.e., 1:1 (one-to-one) or m:l (many-to-one), or multivalued, i.e., l:m (one-to-many) or m:m (many-to-many). A minimal set of attributes of an entity type E which uniquely identifies E is called a key of E. An entity type may have more than one key and we designate one of them as the identifier of the entity type. A relationship is an association among two or more entities. A relationship type is a collection of similar relationships that satisfy a set of predefined common attributes. A minimal set of attributes (including the identifiers of participating entity types) in a relationship type R that uniquely identifies R is called a key of R. A relationship set may have more than one key and we designate one of them as the identifier of the relationship type. The cardinality constraints of ER schemas incorporate FDs and MVDs. For example, given an ER schema below, let Kl, K2 and K3 be the identifiers of El, E2 and E3, we have: and as Al is a 1:1 attribute of El; as A2 is a m: 1 attribute of E2; as A3 is a m:m attribute of E3; as the cardinality of E3 is 1 in R; as B is a m: 1 attribute of R.

3 Ontology and Context In this section, we first represent the constructs of ER schemas using ontology, then define schematic discrepancy in general based on the schemas represented using ontology. In this paper, we treat ontology as the specification of a representational vocabulary for a shared domain of discourse which includes the definitions of types (representing classes, relations, and properties) and their values. We present ontology at a conceptual level, which could be implemented by an ontology language, e.g., OWL [20]. For example, suppose ontology SupOnto describes the concepts in the universe of product supply. It includes the following types: product, month, supplier, supply (i.e., the supply relations among products, months and suppliers), price (i.e. the supplying prices of products), p#, pname, s#, etc. It also includes the values of these types, e.g. jan, ..., dec for month, and s1, ..., sn for supplier. Note we use lower case italic words to represent types and values of ontology, in contrast to capitals for schema constructs of an ER schema. By use of OWL expression, product, month, supplier and supply would be declared as classes, p# and pname as properties of product, s# as a property of supplier, and price as a property of supply. Conceptual modeling is always done within a particular context. In particular, the context of an entity type, relationship type or attribute is the meta-information relating to its source, classification, property etc. Contexts are usually at four levels: database, object class, relationship type and attribute. An entity type may “inherit” a context from a database (i.e., the context of a database applies to the entities), and so on. In general, the inheritance hierarchy of contexts at different levels is:

248

Qi He and Tok Wang Ling

We’ll give a formal representation of context below. Note as the context of a database would be handled in the object classes which inherit it, we will not care database level contexts any more in the rest of the paper. Definition 1. Given an ontology, we represent an entity type (relationship type, or attribute) E as: where T,

are types in the ontology, and each is a value of for respectively have a value of which are not explicitly given. This representation means that each instance of E is a value of T, and satisfies the conditions for each with the values constitute the context within which E is defined; we call them meta-attributes, and their values metadata of E. Furthermore, with the values are from the context at a higher level (i.e. the context of a database if E is an entity type, the contexts of entity types if E is a relationship type, or the context of an entity type/relationship type if E is an attribute). We call E inherits the meta-attributes with the values. If E inherits all the meta-attributes with values of the higher level context, we simply represent it as: For easy reference, we call the set the self context, and the inherited context of E. In the above representation of E, either self or inherited context could be empty. Specifically, when the context of E is empty, we have E = T. In the example below, we represent the entity types, relationship types and attributes in Fig. 1 using the ontology SupOnto. Example 2. In Fig. 1, using the ontology SupOnto, the entity type JAN_PROD of DB2 is represented as: That is, the context of JAN_PROD is month=‘jan’. This means that each entity of JAN_PROD is a product supplied in Jan. Also in DB2, given an attribute S1_PRICE of the entity type JAN_PROD, we represent it as: That is, the self context of S1_PRICE is supplier=‘s1’, and the inherited context (from the entity type) is month= ‘jan’. This means that each value of S1_PRICE of the entity type JAN_PROD is a price of a product supplied by supplier s1 in the month of Jan.

Resolving Schematic Discrepancy in the Integration of Entity-Relationship Schemas

249

In DB3, given a relationship type JAN_SUP, we represent it as: This means that each relationship of JAN_SUP is a supply relationship in the month of Jan. Also in DB3, given an attribute PRICE of the relationship type JAN_SUP, we represent it as: PRICE inherits the context month=‘jan’ from the relationship type. This means that each value of PRICE of the relationship type JAN_SUP is a supplying price in Jan. In contrast to original ER schemas, we call an ER schema whose schema constructs are represented using ontology symbols elevated schema, as the ER schemas with the statements given in Fig. 1. The mapping from an ER schema onto an elevated schema should be specified by users. Our work is based on elevated schemas. Now we can define schematic discrepancy in general as follows. Definition 2. Two elevated schemas are schematic discrepant, if metadata in one database correspond to attribute values or entities in the other. We call meta-attributes whose values correspond to attribute values or entities in other databases discrepant meta-attributes. For example, in Fig. 1, in DB2, month and supplier are discrepant meta-attributes as their values correspond to entities in DB1, so is the meta-attribute month in DB3. Before ending this section, we define the global identifier of a set of entity types. In general, two entity types (or relationship types) El and E2 are similar, if E1=T[Cnt1] and E2=T[Cnt2] with T an ontology type, and Cntl and Cnt2 two sets (possibly empty sets) of meta-attributes with values. Intuitively, a global identifier identifies the entities of similar entity types, independent of context. Definition 3. Given a set of similar entity types let K be an identifier of each entity type in We call K a global identifier of the entity types of provided that if two entities of the entity types of refer to the same real world object, then the values of K of the two entities are the same, and vice versa. For example, in Fig. 1, the PROD entity types of DB1 and DB3, and the entity types JAN_PROD, ..., DEC_PROD of DB2 are similar entity types, for they all correspond to the ontology type product without or with a context. Suppose P# is a global identifier of these entity types, i.e., P# uniquely identifies products from all the three databases. Similarly, we suppose S# is a global identifier of the SUPPLIER entity types of DB1 and DB3. In [13], Lee et al proposes an ER based federated database system where local schemas modeled in the relational, object-relational, network or hierarchical models are first translated into the corresponding ER export schemas before they are integrated. Our approach is an extension to theirs by using ontology to provide semantics necessary for schema integration. In general, local schemas could be in different data models. We first translate them into ER or ORASS schemas (ORASS is an ER-like model for semi-structured data [25]). Then map the schema constructs of ER schemas onto the types of ontology and get elevated schemas with the help of semi-automatic tools. Finally, integrate the elevated schemas using the semantics of ontology; semantic heterogeneities among elevated schemas are resolved in this step. Integrity constraints on the integrated schema are derived from the constraints on the elevated schemas at the same time.

250

Qi He and Tok Wang Ling

4 Resolving Schematic Discrepancies in the Integration of ER Schemas In this section, we resolve schematic discrepancies in schema integration. In particular, we present four algorithms to resolve schematic discrepancies for entity types, relationship types, attributes of entity types and attributes of relationship types respectively. This is done by transforming discrepant meta-attributes into entity types. The transformations keep the cardinalities of attributes and entity types, and therefore preserve the FDs and MVDs. Note in the presence of context, the values of an attribute depend on not only the identifier of an entity type/relationship type, but also the metadata of the attribute. To simplify the presentation, we only consider the discrepant meta-attributes of entity types, relationship types and attributes, leaving the other meta-attributes out as they will not change in schema transformation. In the rest of this section, we first present Algorithm TRANS_ENT and TRANS_REL, the resolutions of discrepancies for entity types and relationship types in Section 4.1, and then TRANS_ENT_ATTR and TRANS_REL_ATTR, the resolutions for attributes of entity types and attributes of relationship types in Section 4.2. Examples are provided to understand each algorithm.

4.1 Resolving Schematic Discrepancies for Entity Types/Relationship Types In this sub-section, we first show how to resolve discrepancies for entity types using the schema of Fig. 1, then present Algorithm TRANS_ENT in general. Finally, we describe the resolution of discrepancies for relationship types by an example, omitting the general algorithm which is similar to TRANS_ENT. As an example to remove discrepancies for entity types, we transform the schema of DB2 in Fig. 1 below. Example 3 (Fig. 2). In Step 1, for each entity type of DB2, say JAN_PROD, we represent the meta-attribute month as an entity type MONTH consisting of the only entity jan that is the metadata of JAN_PROD. We change the entity type JAN_PROD into PROD after removing the context, and construct a relationship type R to associate the entities of PROD and the entity of MONTH. Then we handle the attributes of JAN_PROD. As PNAME has nothing to do with the context month = ‘jan’ of the entity type, it becomes an attribute of PROD. However, S1_PRICE, ..., SN_PRICE inherit the context of month; they become the attributes of the relationship type R. Then in Step 2, the corresponding entity types, relationship types and attributes are merged respectively. The merged entity type of MONTH consists of all the entities {jan, ..., dec} of the original MONTH entity types, so do the entity type PROD, relationship type R and their attributes. Then we give the general algorithm below. Algorithm TRANS_ENT Input: an elevated schema DB. Output: a schema DB’ transformed from DB such that all the discrepant meta-attributes of entity types are transformed into entity types.

Resolving Schematic Discrepancy in the Integration of Entity-Relationship Schemas

251

Step 1: Resolve the discrepant meta-attributes of an entity type. Let be an entity type of DB, for a type in the ontology and discrepant meta-attributes with the values Let K be the global identifier of E. Step 1.1: Transform discrepant meta-attributes into entity types. Construct an entity type with the global identifier K. E’ consists of the entities of E without any context. Construct entity types with identifier for each meta-attribute Each contains one entity //Construct a relationship type to represent the associations among the entities of E and the values of Construct a relationship type R connecting the entity types E’ and Step 1.2: Handle the attributes of E. Let A be an attribute (not part of the identifier) of E, and selfCnt, a set of metaattributes with values, be the self context of A. If A is a m:1 or m:m attribute, then Case 1: attribute A has nothing to do with the context of E. Then A becomes an attribute of E’. Case 2: attribute inherits all the context from E. Then becomes an attribute of R. Case 3: attribute inherits some discrepant metaattributes with the values from E, Then construct a relationship type connecting E’ and those for each meta-attribute Attribute becomes an attribute of Else //A is a 1:1 or 1:m attribute, i.e., the values of A determine the entities of E in the context. In this case, A should be modeled as an entity type to preserve the cardinality constraint. We keep the discrepant meta-attributes of A, and delay the resolution in Alg. TRANS_ENT_ATTR, the resolution for attributes of entity types. Construct an attribute of E’, with Cnt the (self and inherited) context of A as the (self) context of A’. Step 1.3: Handle relationship types involving entity type E in DB. Let R1 be a relationship type involving E in DB. Case 1: R1 has nothing to do with the context of E. Then replace E with E’ in R1. Case 2: R1 inherits all the context from E. Then replace E with R (i.e., treat R as a high level entity type) in R1. Case 3: R1 inherits some discrepant meta-attributes with the values from E, Then construct a relationship type R2 connecting E’ and those for each meta-attribute Replace E with R2 in R1. Step 2: Merge the entity types, relationship types and attributes respectively which correspond to the same ontology type with the same context, and union their domains. In the resolution of schematic discrepancies for relationship types, we should deal with a set of entity types (participating in a relationship type) instead of individual ones. The steps are similar to those of Algorithm TRANS_ENT, but without Step 1.3.

252

Qi He and Tok Wang Ling

Fig. 2. Resolve schematic discrepancies for entity types

We omit the resolution algorithm TRANS_REL for lack of space, but explain it by any example below, i.e., transforming the schema of DB3 in Fig. 1. Example 4 (Fig. 3). In Step 1, for each relationship type of DB3, say JAN_SUP, we represent the meta-attribute month as an entity type MONTH consisting of the only entity jan that is the metadata of JAN_SUP. We change JAN_SUP into the relationship type SUP after removing the context, and relate the entity type MONTH to SUP. Attribute PRICE of JAN_SUP inherits the context month=‘jan’ from the relationship type, and therefore it becomes an attribute of SUP in the transformed schema. Then in Step 2, the MONTH entity types are merged into one consisting of all the entities {jan, ..., dec}; the SUP relationship types are also merged, and get the schema of DB l in Fig. 1.

4.2 Resolving Schematic Discrepancies for Attributes In this sub-section, we first show how to resolve discrepancies for attributes of entity types using an example, then present Algorithm TRANS_ENT_ATTR in general. Finally, we describe the resolution of discrepancies for attributes of relationship types by an example, omitting the general algorithm which is similar to TRANS_ENT_ ATTR.

Resolving Schematic Discrepancy in the Integration of Entity-Relationship Schemas

253

Fig. 3. Resolve schematic discrepancies for relationship types

The following example shows how to resolve discrepancies for attributes of entity types. Note the discrepancies of entity types should be resolved before this step. Example 5 (Fig. 4). Suppose we have another database DB4 recording the supplying information, in which all the suppliers and months are modeled as contexts of the attributes in an entity type PROD. The transformation is given in Fig. 4. In Step 1, for each attribute with discrepant meta-attributes, say S1_JAN_PRICE, the metaattributes supplier and month are represented as entity types SUPPLIER and MONTH consisting of one entity s1 and jan respectively. A relationship type SUP is constructed to connect PROD, MONTH and SUPPLIER. After removing the context, we

Fig. 4. Resolve schematic discrepancies for attributes of entity types

254

Qi He and Tok Wang Ling

change S1_JAN_PRICE into PRICE, an attribute of the relationship type SUP. Then in Step 2, we merge all the corresponding entity types, relationship types and attributes, and get the schema of DB1 in Fig. 1. Then we give the general algorithm below. Algorithm TRANS_ENT_ATTR Input: an elevated schema DB. Output: a schema DB’ transformed from DB such that all the discrepant meta-attributes of attributes of entity types are transformed into entity types. Step 1: Resolve the discrepant meta-attributes of an attribute in an entity type. Given an entity type E of DB, let be an attribute (not part of the identifier) of E, for a type in the ontology, and the discrepant meta-attributes with the values //Note A has no inherited context which has been removed in Algorithm TRANS_ENT if any. //Represent the discrepant meta-attributes as entity types. Construct entity types with identifiers for each meta-attribute Each

contains one entity

If A is a m:1 or m:m attribute, then //Construct a relationship type to represent the associations among the entities of E and the values of Construct a relationship type R connecting the entity types E and Attribute becomes an attribute of R. Else // A is a 1:1 or 1:m attribute, i.e., the values of A determines the entities of E in the context. A should be modeled as an entity type to preserve the cardinality constraint. Construct with the identifier Construct a relationship type R connecting the entity types E, and

Represent the as the cardinality constraint on R. If A is a 1:1 attribute, also represent the on R. Step 2: Merge the entity types, relationship types and attributes respectively which correspond to the same ontology type with the same context, and union their domains. The resolution of schematic discrepancies for the attributes of relationship types is similar to that for the attributes of entity types, as a relationship type could be treated as a high level entity type. We omit the resolution algorithm TRANS_REL_ATTR for lack of space, but explain it by an example below. Example 6 (Fig. 5). Given the transformed schema of Fig. 2, we transform the attributes of the relationship type R as follows. In Step 1, for each attribute of R, say S1_PRICE, we represent the meta-attribute supplier as an entity type SUPPLIER with one entity s1, and construct a relationship type SUP to connect the relationship type R

Resolving Schematic Discrepancy in the Integration of Entity-Relationship Schemas

255

and entity type SUPPLIER. After removing the context, we change S1_PRICE into PRICE, an attribute of SUP. Then in Step 2, we merge the SUPPLIER entity types and SUP relationship types respectively. In the merged schema, the relationship type R is redundant as it is a projection of SUP and has no attributes. Consequently, we remove R and get the schema of DB1 in Fig. 1.

Fig. 5. Resolve schematic discrepancies for attributes of relationship types

The transformations of the algorithms (in Section 4.1 and 4.2) correctly preserve the FDs/MVDs in the presence of context, as shown in the following proposition. Proposition 1. Let be a set of similar entity types (or relationship types) with the same set of discrepant meta-attributes, and K be the global identifier of (or a set of global identifiers of entity types if is a set of relationship types). Suppose each entity type (or relationship type) of has a set of attributes with the same cardinality: Then in the transformed schema, are modeled as entity types, and the following FDs/MVDs hold: Case 1: are m:1 attributes. Then is modeled as an attribute and a holds. Case 2: are m:m attributes. Then is modeled as an attribute and a holds. Case 3: are 1:1 attributes. Then is modeled as an entity type with the identifier and and hold.

256

Qi He and Tok Wang Ling

Case 4: are 1 :m attributes. Then fier and a

is modeled as an entity type with the identiholds.

For lack of space, we only prove Case 1 when is a set of entity types. In a transformed schema, given two relationships with values on A': and for k and k' values (or value sets) of K, and values of and a and a' values of A'. If k=k', then in the original schemas, the two relationships correspond to the same entity and same attribute, say As A is a m: 1 attribute, we have a=a’. That is, the holds in the transformed schema. In schema integration, schematic discrepancies of different schema constructs should be resolved in order, i.e., first for entity types, then relationship types, finally attributes of entity types and attributes of relationship types. The resolutions for most of the other semantic heterogeneities (introduced in Section 1) follow the resolution of schematic discrepancies.

5 Related Work Context is the key component in capturing the semantics related to the definition of an object or association. The definition of context as a set of meta-attributes with values is originally adopted in [7, 23], but is used to solve different kinds of semantic heterogeneities. Our work complements rather than competes with theirs. Their work is based on the context at the attribute level only. We consider the contexts at different levels, and the inheritance of context. A special kind of schematic discrepancy has been studied in multidatabase interoperability, e.g. [11, 12, 16, 17, 19], and [2]. They dealt with the discrepancy when schema labels (e.g., relation names or attribute names) in one database correspond to attribute values in another. However, we use contexts to capture meta-information, and solve a more general problem in the sense that schema constructs could have multiple (instead of single) discrepant meta-attributes. Furthermore, their works are at the “structure level”, i.e., they did not consider the constraint issue in the resolution of schematic discrepancies. However, the importance of constraints can never be overestimated in both individual and multidatabase systems. In particular, we preserve FDs and MVDs during schema transformation, which are expressed as cardinality constraints in ER schemas. The purposes are also different. Previous works focused on the development of a multidatabase language by which users can query across schematic discrepant databases. However, we try to develop an integration system which can detect and resolve schematic discrepancies automatically given the meta-information on source schemas. The issue of inferring view dependencies was introduced in [1, 8]. However, their works are based on the views defined using relational algebra. In other words, they did not solve the inference problem in the transformations between schematic discrepant schemas. In [14, 21, 24], people have begun to focus on the derivation of constraints for integrated schemas from constraints of component schemas in schema integration. However, these works failed to consider schematic discrepancy in schema integration. Our work complements theirs.

Resolving Schematic Discrepancy in the Integration of Entity-Relationship Schemas

257

6 Conclusions and Future Works Information integration provides a competitive advantage to businesses, and becomes a major area of investment by software companies today [18]. In this paper, we resolve a common problem in schema integration, schematic discrepancy in general, using the paradigm of context. We define context as a set of meta-attributes with values, which could be at the levels of databases, entity types, relationship types, and attributes. We design algorithms to resolve schematic discrepancies by transforming discrepant meta-attributes into entity types. The transformations preserve information and cardinality constraints which are useful in verifying lossless schema transformation, schema normalization and query processing in multidatabase systems. We have implemented a schema integration tool to semi-automatically integrate schematic discrepant schemas from several relational databases. Next, we’ll try to extend our system to integrate databases in different models and semi-structured data.

References 1. S. Abiteboul, R. Hull, and V. Vianu: Foundations of databases. Addison-Wesley, 1995, pp 216-235 2. R. Agrawal, A. Somani, Y. Xu: Storing and querying of e-commerce data. VLDB, 2001, pp 149-158 3. C. Batini, M. Lenzerini: A methodology for data schema integration in the EntityRelationship model. IEEE Trans, on Software Engineering, 10(6), 1984 4. C. Batini, M. Lenzerini, S. B. Navathe: A comparative analysis of methodologies for database schema integration, ACM Computing Surveys, 18(4), 1986, pp 323-364 5. P. P. Chen: The entity-relationship model: toward a unified view of data. TODS 1(1), 1976 6. A. Elmagarmid, M. Rusinkiewicz, A. Sheth: Management of heterogeneous and autonomous database systems. Morgan Kaufmann, 1999 7. C. H. Goh, S. Bressan, S. Madnick, and M. Siegel: Context interchange: new features and formalisms for the intelligent integration of information. ACM Transactions on Information Systems, 17(3), 1999, pp 270-293 8. G. Gottlob: Computing covers for embedded functional dependencies. SIGMOD, 1987 9. C. N. Hsu and C. A. Knoblock: Semantic query optimization for query plans of heterogeneous multidatabase systems. TKDE 12(6), 2000, pp 959-978 10. Qi He, Tok Wang Ling: Extending and inferring functional dependencies in schema transformation. Technical report, TRA3/04. School of Computing, National University of Singapore, 2004 11. R. Krishnamurthy, W. Litwin, W. Kent: Language features for interoperability of databases with schematic discrepancies. SIGMOD, 1991, pp 40-49 12. V. Kashyap, A. Sheth: Semantic and schematic similarity between database objects: a context-based approach. The VLDB Journal 5,1996, pp 276-304 13. Tok Wang Ling, Mong Li Lee: Issues in an entity-relationship based federated database system, CODAS, 1996, pp 60-69 14. Mong Li Lee, Tok Wang Ling: Resolving constraint conflicts in the integration of entityrelationship schemas. ER, 1997, pp 394-407 15. Mong Li Lee, Tok Wang Ling: A methodology for structural conflicts resolution in the integration of entity-relationship schemas. Knowledge and Information Sys., 5, 2003, pp 225247 16. L. V. S. Lakshmanan, F. Sadri, S. N. Subramanian: On efficiently implementing schemaSQL on SQL database system. VLDB, 1999, pp 471-482 17. L. V. S. Lakshmanan, F. Sadri, S. N. Subramanian: SchemaSQL – an extension to SQL for multidatabase interoperability. TODS, 2001, pp 476-519

258

18. 19. 20. 21. 22. 23. 24. 25.

Qi He and Tok Wang Ling N. M. Mattos: Integrating information for on demand computing. VLDB, 2003, pp 8-14 R. J. Miller: Using schematically heterogeneous structures. SIGMOD, 1998, pp 189-200 Web ontology language, W3C recommendation. http://www.w3.org/TR/owl-guide/ M. P. Reddy, B.E.Prasad, Amar Gupta: Formulating global integrity constraints during derivation of global schema. Data & Knowledge Engineering, 16,1995, pp 241-268 A. P. Sheth and S. K. Gala: Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Computing surveys, 22(3), 1990 E. Sciore, M. Siegel, A. Rosenthal: Using semantic values to facilitate interoperability among heterogeneous information systems, TODS, 19(2), 1994, pp 254-290 M. W. W. Vermeer and P. M. G. Apers: The role of integrity constraints in database interoperation. VLDB, 1996, pp 425-435 Xiaoying Wu, Tok Wang Ling, Mong Li Lee, and Gillian Dobbie: Designing Semistructured Databases Using ORA-SS Model. WISE, 2001, pp 171-180

Managing Merged Data by Vague Functional Dependencies* An Lu and Wilfred Ng Department of Computer Science The Hong Kong University of Science and Technology Hong Kong, China {anlu,wilfred}@cs.ust.hk

Abstract. In this paper, we propose a new similarity measure between vague sets and apply vague logic in a relational database environment with the objective of capturing the vagueness of the data. By introducing a new vague Similar Equality for comparing data values, we first generalize the classical Functional Dependencies (FDs) into Vague Functional Dependencies (VFDs). We then present a set of sound and complete inference rules. Finally, we study the validation process of VFDs by examining the satisfaction degree of VFDs, and the merge-union and merge-intersection on vague relations.

1

Introduction

The relational data model [8] has been extensively studied for over three decades. This data model basically handles precise and exact data in an information source. However, many real life applications such as merging data from many sources involve imprecise and inexact data. It is well known that Fuzzy database models [11, 2], based on the fuzzy set theory by Zadeh [13], have been introduced to handle inexact and imprecise data. In [5], Gau et al. point out that the drawback of using the single membership value in fuzzy set theory is that the evidence for and the evidence against are in fact mixed together. (Here U is a classical set of objects, called the universe of discourse. An element of U is denoted by Therefore, they propose vague sets, which is similar to that of intuitionistic fuzzy sets proposed in [1]. A true membership function and a false membership function are used to characterize the lower bound on (Here V means a vague set and F means a fuzzy set.) The lower bounds are used to create a subinterval of the unit interval [0,1], where in order to generalize the membership function of fuzzy sets. There have been many studies which discuss the topic concerning how to measure the degree of similarity or distance between vague sets or intuitionistic fuzzy sets [3,4,7,9,12,6]. However, the proposed methods have some limitations. *

This work is supported in part by grants from the Research Grant Council of Hong Kong, Grant Nos HKUST6185/02E and HKUST6165/03E.

P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 259–272, 2004. © Springer-Verlag Berlin Heidelberg 2004

260

An Lu and Wilfred Ng

For example, Hong’s similarity measure in [7] means that the similarity measure between the vague value with the most imprecise evidence (the precision of the evidence is 0) and the vague value with the most precise evidence (the precision of the evidence is 1) is equal to 0.5. In this case, the similarity measure should be equal to 0. Our view is that the similarity measure should include two factors of vague values. One is the difference between the evidences contained by the vague values; another is the difference between the precisions of the evidences. However, the proposed measures or distances consider only one factor (e.g. in [3,4]) or do not combine both the factors appropriately (e.g. in [7,9,12,6]). Our new similarity measure is able to return a more reasonable answer. In this paper, we extend the classical relational data model to deal with vague information. Our first objective is to extend relational databases to include vague domains by suitably defining the Vague Functional Dependencies (VFDs) based on our notion of similarity measure. A set of sound and complete inference rules for VFDs is then established. We discuss the satisfaction degree of VFDs and apply VFDs in merged vague relations as the second objective. The main contributions of the paper are as follows: (1) A new similarity measure between vague sets is proposed to remedy some problems for similar definitions in literature. We argue that our measure gives a more reasonable estimation; (2) A VFD is proposed in order to capture more semantics in vague relations; (3) The satisfaction degree of VFDs in merged vague relations is studied. The rest of the paper is organized as follows. Section 2 presents some basic concepts related to databases and the vague set theory. In Section 3, we propose a new similarity measure between vague sets. In Section 4, we introduce the concept of a Vague Functional Dependency (VFD) and the associated inference rules. We then explain the validation process which determines the satisfaction degree of VFDs in vague relations. In Section 5, we give the definitions of merge operators of vague relations and discuss the satisfaction degree of VFDs after merging. Section 6 concludes the paper.

2

Preliminaries

In this section, some basic concepts related to the classical relational data model and the vague set theory are given.

2.1

Relational Data Model

We assume the readers are familiar with the basic concepts of the relation data model [8]. There are two operations on relations that are particularly relevant in subsequent discussion: projection and natural join. The projection of a relation of R(XYZ) over the set of attributes X is obtained by taking the restriction of the tuples of to the attributes in X and eliminating duplicate tuples in what remains. This operation is denoted by Let and be two relations of R(XY) and R(XZ), respectively. The natural join is a relation over R(XYZ) defined by

Managing Merged Data by Vague Functional Dependencies

261

and Functional Dependencies (FDs) are important integrity constraints in relational databases. An FD is a statement, where X and Y are sets of attributes. A relation satisfies the FD, if for all and in implies

2.2

Vague Data Model

Let U be a classical set of objects, called the universe of discourse, where an element of U is denoted by Definition 1. (Vague Set) A vague set V in a universe of discourse U is characterized by a true membership function, and a false membership function, as follows: and where is a lower bound on the grade of membership of derived from the evidence for and is a lower bound on the grade of membership of the negation of derived from the evidence against Suppose A vague set V of the universe of discourse U can be represented by where and This approach bounds the grade of membership of to a subinterval of [0,1]. In other words, the exact grade of membership of may be unknown, but is bounded by where We depict these ideas in Fig. 1. Throughout this paper, we simply use and for if no ambiguity of V arising.

Fig. 1. The true

and false

membership functions of a vague set

For a vague set we say that the interval is the vague value to the object For example, if then we can see that and It is interpreted as “the degree that object belongs to the vague set V is 0.6, the degree that object does not belong to the vague set V is 0.1.” In a voting process, the vague value [0.6,0.9] can be interpreted as “ the vote for resolution is 6 in favor, 1 against, and 3 neutral (abstentious).”

262

An Lu and Wilfred Ng

The precision of the knowledge about is characterized by the difference If this is small, the knowledge about is relatively precise; if it is large, we know correspondingly little. If is equal to the knowledge about is exact, and the vague set theory reverts back to fuzzy set theory. If and are both equal to 1 or 0, depending on whether does or does not belong to V, the knowledge about is very exact and the theory reverts back to ordinary sets. Thus, any crisp or fuzzy value can be regarded as a special case of a vague value. For example, the ordinary set can be presented as the vague set while the fuzzy set (the membership of is 0.8) can be presented as the vague set Definition 2. (Empty Vague Set) A vague set V is an empty vague set, if and only if, its true membership function and false membership function for all We use to denote it. Definition 3. (Complement) The complement of a vague set V is denoted by and is defined by and Definition 4. (Containment) A vague set A is contained in another vague set B, if and only if, and Definition 5. (Equality) Two vague sets A and B are equal, written as A = B, if and only if, and that is and Definition 6. (Union) The union of two vague sets A and B is a vague set C, written as whose true membership and false membership functions are related to those of A and B by and Definition 7. (Intersection) The intersection of two vague sets A and B is a vague set C, written as whose true membership and false membership functions are related to those of A and B by and Definition 8. (Cartesian Product) Let be the Cartesian product of m universes, and be the vague sets in their respectively, corresponding universe of discourse The Cartesian product is defined to be a vague set of where the memberships are defined as follows: and

2.3

Vague Relations

Definition 9. (Vague Relation) A vague relation is a vague subset of tuple in subset of

on a relation scheme A is a vague

Managing Merged Data by Vague Functional Dependencies

263

A relation scheme R is denoted by or simply by R if the attributes are understood. Corresponding to each attribute name the domain of is written as However, unlike classical and fuzzy relations, in vague relations, we define as a set of vague sets. Vague relations may be considered as an extension of classical relations and fuzzy relations, which can capture more information about imprecision. Example 1. Consider the vague relation over Product(ID, Weight, Price) given in Table 1. In Weight and Price are vague attributes. To make the attribute ID simple, we express it as the ordinary value. The first tuple in means the product with ID = 1 has the weight of [1,1]/10 and the price of [0.4, 0.6]/50 + [1,1]/80, which are vague sets. In the vague set [1,1]/10, [1,1] means the evidence in favor “the weight is 10” is 1 and the evidence against it is 0.

3

Similarity Measure Between Vague Sets

In this section, we review the notions of similarity measures between vague sets proposed by Chen [3, 4], Hong [7] and Li [9], together with distances between intuitionistic fuzzy sets proposed by Szmidt [12] and Grzegorzewski [6]. We show by some examples that these measures are not able to reflect our intuitions. A new similarity measure between vague sets is proposed to remedy the limitations.

3.1 Let

Similarity Measure Between Two Vague Values and

be two vague values to a certain object such that In general, there are two factors should be considered in measuring the similarity between two vague values. One is the difference between the difference of the true and false membership values, which is given by such that another is the difference between the sum of the true and false membership values, which is given by such that The first factor implies the difference between the evidences contained by the vague values, and the second factor implies the difference between the precisions of the evidences. In [3, 4], Chen defines a similarity measure between two vague values and as follows:

264

An Lu and Wilfred Ng

which is equal to This similarity measure ignores the difference between the precisions of the evidences For example, consider

This means that and are equal. On the one hand, means and that is to say, we have no information about the evidence, and the precision of the evidence is zero. On the other hand, means and that is to say, we have some information about the evidence, and the precision of the evidence is not zero. So it is not intuitive to have the similarity measure of and being equal to 1. In order to solve this problem, Hong et al. [7] propose another similarity measure between vague values as follows:

However, this definition also has some problems. Here is an example. Example 2. The similarity measure between [0,1] and is equal to 0.5. This means that the similarity measure between the vague value with the most imprecise evidence (the precision of the evidence is equal to zero) and the vague value with the most precise evidence (the precision of the evidence is equal to one) is equal to 0.5. However, our intuition shows that the similarity measure in this case should be equal to 0. Li et al. in [9] also give a similarity measure in order to remedy the problems in Chen’s definition as follows:

It can be checked that This means Li’s similarity measure is just the arithmetic mean of Chen’s and Hong’s. So Li’s similarity measure still contains the same problems. [12, 6] adopt Hamming distance and Euclidean distance to measure the distances between intuitionistic fuzzy sets as follows: 1. Hamming distance is given by

2. Euclidean distance is given by

These methods also have some problems. Here is an example.

Managing Merged Data by Vague Functional Dependencies

265

Example 3. We still consider the vague values and in Example 2. For the Hamming distance, it can be calculated that This means that the Hamming distance between and are equal to that between and In a voting process, as mentioned in Example 2, since both and have identical votes in favor and against, the Hamming distance between and should be less than that between and For the Euclidean distance, consider the Euclidean distance between [0,1] and which is equal to This means that the distance between the vague value with the most imprecise evidence and the vague value with the most precise evidence is not equal to 1. (Actually, the Euclidean distance in this case is in the interval However, our intuition shows that the distance in this case should always be equal to 1. In order to solve all the problems mentioned above, we define a new similarity measure between the vague values and as follows: Definition 10. (Similarity Measure Between Two Vague Values)

Furthermore, we define a distance between the vague values

and

as

The similarity measure given in Definition 10 takes into account of both the difference between the evidences contained by the vague values and the difference between the precisions of the evidences. Here is an example. Example 4. We still consider the vague values and in Example 2. It can be calculated that So This means that the similarity measures between and are less than that between and As mentioned in Example 2, this result is accordant to our intuition. Another example is the similarity measure between [0,1] and which is equal to 0. This means that the similarity measure between the vague value with the most imprecise evidence and the vague value with the most precise evidence is equal to 0. This result is also accordant to our intuition. From Definition 10, we can obtain the following theorem. Theorem 1. The following statements are true: 1. The similarity measure is bounded, i.e., 2. if and only if, the vague values if and only if, the vague values 3. [0,1] and 4. The similarity measure is commutative, i.e.,

and are equal (i.e., and are [0,0] and [1,1] or

266

3.2

An Lu and Wilfred Ng

Similarity Measure Between Two Vague Sets

We generalize the similarity measure to two given vague sets. Definition 11. (Similarity Measure Between Two Vague Sets) Let X and Y be two vague sets, where and The similarity measure between the vague sets X and Y can be evaluated as follows:

Similarly, we give the definition of distance between two vague sets as D(X, Y) = 1–M(X,Y). From Definition 11, we obtain the following theorem for vague sets, which is similar to Theorem 1. Theorem 2. The following statements related to M(X, Y) are true: 1. The similarity measure is bounded, i.e., 2. M(X, Y) = 1, if and only if, the vague sets X and Y are equal (i.e., X = Y); 3. M(X, Y) = 0, if and only if, all the vague values and are [0, 0] and [1, 1] or [0, 1] and 4. The similarity measure is commutative, i.e., M(X, Y) = M(Y, X}.

4 Vague Functional Dependencies and Inference Rules In this section, we first give the definition of Similar Equality of vague relations, which can be used to compare vague relations. Then we present the definition of a Vague Functional Dependency (VFD). Next, we present a set of sound and complete inference rules for VFDs, which is an analogy to Armstrong’s Axiom for classical FDs.

4.1 Similar Equality of Vague Relations Similar Equality of vague relations defined below can be used as a vague similarity measure to compare elements of a given domain. Suppose and are any two tuples in a relation over the scheme R. Definition 12. (Similar Equality of Tuples) The Similar Equality of two vague tuples and on the attribute in a vague relation is given by:

Managing Merged Data by Vague Functional Dependencies

The Similar Equality of two vague tuples and in a vague relation is given by:

267

on attributes

From Definition 12 and Theorem 2, we have the following theorem. Theorem 3. The following statements of the properties of are true: 1. The similar equality is bounded: 2. if and only if, all vague sets and are equal (i.e., 3. if and only if, if and only if, all the vague values and are [0, 0] and [1, 1], or [0, 1] and

where

4. The similar equality is commutative:

4.2

Vague Functional Dependencies

Informally, a VFD captures the semantics of the fact that, for given two tuples, Y values should not be less similar than X values. We now give the following definition of a VFD. Definition 13. (Vague Functional Dependency) Given a relation over a relation schema where are sets of vague sets, a Vague Functional Dependency (VFD) where holds over if for all tuples and in we have In the database literature [8], a set of inference rules is generally used to derive new data dependencies from the given set of dependencies. We now present a set of sound and complete inference rules for VFDs, which is similar to Armstrong’s Axiom for FDs. Definition 14. (Inference Rules) Let us consider a relation scheme and a set of VFDs F. Let X, Y, and Z be subsets of the relation scheme R. We define a set of inference rules as follows: 1. Reflexivity: If 2. Augmentation: If 3. Transitivity: If

then holds, then and hold, then

also holds; holds.

The following theorem follows by assuming that there are at least two elements and in each data domain such that Theorem 4. The inference rules given in Definition 14 are sound and complete. The Union, Decomposition, Pseudotransitivity rules follow from these three rules, as in the case of functional dependencies [8]. We skip the proof due to space limitation.

268

An Lu and Wilfred Ng

4.3

Validation of VFDs

In this section, we study the validation issues of VFDs. We relax the notion that if a VFD does not hold for a pair of tuples in then the VFD does not hold. We allow the VFD to hold with a certain satisfaction degree over The validation process and the calculation of the satisfaction degree of the VFD are given as follows: 1. For every attribute in we calculate between every pair of tuples and in by constructing two is the cardinality of upper triangular matrices X and A. The row and column represent a comparison of different tuples. We ignore the lower part of the matrix and the diagonal, since is commutative. Thus we get entries in the matrix. Each entry is the comparison of a pair of tuples; 2. We check for every in If true, then we say that the VFD holds (with the satisfaction degree of 1). We construct a matrix W = X – A to check this; 3. If the result in Step 2 is not true, in the matrix W, we count the number of entries (denoted by s) which are less than or equal to 0. The satisfaction degree SD of the VFD in can be calculated as follows:

Obviously, if the inequality given in Definition 13 holds for all tuples in the satisfaction degree calculated by (11) is equal to 1. Suppose there are many VFDs hold over relation say with the satisfaction degrees respectively. We use a VFD set to present this. Then the satisfaction degree of the VFD set F over relation can be calculated by the arithmetic mean of the satisfaction degrees of F as follows:

Here is an example to illustrate the validation process and the calculation of the satisfaction degree of the VFD. Example 5. Consider the vague relation presented in Table 1, it can be checked that the VFD Weight Price holds to a certain satisfaction degree. In step 1, we calculate for attributes X = Weight and A = Price and the results are shown by matrix X and A or Tables 2 and 3. In step 2, we check by taking the difference between the two matrices X and A. The result is shown by matrix W or Table 4. Since does not hold for every we go to step 3. In step 3, we get So the satisfaction degree SD can be calculated as follows:

Managing Merged Data by Vague Functional Dependencies

269

Therefore, the VFD Weight Price over relation holds with the satisfaction degree 0.5. Furthermore, for the zero entries in W, we check the corresponding values in the matrix X. If the values are equal to 1, all vague sets and in X) are equal according to Theorem 3. Thus, we can remove some redundancies by decomposing the original relation into two relations. For instance, there is a value in position (3,2) is 0 in W above. We check the corresponding value in position (3,2) in matrix X, and find the value is 1. So the vague relation in Table 1 can be decomposed into two relations IW(ID, Weight), WP(Weight,Price) (Tables 5 and 6), and some redundancies have been removed.

5

Merge Operations of Vague Relations

In this section, we first give the definition of merge operators of vague relations and then discuss the evaluation of the satisfaction degree of VFDs over the merged vague relations.

270

5.1

An Lu and Wilfred Ng

Merge Operators

Generally speaking, when multiple data sources merge together, the result may contain objects of three cases [10]: (1) an attribute value is not provided; (2) an attribute value is provided by exactly one source; (3) an attribute value is provided by more than one source. When merging vague data, in the first case, we use an empty vague set to express the unavailable value; in the second case, we keep the original vague set; in the third case, we take the union of the vague sets provided by the source. We now define two new merge operators to serve our purpose. Definition 15. (Join Merge Operator) Let be a tuple in the vague relation over scheme and be a tuple in the vague relation over scheme and have a common ID attribute The attributes are common in both vague relations. Then we define the join merge of and denoted by as follows: with where means the union of two vague sets as defined in Definition 6. Definition 16. (Union Merge Operator) Let Then we define the union merge of and denoted by with with , where means an empty vague set.

as follows:

Since vague sets have the property of associativity given in [5], the join merge operator and the union merge operator also have the property of associativity. That is to say, and (recall that are vague relations). We can also generalize Definitions 15 and 16 to more than two data sources. Definition 16 guarantees that every tuple is contained in the new merged relation. For example, consider the following vague relations and given in Tables 7 and 8. We then have and as given in Tables 9 and 10.

5.2

Satisfaction Degree of Merged Relations

Suppose we have Each relation

data sources represented by the vague relations has a set of VFDs,

with

Managing Merged Data by Vague Functional Dependencies

271

the satisfaction degree defined in (12). By the union merge operator, we get a new relation We can also get a new VFD set over For each VFD in F, we can calculate the new satisfaction degree over by the validation process proposed in Sect. 4. Then the satisfaction degree of the new VFD set F over relation can be calculated by (12). In the case of non-overlapping sources, we can simplify the calculation as follows. Assume two data sources represented by the vague relations, and which have the same VFD on a common schemas. We let the satisfaction degree be and and the cardinalities of and are and (As the sources are non-overlapping, there exists no tuple which has the same value of (the ID attribute) in both and This implies that the cardinality of is In order to calculate the new SD of over we need to construct two new matrices, and to calculate the of every pair of tuples between and Then we need to construct a matrix and count the number of entries (denoted by which are less than or equal to 0 in According to (11), the satisfaction degree SD of the VFD over where can be calculated as follows:

272

6

An Lu and Wilfred Ng

Conclusions

In this paper, we incorporate the notion of vagueness into the relational data model, with an objective to provide a generalized approach for treating imprecise data. We propose a new similarity measure between vague sets, which gives more reasonable estimation than those proposed in literature. We apply Similar in vague relations. The equality measure can be used to compare elements of a given vague data domain. Based on the concept of similar equality of attribute values in vague relations, we develop the notion of Vague Functional Dependencies (VFDs), which is a simple and natural generalization of classical or fuzzy functional dependencies. In spite of this generalization, the inference rules for VFDs share the simplicity of Armstrong’s axiom for classical FDs. We also present the validation process of VFDs and the formula to determine the satisfaction degree of VFDs. Finally, we give the definition of merge operators of vague relations and discuss the satisfaction degree of VFDs over the merged vague data. As a future work, we plan to extend the merge operations over vague data, which provide a flexible means to merge data in modern applications, such as querying internet sources and merging the returned result. We are also studying the notion of Vague Inclusion Dependencies, which is useful to generalize the foreign keys in vague relations.

References 1. Atanassov, K.: Intuitionistic Fuzzy Sets. Fuzzy Sets and Systems 20(1) (1986) 87–96 2. Buckles, B.P., Petry F.E.: A Fuzzy Representation of Data for Relational Databases. Fuzzy Sets and Systems 7 (1982) 213–226 3. Chen, S.M.: Similarity Measures Between Vague Sets and Between Elements. IEEE Transactions on System, Man and Cybernetics 27(1) (1997) 153–159 4. Chen, S.M.: Measures of Similarity Between Vague Sets. Fuzzy Sets and Systems 74(2) (1995) 217–223 5. Gau, W.L., Danied, J.B.: Vague Sets. IEEE Transactions on Systems, Man, and Cybernetics 23(2) (1993) 610–614 6. Grzegorzewski, P.: Distances Between Intuitionistic Fuzzy Sets and/or Intervalvalued Fuzzy Sets Based on the Hausdorff Metric. Fuzzy Sets and Systems (2003) 7. Hong, D.H., Kim, C.: A Note on Similarity Measures Between Vague Sets and Between elements. Information Sciences 115 (1999) 83–96 8. Levene, M., Loizou, G.: A Guided Tour of Relational Databases and Beyond. Springer-Verlag, Berlin Heidelberg New York (1999) 9. Li, F., Xu, Z.: Measures of Similarity Between Vague sets. Journal of Software 12(6) (2001) 922–927 10. Naumann, F., Freytag, J.C.: Completeness of Information Sources. Ulf Leser Workshop on Data Quality in Cooperative Information Systems 2003 (DQCIS) (2003) 11. Raju, K.V.S.V.N., Majumdar, A.K.: Fuzzy Functional Dependencies and Lossless Join Decomposition of Fuzzy Relational Database Systems. ACM Transactions on Database Systems 13(2) (1988) 129–166 12. Szmidt, E., Kacprzyk, J.: Distances Between Intuitionistic Fuzzy Sets. Fuzzy Sets and Systems 114 (2000) 505–518 13. Zadeh, L.A.: Fuzzy Sets. Information and Control 8(3) (1965) 338–353

Merging of XML Documents Wanxia Wei, Mengchi Liu, and Shijun Li School of Computer Science, Carleton University, Ottawa, Ontario, Canada, K1S 5B6 {wwei2,mengchi,shj_li}@scs.carleton.ca

Abstract. How to deal with the heterogeneous structures of XML documents, identify XML data instances, solve conflicts, and effectively merge XML documents to obtain complete information is a challenge. In this paper, we define a merging operation over XML documents that can merge two XML documents with different structures. It is similar to a full outer join in relational algebra. We design an algorithm for this operation. In addition, we propose a method for merging XML elements and handling typical conflicts. Finally, we present a merge template XML file that can support recursive processing and merging of XML elements.

1

Introduction

Information about real world objects may spread over heterogeneous XML documents. Moreover, it is critical to identify XML data instances representing the same real world object when merging XML documents, but each XML document may have different elements and/or attributes to identify objects. Furthermore, conflicts may emerge when merging these XML documents. In this paper, we present a new approach to merging XML documents. Our main contributions are as follows. First, we define a merging operation over XML documents that is similar to a full outer join in relational algebra. It can merge two XML documents with different structures. We design an algorithm for this operation. Second, we propose a method for merging XML elements and handling typical conflicts. Finally, we present a merge template XML file that can support recursive processing and merging of XML elements. The rest of the paper is organized as follows. Section 2 defines the merging operation and presents the algorithm for this operation. Section 3 studies the mechanism for identifying XML instances. Section 4 examines XML documents that this algorithm produces. Section 5 demonstrates the method for merging elements and handling conflicts. Section 6 describes the merge template XML file. Section 7 discusses related work. Finally, Section 8 concludes this paper.

2

Our Approach

The merging operation to be defined can merge two XML documents that have different structures, and create one single XML document. We assume that two P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 273–285, 2004. © Springer-Verlag Berlin Heidelberg 2004

274

Wanxia Wei, Mengchi Liu, and Shijun Li

Fig. 1.

the first XML document to be merged.

XML documents to be merged share many tag names and also have some tags with different tag names. We also assume that two tags that share the same tag name in these two XML documents describe the same kind of objects in the real world but their corresponding elements may have different structures. This merging operation can be formally represented as: where and are two input XML documents to be merged and is the merged XML document; and are the DTDs of and and are absolute location paths (paths for short) in XPath that designate the elements to be merged in and respectively; and are Boolean expressions that are used to control merging of XML elements in and Boolean expression is used to identify XML instances when merging and Also, it is used for merging of XML elements and handling conflicts. It consists of a number of conditional expressions connected by Boolean operator Let be one of the elements whose path is in and one of the elements whose path is in determines if in and in describe the same object. As long as is true, in and in describe the same object and they are merged. We say that in and in are matching elements if they describe the same object. Boolean expression is used to determine if in that does not have a matching in will be incorporated into It consists of several conditional expressions connected by Boolean operator

Merging of XML Documents

Fig. 2.

275

the second XML document to be merged.

Example 1. The two input XML documents and in Figures 1 and 2 have different structures. They describe employees by different elements: Employee elements in and Person elements in and are merged into shown in Figure 3. The merge conditions are as follows: /Factory /Department /Employees /Employee. / FactoryInfo /People /Person. (:: Department/@DName = WorkIn/ Unit) (Name = @PName). (:: Department/@DName = WorkIn/Unit). According to the above and Employee elements in and Person elements in are merged into the result XML document Thus, for this example, is any Employee element in and is any Person element in In the above :: Department/@ DName represents the attribute DName of the ancestor Department of Employee element in and WorkIn/ Unit denotes the child Unit of the child WorkIn of Person element in (:: and @ denote an ancestor and an attribute respectively). According to an Employee element in and a Person element in describe the same employee and are merged into an Employee element in if the value of the attribute DName of the ancestor Department of an Employee is the same as the content of the descendant Unit of a Person, and the content of the child Name of an Employee is the same as the value of the attribute PName of a Person. Note that the child Name of an Employee cannot identify an Employee in because two Department elements may have Employee descendants that have the same content for the child Name.

276

Wanxia Wei, Mengchi Liu, and Shijun Li

Fig. 3.

the resulting single XML document.

According to if there exists a Department in that has an attribute DName whose value is the same as the content of the descendant Unit of a nonmatching Person, this non-matching Person is incorporated into Otherwise, this non-matching Person cannot be incorporated into because no element in can have this Person as a descendant. In relational algebra, a full outer join extracts the matching rows of two tables and preserves non-matching rows from both tables. Analogously, the merging operation defined merges XML documents and that have different structures and creates an XML document It merges in and its matching in according to and It incorporates each modified non-matching in and some modified non-matching elements in based on and Moreover, it incorporates the elements in that do not need merging. Path is the prefix path of path if is the left part of or is equal to For example, is the prefix path of It is obvious that the path of any

Merging of XML Documents

Fig. 4.

277

the XML document procedure LeftOuterJoin produces for Example 1.

ancestor of an element is the prefix path of the path of this element. Path is the parent path of path if is the prefix path of and contains one more element name than For Example 1, /Factory/Department/Employees is the parent path of The algorithm for the merging operation is as follows. Algorithm xmlmerge Input: and Output: the root element of call LeftOuterJoin is the XML document generated by procedure LeftOuterJoin the root element of call FullOuterJoin End of algorithm xmlmerge Algorithm xmlmerge merges and and generates an XML document which contains every element merged from in and its matching in each modified non-matching in and some modified non-matching elements. Also, incorporates the elements in that do not need merging.

278

Wanxia Wei, Mengchi Liu, and Shijun Li

Algorithm xmlmerge calls two recursive procedures LeftOuterJoin and FullOuterJoin. We explain FullOuterJoin in Section 4. LeftOuterJoin is as follows. Procedure LeftOuterJoin if the path of in is not equal to then output the start tag of to output all the attributes of to for each child element of if the path of is the prefix path of then call LeftOuterJoin else copy to output the end tag of to else if has a matching element in then output the start tag of to for every attribute of call processa1 for every attribute of call processa2 for every child element of call processc1 for every child element of call processc2 output the end tag of to else output the start tag of to output all the attributes of to for every child element of if has a semantically corresponding attribute that is an attribute of then output an attribute to whose attribute name is that of and whose value is the content of for every child element of if does not have a semantically corresponding attribute that is an attribute of then copy to output the end tag of to End of procedure LeftOuterJoin

Procedure LeftOuterJoin merges in and its matching in and resolves conflicts by calling procedures processa1, processa2, processc1, and processc2, and produces an XML document which contains every element merged from in and its matching in every modified non-matching in and the elements in that do not need merging. For Example 1, XML document that LeftOuterJoin produces is presented in Figure 4.

Merging of XML Documents

3

279

Instance Identification

A Skolem function returns a value for an object as the identifier of this object [4]. The computation of Boolean expression has the equivalent effects as a Skolem function does. For Example 1, the constructed Skolem function concatenates the attribute DName of the ancestor Department and the child Name of an Employee element in or the descendant Unit and the attribute PName of a Person element in and returns this concatenated value for an object as the identifier. As long as two identifiers for two objects described in and are equivalent, these two objects are the same object.

4

The Generated XML Documents

In LeftOuterJoin is the currently processed element in and it always has the property: is one of the elements in that need merging, or does not need merging but some of the descendants of need merging. Assume is an element in that needs merging, and is an element in that does not need merging and is not a descendant of We consider the relationship between and in There are four cases: (1) is an ancestor of (2) is a sibling of an ancestor of (3) and are siblings. (4) is a descendant of a sibling of For these four cases, is merged with its matching element in and is incorporated into When does not have a matching element in is modified and incorporated into We consider Example 1. The Employee in that has “Paul Smith” as the attribute PName and “Production” as the attribute DName of the ancestor Department is merged from the Employee in that has “Paul Smith” as the child Name and “Production” as the attribute DName of the ancestor Department and the matching Person in that has “Paul Smith” as the attribute PName and “Production” as the descendant Unit. The Employee in that has “Paul Smith” as the child Name and “Sales” as the attribute DName of the ancestor Department is a non-matching Employee. It is modified and incorporated into Its child Name is changed to attribute PName to obey the structure of the merged Employee in Department and Employees do not need merging and they are incorporated into FullOuterJoin incorporates every element in into XML document and modifies some non-matching elements and inserts the modified nonmatching elements into as child elements of some elements whose path is the parent path of FullOuterJoin modifies some non-matching elements in order to resolve conflicts and make the non-matching elements obey the structure of the merged element in Let us examine Example 1. The Person in that has “Alice Bush” as the attribute PName and “Production” as the descendant Unit is a non-matching Person. This non-matching Person in and the Employees in that has the ancestor Department that has the attribute DName with value “Production” make Boolean expression true. Therefore,

280

Wanxia Wei, Mengchi Liu, and Shijun Li

this non-matching Person is modified and embodied into as a child element of this Employees. The element name of this non-matching Person is changed to Employee. The child WorkIn of this non-matching Person is modified.

5

Merging XML Elements and Handling Typical Conflicts

First, we rephrase the assumptions about (1) (2)

(3) (4) (5)

and

and share many tag names and also have some tags with different tag names. Two tags that share the same tag name in and describe the same kind of objects in the real world, but the corresponding elements can have the same structure or have different structures. For tags with different tag names in and some of them can still describe the same kinds of objects. In this case, Boolean expression indicates that they describe the same kinds of objects. For two tags in and that describe the same kind of objects, the corresponding elements have the same cardinality. For two elements whose tags describe the same kind of objects in and their two attributes have the same attribute type and the same default value if these two attributes have the same attribute name in and

Then, we introduce several notions. Elements whose tags describe the same kind of objects in and can be classified into two categories: semantically identical elements and semantically corresponding elements. Two elements in and are semantically identical elements if their tags describe the same kind of objects and they have the same structure. Two semantically identical elements can have different element names. In this case, indicates they describe the same kind of objects. Two elements in and are semantically corresponding elements if their tags describe the same kind of objects but they have different structures. Also, two semantically corresponding elements can have different element names. In this case, indicates they describe the same kind of objects. It is true that in and in are semantically corresponding elements because they actually express the same kind of objects and they describe the same object if they make true. Two attributes in and are said to be semantically identical attributes if they have the same name, and one is an attribute of an element in and the other is an attribute of the semantically identical or corresponding element of this element in Similarly, two semantically identical attributes can have different names. In this case, they are specified as semantically identical attributes in An attribute in one XML file to be merged can have a semantically corresponding element in another XML file to be merged. An attribute and an element are a pair of semantically corresponding attribute and element if the name of this attribute is the same as the name of this element, this attribute is

Merging of XML Documents

281

an attribute of element in one XML file and this element is a child element of the semantically corresponding element of element in another XML file, this attribute is a required attribute and of type CDATA, and this element is specified as a parsed character data element with cardinality 1 and it does not have any attribute. Also, the attribute name and element name of a pair of semantically corresponding attribute and element can be different. In this case, indicates they are a pair of semantically corresponding attribute and element. We present the method for merging elements and handling conflicts. Conflicts may emerge when LeftOuterJoin merges in and its matching in into an element in Let be an attribute of and an attribute of Let be a child element of and a child element of Typical conflicts are: conflicts between and conflicts between and conflicts between and conflicts between or a descendant of and or a descendant of and conflicts between or a descendant of and an ancestor of If attribute of in has a semantically identical attribute that is an attribute of in and its semantically identical attribute should be merged into an attribute. If and its semantically identical attribute are consistent with each other, redundancy is eliminated by merging them into one attribute; otherwise, a conflict is indicated in the merged attribute. Similarly, if attribute of in has a semantically corresponding element that is a child element of in and its semantically corresponding element are merged into an attribute. Procedure processa1 accomplishes these tasks. In Example 1, the child element Name of Employee in and the attribute PName of Person in are semantically corresponding element and attribute because (Name = @PName) is specified in Boolean expression They are combined into the attribute PName of the merged Employee element in The relationship between a descendant of and a descendant of is illustrated in Figure 5 where (e) shows no correspondence of an element and its semantically identical or corresponding element is found. Assume that is a descendant of is a descendant of and and are semantically corresponding or identical elements. Based on the assumptions about the two XML documents to be merged, and have the same cardinality. If the cardinality is not greater than 1, and are merged into an element and conflicts between them are reported. Otherwise, and cannot be simply merged into an element. When and describe the same object, they are merged into an element; conversely, both and are incorporated into the merged element in Moreover, when and are semantically corresponding elements that represent the same object, if has some attributes and/or descendants that does not have, an element that has both the attributes and descendants of and the extra attributes and/or descendants of is incorporated into the merged element in as a descendant. Recursive procedure processc1 is responsible for completing the above tasks. We consider Example 1 again. The child Age of Employee in and the child Age of Person in are semantically identical elements with cardinality 1. They

282

Wanxia Wei, Mengchi Liu, and Shijun Li

Fig. 5. The relationship between

and

are combined into the child Age of the merged Employee in reports a conflict: the child Age of the merged Employee in has content The content is an or-value and it implies it is not clear which one is the correct one [7]. The child Contact of Employee in contains Phone, Address, Email child elements. The child Phone of Contact and the child Phone of Person are semantically identical elements. The cardinality of Phone is greater than 1, so the child Phone of the child Contact and the child Phone of Person are usually fused into the Phone child elements of the child Contact of the merged Employee element in The child Address of the child Contact of Employee and the child Address of Person are semantically corresponding elements with cardinality 1. There are no conflicts between them, and the child Address of Person has a child PostCode that the child Address of Contact does not have, and as a result, this child PostCode is added to the child Address of the child Contact of the merged Employee in The child Email of the child Contact of Employee is embodied in the child Contact of the merged Employee in Assume that descendant of has a semantically corresponding or identical element that is an ancestor of in To deal with two solutions are possible. One is to simply include into the merged element in This results in a typical conflict: a conflict between or a descendant of and an ancestor of Another is to simply exclude This also has a problem: if contains some descendants that are not semantically corresponding or identical elements of any ancestors of the information about these descendants of is lost in the merged element in It is appropriate to reconcile these two opposing solutions by modifying and incorporating this modified into the merged element in Recursive procedure processc2 carries out the tasks described above. Let us examine Example 1. The child WorkIn of Person in has Factory, Unit, and Group child elements. The child Factory of the child WorkIn of Person in and the ancestor Factory of Employee in are semantically corresponding

Merging of XML Documents

283

elements. Also, the child Unit of the child WorkIn of Person in and the ancestor Department of Employee are semantically corresponding elements because (::Department/@DName = WorkIn/ Unit) is specified in Boolean expression Consequently, WorkIn that has only child element Group is included into the merged Employee in as a child element.

6

A Merge Template XML File

In our implementation, a merge template XML file is created to express and Figure 6 shows an example merge template XML file for Example 1 where MergeTemplate has three child elements: P1, P2, and Key. P1 and P2 indicate the paths of elements to be merged in and respectively. Key gives the information for identifying XML instances and handling typical conflicts. The order of element names in or is significant. The first one is the name of the root element of the corresponding XML document and the last

Fig. 6. An example merge template XML file.

one indicates the name of the elements to be merged. Moreover, each pair of consecutive element names in a path is associated with a pair of a parent and a child in the corresponding XML document, and the child element in each pair of a parent and a child in associated with a pair of consecutive element names in is the only kind child that needs merging or has descendant elements that require merging. All these characteristics are used to support recursive processing of XML elements in and merging of designated elements in and Each child Factor of Key describes a conditional expression in and a Factor that has a Selected attribute with value “Yes” also describes a conditional expression in In Sections 2 and 3, we assume that XML data in and XML data in specified in each conditional expression in have the same representations. In fact, they may have different formats. We define Boolean functions to solve this problem. Consequently, the mechanism presented combines Skolem function and user-defined Boolean functions to identify XML instances. Boolean function samename specified in Figure 6 returns true if and actually refer to the same name although they have different formats.

7

Related Work

Bertino et al. point out that XML data integration involves reconciliation at data model level, data schema level, and data instance level [2]. In this paper, we

284

Wanxia Wei, Mengchi Liu, and Shijun Li

mainly focus on reconciliation at data instance level to merge XML documents that have different structures. A lot of research in semantic integration of XML data has been conducted [3, 10]. Castano et al. propose a semantic approach to integration of heterogeneous XML data by building a domain ontology [3]. Rodríguez-Gianolli et al. present a framework that can provide a tool to integrate DTDs into a common conceptual schema [10]. Several systems for processing XML or XML streams are developed [8, 9]. The Niagara system focuses on providing query capabilities for XML documents and can handle infinite streams [9]. Lore is a semi-structured data repository that builds a database system to query XML data [8]. The merging operation defined in this paper is not available in any of these works or systems. A lot of research in merging or integration of XML data that has similar or identical structures has been done [6, 7]. A data model for semi-structured data is introduced and an integration operator is defined in [7]. This operator integrates similarly structured XML data. Lindholm designs a 3-way merging algorithm for XML files that comply with an identical DTD [6]. The mechanism proposed in this paper can merge two XML documents that have different structures. Merge Templates that specify how to recursively combine two XML documents are introduced by Tufte et al. [12]. Our work is different from their work in several aspects. First, the Merge operation proposed by Tufte et al. combines two similarly structured XML documents to create aggregates over streams of XML fragments. Second, a method for merging XML elements and handling typical conflicts is proposed in this paper. When merging XML documents, it is critical to identify XML data instances representing the same object of the real world. Albert uses the term instance identification to refer to this problem [1]. This problem has been investigated [1, 5]. These papers propose different methods to deal with this problem. A universal key is used in [1]. Lim et al. define the union of keys of the data sources [5]. However, these works deal with databases and support typed data. Skolem function is introduced in [4]. It returns a value for an object as the identifier of this object. Saccol et al. present a proposal for instance identification based on Skolem function [11]. The mechanism presented in this paper combines Skolem function and Boolean functions defined for designers [11] to identify XML instances.

8

Conclusion

We have defined a merging operation over XML documents that is similar to a full outer join in relational algebra. It can merge two XML documents with different structures. We have implemented a prototype to merge XML documents. We plan to investigate other operations over XML documents, such as intersection and difference.

Merging of XML Documents

285

References 1. J. Albert. Data Integration in the RODIN Multidatabase System. In Proceedings of the First IFCIS International Conference on Cooperative Information Systems (CoopIS’96), pages 48–57, Brussels, Belgium, June 19-21 1996. 2. E. Bertino and E. Ferrari. XML and Data Integration. IEEE Internet Computing, 5(6):75–76, 2001. 3. S. Castano, A. Ferrara, G. S. Kuruvilla Ottathycal, and V. De Antonellis. Ontology-based Integration of Heterogeneous XML Datasources. In Proceedings of the 10th Italian Symposium on Advanced Database Systems - SEBD’02, pages 27–41, Isola d’Elba, Italy, June 2002. 4. J. L. Hein. Discrete Structures, Logic, and Computability. Jones and Bartlett Publishers, USA, 1995. 5. E. Lim, J. Srivastava, S. Prabhakar, and J. Richardson. Entity Identification in Database Integration. In Proceedings of the Ninth International Conference on Data Engineering, pages 294–301, Vienna, Austria, April 19-23 1993. IEEE Computer Society. 6. T. Lindholm. A 3-way Merging Algorithm for Synchronizing Ordered Trees–the 3DM Merging and Differencing Tool for XML. Master’s thesis, Helsinki University of Technology, Department of Computer Science, 2001. 7. M. Liu and T. W. Ling. A Data Model for Semi-structured Data with Partial and Inconsistent Information. In Proceedings of the Seventh International Conference on Extending Database Technology (EDBT 2000), pages 317–331, Konstanz, Germany, March 27-31 2000. Springer. 8. J. McHugh, S. Abiteboul, R. Goldman, D. Quass, and J. Widom. Lore: A Database Management System for Semistructured Data. SIGMOD Record, 26(3):54–66, September 1997. 9. J. Naughton, D. DeWitt, D. Maier, and et al. Niagara Internet Query System. IEEE Data Engineering Bulletin, 24(2):27–33, June 2001. 10. R. Rodríguez-Gianolli and J. Mylopoulos. A Semantic Approach to XML-based Data Integration. In Proceedings of the 20th International Conference on Conceptual Modelling (ER), pages 117–132, Yokohama, Japan, November 27-30 2001. 11. D. B. Saccol and C. A. Heuser. Integration of XML Data. In Efficiency and Effectiveness of XML Tools and Techniques and Data Integration over the Web: VLDB 2002 workshop EEXTT and CAiSE 2002 workshop DIWeb, pages 68–80. Springer, 2003. 12. K. Tufte and D. Maier. Merge as a Lattice-Join of XML Documents. In Proceedings of the 28th VLDB Conference, Hong Kong, China, 2002.

Schema-Based Web Wrapping Sergio Flesca and Andrea Tagarelli DEIS, University of Calabria, Italy {flesca,tagarelli}@deis.unical.it

Abstract. An effective solution to automate information integration is represented by wrappers, i.e. programs which are designed for extracting relevant contents from a particular information source, such as web pages. Wrappers allow such contents to be delivered through a selfdescribing and easily processable representation model. However, most existing approaches to wrapper designing focus mainly on how to generate extraction rules, while do not weigh the importance of specifying and exploiting the desired schema of the extracted information. In this paper, we propose a new wrapping approach which encompasses both extraction rules and the schema of required information in wrapper definitions. We investigate the advantages of suitably exploiting extraction schemata, and we define a clean declarative wrapper semantics by introducing (preferred) extraction models for source HTML documents with respect to a given wrapper.

1 Introduction Information available on the Web is mainly encoded into the HTML format. Typically, HTML pages follow source-native and fairly structured styles, thus are ill-suited for automatic processing. However, the need for extracting and integrating information from different sources into a structured format has become a primary requirement for many information technology companies. For example, one would like to monitor appealing offers about books concerning specific topics. Here, an interesting offer may consist in finding highly-rated books. In this context, an effective solution to automate information integration is related to the exploitation of wrappers. Essentially, wrappers are programs designed for extracting relevant contents from a particular information source (e.g. HTML pages), and for delivering such contents through a self-describing and easily processable representation model. XML [19] is widely known as the standard for representing and exchanging data through the web, therefore successfully fulfills the above requirements for a wrapping environment. Generally, a wrapper consists of a set of extraction rules which are used both to recognize relevant content portions within a document and to map them to specific semantics. Several wrapping technologies have been recently developed: we mention here TSIMMIS [8], FLORID [15], DEByE [13], W4F [18], XWrap [14], RoadRunner [2], and Lixto [1] as exemplary systems proposed by the research community. Traditional issues concerning wrapper systems are the P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 286–299, 2004. © Springer-Verlag Berlin Heidelberg 2004

Schema-Based Web Wrapping

287

development of powerful languages for expressing extraction rules and the capability of generating these rules with the lowest human effort. Such issues can be addressed by a number of approaches, such as wrapper induction based on learning from annotated examples [6, 9, 12, 17] and the visual specification of wrappers [1]. The first approach suffers from negative theoretical results on the expressive power of learnable extraction rules, while visual wrapper generation allows the definition of more expressive rules [7]. However, although the schema of the required information should be carefully defined at the time of wrapper generation, most existing wrapper designing approaches focus mainly on how to specify extraction rules. Indeed, while generating wrappers, such approaches ignore the potential advantages coming from the specification and usage of the extraction schema, that is the desired schema of the documents to be created to contain the extracted information. A specific extraction schema can aid to recognize and discard irrelevant or noisy information from documents resulting from the data extraction, thus improving the accuracy of a wrapper. Furthermore, the extracted information can be straightforwardly used in the data integration process, since it follows a specific organization best reflecting user requirements. As a running example, consider an excerpt of Amazon page displayed in Fig.1, and suppose we would like to extract the title, the author (s), the customer rate (if available), the price proposed by the Amazon site, and the publication year, for any book listed in the page. The extraction schema for the above information can be suitably represented by the following DTD:

It is easy to see that such a schema allows the extraction of structured information with multi-value attributes (operator +), missing attributes (operator ?), and variant attribute permutations (operator As mentioned above, existing wrappers are not able to specify and exploit extraction schemata. Some full-fledged systems describe a hierarchical structure of the information to be extracted [1, 17], and they are mostly capable of specifying constraints on the cardinality of the extracted sub-elements. However, no such system allows complex constraints to be expressed: for instance, it is not possible to require that element customer_rate may occur alternatively to element no_rate. As a consequence, validating the extraction of elements with complex contents is not allowed. Two preliminary attempts of exploiting information on extraction schema have been recently proposed in the information extraction [10] and wrapping [16] research areas. In the former work, schemata represented as tree-like structures do not allow alternative subexpressions to be expressed. Moreover, a heuristic approach is used to make a rule fit to other mapping rule instances: as a con-

288

Sergio Flesca and Andrea Tagarelli

Fig. 1. Excerpt of a sample Amazon page from www.amazon.com

sequence, rule refinement based on user feedback is needed. In [10], DTD-style extraction rules exploiting enhanced content models are used in both learning and extracting phases. [3] is related to a particular direction of research: turning the schema matching problem into an extraction problem based on inferring the semantic correspondence between a source HTML table and a target HTML schema. The proposed approach differs from the previous ones related to schema mapping since it entails elements of table understanding and extraction ontologies. In particular, table understanding strategies are exploited to form attribute-value pairs, then an extraction ontology performs data extraction. It is worth noticing that all the above approaches lack a rigorous formalism for the specification of extraction rules. Moreover, they do not define any model for the construction of the documents into which the extracted information has to be inserted. Our contributions can be summarized as follows. We propose a novel wrapping approach which improves standard approaches based on hierarchical extraction by introducing the presence of extraction schema in the wrapper generation. Indeed, a wrapper is defined by specifying, besides a set of extraction rules, the desired schema of the XML documents to be built from the extracted information. The schema availability not only allows the extracted XML documents to be effectively used for further processing, but also allows the exploitation of simpler rules for extracting the desired information. For instance, to extract customer_rate from a book, a standard approach should express a rule extracting the third row of a book table only if this row contains an image displaying

Schema-Based Web Wrapping

289

the “rate”. The presence of the extraction schema allows the definition of two simple rules, one for customer_rate element and one for its rate subelement: the former extracts the third row of the book table, while the latter extracts an image. Moreover, our approach in principle does not rely on any particular form of extraction rules, that is any preexisting kind of rules can be easily plugged in; however, we show that XPath extraction rules are particularly suitable for our purposes. Finally, we define a clean declarative semantics of schema-based wrappers: this is accomplished by introducing the concept of extraction models for source documents with respect to a given wrapper, and by identifying a unique preferred model.

2

Preliminaries

Any XML document can be associated with a document type definition (DTD) that defines the structure of the document and what tags might be used to encode the document. A DTD is a tuple where: i) El is a finite set of element names, ii) P is a mapping from El to element type definitions, and iii) is the root element name. An element type definition is a one-unambiguous regular expression defined as follows1:

where #PCDATA is an element whose content is composed of character data, EMPTY is an element without content, and ANY is an element with generic content. An element type definition specifies an element-content model that constrains the allowed types of the child elements and the order in which they are allowed to appear. A recursive DTD is a DTD with at least a recursive element type definition, i.e. an element whose definition refers to itself or an element that can be its ancestor. In other terms, a recursive DTD admits documents such that an element may contain (directly or indirectly) an element of the same type. For the sake of presentation clarity, we refer to DTDs which do not contain attribute lists. As a consequence, we consider a simplified version of XML documents, whose elements have no attributes. In our domain, the application of a wrapper to a source document can produce several candidate document results. A desirable property of a wrapping framework should be that of producing results that are ordered with respect to some criteria in order to identify a unique preferred extraction document. We accomplish this objective by exploiting partially ordered regular expressions [4], i.e. an extension of regular expressions where a partial order between strings holds. A partially ordered language over a given alphabet is a pair where L is a (standard) language over (a subset of and 1

The symbol denotes different productions with the same left part. Here we do not consider mixed content of elements [19].

290

Sergio Flesca and Andrea Tagarelli

is a partial order on the strings of L. Ordered regular expressions are defined by adapting classical operations for standard languages to partially ordered languages. In particular, a new set of strings and a partial order on this set can be defined for the operations of prioritized union, concatenation, and prioritized closure between languages [4]. Let be an alphabet. The ordered regular expressions over and the sets that they denote, are defined recursively as follows: 1. is a regular expression and denotes the empty language 2. for each is a regular expression and denotes the language and are regular expressions denoting languages and 3. if respectively, then i) denotes the prioritized union language ii) denotes the concatenation language iii) notes the prioritized closure language

de-

Proposition 1. Let be a one-unambiguous ordered regular expression. The language is linearly ordered.

3

Schema-Based Wrapping Framework

In the following we describe our proposal to extend traditional hierarchical wrappers in such a way they can effectively benefit from exploiting extraction schemata. To this purpose, we do not focus on a particular extraction language, but investigate how to build documents, for the extracted information, that are valid with respect to a predefined schema. Indeed, our approach can profitably employ different kinds of extraction rules. Therefore, before describing the schema-based wrapping approach in more detail, we introduce a general notion of extraction rule. We assume any source HTML document is represented by its parse tree, also called as XHTML document. Generally, each extraction rule works on a sequence of nodes of an HTML parse tree, providing a sequence of sequences of nodes. Notice that working on a tree-based model for HTML data is not a strong requirement, and can be easily relaxed. However, for the sake of simplicity, we do not refer to string-based extraction rules like those introduced in [1, 11, 17]. Definition 1 (Extraction rule). Given an HTML parse tree doc and a sequence of nodes in doc, an extraction rule is a function associating with a sequence S of node sequences. Extraction rules so defined can be seen as a generalization of Lixto extraction filters. The main difference with respect to Lixto filters is that our rules allow the extraction of non-contiguous portions of an HTML document. However, an extraction rule is not able to contain references to elements extracted by different rules. Moreover, we define a special type of extraction rules which turn out to be particularly useful to address the problem of wrapper evaluation [5].

Schema-Based Web Wrapping

291

Definition 2 (Monotonic extraction rule). Given a sequence of nodes in an HTML parse tree doc, a monotonic extraction rule is a function associating with a sequence S of node sequences such that, for each sequence and for each node there exists which is ancestor of Let us now introduce our notion of wrapper. A wrapper is essentially composed of: i) the desired schema of the information to be extracted from HTML documents, and ii) a set of extraction rules. As in most earlier approaches (such as [1,17]), the extraction of the desired information proceeds in a hierarchical way. The formal definition of a wrapper is provided below. Definition 3 (Wrapper). Let be a DTD, traction rules, and be a function associating each pair with a rule A wrapper is defined as

be a set of exof elements

In practice, a wrapper associates the root element of the DTD with the root of the HTML parse tree to be processed, then it recursively builds the content of by exploiting the extraction rules to identify the sequences of nodes that should be extracted. In other terms, once an element has been associated with a sequence of nodes of the source document, an extraction rule is applied to to identify the sequences that can be associated with the children of In order to devise a complete specification of a wrapper, we further propose an effective implementation of extraction rules based on the XPath language [20].

3.1

XPath Extraction Rules

The primary syntactic construct in XPath is the expression. An expression is evaluated to yield an ordered collection of nodes without duplicates, i.e. a sequence of nodes. In this work, we consider XPath expressions with variables. The evaluation of an XPath expression occurs with respect to a context and a variable binding. Variable bindings represent mappings from variable names to sequences of objects. Formally, given a variable binding and a variable name we denote with the sequence associated to by Moreover, given two disjoint variable bindings and we denote with a variable binding such that, for each (resp. if (resp. is defined, otherwise is undefined. Given an XPath expression an XHTML document doc, a sequence of nodes and a variable binding denotes the sequence of nodes provided by when is evaluated on doc, starting from and according to The relation between the result of an XPath expression and a variable is represented by the concept of XPath predicate, which is formally defined as follows. Definition 4 (XPath predicate). Given a set of variables and an XPath expression using the variables we denote an XPath predicate with Moreover, we denote a subsequence XPath predicate with

292

Sergio Flesca and Andrea Tagarelli

Given an XHTML document doc and a variable binding an XPath predicate is true with respect to if Analogously, a subsequence XPath predicate is true with respect to if is a subsequence of Moreover, we consider an order on node sequences which is defined according to the document order. Given two sequences and precedes if there exists an index such that and for each or is a prefix of Given an XHTML document doc, a variable binding and a subsequence XPath predicate we denote with the sequence of node sequences such that for each and is true with respect to for each XPath predicates are the basis of more complex concepts, such as extraction filters and extraction rules. An extraction filter is defined over both a target predicate and a set of other predicates which act as filter conditions. Definition 5 (XPath extraction filter). Given a set of variables an XPath extraction filter is defined as a tuple where: is a target predicate, that is a subsequence XPath predicate defining variable on the empty set of variables; is a conjunction of predicates defined on variables The application of an XPath extraction filter to a sequence of nodes yields a sequence of node sequences where: 1) for each 2) for each and 3) there exists a substitution which is disjoint with respect to such that each XPath predicate in is true with respect to We devise any extraction rule as a composition of two kinds of filters: extraction filters and external filters. The latter specify conditions on the size of the extracted sequences. In particular, we consider the following external filters: an absolute size condition filter as specified by bounds (min, max) on the size of a node sequence that is is true if a relative size condition filter rs specified by policies {minimize, maximize}, that is, given a sequence S of node sequences and a sequence is true if rs = minimize (resp. rs = maximize) and there not exists a sequence such that (resp. Definition 6 (XPath extraction rule). An XPath extraction rule is defined as where is a disjunction of extraction filters, as and rs are external filters. For any sequence of nodes, the application of an XPath extraction rule to yields a sequence of node sequences which is constructed as follows. Firstly, we build the ordered sequence that is the sequence obtained by merging the sequences produced by each extraction filter applied to Secondly, we derive the sequence of node

Schema-Based Web Wrapping

sequences such that all the sequences

by removing from is false. Finally, we obtain such that is false.

293

all the sequences by removing from

Example 1. Suppose we are given an extraction rule where filters and are defined respectively as:

Consider now the document tree sketched below, and suppose we apply the rule to the sequence of nodes

The target predicate of returns the sequence [[5], [5,7], [5,7,8], [7], [7,8], [8]], which is turned into [[5], [5,8], [8]] by applying conditions in Analogously, the target predicate of returns the sequence [[11], [11,13], [11,13,14], [11,13,14, 16], [13], [13,14], [13,14,16], [14], [14,16], [16]], which is simplified in [13,14]. The union between and is computed as By applying the external filters it can straightforwardly derived that the resulting sequence is [[5,8], [13,14]].

4

Wrapper Semantics

In this section we provide a clean declarative semantics for schema-based wrappers. This is accomplished by introducing the notion of extraction models for source HTML documents with respect to a given wrapper. Extraction models are essentially collections of extraction events. An extraction event models the extraction of a subsequence by means of an extraction rule which is applied to a context, that is a specific sequence of nodes. However, not all the extraction events turn out to be useful for the construction of the XML document dedicated to contain the extracted information: extraction models are able to identify those events that can be profitably exploited in building an XML document.

294

Sergio Flesca and Andrea Tagarelli

4.1

Extraction Events and Models

The notion of extraction model relies strictly on the notion of extraction event. An extraction event happens whenever an extraction rule is applied. We assume that each extraction event is associated with a unique identifier. Definition 7 (Extraction event). Given a target element name and an associated node sequence an extraction event is a tuple where id and pid denote the identifiers of the current and parent extraction event, respectively, and pos denotes the position of relative to event pid. In order to build an XML document to be extracted by a wrapper, we have to consider sets of extraction events. However, only some sets of extraction events correspond to a valid document. Therefore, we have to carefully characterize such sets of extraction events. To this purpose, let us introduce some preliminary definitions on properties of sets of extraction events. To begin with, a set of extraction events is said to be well-formed if the following conditions hold: there not exist two events and in such that i.e. an extraction event must have a unique identifier; there not exist two events and such that i.e. two sibling events cannot refer to the same position; there not exist two events and such that i.e. two identical node sequences cannot be associated to the same element. Notations for handling well-formed sets of extraction events are introduced next. Given a set of extraction events and a specific event identified by pid, we denote with the set containing all the extraction events which are children of i.e. We further describe two simple functions, namely elnames and linearize, that provide flat versions of a set of extraction events. Given an event identifier pid and a set of extraction events, we denote with the list of extraction events in such that the events are ordered by position. Moreover, we denote with the sequence of element names corresponding to formally, where Extraction events need to be characterized with respect to their conformance to a given regular expression specifying an element type. Given a regular expression on an alphabet of element names, and an event identifier pid, we say that is valid for if spells i.e. the string formed by concatenating element names in belongs to the language We are now able to characterize the validity of a set of extraction events with respect to the definition of an element. Let be a DTD and be a wrapper. We say that a well-formed set of extraction events is valid for an element name if the following conditions hold:

Schema-Based Web Wrapping

295

Fig. 2. Sketch of HTML parse tree of page in Fig.1

or or for each extraction event is valid for and for each event there not exist two extraction events in such that and and contains extraction events such that

and and does not precede

in

An extraction model is essentially a well-formed set of extraction events that conform to the definition of all the elements appearing in the DTD specified within a wrapper. Moreover, an extraction model can be represented by a tree of extraction events. Definition 8 (Extraction Model). Let be a DTD, be a wrapper, doc be an XHTML document, and be a well-formed set of extraction events. is said to be an extraction model of doc with respect to (for short, is an extraction model of if: corresponds to a tree where N is the set of extraction events, E is formed by pairs such that and is the parent event of and is a function associating an identifier to each extraction event; for each extraction event is valid for Example 2. Consider again the Amazon page displayed in Fig.1, and suppose that such a page is subject to a wrapper based on the DTD presented in the Introduction. The extraction rules used by this wrapper are reported on the third column of Table 1; we assume that (1,1) and minimize are adopted as default

296

Sergio Flesca and Andrea Tagarelli

external filters. The first column reports the target element names associated to each rule, whereas the parent element names can be deduced by the DTD. Extraction events occurring in the example model are reported on the second column of the table. For the sake of simplicity, we focus only on a portion of the document doc corresponding to the page of Fig.1; the parse tree associated with doc is sketched in Fig.2. Therefore, we consider only some events, according to the portion of page we have chosen. Event occurs implicitly in the model under consideration, thus it is not extracted by any rule. Offered books are stored into a unique table, which is extracted by event using filter This filter fulfills the requirement that the book table has to be preceded by a simpler table containing a selection list. Information about any book is stored into a separate table which consists of two parts: the first one contains a book picture, while the second one is another table divided into eight rows, one for each specific information about the book. Let us consider the first instance of book, whose subtree is rooted in node 25 of the parse tree. The book, which is identified by event using filter has information on title, (one) author, year, customer rate, and price. The set of events which are children of is built as Even though information on customer rate is available from the first instance of book, we can observe that event happens for element no_rate: however, such an event cannot appear in the model, because would not be a valid content for an element of type book. It is worth noting that rules for extracting information on both availability and unavailability of customer rate have been intentionally defined as identical in this example. However, both kinds of extraction events occur only in any book having customer rate, while only event for element no_rate is extracted from any book which has not customer rate. This happens since it is not possible that an event for rate occurs as a child of an event for no_rate. An extraction model is implicitly associated with a unique XML document, which is valid with respect to a previously specified schema. Given a DTD a wrapper an XHTML document doc, and an extraction model of we define the function buildDoc which takes and an event as input and returns the XML fragment relative to For any event is recursively defined as follows: if then if then if is a regular expression then where In the above definitions, denotes the concatenation of the string values of the nodes in and symbol‘+’ is used to indicate the concatenation of strings. Moreover, we denote with the application of buildDoc to the root extraction event in Definition 9 (Extracted XML document). Given a wrapper and an XHTML document doc, an XML document xdoc is extracted from doc

Schema-Based Web Wrapping

297

by applying (hereinafter referred to as if there exists an extraction model of such that Moreover, we denote with the set of all the XML documents xdoc such that Theorem 1. Let be a wrapper and doc be an XHTML document. If is not recursive and all the extraction rules in are monotonic then: 1. each extraction model of is finite, and the cardinality of bounded by a polynomial with respect to the size of doc; is finite. 2. the set

4.2

is

Preferred Extraction Models

Extraction models provide us a characterization of the set of XML documents that encode the information extracted by a wrapper from a given XHTML document doc, i.e. the set Each document in this set represents a candidate result of the application of to doc. However, this should not be a desirable property for a wrapping framework. In this section we investigate the requirements to identify a unique document which is preferred with respect to all the candidate XML extracted documents.

298

Sergio Flesca and Andrea Tagarelli

Firstly, we introduce an order relation between sets of extraction events having the same parent element type. Consider two extraction models and of and two events and We say that precedes (hereinafter referred to as if the following conditions hold: precedes in the language or is equal to and there exists a position pos such that for each if and then and and if and then precedes in or is equal to and there exists a position pos such that for each if and then and and and if and then The above order relation allows us to define an order relation between sets of extraction events, and consequently between extracted documents. Given two extraction models and of we have that precedes if Moreover, given two XML documents and generated from and respectively, we say that precedes if, for each model of there exists a model of such that Definition 10 (Preferred extracted document). Let be a DTD, be a wrapper, doc be an XHTML document and xdoc be an XML document in xdoc is preferred in if, for each document holds. Theorem 2. Let be a DTD, be a wrapper, and doc be an XHTML document. There exists a unique preferred extracted document pxdoc in

5

Conclusions and Future Work

In this work, we posed the theoretical basis for exploiting the schema of the information to be extracted in a wrapping process. We provided a clean declarative semantics for schema-based wrappers, through the definition of extraction models for source HTML documents with respect to a given wrapper. We also addressed the issue of wrapper evaluation, developing an algorithm which works in polynomial time with respect to the size of a source document; the reader is referred to [5] for detailed information. We are currently developing a system that implements the proposed wrapping approach. As ongoing work, we plan to introduce enhancements to extraction schema. In particular, we are interested in considering XSchema constraints and relaxing the one-unambiguous property.

Schema-Based Web Wrapping

299

References 1. R. Baumgartner, S. Flesca, and G. Gottlob. Visual Web Information Extraction with Lixto. In Proc. 27th VLDB Conf., pages 119–128, 2001. 2. V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In Proc. 27th VLDB Conf., pages 109–118, 2001. 3. D. W. Embley, C. Tao, and S. W. Liddl. Automatically Extracting Ontologically Specified Data from HTML Tables of Unknown Structure. In Proc. 21st ER Conf., pages 322–337, 2002. 4. S. Flesca and S. Greco. Partially Ordered Regular Languages for Graph Queries. In Proc. 26th ICALP, pages 321–330, 1999. 5. S. Flesca and A. Tagarelli. Schema-based Web Wrapping. Tech. Rep., DEIS - University of Calabria, 2004. Available at http://www.deis.unical.it/tagarelli/. 6. D. Freitag and N. Kushmerick. Boosted Wrapper Induction. In Proc. 17th AAAI Conf., pages 577–583, 2000. 7. G. Gottlob and C. Koch. Monadic Datalog and the Expressive Power of Languages for Web Information Extraction. In Proc. 21st PODS Symp., pages 17–28, 2002. 8. J. Hammer, J. McHugh, and H. Garcia-Molina. Semistructured Data: The TSIMMIS Experience. In Proc. 1st ADBIS Symp., pages 1–8, 1997. 9. C.-H. Hsu and M.-T. Dung. Generating Finite-State Transducers for Semistructured Data Extraction from the Web. Information Systems, 23(8):521–538, 1998. 10. D. Kim, H. Jung, and G. Geunbae Lee. Unsupervised Learning of mDTD Extraction Patterns for Web Text Mining. Information Processing and Management, 39(4):623–637, 2003. 11. N. Kushmerick. Wrapper Induction: Efficiency and Expressiveness. Artificial Intelligence, 118(1–2):15–68, 2000. 12. N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper Induction for Information Extraction. In Proc. 15th IJCAI, pages 729–737, 1997. 13. A. H. F. Laender, B. A. Ribeiro-Neto, and A. S. da Silva. DEByE - Data Extraction By Example. Data & Knowledge Engineering, 40(2):121–154, 2002. 14. L. Liu, C. Pu, and W. Han. XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources. In Proc. 16th ICDE, pages 611–621, 2000. 15. W. May, R. Himmeröder, G. Lausen, and B. Ludäscher. A Unified Framework for Wrapping, Mediating and Restructuring Information from the Web. In ER Workshops, pages 307–320, 1999. 16. X. Meng, H. Lu, H. Wang, and M. Gu. Data Extraction from the Web Based on Pre-Defined Schema. JCST, 17(4):377–388, 2002. 17. I. Muslea, S. Minton, and C. Knoblock. Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems, 4(1/2):93–114, 2001. 18. A. Sahuguet and F. Azavant. Building Intelligent Web Applications Using Lightweight Wrappers. Data & Knowledge Engineering, 36(3):283–316, 2001. 19. World Wide Web Consortium – W3C. Extensible Markup Language 1.0, 2000. 20. World Wide Web Consortium – W3C. XML Path Language 2.0, 2003.

Web Taxonomy Integration Using Spectral Graph Transducer Dell Zhang1,2, Xiaoling Wang3, and Yisheng Dong4 1

Department of Computer Science, School of Computing, National University of Singapore S15-05-24, 3 Science Drive 2, Singapore 117543 2 Computer Science Programme, Singapore-MIT Alliance E4-04-10, 4 Engineering Drive 3, Singapore 117576 [emailprotected] 3

Department of Computer Science & Engineering, Fudan University 220 Handan Road, Shanghai, 200433, China [emailprotected]

4

Department of Computer Science & Engineering, Southeast University 2 Sipailou, Nanjing, 210096, China [emailprotected]

Abstract. We address the problem of integrating objects from a source taxonomy into a master taxonomy. This problem is not only currently pervasive on the web, but also important to the emerging semantic web. A straightforward approach to automating this process would be to train a classifier for each category in the master taxonomy, and then classify objects from the source taxonomy into these categories. Our key insight is that the availability of the source taxonomy data could be helpful to build better classifiers in this scenario, therefore it would be beneficial to do transductive learning rather than inductive learning, i.e., learning to optimize classification performance on a particular set of test examples. In this paper, we attempt to use a powerful transductive learning algorithm, Spectral Graph Transducer (SGT), to attack this problem. Noticing that the categorizations of the master and source taxonomies often have some semantic overlap, we propose to further enhance SGT classifiers by incorporating the affinity information present in the taxonomy data. Our experiments with real-world web data show substantial improvements in the performance of taxonomy integration.

1

Introduction

A taxonomy, or directory or catalog, is a division of a set of objects (documents, images, products, goods, services, etc.) into a set of categories. There are a tremendous number of taxonomies on the web, and we often need to integrate objects from a source taxonomy into a master taxonomy. This problem is currently pervasive on the web, given that many websites are aggregators of information from various other websites [1]. A few examples will illustrate the scenario. A web marketplace like Amazon1 may want to combine goods from multiple vendors’ catalogs into its own. A web portal like NCSTRL2 may want to combine documents from multiple libraries’ directories into its own. A company may want to merge its service taxonomy with its partners’. A researcher may want to 1 2

http://www.amazon.com/ http://www.ncstrl.org/

P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 300–312, 2004. © Springer-Verlag Berlin Heidelberg 2004

Web Taxonomy Integration Using Spectral Graph Transducer

301

merge his/her bookmark taxonomy with his/her peers’. Singapore-MIT Alliance3, an innovative engineering education and research collaboration among MIT, NUS and NTU, has a need to integrate the academic resource (courses, seminars, reports, softwares, etc.) taxonomies of these three universities. This problem is also important to the emerging semantic web [2], where data has structures and ontologies describe the semantics of the data, thus better enabling computers and people to work in cooperation. On the semantic web, data often come from many different ontologies, and information processing across ontologies is not possible without knowing the semantic mappings between them. Since taxonomies are central components of ontologies, ontology mapping necessarily involves finding the correspondences between two taxonomies, which is often based on integrating objects from one taxonomy into the other and vice versa [3,4]. If all taxonomy creators and users agreed on a universal standard, taxonomy integration would not be so difficult. But the web has evolved without central editorship. Hence the correspondences between two taxonomies are inevitably noisy and fuzzy. For illustration, consider the taxonomies of two web portals Google4 and Yahoo5: what is “Arts/ Music/ Styles/” in one may be “Entertainment/ Music/ Genres/” in the other, category “Computers_and_Internet/ Software/ Freeware” and category “Computers/ Open_Source/ Software” have similar contents but show non-trivial differences, and so on. It is unclear if a universal standard will appear outside specific domains, and even for those domains, there is a need to integrate objects from legacy taxonomy into the standard taxonomy. Manual taxonomy integration is tedious, error-prone, and clearly not possible at the web scale. A straightforward approach to automating this process would be to formulate it as a classification problem which has being well-studied in machine learning area [5]. Our key insight is that the availability of the source taxonomy data could be helpful to build better classifiers in this scenario, therefore it would be beneficial to do transductive learning rather than inductive learning, i.e., learning to optimize classification performance on a particular set of test examples. In this paper, we attempt to use a powerful transductive learning algorithm, Spectral Graph Transducer (SGT) [6], to attack this problem. Noticing that the categorizations of the master and source taxonomies often have some semantic overlap, we propose to further enhance SGT classifiers by incorporating the affinity information present in the taxonomy data. Our experiments with real-world web data show substantial improvements in the performance of taxonomy integration. The rest of this paper is organized as follows. In §2, we review the related work. In §3, we give the formal problem statement. In §4, we present our approach in detail. In §5, we conduct experimental evaluations. In §6, we make concluding remarks.

2

Related Work

Most of the recent research efforts related to taxonomy integration are in the context of ontology mapping on semantic web. An ontology specifies a conceptualization of a domain in terms of concepts, attributes, and relations [7]. The concepts in an ontology 3 4 5

http://web.mit.edu/sma/ http://www.google.com/ http://www.yahoo.com/

302

Dell Zhang, Xiaoling Wang, and Yisheng Dong

are usually organized into a taxonomy: each concept is represented by a category and associated with a set of objects (called the extension of that concept). The basic goal of ontology mapping is to identify (typically one-to-one) semantic correspondences between the taxonomies of two given ontologies: for each concept (category) in one taxonomy, find the most similar concept (category) in the other taxonomy. Many works in this field use a variety of heuristics to find mappings [8-11]. Recently machine learning techniques have been introduced to further automate the ontology mapping process [3, 4, 12-14]. Some of them derive similarities between concepts (categories) based on their extensions (objects) [3, 4, 12], therefore they need to first integrate objects from one taxonomy into the other and vice versa (i.e., taxonomy integration). So our work can be utilized as a basic component of an ontology mapping system. As explained later in §3, taxonomy integration can be formulated as a classification problem. The Rocchio algorithm [15, 16] has been applied to this problem in [3]; and the Naïve Bayes (NB) algorithm [5] has been applied to this problem in [4], without exploiting information in the source taxonomy. In [1], Agrawal and Srikant proposed the Enhanced Naïve Bayes (ENB) approach to taxonomy integration, which enhances the Naïve Bayes (NB) algorithm [5]. In [17], Zhang and Lee proposed the CS-TSVM approach to taxonomy integration, which enhances the Transductive Support Vector Machine (TSVM) algorithm [18] by the distance-based Cluster Shrinkage (CS) technique. They later proposed another approach in [19], CB-AB, which enhances the AdaBoost algorithm [20-22] by the Co-Bootstrapping (CB) technique. In [23], Sarawagi, Chakrabarti and Godboley independently proposed the Co-Bootstrapping technique (which they named CrossTraining) to enhance the Support Vector Machine (SVM) [24, 25] for taxonomy integration, as well as an Expectation Maximization (EM) based approach EM2D (2Dimensional Expectation Maximization). This paper is actually an straightforward extension of [17]. Basically, the approach proposed in this paper is similar to ENB [1] and CS-TSVM [17], in the sense that they are all motivated by the same idea: to bias the learning algorithm against splitting source categories. In this paper, we compare these two state-of-the-art approaches with ours both analytically and empirically. Comparisons with other approaches are left for future work.

3

Problem Statement

Taxonomies are often organized as hierarchies. In this work, we assume for simplicity, that any objects assigned to an interior node really belong to a leaf node which is an offspring of that interior node. Since we now have all objects only at leaf nodes, we can flatten the hierarchical taxonomy to a single level and treat it as a set of categories [1]. Now we formally define the taxonomy integration problem that we are solving. Given two taxonomies: a master taxonomy with a set of categories each containing a set of objects, and a source taxonomy with a set of categories each containing a set of objects, we need to find the category in for each object in

Web Taxonomy Integration Using Spectral Graph Transducer

303

To formulate taxonomy integration as a classification problem, we take as classes, the objects in as training examples, the objects in as test examples, so that taxonomy integration can be automatically accomplished by predicting the class of each test example. It is possible that an object in belongs to multiple categories in Besides, some objects in may not fit well in any existing category in so users may want to have the option to form a new category for them. It is therefore instructive to create an ensemble of binary (yes/no) classifiers, one for each category C in When training the classifier for C, an object in

is labeled as a positive example if it is

contained by C or as a negative example otherwise. All objects in are unlabeled and wait to be classified. This is called the “one-vs-rest” ensemble method.

4

Our Approach

Here we present our approach in detail. In §4.1, we review transductive learning and explain why it is suitable to our task. In §4.1, we review Spectral Graph Transducer (SGT). In §4.3, we propose the similarity-based Cluster Shrinkage (CS) technique to enhance SGT classifiers. In §4.4, we compare our approach with ENB and CSTSVM.

4.1

Transductive Learning

Regular learning algorithms try to induce a general classifying function which has high accuracy on the whole distribution of examples. However, this so-called inductive learning setting is often unnecessarily complex. For the classification problem in taxonomy integration situations, the set of test examples to be classified are already known to the learning algorithm. In fact, we do not care about the general classifying function, but rather attempt to achieve good classification performance on that particular set of test examples. This is exactly the goal of transductive learning [26]. The transductive learning task is defined on a fixed array of n examples Each example has a desired classification where for binary classification. Given the labels for a subset of (training) examples, a transductive learning algorithm attempts to predict the labels of the remaining (test) examples in X as accurately as possible. Several transductive learning algorithms have been proposed. A famous one is Transductive Support Vector Machine (TSVM), which was introduced by [26] and later refined by [18, 27]. Why can transductive learning algorithms excel inductive learning algorithms? Transductive learning algorithms can observe the examples in the test set and potentially exploit structure in their distribution. For example, there usually exists a clustering structure of examples: the examples in same class tend to be close to each other in feature space, and such kind of knowledge is helpful to learning, especially when there are only a small number of training examples.

304

Dell Zhang, Xiaoling Wang, and Yisheng Dong

Most machine learning algorithms assume that both the training and test examples come from the identical data distribution. This assumption does not necessarily hold in the case of taxonomy integration. Intuitively, transductive learning algorithms seem to be more robust than inductive learning algorithms to the violation of this assumption, since transductive learning algorithms takes the test examples into account for learning. This interesting issue needs to be stressed in the future.

4.2

Spectral Graph Transducer

Recently, Joachims introduced a new transductive learning method, Spectral Graph Transducer (SGT) [6], which can be seen as a transductive version of the k nearestneighbor (kNN) classifier. SGT works in three steps. The first step is to build the k nearest-neighbor (kNN) graph G on the set of examples X. The kNN graph G is similarity-weighted and symmetrized: its adjacency matrix is defined as where

The function sim(·,·) can be any reasonable similarity measure. In the following, we will use a common similarity function

where represents the angle between and The second step is to decompose G into spectrum, specifically, compute the smallest 2 to d + 1 eigenvalues and corresponding eigenvectors of G’s normalized Laplacian where B is the diagonal degree matrix with The third step is to classify the examples. Given a set of training labels SGT makes predictions by solving the following optimization problem which minimizes the normalized graph cut with constraints:

where and denote the set of examples (vertices) with and respectively, and the cut-value is the sum of the edge weights across the cut (bi-partitioning) defined by and Although this optimization problem is known to be NP-hard, there are highly efficient methods based on the spectrum of the graph that give good approximation to the global optimal solution [6].

Web Taxonomy Integration Using Spectral Graph Transducer

305

For example, consider a classification problem with 6 examples whose kNN graph G is shown in Figure 1 (adopted from [6]) with line thickness indicating edge weight. Given a set of training labels and SGT predicts and to be positive whereas predicts and to be negative, because cutting G into and gives the minimal normalized cut-value while keeping

and

Fig. 1. SGT does classification through minimizing the normalized graph cuts with constraints

Unlike most other transductive learning algorithms, SGT does not need any additional heuristics to avoid unbalanced splits [6]. Furthermore, since SGT has a meaningful relaxation that can be solved globally optimally with efficient spectral methods, it is more robust and promising than existing methods.

4.3

Similarity-Based Cluster Shrinkage

Applying SGT to taxonomy integration, we can effectively use the objects in (test examples) to boost classification performance. However, thus far we have completely ignored the categorization of Although

and

are usually not identical, their categorizations often have

some semantic overlap. Therefore the categorization of knowledge about the categorization of same category S in

contains valuable implicit

For example, if two objects belong to the

they are more likely to belong to the same category C in

rather than to be assigned into different categories. We hereby propose the similarity-based Cluster Shrinkage (CS) technique to further enhance SGT classifiers by incorporating the affinity information present in the taxonomy data. 4.3.1 Algorithm Since SGT models the learning problem as a similarity-weighted kNN-graph, it offers a large degree of flexibility for encoding prior knowledge about the relationship between individual examples in the similarity function. Our proposed similarity-based CS technique takes all categories as clusters and shrinks them by substituting the regular similarity function sim(·,·) with the CS similarity function cs-sim(·,·). Definition 1. The center of a category S is

306

Dell Zhang, Xiaoling Wang, and Yisheng Dong

Definition 2. The CS similarity function cs-sim(·,·) for two examples is defined as and are the centers of and respectively,.

and where

When an example x belongs to multiple categories whose centers are respectively, its corresponding category center in the above formula should be amended to We name our approach that uses SGT classifiers enhanced by the similarity-based CS technique as CS-SGT. 4.3.2 Analysis Theorem 1. For any pair of examples

and

in the same category S ,

Proof: Suppose the center of S is c , we get Since

and

we get

therefore

i.e. From the above theorem, we see that CS-SGT increases the similarity between examples that are known in the same category, consequently puts more weight to the edge between them in the kNN graph. Since SGT seeks the minimum normalized graph cut, stronger connection among examples in the same category directs SGT to avoid splitting that category, in other words, to reserve the original categorization of the taxonomies to some degree while doing classification. Through substituting the regular similarity function with the CS similarity function, the CS-SGT approach can not only make effective use of the objects in like SGT, but also make effective use of the categorization of The CS similarity function

is actually a linear interpolation of

and The linear interpolation parameter controls the influence of the original categorization on the classification. When CS-SGT classifies all objects belonging to one category in as a whole into a specific category in When CS-SGT is just the same as SGT. As long as the value of is set appropriately, CS-SGT should never be worse than SGT because it includes SGT as a special case. The optimal value of can be found using a tune set (a set of objects whose categories in both taxonomies are known). The tune set can be made available via random sampling or active learning, as described in [1].

Web Taxonomy Integration Using Spectral Graph Transducer

4.4

307

Comparison with ENB and CS-TSVM

Both ENB and CS-TSVM outperform conventional machine learning methods in taxonomy integration, because they are able to leverage the source taxonomy data to improve classification. CS-SGT also follows this idea to enhance SGT for taxonomy integration. ENB [1] is based on NB [5] which is an inductive learning algorithm. In contrast, CS-TSVM is based on TSVM [18] which is a transductive learning algorithm. It has been shown that CS-TSVM is more effective than ENB [17] in taxonomy integration. However, CS-TSVM is not as efficient as ENB because TSVM runs much slower than NB. CS-SGT is based on the recently proposed transductive learning algorithm SGT [6]. We think CS-SGT should achieve similar performance as CS-TSVM, because in theory SGT connects to a simplified version of TSVM, and both of them attempt to incorporate the affinity information present in the taxonomy data into learning. This has been confirmed by our experiments. On the other hand, CS-SGT is much more efficient than CS-TSVM because of the following three reasons. (1) CS-TSVM is based on TSVM that uses computational-expensive greedy search to get a local optimal solution. In contrast, CS-SGT is based on SGT that uses efficient spectral methods to get the global optimal solution. (2) CS-TSVM must run SVM first to get a good estimation of the fraction of the positive examples in the test set [17] because TSVM requires that fraction to be fixed a priori [18]. In contrast, CS-SGT does not need this kind of extracomputation due to the merit of SGT in automatically avoiding unbalanced splits [6]. (3) CS-TSVM requires training a TSVM classifier from scratch for each master category, using the “one-vs-rest” ensemble method for multi-class multi-label classification (as stated in §2). In contrast, CS-SGT (or SGT) needs to build and decompose the kNN graph only once for a specific set of examples (dataset), hence saves a lot of time. It has been observed that construction of the kNN graph is the most time-consuming step of SGT, but it can be sped up using appropriate data structures like inverted indices or kd-trees [6]. The CS-SGT approach’s prominent advantage in efficiency has also been confirmed by our experiments. In summary, the CS-SGT approach is able to achieve similar performance as CSTSVM in taxonomy integration while holding high efficiency as ENB.

5

Experiments

We conduct experiments with real-world web data, to demonstrate the advantage of our proposed CS-SGT approach to taxonomy integration. To facilitate comparison, we use exactly the same datasets and experimental setup as [17].

5.1

Datasets

We have collected 5 datasets from Google and Yahoo: Book, Disease, Movie, Music and News. One dataset includes the slice of Google’s taxonomy and the slice of Yahoo’s taxonomy about websites on one specific topic.

308

Dell Zhang, Xiaoling Wang, and Yisheng Dong

In each slice of taxonomy, we take only the top level directories as categories, e.g., the “Movie” slice of Google’s taxonomy has categories like “Action”, “Comedy”, “Horror”, etc. In each category, we take all items listed on the corresponding directory page and its sub-directory pages as its objects. An object (listed item) corresponds to a website on the world wide web, which is usually described by its URL, its title, and optionally a short annotation about its content. The set of objects occurred in both Google and Yahoo covers only a small portion (usually less than 10%) of the set of objects occurred in Google or Yahoo alone, which suggests the great benefit of automatically integrating them. This observation is consistent with [1]. The number of categories per object in these datasets is 1.54 on average. This observation confirms our previous statement in §3 that an object may belong to multiple categories, and justifies our strategy to build a binary classifier for each category in the master taxonomy. The category distributions in all theses datasets are highly skewed. For example, in Google’s Book taxonomy, the most common category contains 21% objects, but 88% categories contain less than 3% objects and 49% categories contain less than 1% objects. In fact, skewed category distributions have been commonly observed in realworld applications [28].

5.2

Tasks

For each dataset, we pose 2 symmetric taxonomy integration tasks: (integrating objects from Yahoo into Google) and (integrating objects from Google into Yahoo). As described in §3, we formulate each task as a classification problem. The objects in can be used as test examples, because their categories in both taxonomies are known to us [1]. We hide the test examples’ master categories but expose their source categories to the learning algorithm in training phase, and then compare their hidden master categories with the predictions of the learning algorithm in test phase. Suppose the number of the test examples is n . For tasks, we randomly sample n objects from the set G-Y as training examples. For tasks, we randomly sample n objects from the set Y-G as training examples. This is to simulate the common situation that the sizes of and are roughly in same magnitude. For each task, we do such random sampling 5 times, and report the classification performance averaged over these 5 random samplings.

5.3

Features

For each object, we assume that the title and annotation of its corresponding website summarizes its content. So each object can be considered as a text document composed of its title and annotation6. The most commonly used feature extraction technique for text data is to treat a document as a bag-of-words [18, 25]. For each document d in a collection of documents D, its bag-of-words is first pre-processed by removal of stop-words and 6

Note that this is different with [1, 23] which take actual Web pages as objects.

Web Taxonomy Integration Using Spectral Graph Transducer

309

stemming. Then it is represented as a feature vector where indicates the importance weight of term (the i-th distinct word occurred in D). Following the TF×IDF weighting scheme, we set the value of to the product of the term frequency and the inverse document frequency i.e., The term frequency means the number of occurrences of in d . The inverse document frequency is defined as where

is the total number of documents in D, and

is the number of documents in which are normalized to have unit length.

5.4

occur. Finally all feature vectors

Measures

As stated in §3, it is natural to accomplish a taxonomy integration task via an ensemble of binary classifiers, each for one category in To measure classification performance, we use the standard F-score measure) [15]. The F-score is defined as the harmonic average of precision (p) and recall (r), F = 2pr/(p + r), where precision is the proportion of correctly predicted positive examples among all predicted positive examples, and recall is the proportion of correctly predicted positive examples among all true positive examples. The F-scores can be computed for the binary decisions on each individual category first and then be averaged over categories. Or they can be computed globally over all the M×n binary decisions where M is the number of categories in consideration (the number of categories in and n is the number of total test examples (the number of objects in The former way is called macroaveraging and the latter way is called micro-averaging [28]. It is understood that the micro-averaged F-score (miF) tends to be dominated by the classification performance on common categories, and that the macro-averaged F-score (maF) is more influenced by the classification performance on rare categories [28]. Since the category distributions are highly skewed (see §5.1), providing both kinds of scores is more informative than providing either alone.

5.5

Settings

We use the SGT software implemented by Joachims7 with the following parameters : “-k 10”, “-d 100”, “-c 1000 –t f –p s”. We set the parameter for CS similarity function to 0.2. Fine-tuning using tune sets would decisively generate better results than sticking with a pre-fixed value. In other words, the performance superiority of CS-SGT is under-estimated in our experiments.

5.6

Results

The experimental results of SGT and CS-SGT are shown in Table 1. We see that CSSGT really can achieve much better performance than SGT for taxonomy integration. 7

http://sgt.joachims.org/

310

Dell Zhang, Xiaoling Wang, and Yisheng Dong

We think this is because CS-SGT makes effective use of the affinity information present in the taxonomy data.

In Figure 2 and 3, we compare the experimental results of CS-SGT and those of ENB and CS-TSVM which come from [17]. We see that CS-SGT outperforms ENB consistently and significantly. We also find that CS-SGT’s macro-averaged F-scores are slightly lower than those of CS-TSVM, and its micro-averaged F-scores are comparable to those of CS-TSVM. On the other hand, our experiments demonstrated that CS-SGT was much faster than CS-TSVM: CS-TSVM took about one or two days to run all the experiments while CS-SGT finished in several hours.

Fig. 2. Comparing the macro-averaged F-scores of ENB, CS-TSVM and CS-SGT

6

Conclusion

Our main contribution is to show how Spectral Graph Transducer (SGT) can be enhanced for taxonomy integration tasks. We have compared the proposed CS-SGT approach to taxonomy integration with two existing state-of-the-art approaches, and demonstrated that CS-SGT is both effective and efficient.

Web Taxonomy Integration Using Spectral Graph Transducer

311

Fig. 3. Comparing the micro-averaged F-scores of ENB, CS-TSVM and CS-SGT

The future work may include: comparing with the approaches in [19, 23], incorporating commonsense knowledge and domain constraints into the taxonomy integration process, extending to full-functional ontology mapping systems, and so forth.

References 1. Agrawal, R., Srikant, R.: On Integrating Catalogs. In: Proceedings of the 10th International World Wide Web Conference (WWW). (2001) 603-612 2. Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web. Scientific American (2001) 3. Lacher, M. S., Groh, G.: Facilitating the Exchange of Explicit Knowledge through Ontology Mappings. In: Proceedings of the Fourteenth International Florida Artificial Intelligence Research Society Conference (FLAIRS). (2001) 305-309 4. Doan, A., Madhavan, J., Domingos, P., Halevy, A.: Learning to Map between Ontologies on the Semantic Web. In: Proceedings of the 11th International World Wide Web Conference (WWW). (2002) 662-673 5. Mitchell, T.: Machine Learning. international edn. McGraw Hill, New York (1997) 6. Joachims, T.: Transductive Learning via Spectral Graph Partitioning. In: Proceedings of the 20th International Conference on Machine Learning (ICML). (2003) 290-297 7. Fensel, D.: Ontologies: A Silver Bullet for Knowledge Management and Electronic Commerce. Springer-Verlag (2001) 8. Chalupsky, H.: OntoMorph: A Translation System for Symbolic Knowledge. In: Proceedings of the 7th International Conference on Principles of Knowledge Representation and Reasoning (KR). (2000) 471-482 9. McGuinness, D. L., Fikes, R., Rice, J., Wilder, S.: The Chimaera Ontology Environment. In: Proceedings of the 17th National Conference on Artificial Intelligence (AAAI). (2000) 1123–1124 10. Mitra, P., Wiederhold, G., Jannink, J.: Semi-automatic Integration of Knowledge Sources. In: Proceedings of The 2nd International Conference on Information Fusion. (1999) 11. Noy, N. F., Musen, M. A.: PROMPT: Algorithm and Tool for Automated Ontology Merging and Alignment. In: Proceedings of the National Conference on Artificial Intelligence (AAAI). (2000) 450-455

312

Dell Zhang, Xiaoling Wang, and Yisheng Dong

12. Ichise, R., Takeda, H., Honiden, S.: Rule Induction for Concept Hierarchy Alignment. In: Proceedings of the Workshop on Ontologies and Information Sharing at the 17th International Joint Conference on Artificial Intelligence (IJCAI). (2001) 26-29 13. Noy, N. F., Musen, M. A.: Anchor-PROMPT: Using Non-Local Context for Semantic Matching. In: Proceedings of the Workshop on Ontologies and Information Sharing at the 17th International Joint Conference on Artificial Intelligence (IJCAI). (2001) 63-70 14. Stumme, G., Maedche, A.: FCA-MERGE: Bottom-Up Merging of Ontologies. In: Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI). (2001) 225-230 15. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, New York, NY (1999) 16. Rocchio, J. J.: Relevance Feedback in Information Retrieval. In: G. Salton, (ed.) The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall (1971) 313-323 17. Zhang, D., Lee, W. S.: Web Taxonomy Integration using Support Vector Machines. In: Proceedings of the 13th International World Wide Web Conference (WWW). (2004) 18. Joachims, T.: Transductive Inference for Text Classification using Support Vector Machines. In: Proceedings of the 16th International Conference on Machine Learning (ICML). (1999) 200-209 19. Zhang, D., Lee, W. S.: Web Taxonomy Integration through Co-Bootstrapping. In: Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). (2004) 20. Freund, Y., Schapire, R. E.: A Decision-theoretic Generalization of On-line Learning and an Application to Boosting. Journal of Computer and System Sciences 55 (1997) 119-139 21. Schapire, R. E., Singer, Y.: BoosTexter: A Boosting-based System for Text Categorization. Machine Learning 39 (2000) 135-168 22. Schapire, R. E., Singer, Y.: Improved Boosting Algorithms Using Confidence-rated Predictions. Machine Learning 37 (1999) 297-336 23. Sarawagi, S., Chakrabarti, S., Godbole, S.: Cross-Training: Learning Probabilistic Mappings between Topics. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). (2003) 177-186 24. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines. Cambridge University Press, Cambridge, UK (2000) 25. Joachims, T.: Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In: Proceedings of the 10th European Conference on Machine Learning (ECML). (1998) 137-142 26. Vapnik, V. N.: Statistical Learning Theory. Wiley, New York, NY (1998) 27. Bennett, K.: Combining Support Vector and Mathematical Programming Methods for Classification. In: B. Scholkopf, C. Burges, and A. Smola, (eds.): Advances in Kernel MethodsSupport Vector Learning. MIT-Press (1999) 28. Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. In: Proceedings of the 22nd ACM International Conference on Research and Development in Information Retrieval (SIGIR). (1999) 42-49

Contextual Probability-Based Classification Gongde Guo1,2, Hui Wang2, David Bell3, and Zhining Liao2 1

School of Computer Science Fujian Normal University, Fuzhou, 350007, China 2 School of Computing and Mathematics University of Ulster, BT37 0QB, UK {G.Guo,H.Wang,Z.Liao}@ulster.ac.uk 3

School of Computer Science Queen’s University Belfast, BT7 1NN, UK [emailprotected]

Abstract. The (kNN) method for classification is simple but effective in many cases. The success of (kNN in classification depends on the selection of a “good value” for In this paper, we proposed a contextual probability-based classification algorithm (CPC) which looks at multiple sets of nearest neighbors rather than just one set of nearest neighbors for classification to reduce the bias of The proposed formalism is based on probability, and the idea is to aggregate the support of multiple neighborhoods for various classes to better reveal the true class of each new instance. To choose a series of more relevant neighborhoods for aggregation, three neighborhood selection methods: distance-based, symmetric-based, and entropy-based neighborhood selection methods are proposed and evaluated respectively. The experimental results show that CPC obtains better classification accuracy than kNN and is indeed less biased by after saturation is reached. Moreover, the entropy-based CPC obtains the best performance among the three proposed neighborhood selection methods.

1

Introduction

kNN is a simple but effective method for classification [1]. For an instance to be classified, its nearest neighbors are retrieved, and this forms a neighborhood of Majority voting among the instances in the neighborhood is commonly used to decide the classification for with or without consideration of the distancebased weighting. Despite its conceptual simplicity, kNN performs as well as any other possible classifier when applied to non-trivial problems. Over the last 50 years, this simple classification method has been extensively used in a broad range of applications such as medical diagnosis, text categorization [2], pattern recognition [3], data mining [4], and e-commerce. However, to apply kNN we need to choose an appropriate value for and the success of classification is very much dependent on this value. In a sense, kNN is biased by There are many ways of choosing the value, and a simple one is to run the algorithm many times with different values and choose the one with the best performance. But this is not a pragmatic method in real applications. P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 313–326, 2004. © Springer-Verlag Berlin Heidelberg 2004

314

Gongde Guo et al.

In order for kNN to be less dependent on the choice of we propose to look at multiple sets of nearest neighbors rather than just one set of nearest neighbors. As we know that for an instance each neighborhood bears support for different possible classes. The proposed formalism is based on contextual probability [5], and the idea is to aggregate the support of multiple sets of nearest neighbors for various classes to give a more reliable support value, which better reveals the true class of However, in practice the given data set is usually a sample of the underlying data space, it is impossible to gather all the neighborhoods to aggregate the support for classifying a new instance. On the other hand, even if it is possible to gather all the neighborhoods of a given new instance for classification, the computational cost could be unbearable. In a sense, the classification accuracy of CPC depends on a given number of chosen neighborhoods. So methods used to select more relevant neighborhoods for aggregation in the process of picking up neighborhoods are important. Having identified the existing problems of CPC, we propose three neighborhood selection methods in this paper, aimed at choosing a set of neighborhoods as informative as possible for classification to further improve the classification accuracy of CPC. The rest of the paper is organized as follows: Section 2 describes the contextual probability-based classification method. Section 3 introduces the three neighborhood selection methods: distance-based, symmetric-based, and entropybased neighborhood selection methods. The experimental results are described and discussed in Section 4. Section 5 ends the paper with a summary, linking on existing problems and further research directions.

2

Contextual Probability-Based Classification

Let

be a finite set called a frame of discernment. A mass function is such that

The mass function is interpreted as a representation (or measure) of knowledge or belief about and is interpreted as a degree of support for [6, 7]. To extend our knowledge to an event, A, that we cannot evaluate explicitly for we define a new function such that for any

This means that the knowledge of event A may not be known explicitly in the representation of our knowledge, but we know explicitly some events X that are related to it (i.e. A overlaps with X or Part of the knowledge about X, should then be shared by A, and a measure of this part is The mass function can be interpreted in different ways. In order to solve the aggregation problem, one interpretation is made as follows.

Contextual Probability-Based Classification

315

Let S be a finite set of class labels, and be a finite data set each element of which has a class label in S. The labelling is denoted by a function so that for is the class label of Consider a class Let and The mass function for is defined as such that, for

clearly defined as

and if the distribution over is uniform, then Based on the mass function, the aggregation function for is such that, for

When A is singleton, denoted as

equation (4) can be changed to equation (5).

If the distribution over is uniform then, for represented as equation (6).

Let represent the number of ways of picking N possibilities, then,

and

can be

unordered outcomes from

Let be an instance to be classified. If we know for all then we can assign to the class that has the largest Since the given data set is usually a sample of the underlying data space we may never know the true All we can do is to approximate Equation (6) shows the relationship between and and the latter can be calculated from some given events. If the set of events is complete, i.e. we can accurately calculate and hence otherwise if it is partial, i.e. a subset of is a approximate and so is From equation (5) we know that the more we know about the more accurate (and hence will be. As a result, we can try to gather as many relevant events about as possible. In the spirit of kNN we can deem the neighborhood of as relevant. Therefore we can take neighborhoods of as events. But in practice, the more neighborhoods chosen for classification, the more computational cost it takes. With limited computing time, the choice of the more relevant neighborhoods is non-trivial. This is

316

Gongde Guo et al.

one reason that motivated us to seek a series of more relevant neighborhoods to aggregate the support for classification. Also in the spirit of kNN, for an instance to be classified, the closer an instance is to the more contribution the instance donates for classifying Based on this understanding, for a given number of neighborhoods (for example, chosen for aggregation, we choose a series of specific neighborhoods, which we think are relevant to an instance to be classified, for classification. Summarizing the above discussion we propose the following procedure for CPC. 1. Determine N and for every class and then calculate and These numbers are valid for any 2. Select a number of neighborhoods 3. Calculate for all and 4. Calculate for every 5. Calculate for every 6. Classify for that has the largest In its simplest form kNN is majority voting among the nearest neighbors In our terminology kNN can be described as follows: Select one neighborhood A of calculate then calculate and then finally classify by largest We can see that kNN considers only one neighborhood, and it does not take into account the proportion of instances in a class. In this sense, therefore, kNN is a special case of our classification procedure. of

3

Neighborhood Selection

In practice, a given data set is usually a sample of the underlying data space. It is impossible to gather all the neighborhoods to aggregate the support for classifying a new instance. On the other hand, even if it is possible to gather all neighborhoods for classification, the computational cost could be unbearable. So methods used to select more relevant neighborhoods for aggregation in the process of picking up neighborhoods are quite important. In this section, we describe the three proposed neighborhood selection methods: distance-based, symmetric, and entropy-based neighborhood selection methods which have been implemented in our prototype.

3.1 Distance-Based Neighborhood Selection For a new instance to be classified, distance-based neighborhood selection proceeds by choosing nearest neighbors with different as neighborhoods. One simple way, for example, is to ensure that for each its nearest neighbors make up of a neighborhood called With this convention, we have and where represents the number of neighbors within This is the simplest neighborhood selection method. Figure 1 demonstrates the first four neighborhoods using the distance-based neighborhood selection method.

Contextual Probability-Based Classification

317

Fig. 1. The first four distance-based neighborhoods around

3.2

Symmetric-Based Neighborhood Selection

Let S be a finite set of class labels denoted as and be a finite data set denoted as Each instance in denoted as has a class label in S. The labelling is denoted by a function so that for is the class label of Firstly, we project data set into space. Each instance is represented as a point in the space. Then we partition the space into grids. The partitioning process proceeds as follows: For each dimension of space, if feature is ordinal, we partition into equal intervals, where is the standard deviation of the values occurring for the feature is a parameter whose value is application dependent. We use symbol to represent the length of each cell of feature i.e. If feature is nominal, its discrete values provide a natural partition. At the end of the partitioning process all the instances in data set are distributed into the grids. Assume is an instance to be classified denoted as the initial cell location of denoted by can be calculated as follows: For ordinal feature For nominal feature

in cell

it is represented as an interval

in cell

it is represented as a set

All the instances covered by cell make up of the first neighborhood Strictly speaking, each cell in grids, e.g. is a hypertuple. A hypertuple is a tuple where entries are sets for nominal features, and intervals for ordinal features instead of single values [10]. Assume is the neighborhood and is the corresponding hypertuple, to generate the next neighborhood the hypertuple is expanded in the following way: An ordinal feature in is expanded to An nominal feature in is expanded to

which is represented as an interval which is represented as a set where and

where is a set which represents all the values of feature that occur in the training instances. All the instances covered by the newly generated hypertuple make up of Figure 2 is an example of three symmetric hypertuples

Gongde Guo et al.

318

around which are denoted as tively. In Figure 2, is covered by the instances covered by hypertuples respectively, where

and both

and respecare covered by All make up of the neighborhood

Fig. 2. Three symmetric neighborhoods

3.3

Entropy-Based Neighborhood Selection

We proposed an entropy-based neighborhood selection method by selecting a given number of neighborhoods with as much information for classification as possible. Our goal is to improve the classification accuracy of CPC. This is a neighborhood-expansion method by which the next neighborhood is generated by expanding the previous one. Obviously, the earlier one is covered by the later one. In each neighborhood expansion process, we calculate the entropy of each possible expansion (candidate) and select the one with minimal entropy as our next neighborhood. The smaller the entropy of a neighborhood, the more imbalance there is in the class distribution of the neighbors, and the more relevant the neighbors are to the instance to be classified. Assume is the neighborhood and is the corresponding hypertuple in space. Consider feature If it is ordinal, then is an interval denoted as The set of all the instances covered by hypertuple and the set of all the instances covered by hypertuple will be two candidates for the next neighborhood selection. If feature is nominal, is a set denoted as For every instance where the set of all the instances covered by hypertuple will be a candidate for the next neighborhood selection. We then calculate each candidate’s entropy according to equation (7), and choose the candidate with minimal entropy as The entropy is defined as follows:

In equation (7), is the number of classes; are class labels. in equation (8) is determined by counting the number of instances in the

Contextual Probability-Based Classification

319

candidate that belongs to class and presented this as a percentage of the total number of instances in this candidate. All the instances covered by make up of Suppose that a new instance to be classified initially falls into a cell in grids represented as a hypertuple i.e. is covered by For hypertuple if is ordinal, represents an interval, denoted by where Obviously, satisfies if is nominal, is a set, denoted by where All the instances covered by hypertuple make up of a set denoted by which is the first neighborhood of our algorithm. The detailed entropy-based neighborhood selection algorithm is described as follows: 1. Set is covered by 2. For to {Find the neighborhood with minimal entropy dates expanding from

Suppose that of entropy, i.e.

and

among all the candi-

are two neighborhoods of having the same amount

If where represents the cardinality of we believe that is more relevant to than so in this case, we prefer to choose as the next neighborhood. Otherwise, we prefer to choose the one with minimal as the next neighborhood. According to equation (7), the smaller a neighborhood’s entropy is, the more imbalance its class distribution is, and consequently the more information it has for classification. So, in our algorithm, we adopt equation (7) to be the criteria for neighborhoods selection. In each neighborhood expanding process, we select a candidate with minimal as the next neighborhood. To illustrate the method we consider an example here. For simplicity, we describe our entropy-based neighborhood selection method in 2-dimensional space. Suppose that an instance to be classified locates at cell [3, 3] in the leftmost graph of Figure 3. We collect all the instances, which are covered by cell [3, 3] into a set called as the first neighborhood. Then we try to expand our cell one step in each of 4 different directions (up, down, left, and right) respectively and choose a candidate with minimal as a new expanded area,

Fig. 3. Neighborhood expansion process (1)

320

Gongde Guo et al.

e.g. Then we look up, down, left, right again and select a new area (e.g. in the rightmost graph of Figure 3). All the instances covered by the expanded area make up of the next neighborhood called and so on. At the end of the procedure, we obtain a series of neighborhoods e.g. as shown in Figure 3 from left to right. If an instance to be classified locates at cell [2, 3] in the leftmost graph of Figure 4, the selection process of three neighborhoods is demonstrated by Figure 4 from left to right.

Fig. 4. Neighborhood expansion process (2)

4

Experiment and Evaluation

One motivation of this work is the fact that kNN for classification is heavily dependent on the choice of a ‘good’ value for The objective of this paper is therefore to come up with a method in which this dependence is reduced. A contextual probability-based classification method is proposed to solve this problem, which works in the same spirit as kNN but needs more neighborhoods. For simplicity we refer to our classification procedure presented in the section 2 as nokNN. To distinguish between three different neighborhood selection methods, we refer to distance-based neighborhood selection method as nokNN(d), symmetric neighborhood selection method as nokNN(s), and entropy-based neighborhood selection method as nokNN(e). Here we experimentally evaluate the classification procedures of nokNN(d), nokNN(s), and nokNN(e) with real world data sets in order to verify our expectations and to see if and how aggregating different neighborhoods improves the classification accuracy of kNN.

4.1

Data Sets

In experiment, we used fifteen public data sets available from the UC Irvine Machine Learning Repository. General information about these data sets is shown in Table 1. The data sets are relatively small but scalability is not an issue when data sets are indexed. In Table 1, the meaning of the column headings is follows, NF-Number of Features, NN-Number of Nominal features, NO-Number of Ordinal features, NBNumber of Binary features, NI-Number of Instances, and CD-Class Distribution.

Contextual Probability-Based Classification

321

4.2 Experiments Experiment 1. kNN and nokNN(d) were implemented in our prototype. In the experiment, 30 neighborhoods were used and for every data set. kNN was run with varying number of neighbors ranging from 1 to 88 with step 3 for and nokNN(d) was run with varying number of neighborhoods ranging from 1 to 30 with step 1 for N. Each set of nearest neighbors for kNN) makes up a neighborhood. There are totally 30 neighborhoods corresponding to different ranging from 1 to 88 with step 3. The comparison of kNN and nokNN(d) in classification accuracy is shown in Figure 5. Each value in horizontal axis, e.g. represents the number of neighborhoods used for aggregation for nokNN(d) and the neighborhood used for kNN. The value for kNN with respect to the neighborhood is The detailed experimental results for kNN and nokNN(d) are presented in two separate tables: Table 2 for nokNN(d) and Table 3 for kNN, where N is varied from 1 to 10 for both kNN and nokNN(d).

Fig. 5. A comparison of nokNN(d) and kNN in average classification accuracy

322

Gongde Guo et al.

In Table 2, heading represents the number of neighborhoods used for aggregation. In Table 3, heading represents the neighborhoods used for kNN. The neighborhood contains neighbors, i.e.

Contextual Probability-Based Classification

323

Fig. 6. Classification accuracy of nokNN(d) testing on Diabetes data set

Fig. 7. Classification accuracy of kNN testing on Diabetes data set

Figure 6 and Figure 7 show the full details of the performance of nokNN(d) and kNN testing on the Diabetes data set where the number of neighborhoods varies from 1 to 30. We also give the worst and best performance of kNN together with the corresponding “N” values, and the performance of nokNN(d) in Table 4 when ten neighborhoods are used for aggregation. In this experiment, we use the 10-fold cross validation method for evaluation. The experimental results show that the performance of kNN varies when different neighborhoods are used while the performance of nokNN(d) improves with increasing number of neighborhoods, but stabilizes after a certain number of stages in Figure 5). Furthermore the stabilized performance of nokNN(d) is comparable (in fact slightly better in our experiment on fifteen data sets) to the best performance of kNN within 10 neighborhoods. Experiment 2 In this experiment, our goal is to test whether or not the entropy-based neighborhood selection method can improve the classification accuracy of CPC. In the experiment, for each value of N, e.g. nokNN(e) represents the average classification accuracy obtained when neighborhoods are used for aggregation, and kNN represents the average classification accuracy obtained when testing on the neighborhood. A comparison of entropy-based nokNN(e) and kNN with respect to classification accuracy using 10-fold cross validation is shown in Figure 8. To further verify our aggregation method, we also implemented a symmetric neighborhood selection method. Refer to section 3.2 for more details.

324

Gongde Guo et al.

Fig. 8. A comparison of nokNN(e) and kNN in classification accuracy

Figure 9 shows that the similar results are obtained using the symmetric neighborhood selection method. A comparison of entropy-based nokNN(e) with symmetric-based nokNN(s), and distance-based nokNN(d) in classification accuracy is shown in Figure 10 It is obvious that the entropy-based CPC obtains better classification accuracy than the symmetric-based CPC and the distance-based CPC, especially when the number of neighborhoods for aggregation is relatively small, e.g. The experimental results justify our hypotheses: (1) the bias of can be removed by CPC, and (2) the entropy-based neighborhood selection method indeed improves the classification accuracy of CPC.

5

Conclusions

In this paper we have discussed the issues related to the kNN method for classification. In order for kNN to be less dependent on the choice of we looked at

Contextual Probability-Based Classification

325

Fig. 9. A comparison of nokNN(s) and kNN in classification accuracy

Fig. 10. A comparison of nokNN(d), nokNN(s), and nokNN(d)

multiple sets of nearest neighbors rather than just one set of nearest neighbors. A set of neighbors is called a neighborhood. For an instance each neighborhood bears support for different possible classes. We have presented a novel formalism based on probability to aggregate the support for various classes to give a more reliable support value, which better reveals the true class of Based on this idea, for specific neighborhoods used in kNN, which always surround around the instance to be classified, we have proposed a contextual probability-based classification method together with three different neighborhood selection methods. To choose a given number of neighborhoods with as much information for classification as possible, the proposed entropy-based neighborhood selection method which partitions a multidimensional data space into grids and expands neighborhood each time with minimal information entropy among all candidates in this grids. This method is independent on “distance metric” or “similarity metric”. Experiments on some public data sets have shown that using nokNN (whether nokNN(d), nokNN(s), or nokNN(e)) the classification accuracy increases as the number of neighborhoods increases, but stabilizes after a small number of neighborhoods; using kNN, however, the classification performance varies when different neighborhoods are used. Experiments also have shown that the stabilized performance of nokNN(d) is comparable to the best performance of kNN. The comparison of entropy-based, symmetric-based, and distance-based CPC has shown that the entropy-based CPC obtains the highest classification accuracy.

326

Gongde Guo et al.

References 1. Hand, D., Mannila, H., and Smyth, P. Principles of Data Mining, The MIT Press, 2001. 2. Sebastiani, F. Machine Learning in Automatic Text Categorization, In ACM Computing Survey, Vol.34, No. 1, pages.1-47, March 2002. 3. Ripley, B. Pattern Recognition and Neural Networks. Cambridge University Press, 1996. 4. Mitchell, T. Machine Learning, MIT Press and McGraw-Hill, 1997. 5. Wang, H. Contextual Probability, Journal of Telecommunications and Information Technology, 4(3), pages 92-97, 2003. 6. Guan, J. and Bell, D. Generalization of the Dempster-Shafer Theory. Proc. IJCAI93, pages 592-597, 1993. 7. Shafer, G. A Mathematical Theory of Evidence, Princeton University Press, Princeton, New Jersey, 1976. 8. Feller, W. An Introduction to Probability Theory and Its Applications, Wiley, 1968. 9. Michie, D., Spiegelhalter, D. J., and Taylor, C. C. Machine Learning, Neural and Statistical Classification. Ellis Horwood, 1994. 10. Wang, H., Duntsch, I. and Bell, D. Data Reduction Based on Hyper Relations. Proc. of KDD98, New York, pages 349-353, 1998.

Improving the Performance of Decision Tree: A Hybrid Approach LiMin Wang1, SenMiao Yuan1, Ling Li2, and HaiJun Li2 1

College of Computer Science and Technology, JiLin University, ChangChun 130012, China

2

School of Computer, YanTai University, YanTai 264005, China

[emailprotected]

Abstract. In this paper, a hybrid learning approach named Flexible NBTree is proposed. Flexible NBTree uses Bayes measure to select proper test and applies post-discretization strategy to construct decision tree. The finial decision tree nodes contain univariate splits as regular decision trees, but the leaf nodes contain General Naive Bayes, which is a variant of standard Naive Bayesian classifier. Empirical studies on a set of natural domains show that Flexible NBTree has clear advantages with respect to the generalization ability when compared against its counterpart, NBTree. Keywords: Flexible NBTree; Bayes measure post-discretization

General Naive Bayes;

1 Introduction Decision tree based methods of supervised learning represent one of the most popular approaches within the AI field for dealing with classification problems. They have been widely used for years in many domains such as web mining, data mining, pattern recognition, signal processing, etc. But standard decision tree learning algorithm [1] has difficulty in obtaining the relation between continuousvalued data points. It is a key issue in research to learn from data consisting of both continuous and nominal variables. Some researchers indicate that hybrid approaches can take advantage of both symbolic and connectionist models to handle tough problems. Much research has addressed the issue of combining decision tree with other learning algorithm to construct hybrid model. Baldwin et al. [2] used mass assignment theory to translate attribute values to probability distribution over the fuzzy partitions, then introduced probabilistic fuzzy decision trees in which fuzzy partitions were used to discretize continuous test universes. Tsang et al. [3] used a hybrid neural network to refine fuzzy decision tree and extracts a fuzzy decision tree with parameters, which is equivalent to a set of fuzzy production rules. Based on variable precision rough set theory, Zhang et al. [4] introduced a new concept of generalization and employed the variable precision rough sets (VPRS) model to construct multivariate decision tree. P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 327–335, 2004. © Springer-Verlag Berlin Heidelberg 2004

328

LiMin Wang et al.

By redefining test selection measure, this paper proposes a novel hybrid approach, Flexible NBTree, which attempts to utilize the advantage of decision tree and Naive Bayes. The final classifier resembles Kohavi’s NBTree [5] but in two respects: 1. NBTree pre-discretizes the data set by applying an entropy based algorithm. Flexible NBTree applyies post-discretization strategy to construct decision tree. 2. NBTree uses standard Naive Bayes at the leaf node to handle pre-discretized and nominal attributes. Flexible NBTree uses General Naive Bayes (GNB), which is a variant of standard Naive Bayes, at the leaf node to handle continuous and nominal attributes in the subspace. The remainder of this paper is organized as follows: Section 2, 3 introduce the post-discretization strategy and GNB, respectively. Section 4 illustrates Flexible NBTree in detail. Section 5 presents the corresponding experimental results of compared performance with regarding to Flexible NBTree and NBTree. Section 6 sums up whole paper.

2

The Post-discretization Strategy

When applying post-discretization strategy to construct decision tree, at each internal node in the tree, we first select the test which is the most useful for improving classification accuracy, then apply discretization of continuous tests.

2.1

Bayes Measure

In this discussion we use capital letters such as X, Y for variable names, and lower-case letters such as to denote specific values taken by those variables. Let P(·) denote the probability, refer to the probability density function. Suppose the training set T consists of predictive attributes and class attribute C. Each attribute is either continuous or nominal. The aim of decision tree learning is to construct a tree model which can describe the relationship between the predictive attributes and class attribute C. That is, the classification accuracy of the tree model on data set T should be the highest. Correspondingly the Bayes measure which is introduced in this section as a test selection measure, is also based on this criterion. Let represent one of the predictive attributes. According to Bayes theorem, if is nominal then:

Otherwise if

is continuous then:

Improving the Performance of Decision Tree: A Hybrid Approach

329

The aim of Bayesian classification is to decide and choose the class that maximizes the posteriori probability. When some instances satisfy their class labels are most likely to be:

Definition 1. Suppose as:

has

distinct values. We define the Bayes measure

where N is the size of set T. Intuitively spoken, is the classification accuracy when classifier consists of attribute only. It describes the extent to which the model constructed by attribute fits class attribute C. The predictive attribute which maximizes is the one that is most useful for improving classification accuracy.

2.2

Discretization of Continuous Tests

The aim of discretization is to partition the values of continuous test nominal set of intervals. According to (3), we have:

where conditional probability density function trary values and of attribute when

into a

is continuous. Given arbithere will be

So, the class labels inferred from (3) will not change within a small interval of the values of For clarification, suppose the relationship between the distribution of and C is shown in Fig. 1.

Fig. 1. The relationship between the distribution of

and C

LiMin Wang et al.

330

We can see from Fig. 1 that,

Note that the attribute values are inferred from (3), not the true class labels of training instances. And in the current example, there are three candidate boundaries corresponding to the values of at which the value of C changes: If we use these boundaries to discretize attribute the classification accuracy after discretization will be equal to So, the process of computing is also the process of discretization. The Bayes measure can also be used to automatically find the most appropriate boundaries for discretization and the number of intervals. Although this kind of discretization method can retain classification accuracy, it may lead to too many intervals. The Minimum Description Length (MDL) principle is used in our experimental study to control the number of intervals. Suppose we have sorted sequence S into ascending order by the values of Such a sequence is partitioned by boundary B to two subsets The class information entropy of the partition denoted by is given by:

where Ent(·) denotes the entropy function,

and

stands for the proportion of the instances in

that belong to class

According to MDL principle, the partitioning within S is reasonable iff

where is the information gain, which measures the decrease of the weighted average impurity of the partitions compared with the impurity of the complete set S. N is the number of instances in set is the number of class labels represented in set This approach can then be applied recursively to all adjacent partitions of attribute thus create the final intervals.

3

General Naive Bayse (GNB)

Naive Bayes comes originally from work in pattern recognition and is based on one assumption that predictive attributes are conditionally independent given the class attribute C, which can be expressed as follows:

Improving the Performance of Decision Tree: A Hybrid Approach

331

But when instance space contains continuous attributes, the situation is different. For clarity, we first just consider two attributes: and Suppose the values of have been discretized into a set of intervals, each corresponding to a nominal value. Then the independence assumption should be:

where is arbitrary interval of the values of attribute This assumption, which is the basis of GNB, supports very efficient algorithms for both classification and learning. By the definition of a derivative,

where and

When hence

We now extend (9) to handle a much more common situation. Suppose the first of attributes are continuous and the remaining attributes are nominal. Similar to the induction process of (9), we will have

Then the classification rule of GNB is:

The probability in (11) are estimated by using the Laplace-estimate and M-estimate [6], respectively.

332

LiMin Wang et al.

Kernel-based density estimation [7] is the most widely used non-parametric density estimation technique. Compared with parametricdensity estimation technique, it does not make any assumption of data distribution. In this paper we choose it to estimate conditional probability density in Eq.(11):

where is the corresponding value of attribute X when K(·) is a given kernel function And is the corresponding kernel width, is the number of training instances when This estimate converges to the true probability density function if the kernel function obeys certain smoothness properties and the kernel width are chosen appropriately. One way of measuring the difference between the true and the estimated is the expected cross-entropy:

where and is chosen to minimize the estimated cross-entropy. In our experiments, we use an exhaustive grid search where grid width is 0.01 and the search is over

4

Flexible NBTree

Kohavi proposes NBTree as a hybrid approach combining the Naive Bayes and decision tree. It has been shown that NBTree frequently achieves higher accuracy than either a Naive Bayes or a decision tree. Like NBTree, Flexible NBTree also uses a tree structure to split the instance space into subspaces and generates one Naive Bayes in each subspace. However, it uses a different discretization strategy and different version of Naive Bayes. The Flexible NBTree learning algorithm is shown as follows.

Improving the Performance of Decision Tree: A Hybrid Approach

5

333

Experiments

In order to evaluate the performance of Flexible NBTree and compare it against its counterpart, NBTree, we conducted an empirical study on 18 data sets from the UCI machine learning repository1. Each data set consists of a set of classified instances described in terms of varying numbers of continuous and nominal attributes. For comparison purpose, the stopping criterions in our experiments are the same: the relative reduction in error for a split is less than 5% and there are no more than 30 instances in the node. The classification performance was evaluated by ten-folds cross-validation for all the experiments on each data set. Table 1 shows classification accuracy and standard deviation for Flexible NBTree and NBTree, respectively. indicates that the accuracy of Flexible NBTree is higher than that of NBTree at a significance level better than 0.05 using a two-tailed pairwise t-test on the results of the 20 trials in a data set. From Table 1, the significant advantage of Flexible NBTree over NBTree in terms of the higher accuracy can be clearly seen. In order to investigate the reason(s), we analyze the experimental results on data set Breast-w in particular. Figure 2 shows the comparison of classification accuracy for Flexible NBTree and NBTree. When N (the training size of data set Breast-w) < 650, the tree structures that learned from these two algorithms are almost the same. But when 1

ftp://ftp.ics.uci.edu/pub/machine-learning-databases

334

LiMin Wang et al.

Fig. 2. Comparison of the classification accuracy

the decision node in the second layer of Flexible NBTree contains univariate test and that learned from NBTree contains test . Correspondingly from Fig. 2 we can see that, when N = 600 Flexible NBTree achieves 92.83% accuracy on the test set while NBTree reaches about 92.73%. When N = 650 Flexible NBTree achieves 93.51% accuracy while NBTree reaches about 92.92%. The error reduction increases from 1.38% to 8.33%. We attribute this improvement to the effectiveness of post-discretization strategy. Since no information-lossless discretization procedure is available, some helpful information may lose in the transformation from infinite numeric area to finite subintervals. We conjecture that pre-discretization does not take full advantage of the information that continuous attributes supply and this may affect the cutting points of continuous test or even test selection in the process of tree construction, thus degrade the classification performance to some extent. But post-discretization strategy applies discretization only when necessary, thus the possibility of information loss can be reduced to minimum.

6

Summary

Pre-discretization is a common choice for handling continuous attributes in machine learning. But the information loss may affect classification performance negatively. In this paper, we propose a novel learning approach, Flexible NBTree, which is hybrid of decision tree and Naive Bayes. Flexible NBTree applies postdiscretization strategy to mitigate the negative effect of information loss. Experiments with natural domains showed that Flexible NBTree generalizes much better than NBTree.

Improving the Performance of Decision Tree: A Hybrid Approach

335

References 1. Quinlan, J. R.: Discovering rules from large collections of examples: A case study. Expert Systems in the Micro Electronic Age, Edinburgh University Press, (1979) 2. Baldwin, JF., Sachin, B. Karale.: Asymmetric Triangular Fuzzy Sets for Classification Models. Proceedings of the 7th International Conference on Knowledge-Based Intelligent Information and Engineering Systems, KES 2003 Oxford, UK, (2003) 364-370 3. Tsang, ECC., Wang, XZ., Yeung, YS.: Improving learning accuracy of fuzzy decision trees by hybrid neural networks. IEEE Transactions on Fuzzy Systems, 8 (2000) 601-614 4. Zhang, L., Ye-Yun, M., Yu, S., Ma-Fan, Y.: A New Multivariate Decision Tree Construction Algorithm Based on Variable Precision Rough Set. Advances in WebAge Information Management, (2003) 238-246 5. Kohavi, R.: Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid. Proceedings of the 2th International Conference on Knowledge Discovery and Data Mining, Menlo Park, CA, (1996) 202-207 6. Cestnik, B.: Estimating probabilities: A crucial task in machine learning. Proceedings of the 9th European Conference on Artificial Intelligence, (1990) 147-149 7. Chen, H., Meer, P.: Robust Computer Vision through Kernel Density Estimation. Proceedings of the 7th European Conference on Computer Vision, Copenhagen, Denmark, (2002) 236-246 8. Smyth, P., Gray, A., Fayyad, U.: Retrofitting decision tree classifiers using kernel density estimation. Proceedings of the 12th International Conference on Machine Learning, Morgan Kaufmann Publishers, (1995) 506-514

Understanding Relationships: Classifying Verb Phrase Semantics* Veda C. Storey1 and Sandeep Purao2 1

J. Mack Robinson College of Business, Georgia State University, Atlanta, GA 30302 [emailprotected]

2

School of Information Sciences & Technology, The Pennsylvania State University, University Park, PA 16801-3857 [emailprotected]

Abstract. Relationships are an essential part of the design of a database because they capture associations between things. Comparing and integrating relationships from heterogeneous databases is a difficult problem, partly because of the nature of the relationship verb phrases. This research proposes a multi-layered approach to classifying the semantics of relationship verb phrases to assist in the comparison of relationships. The first layer captures fundamental, primitive relationships based upon well-known work in data abstractions and conceptual modeling. The second layer captures the life cycle of natural progressions in the business world. The third layer reflects the context-dependent nature of relationships. Use of the classification scheme is illustrated by comparing relationships from various application domains with different purposes.

1 Introduction Comparing and integrating databases is an important problem, especially in an increasingly networked world that relies on inter-organizational coordination and systems. With this, is the need to develop new methods to design and integrate disparate databases. Database integration, however, is a difficult problem and one for which semi-automated approaches would be useful. One of the main difficulties is comparing relationships because their verb phrases may be generic or dependent upon the application domain. Being able to compare the semantics of verb phrases in relationships would greatly facilitate database design comparisons. It would be even more useful if the comparison process could be automated. Fully automated techniques, however, are unlikely so solutions to integration problems should aid integrators, but require minimal work on their part [Biskup and Embley, 2003]. The objective of this research is to: propose an ontology for understanding the semantics of relationship verb phrases by mapping the verb phrases to various categories that capture different interpretations. Doing so requires that a classification scheme be developed that captures both the domain-dependent and domain independent nature of verb phrases. The contribution of this research is to provide a useful approach to classifying verb phrases so relationships can be compared in a semi-automated way. * This research was partially supported by J. Mack Robinson College of Business, Georgia State University and Pennsylvania State University. P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 336–347, 2004. © Springer-Verlag Berlin Heidelberg 2004

Understanding Relationships: Classifying Verb Phrase Semantics

337

2 Related Work The design of a database involves representing the universe of discourse in a structure in such a way that it accurately reflects reality. Conceptual modeling of databases is, therefore, concerned with things (entities) and associations among things (relationships) [Chen 1993; Wand et al. 1999]. A relationship R, can be expressed as A verb phrase B (A vp B), where A and B are entities. Most database design practices use simple, binary associations that capture these relationships between entities. A verb phrase, which is selected by a designer with the application domain in mind, can capture much of the semantics of the relationship. Semantics, for this research, is defined as the meaning of a term or a mapping from the real world to a construct. Understanding a relationship, therefore, requires that one understand the semantics of the accompanying verb phrase. Consider the relationships from two databases: Customer (entity) buys (verb) Product (entity) Customer (entity) purchases (verb) Product (entity) These relationships reflect the same aspect of the universe of discourse, and use synonymous verb phrases. Therefore, the two relationships may be mapped to a similar interpretation, recognized as identical, and integrated. Next, consider: Customer reserves Car Customer rents Car. These relationships reflect different concepts from the universe of discourse. The first captures the fact that a customer wants to do something; the second, that the customer has done it. These may be viewed as different states in a life cycle progression, but the two underlying relationships cannot be considered identical. Thus, they could not be mapped to the same semantic interpretation. Finally, consider: Manager considers Agreement Manager negotiates Agreement. The structures of the relationships suggest that both relationships represent an interaction. However, “negotiates” implies changing the status, whereas “considers” involves simply viewing the status. On the other hand, Manager makes Agreement Manager writes Agreement may capture an identical notion of creation. These examples illustrate the importance of employing and understanding how a verb phrase captures the semantics of the application domain. The interpretation of verbs depends upon the nouns (entities) that surround them [Fellbaum, 1998]. Research has been carried out on defining and understanding ontology creation and use. There are different definitions and interpretations of ontologies [Weber 2002]. In general, though, ontologies deal with capturing, representing, and using surrogates for the meanings of terms. This research adopts the approach of Dahlgren [1988] who developed an ontology system as a classification scheme for speech understanding and implemented it as an interactive tool. Work on ontology development has been carried out in database design (Embley et al. [1999], Kedad and Metais [1999], Dullea and Song [1999], Bergholtz and Johannesson [2001]). These efforts

338

Veda C. Storey and Sandeep Purao

provide useful insights and build upon data abstractions. However, no comprehensive ontology for classifying relationships has been proposed.

3 Ontology for Classifying Relationships This section proposes an ontology for classifying the verb phrases of relationships. The ontology is of the type developed by Dahlgren [1988] which operates as an interactive system to classify things. The most important part is the classification scheme. It is the focus of this research and is divided into three layers (Figure 1). The layers were developed by considering: 1) prior research in data modeling, in particular, data abstractions and the inherent business life cycle; 2) the local context of the entities; and 3) the domain-dependent nature of verb phrases.

Fig. 1. Relationship classification levels

3.1 Fundamental Categories The fundamental categories are primitives that reflect a natural division in the real world. This category has three general classes that form the basis of how things in the real world can be associated with each other: status, change in status, and interaction as shown in Figure 2.

Fig. 2. Fundamental Categories

Understanding Relationships: Classifying Verb Phrase Semantics

339

Status captures the fact that one thing has a status with respect to the other. These are primitive because they describe a permanent, or durable, association of one entity with another, expressing the fact that A with respect to B. Business applications follow a natural life cycle of conception or creation through to ownership and, eventually, destruction. The change of status category describes this transition from one status to another. Relationships in this category express the fact that A is transitioning from A with respect to B to A with respect to B. An interaction does not necessarily lead to a change of status of either entity. This happens when the effect of an interaction is worth remembering. Consider the verb phrase, ‘create.’ In some cases, it is useful to remember this as a status as in Author writes Book. In other cases, the interaction itself is important, even if it does not result in a change of status. The interaction category, therefore, expresses the fact that A with respect to B. These fundamental categories are sufficiently coarse that all verb phrases will map to them. They are also coarse enough to warrant finer categories to distinguish among the large set of relationships in each category. Thus, further refinement is needed for each fundamental category. 3.1.1 Refining the Category: Status The ‘Status’ category has been extensively studied by research on data abstractions, which focuses on the structure of relationships as a surrogate for understanding their semantics. Most data abstractions associate entities at different levels of abstraction (sub/superclass relationships) [Goldstein and Storey, 1999]. Since data abstractions infer semantics based on the structure of relationships, they, thus, provide a good start point for understanding the semantics of relationships. Research on understanding natural language also provides verb phrase categories such as auxiliary, generic and other types. The first layer captures fundamental differences between kinds of relationships and was build by considering prior, well-accepted research on data abstractions and other frequently-used verb phrases whose interpretation is unambiguous. These are independent of context. This category, thus, captures the fundamental ways in which things in the real world are related so the categories in this level can be used to distinguish among the fundamental types. Additional results from research on patterns [Coad, 1995] and linguistic analysis [Miller, 1990] results in a hierarchical classification with defined primitives at the leaves of the tree. Figure 3 shows this finer classification of the category ‘Status.’ Examples of primitive status relationships are shown in Table 1. There are two variations of one thing being assigned to another: is-assigned-to and is-subjected-to. In A is-subjected-to B, A does not have a choice with respect to its relationship with B, whereas it might in the former. Temporal relationships capture the sequence of when things happen and can be clearly categorized as before, during, and after. 3.1.2 Refining the Category: Change of Status The change-of-status primitives, in conjunction with the status primitives, capture the lifecycle transitions for each status. Although the idea of a lifecycle has been alluded

340

Veda C. Storey and Sandeep Purao

Fig. 3. Primitives for the Category ‘Status’

to previously [Hay 1996], prior research has not systematically recognized the lifecycle concept. Our conceptualization of the ‘Change of Status’ category is based on an extension and understanding of each primitive in the ‘Status’ category during the

Understanding Relationships: Classifying Verb Phrase Semantics

341

business lifecycle. Consider verb phrases that deal with acquiring something, as is typical of business transactions related to the status primitive ‘is-owner-of.’ The lifecycle for this status primitive has the states shown in Figure 4.

Fig. 4. The Relationship Life Cycle

Each state may, in turn, be mapped to different status primitives. For example, the lifecycle starts with needing something (‘has-attitude-towards’ and ‘requires’) which is followed by intending to become an owner (‘acquire’ or ‘create’), owning (‘owner’ or ‘in-control-of’) and giving up ownership (‘seller’ or ‘destroyer’). The primitives therefore illustrate a lifecycle that goes through creation or acquisition, ownership, and destruction. The life cycle can be logically divided into: intent, attempt to acquire, transition to acquiring, intent to give up, attempt to give up, and transition to giving up. Table 2 shows this additional information superimposed on the different states within the lifecycle. The sub-column under the change-of-status primitives shows the meanings captured in each: intent, attempt and the actual transition.

3.1.3 Refining the Category: Interaction ‘Interaction’ describes communication of short duration between two entities or an operation of one entity on another. The interaction may cause a change in one of the entities. For example, one entity may ‘manipulate’ another [Miller, 1990], or cause movement of the other through time or space (‘transmit,’ ‘receive’). Two entities may

342

Veda C. Storey and Sandeep Purao

interact without causing change to either (‘communicate with,’ ‘observe’). One entity may interact with another also by way of performance (‘operate,’ ‘serve’). Figure 5 shows the primitives for ‘Interaction’ with examples given in Table 3.

Fig. 5. Primitives for the Category ‘Interaction’

3.2 The Local (Internal) Context The second category captures internal context by taking into account the nature of the entities surrounding the verb phrase, highlighting the need to understand the nouns that surround verb phrases [Fellbaum, 1998]. For this research, entities are classified as: actor, action, and artifact. Actor entities are capable of performing independent actions. Action represents the performance of an act. Artifact represents an inanimate object not capable of independent action. After entities have been classified, valid primitives can be specified for each pair of entity types. For example, it does not make sense to allow the primitive ‘perform’ for two entities of the kind ‘Actor.’ On the other hand, this primitive is appropriate when one of the entities is classified as ‘Actor’ and the other as ‘Action.’ The argument can be applied both to the ‘Status’ and ‘Interaction’ primitives. Because the ‘Change of Status’ primitives capture the lifecycle of ‘Status’ primitives, constraints identified for ‘Status’ primitives apply to the ‘Change of Status’ primitives as well. Table 4 shows these constraints for ‘Status’ primitives. Similar constraints have been developed for the ‘Interaction’ primitives.

Understanding Relationships: Classifying Verb Phrase Semantics

343

3.3 Global (External) Context The third level captures the external context, that is, the domain in which the relationship is used, reflecting the domain-dependent nature of verb phrases. Although attempts have been made to capture taxonomies of such domain-dependent verbs, a great deal of manual effort has been involved. This research takes a more pragmatic approach where a knowledge base of domain-dependent verb phrases may be constructed over time when the implemented ontology is being used. When the user classifies a verb phrase, its classification and application domain should be stored. Consider the use of ‘opens’ in a theatre database versus a bank database. The relationship Character Door in the theatre domain maps to the interaction primitive <manipulates>. In the bank application, Teller Account maps to the status primitive ; Customer Account maps to . If a verb phrase has already been classified by a user, it can be suggested as a preliminary classification for additional users, who are interested in classifying it. If a verb phrase has already been classified by a different user for the same application domain, then that classification should be displayed to the user who would agree with the classifycation or provide a new classification. New classifications will also be stored. Ideally, consensus will occur over time. This way the knowledge base builds up, ensuring that the verbs important to different domains are captured appropriately. The following will be stored: [Relationship, Verb phrase classification, Application Domain, User]

3.4 Use of the Ontology The ontology can be used for comparing relationships across two databases by first comparing the entities, followed by classification of the verb phrases accompanying the relationships. Examples are shown in Table 5. The ontology consists of a verb phrase classification scheme, a knowledge base that stores the classified verb phrases, organized by user and application, and a userquestioning scheme as mentioned above. The user is instructed to classify the entities of a relationship as actor, action, or artifact. The next step is to classify the verb

344

Veda C. Storey and Sandeep Purao

phrase. First, the user is asked to select one the three categories: ‘Status,’ ‘Interaction,’ or ‘Change of Status.’ Based on this selection, and the constraints provided by the entity types, primitives within each category are presented to the user for an appropriate classification. Suppose a user classifies a relationship as ‘Status.’ Then, knowing the nature of the entities, only certain primitives are presented as possible for the classification of the relationship. Furthermore, identifying that a verb phase is either status, change or status, or interaction restricts the subset of categories from which an appropriate classification can be obtained and, hence, the options presented to the user. If the verb phrase cannot be classified in this way, then, the other levels are checked to see if they are needed.

4 Assessment Assessing an ontology is a difficult task. A plausible approach to assessment of an ontology is suggested by Gruninger and Fox [1995]. They suggest evaluating the ‘competency’ of an ontology. One of the ways to determine this ‘competency’ is to identify a list of queries that a knowledge-base, which builds on the ontology, should be able to answer (competency queries). Based on these queries, the ontology may be evaluated by posing questions such as: Does the ontology contain enough information to answer these types of queries? Do the answers require a particular level of detail or representation of a particular area? Noy and McGuiness [2001] suggest that the competency questions may be representative, but need not be exhaustive. Following our intent of classifying relationships for the purpose of comparison across databases, we attempted to assess whether the classification scheme of the ontology can provide a correct and complete classification of relationship verb phrases. To do so, a study was carried out which involved the following steps: 1) generation of the verb phrases to be classified; 2) generation of relationships using the verb phrases in different application domains; and 3) classification of all verb phrases.

Understanding Relationships: Classifying Verb Phrase Semantics

345

Step 1: Generation of Verb Phrase Only business-related verbs were used because the intent of the relationship ontology is use for business databases. Furthermore, it restricts the scope of the research. Since the SPEDE verbs [Cottam, 2000] were developed for business applications, these automatically became part of the sample set. The researchers independently selected business-related verbs from a set of 700 generated randomly from WordNet. The verbs that were common to the selections made by both researchers were added to the list from SPEDE. The same procedure was carried out from a set of 300 verbs that were randomly selected by people who support the online dictionary http://dictionary. cambridge.org/. This resulted in a total of 211 business verbs. Step 2: Generation of Relationships Containing Verbs by Application Domain For each verb, a definition was obtained from the on-line dictionary. Dictionaries provide examples for understanding and context, which helped to generate the relationships. Relationships were generated for seven application domains (approximately 30 verbs in each): 1) education, 2) business management, 3) manufacturing, 4) airline, 5) service, 6) marketing, 7and) retail. Examples are shown in Table 6.

After generating the relationships, the researchers independently classified them using the relationship ontology. First, 30 verbs were classified and the researches agreed on 80% of the cases. The remaining verbs were then classified. The next step involved assessing how many of the ontology classifications the set of 211 verbs covered to test for completeness. The researchers generated additional relationships for ten subclasses for a total of 225. Sample classifications are shown in Table 7. The results of this exercise were encouraging, especially given our focus on evaluating the competency of the ontology [Gruninger and Fox 1995]. The classification scheme worked well for these sample relationships. It allowed for the classification of all verb phrases. The biggest difficulty was in identifying whether to move from one level to the next. For example, Student acquires Textbook is immediately classifiable

346

Veda C. Storey and Sandeep Purao

by the primitives. In other cases, the next layer was necessary. Further research is needed to design a user interface that can explain the use and categories to the user so they can be effectively applied. A preliminary version of a prototype has been developed. This will be completed and an empirical test carried out with typical end-users, most likely, database designers.

5 Conclusion A classification scheme for comparing relationship verb phrases has been presented. It is based upon results obtained from research on conceptual modeling, common sense knowledge of a typical life cycle, and the domain-dependent nature of relationships. Further research is needed to complete the ontology system for which the classification scheme will be a part. Then, it needs to be expanded to allow for multiple classifications and the user interface refined.

References 1. Bergholtz, M., and Johnannesson, P., “Classifying the Semantics of Relationships in Conceptual Modelling by Categorization of Roles,” Proceedings of the 6th International Workshop on Applications of Natural Language to Information Systems (NLDB’01), 28-29 June 2001, Madrid, Spain. 2. Biskup, J. and Embley, D.W., “Extracting Information from Heterogeneous Information Sources using Ontologically Specified Target Terms,” Information Systems, Vol.28, No.3, 2003. 3. Brachman, R.J., “What IS-A is and Isn’t: An Analysis of Taxonomic Links in Semantic Networks,” IEEE Computer, October, 1983. 4. Brodie, M., “Association: A Database Abstraction,” Proceedings of the Entity-Relationship Conference, 1981. 5. Chen, P., “The Entity-Relationship Approach”, In Information Technology in Action: Trends and Perspectives, Englewood Cliffs: Prentice Hall, 1993, pp.13-36. 6. Coad, P. et al. 1995. Object Models: Strategies, Patterns, & Applications. Prentice Hall. 7. Cottam, H., “Ontologies to Assist Process Oriented Knowledge Acquisition,” http://www. spede.co.uk/papers/papers.htm, 2000. 8. Dahlgren, K., Naive Semantics for Natural Language Understanding, Kluwer Academic Publishers, Hingham, MA, 1988.

Understanding Relationships: Classifying Verb Phrase Semantics

347

9. Dullea, J. and Song, I.-Y., “A Taxonomy of Recursive Relationships and Their Structural Validity in ER Modeling,” in Akoka, J. et al. (eds.), Conceptual Modeling – ER’99, International Conference on Conceptual Modeling, Lecture Notes in Computer Science 1728, Paris, France, 15-18 November 1999, pp.384-389. 10. Embley, D., Campbell, D.M., Jiang, Y.S., Ng, Y.K., Smith, R. D., Liddle, S.W. and Quass, D.W., “A Conceptual-modeling Approach to Web Data Extraction,” Data & Knowledge Engineering, 1999. 11. Fellbaum, V., “Introduction,” in Wordnet: An Electronic Lexical Database, The MIT Press, Cambridge, Mass., 1998, pp.1-19. 12. Gamma, E., Helm, R., Johnson, R., and Vlissides, J., Design patterns: elements of reusable object-oriented software. Addison-Wesley Longman Publishing Co., Inc., Boston, MA., 1995. 13. Goldstein, R.C. and Storey, V.C., “Data Abstractions: “Why and How”, Data and Knowledge Engineering, Vol.29, No.3, 1999, pp. 1-18. 14. Gruninger, M. and Fox, M.S. Methodology for the Design and Evaluation of Ontologies. In: Proceedings of the Workshop on Basic Ontological Issues in Knowledge Sharing, IJCAI-95, Montreal. 15. Hay, D. C., and Barker, R. Data Model Patterns: Conventions of Thought. Dorset House, 1996. 16. Kedad, Z., and Metais, E., “Dealing with Semantic Heterogeneity During Data Integration,” in Akoka, J. et al. (eds.), Conceptual Modeling – ER’99, International Conference on Conceptual Modeling, Lecture Notes in Computer Science 1728, Paris, France, 1518 November 1999, pp.325-339. 17. Larmon, C., Applying UML and Patterns. Prentice-Hall, 1997. 18. Miller, G.A., Beckwith, R., Fellbaum, C., Gross, D. and Miller, K.J., “Introduction to WordNet: An On-line Lexical Database,” International Journal of Lexicography, Vol. 3, No. 4, 1990, pp. 235-244. 19. Motschnig-Pitrik, R. and Myloppoulos, J., “Class and Instances,” International Journal of Intelligent and Cooperative Systems, Vol.1, No.1, 1992, pp.61-92. 20. Motschnig-Pitrik, R., “A Generic Framework for the Modeling of Contexts and its Applications, ” Data and Knowledge Engineering, Vol. 32, 2000, pp.145-180. 21. Noy, N. F., and McGuinness, D. L. 2001. Ontology Development 101: A Guide to Creating Your First Ontology. Available at http://protege.stanford.edu/publications/ontology_development/ontology101-noy-mcguinness.html. Accessed 15 March 2004 22. Smith, J., Smith, D. “Database Abstractions: Aggregation and Generalization,” ACM Transactions on Database Systems. Vol.2, No.2, 1977, pp. 105-133. 23. Wand, Y., Storey, V.C. and Weber, R., “Analyzing the Meaning of a Relationship,” ACM Transactions on Database Systems, Vol.24, No.4. December, 1999, pp.494-528. 24. Weber, R., “Ontological Issues in Accounting Information Systems,” Sutton, S. and Arnold, V., (Eds.), Researching Accounting as an Information Systems Discipline, 2002.

Fast Mining Maximal Frequent ItemSets Based on FP-Tree Yuejin Yan*, Zhoujun Li, and Huowang Chen 1

School of Computer Science, National University of Defense Technology, Changsha 410073, China Tel: 86-731-4532956 [emailprotected] http://www.nudt.edu.cn

Abstract. Maximal frequent itemsets mining is a fundamental and important problem in many data mining applications. Since the MaxMiner algorithm introduced the enumeration trees for MFI mining in 1998, there have been several methods proposed to use depth-first search to improve performance. This paper presents FIMfi, a new depth-first algorithm based on FP-tree and MFI-tree for mining MFI. FIMfi adopts a novel item ordering policy for efficient lookaheads pruning, and a simple method for fast superset checking. It uses a variety of old and new pruning techniques to prune the search space. Experimental comparison with previous work reveals that FIMfi reduces the number of FP-trees created greatly and is more than 40% superior to the similar algorithms on average.

1 Introduction Since the frequent itemsets mining problem (FIM) was first addressed [1], frequent itemsets mining in large database have been an important problem for it enables essential data mining such as discovering association rules, data correlations, sequential patterns, etc. There are two types of algorithms to mine frequent itemsets. The first one is candidate set generate-and-test approach [1]. The basic idea is to generate and test the candidate itemsets. Each candidate itemset with k+1 items is only generated from frequent itemsets with k items. This process is repeated in bottom-up fashion until no candidate itemset can be generated. In each level, all the frequencies of the candidate itemsets are tested by scanning the database. But this method requires scanning the database several times. In the worst case, the number of the scan is equal to the maximal length of the frequent itemsets. Besides this, lots of candidate itemsets is generated, most of them would be infrequent. Another method is data transformation approach [2, 4]: it avoids the cost of generating and testing a large number of candidate sets by growing a frequent itemset from its prefix. It constructs a sub database related to each frequent itemset h such that all frequent itemsets that have h as prefix can be mined only using the sub database. * Corresponding author. P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 348–361, 2004. © Springer-Verlag Berlin Heidelberg 2004

Fast Mining Maximal Frequent ItemSets Based on FP-Tree

349

The number of frequent itemsets increases exponentially with the increasing of frequent itemsets’ length. So large length of frequent itemsets leads to no feasible of FI mining. However, since frequent itemsets are upward closed, it is sufficient to discover only all maximal frequent itemsets. As a result, researchers now turn to find MFI (maximal frequent itemsets) [5,6,9,10,4,7]. A frequent itemset is called maximal if it has no frequent superset. Given a set of MFI, it is easy to analyze some interesting properties of the database, such as the longest pattern, the overlap of the MFI, etc. Also, there are applications where the MFI is adequate, for example, the combinatorial pattern discovery in biological applications [3]. This paper focuses on the MFI mining problems based on data transformation approach. We use FP-tree to represent sub database containing all relevant frequency information, and MFI-tree are used to store information of discovered MFI that is useful for superset frequency pruning. With these two data structure, our algorithm takes a novel item ordering policy, and integrates a variety of old and new prune strategies. It also uses a simple but fast superset checking method along with some other optimizations. The remaining of this paper is organized as follows. In section 2, we briefly review the MFI mining problem and introduce the related works. Section 3 gives the MFI mining algorithm, FIMfi, which does the MFI mining based on FP-tree and MFI-tree. In this section we also introduce our novel item ordering policy, the prune strategies we applied and the simple but fast superset checking that is needed in efficient “lookaheads” pruning. In section 4, we compare our algorithm with some previous works. Finally, section 5 gives the conclusions.

2 Preliminaries and Related Work This section will formally describe the MFI mining problem and the set enumeration tree that represents the searching space. Also the related works and two important data structure, FP-tree and MFI-tree, which is used in our scheme, will be introduced in this section.

2.1 Problem Revisit Let be a set of m distinct items. Let D denote a database of transactions, where each transaction contains a set of items. A set is also called an itemset. An itemset with k items is called a k-itemset. The support of an itemset X, denoted as sup(X), is the number of transactions in which X occurs as a subset. For a given D and the threshold min_sup, itemset X is frequent if If and for any we have sup(Y) < min_sup, then X is called maximal frequent itemset. From the definitions we can have two lemmas as follows: Lemma 1: A restricted subset of any frequent itemset is not a maximal frequent itemset. Lemma2: A subset of any frequent itemset is a frequent itemset, a superset of any infrequent itemset is not a frequent itemset.

350

Yuejin Yan, Zhoujun Li, and Huowang Chen

Given a transactional database D, supposed I is an itemset of it, then any combination of the items in I would be frequent and all these combinations compose the search space, which can be represented by set enumeration tree [5]. For example, supposed I = {a,b,c,d,e, f} is sorted in firm lexicographic order, then the searching tree can be shown as Figure 1. To avoid the tree too big, we use subset infrequency pruning and superset frequency pruning technique in the tree, and we will introduce the two pruning techniques in next section. The root of the tree represents the empty itemset, and the nodes at level k contain all of the k-itemsets. The itemset associated with each node, n, will be referred as the node’s head (n). The possible extensions of the itemset is denoted as con_tail(n), which is the set of items after the last item of head(n). The frequent extensions denoted as fre_tail(n) is the set of items that can be appended to head(n) to build the longer frequent itemsets. In depth-first traversal of the tree, fre_tail(n) contains only the frequent extensions of n. The itemset associated with each children node of node n is build by appended one of fre_tail(n) to head (n). As example in Figure 1, suppose node n is associated with {b}, then head (n) = {b} and con_tail(n) = {c,d,e,f}. We can see that {b,f} is not frequent, fre_tail(n) = {c,d,e}. The children node of n, {b,e}, is build by appending e from fre_tail(n) to {b}.

Fig. 1. Search space tree

The problem of MFI mining can be thought as to find a border of the tree, all the elements above the border are frequent itemsets, and others are not. All MFIs is near the border. As our examples in Figure1, itemsets in ellipses are MFI.

2.2 Related Work Given the set enumeration tree, we can describe the most recent approaches to MFI mining problem. The MaxMiner [5] employs a breadth-first traversal policy for the searching. To reduce the search space according to lemma 1, it performs not only subset infrequency pruning to skip over the itemset that have an infrequent subset, but also superset frequency pruning (also called lookaheads pruning). To increase the effectiveness of superset frequency pruning, MaxMiner dynamically reorders the children nodes, which was used in all the MFI algorithms after it [4,6,7,9,10]. Normally depthfirst approach have better performance on lookaheads, but MaxMiner uses a breadthfirst approach instead to limit the number of passes over the database.

Fast Mining Maximal Frequent ItemSets Based on FP-Tree

351

DepthProject performs a mixed depth-first traversal, and do the subset infrequency pruning and a variation of superset frequency pruning [6] to the tree. Also it uses an improved counting method based on transaction projections along its branches. The original database and the projections are represented as a bitmap. The experiment results in [6] show that DepthProject outperforms MaxMiner by more than one time. Mafia [7] is another depth-first algorithm, it also uses a vector bitmap representation, where the count of an itemset is based on the column in the bitmap. Besides the two pruning methods we mentioned above, another novel pruning technique called PEP (Parent Equivalence Pruning) in [8] is also used in Mafia, The experiments in [7] shows that PEP prunes the search space greatly. Both DepthProject and Mafia mine a superset of the MFIs, and require a postpruning to eliminate non-maximal frequent itemsets. GenMax [9] integrates the pruning with the mining and finds the exact MFIs by using two strategies. First, just like transaction database is projected on current node, the discovered MFI set can also be projected on the node (Local MFI) and thus yields fast superset checking; Second, GenMax uses Diffset propagation to do fast frequency computation. AFOPT [3] uses a data structure called AFOPT tree in which items are ascending frequency ordered to store the transactions in original database. It also uses subset infrequency pruning, superset frequency pruning and PEP pruning to reduce the search space. And it employs LMFI generated by pseudo projection technique to test whether a frequent itemset is subset of one of it. is an extension of the FP-growth method, for MFIs mining only. It uses a FP-tree to store the transaction projection of the original database for each node in the tree. In order to test whether a frequent itemset is the subset of any discovered MFI in lookaheads pruning, another tree structure (MFI-tree) is utilized to keep the track of all discovered MFI, this makes effective superset checking. uses an array for each node to store the counts of all 2-itemsets that are subset of the frequent extensions itemset, this makes the algorithm scan each FP-tree only once for each recursive call emanating from it. The experiment results [10] shows that has the best performance for almost all the tested database.

2.3 FP-Tree and MFI-Tree The FP-growth method [2] builds a data structure called the FP-tree (Frequent Pattern tree) for each node of the search space tree. FP-tree is a compact representation of all relevant frequency information of current node, each of its path from the root to a node represents an itemset, and the nodes along the paths are stored according to the order of the items infre_tail(n). Each node of the FP-tree also stores the number of transactions or conditional pattern bases which containing the itemset represented by the path. Compression is achieved by building the tree in such a way that overlapping itemsets share prefixes of the corresponding branches. Each FP-tree of the nodes is associated with a header table. Single items in tail and the support of itemset that is the union of head and the item are stored in the header table in decreasing order of the support. The entry for an item also contains the head of a list that links to all the corresponding nodes of the FP-tree.

352

Yuejin Yan, Zhoujun Li, and Huowang Chen

To construct FP-tree of node n, the FP-growth method first finds all the frequent items in fre_tail(n) by an initial scan of the database or the head(n)’s conditional pattern base that comes from FP-tree of its parent node. And then these items are inserted into the header table in the order of items in fre_tail(n). In the next and the last scan, frequent itemset which is subset of the tail are inserted into the FP-tree as a branch. If a new itemset shares a prefix with another itemset that is already in the tree, then the new itemset will share the branch that representing the common prefix with the existing itemset. For example, for the database and min_sup shown in Figure2 (a), the FP-tree of root and itemset {f}is shown as Figure2 (b) and (c).

Fig. 2. Examples of FP-tree

uses an array for each node along with the FP-tree to avoid the first scan of the conditional pattern bases. For each 2-itemsets {a,b} in frequent extensions itemset, an array entry is used to store the support of then when extending the tree from a node to one of its children, we can build the header of the children’ FP-tree according to the array, and avoid scanning the FP-tree of current node again. Considering a given MFI M at node n in the depth-first MFI mining, if we have then all the children of n will not be considered according to lemma 1. This is the superset frequency pruning, also called lookaheads in [5]. Lookaheads needs to access some information in discovered MFI relevant to current node for pruning. uses another FP-tree (MFI-tree) to map the need. The differences between the MFI-tree and the FP-tree of the same node are as follows: first, the nodes do not record frequency information, but they store the length of the itemset represented by the path from the root to the current node. Second, for each itemset S represented by a path, is subset of a certain discovered MFI. In addition, when considering an offspring node of a node, the MFI-tree of the node will be updated as soon as a new MFI is found. Figure3 shows several examples of MFItree.

3 Mining Maximal Frequent Itemsets by FIMfi In this section, we discuss our algorithm FIMfi in details and explain why it is faster than some previous schemes.

Fast Mining Maximal Frequent ItemSets Based on FP-Tree

353

Fig. 3. Examples of MFI-tree

3.1 Pruning Techniques Subset Infrequency Pruning: Supposed n is a node in the search space tree, then for each item x in con_tail(n) that is possible to become an item in fre_tail(n), we need to compute the support of the itemset If < min_sup, then we don’t add it into fre_tail(n) and the node identified by itemset will not be considered any more. This is based on lemma 2: all itemsets that is superset of are not frequent. Superset Frequency Pruning: Superset frequency pruning is also called lookaheads pruning. Considering a node n, if itemset is frequent, then all the children node of n should be pruned (lemma 1). There are two existing methods for determining whether the itemset is frequent. The first is to count the support of directly, this method is normally used in an bread-first algorithms such as in MaxMiner. The second one is to check if a superset of has been already in the discovered MFIs. It is commonly used by the depth-first MFI algorithms [4,7,9,10]. Also there are some other techniques, such as LMFI and MFI projection that is used to reduce the cost of checking. For example, in the MFI-tree situation, we can just check if a superset of fre_tail(n) can be found in all conditional pattern bases of head(n), and then finish the superset checking. Here we propose a new way to do lookaheads pruning based on FP-tree. For a given node, we can get all the conditional pattern bases of the head(n) from the FP-tree of its parent node, and then our algorithm tries to find a superset of fre_tail(n) in a collection of conditional pattern bases, and the last items’ counts of these bases are no less than minimum support. If we find one, S, then we know being frequent, so is frequent based on lemma 2. For example, when considering itemset {b}, the fre_tail of {b} is {a,c}, there is a conditional pattern base of {b} as a3,c3 (Figure2 (b)), then we know {b,c,a} frequent, all the children of {b} will be pruned. If FIMfi finds a superset of fre_tail(n) in FP-tree and is an undiscovered MFI, FIMfi needs to update MFI-trees with as described before. In addition, we also do superset frequency pruning with itemset Before generating fre_tail(n) from con_tail(n), our algorithm will check if there is a superset of con_tail(n) in FP-tree, this is because our scheme will use a very simple and fast method to do the superset checking (see section 3.2).

354

Yuejin Yan, Zhoujun Li, and Huowang Chen

Parent Equivalence Pruning: Also the FIMfi uses the PEP for its efficiency. As an example, taking any item x from fre_tail(n), then there is So if any frequent itemset Z, which contains Y but does not contain x, has the frequent superset Since we only want to get MFI, it is not necessary to count itemsets which contain Y but do not contain x. Therefore, we can move item x from fre_tail(n) to head(n). From the experiment result we find that the PEP can greatly reduce the number of FP-trees’ comparing to the

3.2 Superset Checking As discussed before, superset checking is a main operation in lookaheads pruning. This is because that each new MFI needs to be checked before being added into the MFIs. MaxMiner needs scan all the discovered MFIs, and tries to map item by item for each discovered MFI. Though GenMax uses LMFI to store all the relevant MFIs, it also needs mapping item by item. As for it only needs map fre_tail(n) item by item for all conditional pattern bases of head(n) in MFI-tree. Our simple but fast superset checking method of is based on the lemmas as follows: Lemma 3: If there is one conditional pattern base of head(n) in MFI-tree and its length is equal to the length of fre_tail(n), then is frequent. Proof: Let S be the itemset represented by the base, then is frequent. And for each item x in S, For the bases of same length, there is S =fre_tail(n). Hence, we obtain the lemma. Lemma 4: If there is one conditional pattern base of head(n) in MFI-tree and its length is equal to the length of con_tail(n), then or is frequent. Proof: Let S be the itemset represented by the base, then is frequent. Since con_tail(n) includes all possible extensions of head(n), there is For the bases of same length, there is S = con_tail(n). Hence, we obtain the lemma. Lemma 5: Suppose y is a conditional pattern base of head(n) in FP-tree. If the counter associated with the last item of y is no less than min_sup, and the length of y is equal to the length of fre_tail(n), then is frequent. Proof: Similar as Lemma3. Lemma 6: Suppose y is a conditional pattern base of head(n) in FP-tree. If the counter associated with the last item of y is no less than min_sup, and the length of y is equal to the length of con_tail(n), then is frequent. Proof: Similar as Lemma4. According to lemma 3 and lemma 4, the superset checking needs not to map item by item, and can just be done by checking the length of itemsets. Here the level of the last item in the base can be used as the length of the base. For more efficient lengths checking, the only change of FIMfi for the MFI-tree is storing the node links of items

Fast Mining Maximal Frequent ItemSets Based on FP-Tree

355

to the header table in the decreasing order of the bases’ level. Now the superset checking is very simple for it only needs to check the length of two itemsets. Similarly, the superset checking based on FP-tree can also be simple according to lemma 5 and lemma 6. In this situation, we add a level to each node of the FP-tree, with the level representing the length of the path from the node in question to root. And the node links, whose counts are no less than min_sup, are stored in decreasing order of the levels. The example is shown in Figure2 (d). Therefore, this superset checking is also simple for it needs only to check the length of two itemsets. Let’s revisit the example in section 3.1, when doing the superset checking of we need only compare the length of the conditional pattern base a3:c3 to the length of itemset {a,c}.

3.3 Item Ordering Policy Item ordering policy firstly appears in [5], and is used by almost all the follow MFI algorithms for it can increase the effectiveness of superset frequency pruning. As we know, items with higher frequency are more likely to be members of lone frequent itemsets and subset of some discovered MFIs. As for node n, after fre_tail(n) is generated and before extending to the children, the traditional scheme can sort the items at the tail in the decreasing order of This makes the most frequent items appear in more itemsets that are frequent extensions of some nodes n’s offspring. Therefore, there will be more such pruned offspring nodes. In general, this type of item order policy works better in lookaheads by scanning the database to count the support of in breath-first algorithms, such as in MaxMiner. All the recently proposed depth-first algorithms do the superset checking instead to implement the lookaheads pruning, for the counting support of costs high in depth-first policy. Since superset checking of FIMfi is based on MFI-tree and/or FP-tree, we try to find an item ordering policy to make use of the information of MFI-tree and/or FPtree. As we know, if S is a subset of tail, and is frequent, then we can prune the nodes identified by itemset this is because the itemsets corresponding to the nodes and their offspring are not maximal (lemma 1). Based on FP-tree and MFI-tree, when a policy can let S be the maximal subset of fre_tail(n), we can achieve maximal pruning at the node in question. Supposed there are two itemsets and is represented by the conditional pattern base whose length is maximal in MFI-tree. is represented by the conditional pattern base, whose length is maximal among a collection of conditional pattern bases in FP-tree, here the last item’s count of these bases is no less than min_sup. Let S be the longest one of and and we put the items in S to the head of fre_tail(n), then we can attain the maximal pruning. For example, when considering the node n identified by {e}, we know and as in Figure2(b), then the sorted items in fre_tail(n) is in sequence of a,c,b, the old decreasing order of supports is b,a,c. using the old decreasing order policy has to build FP-trees for nodes {e}, {e,a}, and {e,c}, but FIMfi with the new order policy only need to build FP-trees for nodes {e} and {e,a}. Similarly, when considering the node {d}, we know fre_tail(n)={a,c,b}, as in Figure3(b) and the sorted items in

356

Yuejin Yan, Zhoujun Li, and Huowang Chen

fre_tail(n) is in sequence of a,c,b, the old decreasing order of supports is b,a,c. The experiments results of these two policies will be illustrated in section 4 for comparison purpose. Furthermore, for the items in fre_tail(n)-S, we also sort them in the decreasing order of

3.4 Optimizations FIMfi uses the same array technique for counting frequency as the one in but FIMfi doesn’t count the whole triangle array as do. Suppose, that at node n, sorted fre_tail(n) is with that is frequent. When we extend the nodes corresponding to the items less than the superset checking will return true and those nodes will be pruned. So, for the 2-itemsets which are subsets of the corresponding cells will not be used any more. Therefore, FIMfi will not count those cells when building the array. By this way, FIMfi costs less than does when counting the array. And it is obvious that the bigger the l is, the more counting time saved. We also use the memory management described in [10] to reduce the time consumed in allocating and deallocating space for FP-trees and MFI-trees.

Fig. 4. Pseudo-code of Algorithm FIMfi

3.5 FIMfi Based on section 3.1-3.4, here we show the pseudo-code of FIMfi in Figure 4. In each call procedure, each newly found MFI may be used in superset checking for ancestor nodes of the current node, so we use a parameter called M-trees to access MFI-tree of the ancestor nodes. And when the top call is over, all the MFIs to be mined are stored in the MFI-tree of root in the search space tree. From line (4) to line (6), FIMfi does superset frequency pruning for itemset con_tail(n’). When x is the end item of the header, there is no need to do the pruning,

Fast Mining Maximal Frequent ItemSets Based on FP-Tree

357

for the pruning has already been done by the procedure calling current one in line (12) and/or line (22). Lines from (7) to (10) use the optimization array technique. The PEP technique is used in line (11) and line (19). The superset frequency pruning for itemsets fre_tail(n’) is done in lines from (12) to (18), when the condition at line (17) is true, all the children nodes of n’ are pruned and fre_tail(n’) need not to be inserted into n.MFI-tree any more. Line (20) uses our novel item ordering policy. Line (21) builds a new FP-tree: n’.FP-tree. Lines from (22) to (25) do another superset frequency pruning for fre_tail(n’) in the tree. The return statements in line (4), (6), (13), (17) and (24) mean that all the children nodes after n’ of n are pruned there. And the continue statements in line (14), (18) and (25) tell us that node n’ will be pruned, then we can go to consider the next child of n. After the constructing of n ’.FP-tree and n ’.MFI-tree and the updating of M-trees, FIMfi will be called recursively with the new node n’ and the new M-trees. Note that the algorithm doesn’t employ single path trimming used in and AFOPT. If, by constructing n’.FP-tree, we can find out that n’.FP-tree only has a single path, the superset checking at line (20) will return true, there will be a superset frequency pruning instead of a single path trimming.

4 Experimental Evaluations In the first Workshop on Frequent Itemset Mining Implementations (FIMI’03) [11], which took place at ICDM’03 (The Third IEEE International Conference on Data Mining), there are several recently presented algorithms that are good for mining MFI, such as AFOPT, Mafia and etc, we now present the performance comparisons of our FIMfi with them. All the experiments were conducted on 2.4 GHZ Pentium IV with 1024 MB of DDR memory running Microsoft Windows 2000 Professional. The codes of other four algorithms were downloaded from [12] and all codes of the five algorithms were complied using Microsoft Visual C++ 6.0. Duo to the lack of space, only the results for three real dense datasets and one real sparse dataset are shown here. The datasets we used are also selected from all the 11 real datasets of FIMI’03[12], they are BMS-WebView-2 (sparse), Connect, Mushroom and Pumsb_star, and their data characteristics can be found in [11].

4.1 Comparison of FP-Trees’ Number The item ordering policy and PEP technology are the main improvement of FIMfi. To test their performance in pruning, we build two sub algorithms: FIMfi-order and FIMfi-pep here. Comparing with FIMfi, FIMfi-order just doesn’t use PEP for pruning, and FIMfi-pep discards our novel item ordering policy along with the optimization array technique. We take as the benchmark algorithm, because it is also an MFI mining algorithm based on FP-tree and MFI-tree which does the MFI Mining best for almost all the datasets in FIMI’03 [11]. The numbers of FP-tree created by the four algorithms are shown in Figure 5. On the datasets Mushroom, Connect and Pumsb_star FIMfi-order and FIMfi-pep both

358

Yuejin Yan, Zhoujun Li, and Huowang Chen

generate less than half number of the FP-trees than that of The combination of the ordering policy and PEP into FIMfi creates the least number of FP-trees in the four algorithms. In fact, at the lowest support of Mushroom, creates more than 3 times number of FP-trees than FIMfi does. Note that in Figure 5, there is no result of BMS-WebView-2, it is because that all the four algorithms generate only one tree for BMS-WebView-2, then we omit it.

Fig. 5. Comparison of FP-trees’ Number

4.2 Performance Comparisons The performance comparisons of FIMfi, AFOPT and Mafia on sparse data BMS-WebView-2 are shown in Figure 6. FIMfi is faster than AFOPT at the higher supports that are no less than 50%, and is always defeated by AFOPT at not only lower but also higher supports. FIMfi outperforms about 20% to 40% at all supports and Mafia more than 20 times.

Fast Mining Maximal Frequent ItemSets Based on FP-Tree

359

Fig. 6. Performance on Sparse Datasets

Figure 7 gives the results of comparison the four algorithms on dense data. For all supports on dense datasets, FIMfi has the best performance. FIMfi runs around 40% %60 faster than on all of the dense datasets. AFOPT is the slowest algorithm on Mushroom and Pumsb_star and runs from 2 to 10 times worse than FIMfi on all of the datasets across all supports. Mafia is the slowest algorithm on Connect, it runs between 2 to 5 times slower than FIMfi on Mushroom and Connect across all supports. On Pumsb_star, Mafia is outperformed by FIMfi for all the supports though it outperforms at lower supports.

Fig. 7. Performance on Dense Datasets

360

Yuejin Yan, Zhoujun Li, and Huowang Chen

5 Conclusions Different from the traditional item ordering policy in which the items are sorted on the decreasing order of supports, this paper introduces a novel item ordering policy based on FP-tree and MFI-tree. The policy can guarantee maximal pruning of each node in the search space tree, and then greatly reduces the number of FP-trees created. The experimental comparison of FP-trees’ number reveals that FIMfi will generate less than half number of FP-trees than the traditional one does for dense datasets. We have found a simple method for fast superset checking. The method simplifies the superset checking to check only the equivalence of two integral, therefore makes the cost of superset checking less. Several old and new pruning techniques are integrated into FIMfi. Among the new ones, the superset frequency pruning based on FP-tree is first introduced and makes the cutting of search space more efficiently. The PEP technique used in FIMfi greatly reduces the number of FP-tree created comparing with by experimental results in section 4.1. In FIMfi we also present a new optimization in array technique and use the memory management to further reduce the run time. Our experimental results demonstrate that FIMfi is more optimized for mining MFI and outperforms by 40% averagely, and on dense data it outperforms AFOPT and Mafia more than 2 times to 20 times.

Acknowledgements We would like to thank Jianfei Zhu for providing the executable of FPMax and the code of before the download website being available. We also thank Guimei Liu for providing the code of AFOPT and Doug Burdick for providing the website of downloading the code of Mafia.

References 1. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proceedings of the 20th VLDB Conference, Santiago, Chile, 1994. 2. J. Han, J. Pei, and Y. Yin. Mining Frequent Patterns without Candidate Generation, Proc. 2000 ACM-SIGMOD Int. Conf. on Management of Data (SIGMOD’00), Dallas, TX, May 2000. 3. L. Rigoutsos and A. Floratos: Combinatorial pattern discovery in biological sequences: The Teiresias algorithm.Bioinformatics 14,1 (1998), 55-67. 4. Guimei Liu, Hongjun Lu, Jeffrey Xu Yu, Wei Wang and Xiangye Xiao. AFOPT: An Efficient Implementation of Pattern Growth Approach. In Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations Melbourne, Florida, USA, November 19, 2003. 5. Roberto Bayardo. Efficiently mining long patterns from databases. In ACM SIGMOD Conference, 1998.

Fast Mining Maximal Frequent ItemSets Based on FP-Tree

361

6. R. Agarwal, C. Aggarwal and V. Prasad. A tree projection algorithm for generation of frequent itemsets. Journal of Parallel and Distributed Computing, 2001. 7. D. Burdick, M. Calimlim, and J. Gehrke. MAFIA: A Performance Study of Mining Maximal Frequent Itemsets. In Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations Melbourne, Florida, USA, November 19, 2003. 8. M. J. Zaki and C.-J. Hsiao. CHARM: An efficient algorithm for closed association rule mining. TR 99-10, CS Dept., RPI, Oct. 1999. 9. K. Gouda and M. J. Zaki. Efficiently Mining Maximal Frequent Itemsets. Proc. of the IEEE Int. Conference on Data Mining, San Jose, 2001. 10. Gosta Grahne and Jianfei Zhu. Efficiently Using Prefix-trees in Mining Frequent Itemsets. In Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations Melbourne, Florida, USA, November 19, 2003. 11. Bart Goethals and M. J. Zaki. FIMI’03: Workshop on Frequent Itemset Mining Implementations. In Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations Melbourne, Florida, USA, November 19, 2003. 12. Codes and datasets available at http://fimi.cs.helsinki.fi/.

Multi-phase Process Mining: Building Instance Graphs B.F. van Dongen and W.M.P. van der Aalst Department of Technology Management, Eindhoven University of Technology P.O. Box 513, NL-5600 MB, Eindhoven, The Netherlands {b.f.v.dongen,w.m.p.v.d.aalst}@tm.tue.nl

Abstract. Deploying process-driven information systems is a time-consuming and error-prone task. Process mining attempts to improve this by automatically generating a process model from event-based data. Existing techniques try to generate a complete process model from the data acquired. However, unless this model is the ultimate goal of mining, such a model is not always required. Instead, a good visualization of each individual process instance can be enough. From these individual instances, an overall model can then be generated if required. In this paper, we present an approach which constructs an instance graph for each individual process instance, based on information in the entire data set. The results are represented in terms of Event-driven Process Chains (EPCs). This representation is used to connect our process mining to a widely used commercial tool for the visualization and analysis of instance EPCs. Keywords: Process mining, Event-driven process chains, Workflow management, Business Process Management.

1

Introduction

Increasingly, process-driven information systems are used to support operational business processes. Some of these information systems enforce a particular way of working. For example, Workflow Management Systems (WFMSs) can be used to force users to execute tasks in a predefined order. However, in many cases systems allow for more flexibility. For example transactional systems such as ERP (Enterprise Resource Planning), CRM (Customer Relationship Management) and SCM (Supply Chain Management) are known to allow the users to deviate from the process specified by the system, e.g., in the context of SAP R/3 the reference models, expressed in terms of Event-driven Process Chains (EPCs, cf. [13,14,19]), are only used to guide users rather than to enforce a particular way of working. Operational flexibility typically leads to difficulties with respect to performance measurements. The ability to do these measurements, however, is what made companies decide to use a transactional system in the first place. To be able to calculate basic performance characteristics, most systems have their own built-in module. For the calculation of basic characteristics such as the average flow time of a case, no model of the process is required. However, for more complicated characteristics, such as the average time it takes to transfer work P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 362–376, 2004. © Springer-Verlag Berlin Heidelberg 2004

Multi-phase Process Mining: Building Instance Graphs

363

from one person to the other, some notion of causality between tasks is required. This notion of causality is provided by the original model of the process, but deviations in execution can interfere with causalities specified there. Therefore, in this paper, we present a way of defining certain causal relations in a transactional system. We do so without using the process definition from the system, but only looking at a so called process log. Such a process log contains information about the processes as they actually take place in a transactional system. Most systems can provide this information in some form and the techniques used to infer relations between tasks in such a log is called process mining. The problem tackled in this paper has been inspired by the software package ARIS PPM (Process Performance Monitor) [12] developed by IDS Scheer. ARIS PPM allows for the visualization, aggregation, and analysis of process instances expressed in terms of instance EPCs (i-EPCs). An instance EPC describes the the control-flow of a case, i.e., a single process instance. Unlike a trace (i.e., a sequence of events) an instance EPC provides a graphical representation describing the causal relations. In case of parallelism, there may be different traces having the same instance EPC. Note that in the presence of parallelism, two subsequent events do not have to be causally related. ARIS PPM exploits the advantages of having instance EPCs rather than traces to provide additional management information, i.e., instances can be visualized and aggregated in various ways. In order to do this, IDS Scheer has developed a number of adapters, e.g., there is an adapter to extract instance EPCs from SAP R/3. Unfortunately, these adapters can only create instance EPCs if the actual process is known. For example, the workflow management system Staffware can be used to export Staffware audit trails to ARIS PPM (Staffware SPM, cf. [20]) by taking projections of the Staffware process model. As a result, it is very time consuming to build adapters. Moreover, the approaches used only work in environments where there are explicit process models available. In this paper, we do not focus on the visualization, aggregation, and analysis of process instances expressed in terms of instance EPC or some other notation capturing parallelism and causality. Instead we focus on the construction of instance graphs. An instance graph can be seen as an abstraction of the instance EPCs used by ARIS PPM. In fact, we will show a mapping of instance graphs onto instance EPCs. Instance graphs also correspond to a specific class of Petri nets known as marked graphs [17], T-systems [9] or partially ordered runs [8,10]. Tools like VIPTool allow for the construction of partially ordered runs given an ordinary Petri net and then use these instance graphs for analysis purposes. In our approach we do not construct instance graphs from a known Petri net but from an event log. This enhances the applicability of commercial tools such as ARIS PPM and the theoretical results presented in [8,10]. The mapping from instance graphs to these Petri nets is not given here. However, it will become clear that such a mapping is trivial. In the remainder of this paper, we will first describe a common format to store process logs in. Then, in Section 3 we will give an algorithm to infer causality at an instance level, i.e. a model is built for each individual case. In Section 4 we will provide a translation of these models to EPCs. Section 5 shows a concrete

364

B.F. van Dongen and W.M.P. van der Aalst

example and demonstrates the link to ARIS PPM. Section 6 discusses related work followed by some concluding remarks.

2

Preliminaries

This section contains most definitions used in the process of mining for instance graphs. The structure of this section is as follows. Subsection 2.1 defines a process log in a standard format. Subsection 2.2 defines the model for one instance.

2.1

Process Logs

Information systems typically log all kinds of events. Unfortunately, most systems use a specific format. Therefore, we propose an XML format for storing event logs. The basic assumption is that the log contains information about specific tasks executed for specific cases (i.e., process instances). Note that unlike ARIS PPM we do not assume any knowledge of the underlying process. Experience with several software products (e.g., Staffware, InConcert, MQSeries Workflow, FLOWer, etc.) and organization-specific systems (e.g., Rijkswaterstaat, CJIB, and several hospitals) show that these assumptions are justified. Figure 1 shows the schema definition of the XML format. This format is supported by our tools, and mappings from several commercial systems are available. The format allows for logging multiple processes in one XML file (cf. element “Process”). Within each process there may be multiple process instances (cf. element “ProcessInstance”). Each “ProcessInstance” element is composed of “AuditTrailEntry” elements. Instead of “AuditTrailEntry” we will also use the terms “log entry” or “event”. An “AuditTrailEntry” element corresponds to a single event and refers to a “WorkflowModelElement” and an “EventType”. A “WorkflowModelElement” may refer to a single task or a subprocess. The “EventType” is used to indicate the type of event. Typical events are: “schedule” (i.e., a task becomes enabled for a specific instance), “assign” (i.e., a task

Fig. 1. XML schema for process logs.

Multi-phase Process Mining: Building Instance Graphs

365

instance is assigned to a user), “start” (the beginning of a task instance), “ complete” (the completion of a task instance). In total, we identify 12 events. When building an adapter for a specific system, the system-specific events are mapped on these 12 generic events. As Figure 1 shows the “WorkflowModelElement” and “EventType” are mandatory for each “AuditTrailEntry”. There are three optional elements “Data”, “Timestamp”, and “Originator”. The “Data” element can be used to store data related to the event of the case (e.g., the amount of money involved in the transaction). The “Timestamp” element is important for calculating performance metrics like flow time, service times, service levels, utilization, etc. The “Originator” refers to the actor (i.e., user or organization) performing the event. The latter is useful for analyzing organizational and social aspects. Although each element is vital for the practical applicability of process mining, we focus on the “WorkflowModelElement” element. In other words, we abstract from the “EventType”, “Data”, “Timestamp”, and “Originator” elements. However, our approach can easily be extended to incorporate these aspects. In fact, our tools deal with these additional elements. However, for the sake of readability, in this paper events are identified by the task and case (i.e., process instance) involved.

Table 1 shows an example of a small log after abstracting from all elements except for the “WorkflowModelElement” element (i.e., task identifier). The log shows two cases. For each case three tasks are executed. Case 1 can be described by the sequence SAB and case 2 can be described by the sequence SBA. In the remainder we will describe process instances as sequences of tasks where each element in the sequence refers to a “WorkflowModelElement” element. A process log is represented as a bag (i.e., multiset) of process instances. Definition 2.1. (Process Instance, Process Log) Let T be a set of log entries, i.e., references to tasks. Let define the set of sequences of log entries with length at least 1. We call a process instance (i.e., case) and a process log. If

is a process instance of length then each element corresponds to “AuditTrailEntry” element in Figure 1. However, since we abstract from timestamps, event types, etc., one can think of as a reference to a task. denotes the length of the process instance and the element. We assume process instances to be of finite length. denotes a

366

B.F. van Dongen and W.M.P. van der Aalst

bag, i.e., a multiset of process instances. is the number of times a process instance of the form appears in the log. The total number of instances in a bag is finite. Since W is a bag, we use the normal set operators where convenient. For example, we use as a shorthand notation for

2.2

Instance Nets

After defining a process log, we now define an instance net. An instance net is a model of one instance. Since we are dealing with an instance that has been executed in the past, it makes sense to define an instance net in such a way that no choices have to be made. As a consequence of this, no loops will appear in an instance net. For readers familiar with Petri nets it is easy to see that instance nets correspond to “runs” (also referred to as occurrence nets) [8]. Since events that appear multiple times in a process instance have to be duplicated in an instance net, we define an instance domain. The instance domain will be used as a basis for generating instance nets. Definition 2.2. (Instance domain) Let i.e., We define

be a process instance such that as the domain of

Using the domain of an instance, we can link each log entry in the process instance to a specific task, i.e., can be used to represent the element in In an instance net, the instance is extended with some ordering relation to reflect some causal relation. Definition 2.3. (Instance net) Let instance. Let be the domain of and let is irreflexive, asymmetric and acyclic,

such that is a process be an ordering on such that:

where relation satisfying:

if and only if

is the smallest

or

We call N an instance net. The definition of an instance net given here is rather flexible, since it is defined only as a set of entries from the log and an ordering on that set. An important feature of this ordering is that if then there is no set such that Since the set of entries is given as a log, and an instance mapping can be inferred for each instance based on textual properties, we only need to define the ordering relation based on the given log. In Section 3.1 it is shown how this can be done. In Section 4 we show how to translate an instance net to a model in a particular language (i.e., instance EPCs).

3

Mining Instance Graphs

As seen in Definition 2.3, an instance net consists of two parts. First, it requires a sequence of events as they appear in a specific instance. Second, an ordering on the domain of is required. In this section, we will provide a method

Multi-phase Process Mining: Building Instance Graphs

367

that infers such an ordering relation on T using the whole log. Furthermore, we will present an algorithm to generate instance graphs from these instance nets.

3.1

Creating Instance Nets

Definition 3.1. (Causal ordering) Let W be a process log over a set of log entries T, i.e., Let and be two log entries. We define a causal ordering on W in the following way: if and only if there is an instance and such that and and if and only if there is an instance and such that and and and and not if and only if and or or or The basis of the causal ordering defined here, is that two tasks A and B have a causal relation if in some process instance, A is directly followed by B and B is never directly followed by A. However, this can lead to problems if the two tasks are in a loop of length two. Therefore, also holds if there is a process instance containing ABA or BAB and A nor B can directly succeed themselves. If A directly succeeds itself, then For the example log presented in Table 1, T = {S,A,B} and causal ordering inferred on T is composed of the following two elements and By defining the relation, we defined an ordering relation on T. This relation is not necessarily irreflexive, asymmetric, nor acyclic. This relation however can be used to induce an ordering on the domain of any instance that has these properties. This is done in two steps. First, an asymmetric order is defined on the domain of some Then, we prove that this relation is irreflexive and acyclic. Definition 3.2. (Instance ordering) Let W be a process log over T and let be a process instance. Furthermore, let be a causal ordering on T. We define an ordering on the domain of in the following way. For all such that we define if and only if and or The essence of the relation defined here is in the final part. For each entry within an instance, we find the closest causal predecessor and the closest causal successor. If there is no causal predecessor or successor then the entry is in parallel with all its predecessors or successors respectively. It is trivial to see that this can always be done for any process instance and with any causal relation. In the example log presented in Table 1 there are two process instances, case 1 and case 2. From here on, we will refer to case 1 as and to case 2 as We know that and that Using the causal relation the relation is inferred such that and For this also applies. It is easily seen that the ordering relation is indeed irreflexive and asymmetric, since it is only defined on and for which Therefore, it can easily be concluded that it is irreflexive and acyclic. Furthermore, the third property holds as well. Therefore we can now define an instance net as

368

3.2

B.F. van Dongen and W.M.P. van der Aalst

Creating Instance Graphs

In this section, we present an algorithm to generate an instance graph from an instance net. An instance graph is a graph where each node represents one log entry of a specific instance. These instance graphs can be used as a basis to generate models in a particular language. Definition 3.3. (Instance graph) Consider a set of nodes N and a set of edges We call an instance graph of an instance net if and only if the following conditions hold. is the set of nodes. 1. 2. The set of edges E is defined as

where

An instance graph as described in Definition 3.3 is a graph that typically describes an execution path of some process model. This property is what makes an instance graph a good description of an instance. It not only shows causal relations between tasks but also parallelism if parallel branches are taken by the instance. However, choices are not represented in an instance graph. The reason for that is obvious, since choices are made at the execution level and do not appear in an instance. With respect to these choices, we can also say that if the same choices are made at execution, the resulting instance graph is the same. Note, that the fact that the same choices are made does not imply that the process instance is the same. Tasks that can be done in parallel within one instance can appear in any order in an instance without changing the resulting instance graph. For case 1 of the example log of Table 1 the instance graph is drawn in Figure 2. Note that in this graph, the nodes 1,2 and 3 are actually in the domain of and therefore, they refer to entries in Table 1. It is easily seen that for case 2 this graph looks exactly the same, although the nodes refer to different entries. In order to make use of instance graphs, we will show that an instance graph indeed describes an instance such that an entry in the log can only appear if all predecessors of that entry in the graph have already appeared in the instance.

Fig. 2. Instance graph for

Multi-phase Process Mining: Building Instance Graphs

369

Definition 3.4. (Pre- and postset) Let be an instance graph and let We define to be the preset of such that We define to be the postset of such that Property 3.5. (Instance graphs describe an instance) Every instance graph of some process instance describes that instance in such a way that for all holds that for all implies that This ensures that every entry in process entry occurs only after all predecessors in the instance graph have occurred in Proof. To prove that this is indeed the case for instance graph we consider Definition 3.3 which implies that for “internal nodes” we know that if and only if Furthermore, from the definition of we know that implies that For the source and sink nodes, it is also easy to show that implies that because 0 is the smallest element of N while is the largest. Property

3.6. (Strongly connectedness) For every instance graph of some process instance holds that the short circuited graph is strongly connected.1

Proof. From Definition 3.3 we know that for all such that there does not exist a such that holds that Furthermore, we know that for all such that there does not exist a such that holds that Therefore, the graph is strongly connected if the edge is added to E. In the remainder of this paper, we will focus on an application of instance graphs. In Section 4 a translation from these instance graphs to a specific model are given.

4

Instance EPCs

In Section 3 instance graphs were introduced. In this section, we will present an algorithm to generate instance EPCs from these graphs. An instance EPC is a special case of an EPC (Event-driven Process Chain, [13]). For more information on EPCs we refer to [13,14,19]. These instance EPCs (or i-EPCs) can only contain AND-split and AND-join connectors, and therefore do not allow for loops to be present. These i-EPCs serve as a basis for the tool ARIS PPM (Process Performance Monitor) described in the introduction. In this section, we first provide a formal definition of an instance EPC. An instance EPC does not contain any connectors other than AND-split and ANDjoins connectors. Furthermore, there is exactly one initial event and one final event. Functions refer to the entries that appear in a process log, events however do not appear in the log. Therefore, we make the assumption here that each 1

A graph is strongly connected if there is a directed path from any node to any other node in the graph.

370

B.F. van Dongen and W.M.P. van der Aalst

event uniquely causes a function to happen and that functions result in one or more events. An exception to this assumption is made when there are multiple functions that are the start of the instance. These functions are all preceded by an AND-split connector. This connector is preceded by the initial event. Consequently, all other connectors are preceded by functions and succeeded by events. Definition 4.1. (Instance EPC) Consider a set of events E, a set of functions F, a set of connectors C and a set of arcs We call (E, F, C, A) an instance EPC if and only if the following conditions hold. 1. 2. Functions and events alternate in the presence of connectors:

where is acyclic. 3. The graph 4. There exists exactly one event such that there is no element such that We call the initial event. such that there is no element 5. There exists exactly one event such that We call the final event. 6. The graph is strongly connected. there are exactly two elements such 7. For each function that and Functions only have one input and one output. 8. For each event there are exactly two elements such that and Events only have one input and one output, except for the initial and the final event. For them the following holds. For there is exactly one element such that and for there is exactly one element such that

4.1

Generating Instance EPCs

Using the formal definition of an instance EPC from Definition 4.1, we introduce an algorithm that produces an instance EPC from an instance graph as defined in Definition 3.3. In the instance EPC generated it makes sense to label the functions according to the combination of the task name and event type as they appear in the log. The labels of the events however cannot be determined from the log. Therefore, we propose to label the events in the following way. The initial event will be labeled “initial”. The final event will be labeled “final”. All other events will be labeled in such a way that it is clear which function succeeds it. Connectors are labeled in such a way that it is clear whether it is a split or a join connector and to which function or event it connects with the input or output respectively. Definition 4.2. (Converting instance graphs to EPCs) Let W be a process log and let be an instance graph for some process instance

Multi-phase Process Mining: Building Instance Graphs

371

To create an instance EPC, we need to define the four sets E, F, C and A. The set of functions F is defined as In other words, for every entry in the process instance, a function is defined. The set of events E is defined as and In other words, for every function there is an event preceding it, unless it is a minimal element with respect to Furthermore, there is an initial event and a final event The set of connectors C is defined as where

Here, the connectors are constructed in such a way that connectors are always preceded by a function, except in case the process starts with parallel functions, since then the event is succeeded by a split connector. The set of arcs A is defined as where

It is easily seen that the instance EPC generated by Definition 4.2 is indeed an instance EPC, by verifying the result against Definition 4.1. In definitions 3.3 and 4.1 we have given an algorithm to generate an instance EPC for each instance graph. The result of this algorithm for both cases in the example of Table 1 can be found in Figure 3. In Section 5 we will show the practical use of this algorithm to ARIS PPM.

Fig. 3. Instance EPC for

372

5

B.F. van Dongen and W.M.P. van der Aalst

Example

In this section, we present an example illustrating the algorithms described in sections 3 and 4. We will start from a process log with some process instances. Then, we will run the algorithms to generate a set of instance EPCs that can be imported into ARIS PPM.

5.1

A Process Log

Consider a process log consisting of the following traces.

The process log in Table 2 shows the execution of tasks for a number of different instances of the same process. To save space, we abstracted from the original names of tasks and named each task with a single letter. The subscript refers to the position of that task in the process instance. Using this process log, we will first generate the causal relations from Definition 3.1. Note that casual relations are to be defined between tasks and not between log entries. Therefore, the subscripts are omitted here. This definition leads to the following set of causal relations: Using these relations, we generate instance graphs as described in Section 3 for each process instance. Then, these instance graphs are imported into ARIS PPM and a screenshot of this tool is presented (cf. Figure 5).

5.2

Instance Graphs

To illustrate the concept of instance graphs, we will present the instance graph for the first instance, “case 1”. In order to do this, we will follow Definition 3.2 to generate an instance ordering for that instance. Then, using these orderings, an instance graph is generated. Applying Definition 3.2 to case 1 in the log presented in Table 2 using the casual relations given in Section 5.1 gives the

Multi-phase Process Mining: Building Instance Graphs

373

following instance ordering: Using this instance ordering, an instance graph can be made as described in Definition 3.3. The resulting graph can be found in Figure 4. Note that the instance graphs of all other instances are isomorphic to this graph. Only, the numbers of the nodes change.

Fig. 4. Instance graph for case 1.

For each process instance, such an instance graph can be made. Using the algorithm presented in Section 4 each instance can than be converted into an instance EPC. These instance EPCs can be imported directly into ARIS PPM for further analysis. Here, we would like to point out again that our tools currently provide an implementation of the algorithms in this paper, such that the instance EPCs generated can be imported into ARIS PPM directly. A screenshot of this tool can be found in Figure 5 where “case 1” is shown as an instance EPC. Furthermore, inside the boxed area, the aggregation of some cases is shown. Note that this aggregation is only part of the functionality of ARIS PPM. Using graphical representations of instances, a large number of analysis techniques is available to the user. However, creating instances without knowing the original process model is an important first step.

6

Related Work

The idea of process mining is not new [1, 3, 5–7, 11, 12, 15, 16, 18, 21] and most techniques aim at the control-flow perspective. For example, the allows for the construction of a Petri net from an event log [1,5]. However, process mining is not limited to the control-flow perspective. For example, in [2] we use process mining techniques to construct a social network. For more information on process mining we refer to a special issue of Computers in Industry on process mining [4] and a survey paper [3]. In this paper, unfortunately, it is impossible to do justice to the work done in this area. To support our mining efforts we have developed a set of tools including EMiT [1], Thumb [21], and MinSoN [2]. These tools share the XML format discussed in this paper. For more details we refer to www.processmining.org. The focus of this paper is on the mining of the control-flow perspective. However, instead of constructing a process model, we mine for instance graphs.

374

B.F. van Dongen and W.M.P. van der Aalst

Fig. 5. ARIS PPM screenshot.

The result can be represented in terms of a Petri net or an (instance) EPC. Therefore, our work is related to tools like ARIS PPM [12], Staffware SPM [20], and VIPTool [10]. Moreover, the mining result can be used as a basis for applying the theoretical results regarding partially ordered runs [8].

7

Conclusion

The focus of this paper has been on mining for instance graphs. Algorithms are presented to describe each process instance in a particular modelling language. From the instance graphs described in Section 3, other models can be created as well. The main advantage of looking at instances in isolation is twofold. First, it can provide a good starting point for all kinds of analysis such as the ones implemented in ARIS PPM. Second, it does not require any notion of completeness of a process log to work. As long as a causal relation is provided between log entries, instance graphs can be made. Existing methods such as the [1,3,5] usually require some notion of completeness in order to rediscover the entire process model. The downside thereof is that it is often hard to deal with noisy process logs. In our approach noise can be filtered out before implying the causal dependencies between log entries, without negative implications on the result of the mining process. ARIS PPM allows for the aggregation of instance EPCs into an aggregated EPC. This approach illustrates the wide applicability of instance graphs. However, the aggregation is based on simple heuristics that fail in the presence of

Multi-phase Process Mining: Building Instance Graphs

375

complex routing structures. Therefore, we are developing algorithms for the integration of multiple instance graphs into one EPC or Petri net. Early experiments suggest that such a two-step approach alleviate some of the problems existing process mining algorithms are facing [3,4].

References 1. W.M.P. van der Aalst and B.F. van Dongen. Discovering Workflow Performance Models from Timed Logs. In Y. Han, S. Tai, and D. Wikarski, editors, International Conference on Engineering and Deployment of Cooperative Information Systems (EDCIS 2002), volume 2480 of Lecture Notes in Computer Science, pages 45–63. Springer-Verlag, Berlin, 2002. 2. W.M.P. van der Aalst and M. Song. Mining Social Networks: Uncovering interaction patterns in business processes. In M. Weske, B. Pernici, and J. Desel, editors, International Conference on Business Process Management, volume 3080 of Lecture Notes in Computer Science, pages 244–260. Springer-Verlag, Berlin, 2004. 3. W.M.P. van der Aalst, B.F. van Dongen, J. Herbst, L. Maruster, G. Schimm, and A.J.M.M. Weijters. Workflow Mining: A Survey of Issues and Approaches. Data and Knowledge Engineering, 47(2):237–267, 2003. 4. W.M.P. van der Aalst and A.J.M.M. Weijters, editors. Process Mining, Special Issue of Computers in Industry, Volume 53, Number 3. Elsevier Science Publishers, Amsterdam, 2004. 5. W.M.P. van der Aalst, A.J.M.M. Weijters, and L. Maruster. Workflow Mining: Discovering Process Models from Event Logs. QUT Technical report, FIT-TR-2003-03, Queensland University of Technology, Brisbane, 2003. (Accepted for publication in IEEE Transactions on Knowledge and Data Engineering.). 6. R. Agrawal, D. Gunopulos, and F. Leymann. Mining Process Models from Workflow Logs. In Sixth International Conference on Extending Database Technology, pages 469–483, 1998. 7. J.E. Cook and A.L. Wolf. Discovering Models of Software Processes from EventBased Data. ACM Transactions on Software Engineering and Methodology, 7(3):215–249, 1998. 8. J. Desel. Validation of Process Models by Construction of Process Nets. In W.M.P. van der Aalst, J. Desel, and A. Oberweis, editors, Business Process Management: Models, Techniques, and Empirical Studies, volume 1806 of Lecture Notes in Computer Science, pages 110–128. Springer-Verlag, Berlin, 2000. 9. J. Desel and J. Esparza. Free Choice Petri Nets, volume 40 of Cambridge Tracts in Theoretical Computer Science. Cambridge University Press, 1995. 10. J. Desel, G. Juhas, R. Lorenz, and C. Neumair. Modelling and Validation with VipTool. In W.M.P. van der Aalst, A.H.M. ter Hofstede, and M. Weske, editors, International Conference on Business Process Management (BPM 2003), volume 2678 of Lecture Notes in Computer Science, pages 380–389. Springer-Verlag, 2003. 11. J. Herbst. A Machine Learning Approach to Workflow Management. In Proceedings 11th European Conference on Machine Learning, volume 1810 of Lecture Notes in Computer Science, pages 183–194. Springer-Verlag, Berlin, 2000. 12. IDS Scheer. ARIS Process Performance Manager (ARIS PPM), http://www.idsscheer.com, 2002.

376

B.F. van Dongen and W.M.P. van der Aalst

13. G. Keller, M. Nüttgens, and A.W. Scheer. Semantische Processmodellierung auf der Grundlage Ereignisgesteuerter Processketten (EPK). Veröffentlichungen des Instituts für Wirtschaftsinformatik, Heft 89 (in German), University of Saarland, Saarbrücken, 1992. 14. G. Keller and T. Teufel. SAP R/3 Process Oriented Implementation. AddisonWesley, Reading MA, 1998. 15. A.K.A. de Medeiros, W.M.P. van der Aalst, and A.J.M.M. Weijters. Workflow Mining: Current Status and Future Directions. In R. Meersman, Z. Tari, and B.C. Schmidt, editors, On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE, volume 2888 of Lecture Notes in Computer Science, pages 389–406. Springer-Verlag, Berlin, 2003. 16. M. zur Mühlen and M. Rosemann. Workflow-based Process Monitoring and Controlling - Technical and Organizational Issues. In R. Sprague, editor, Proceedings of the 33rd Hawaii International Conference on System Science (HICSS-33), pages 1–10. IEEE Computer Society Press, Los Alamitos, California, 2000. 17. T. Murata. Petri Nets: Properties, Analysis and Applications. Proceedings of the IEEE, 77(4):541–580, April 1989. 18. M. Sayal, F. Casati, and M.C. Shan U. Dayal. Business Process Cockpit. In Proceedings of 28th International Conference on Very Large Data Bases (VLDB’02), pages 880–883. Morgan Kaufmann, 2002. 19. A.W. Scheer. Business Process Engineering, Reference Models for Industrial Enterprises. Springer-Verlag, Berlin, 1994. 20. Staffware. Staffware Process Monitor (SPM). http://www.staffware.com, 2002. 21. A.J.M.M. Weijters and W.M.P. van der Aalst. Rediscovering Workflow Models from Event-Based Data using Little Thumb. Integrated Computer-Aided Engineering, 10(2): 151–162, 2003.

A New XML Clustering for Structural Retrieval* Jeong Hee Hwang and Keun Ho Ryu Database Laboratory, Chungbuk National University, Korea {jhhwang,khryu)@dblab.chungbuk.ac.kr

Abstract. XML becomes increasingly important in data exchange and information management. Starting point for retrieving the information and integrating the documents efficiently is clustering the documents that have similar structure. Thus, in this paper, we propose a new XML document clustering method based on similar structure. Our approach first extracts the representative structures of XML documents by sequential pattern mining. And then we cluster XML documents of similar structure using the clustering algorithm for transactional data, assuming that an XML document as a transaction and the frequent structure of documents as the items of the transaction. We also apply our technique to XML retrieval. Our experiments show the efficiency and good performance of the proposed clustering method. Keywords: Document Clustering, XML Document, Sequential Pattern, Structural Similarity, Structural Retrieval

1 Introduction XML(eXtensible Markup Language) is a standard for data representation and exchange on the Web, and we will find large XML document collection on the Web in the near future. Therefore, it has become crucial to address the question of how we can efficiently query and search XML documents. Meanwhile, the hierarchical structure of XML has a great influence on the information retrieval, the document management system, and data mining[l,2,3,4]. Since an XML document is represented as a tree structure, one can explore the relationship among XMLs using various tree matching algorithms[5,6]. A closely related problem is to find trees in a database that “match” a given pattern or query tree[7]. This type of retrieval often exploits various filters that eliminate unqualified data trees from consideration at an early stage of retrieval. The filters accelerate the retrieval process. Another approach to facilitating a search is to cluster XMLs into appropriate categories. We propose a new XML clustering technique based on similar structure in this paper. We first extract the representative structures of frequent patterns including hierarchical structure information from XML documents by the sequential pattern mining method[8]. And then we perform the document clustering by considering both the CLOPE algorithm[9] and large items [10], assuming that an XML document as a transaction and the extracted frequent structures from documents as the items of the transaction. We also apply our method to structural retrieval of XML documents in order to verify the efficiency of proposed technique. * This work was supported by University IT Research Center Project and ETRI in Korea.

P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 377–387, 2004. © Springer-Verlag Berlin Heidelberg 2004

378

Jeong Hee Hwang and Keun Ho Ryu

The remaining of the paper is organized as follows. Section 2 reviews the previous researches related to the structure of XML documents. Section 3 describes the method extracting the representative structures of XML documents. In section 4, we define our clustering criterion using large items, and we describe about updating the cluster, and section 5 explains how to apply our clustering method to XML retrieval. Section 6 shows the experiment results of clustering algorithm and the result of XML retrieval, section 7 concludes the paper.

2 Related Works Recently, as XML documents with various structures are increasing, it is needed to study the method that classifies the similar structure documents and retrieves the documents [3,4]. [11] considered XML as a tree and analyzed the similarity among the documents by taking account of semantics. [12] referred the necessity to manage the increasing XML documents and proposed the clustering method about element tags and the text of XML documents using k-means algorithm. In [3,4,13], they say that there are two kinds of structure mining technique for extracting the XML document structure; intra-structured mining for one document and inter-structured mining for various documents. But the concrete algorithm is not described. [14] proposed the clustering method about the DTD based on the similarity of elements as the way to find out the mediate DTD to integrate DTDs. But it can just be applied to the DTDs with the same application domain. [15] concentrated on finding out the common structure of the tree, but not cosidering the document clustering. [16] grouped trees about the same pairs of labels occurring frequently, and then finds a subset of the frequent trees. But the multi relational tree structure can’t be detected, because it is based on the label pairs. [17] proposed the method for clustering the XML documents using the bit map indexing, but it requires too much space for a large amount of documents. In this paper, we use the CLOPE algorithm[9] adding the notion of large items for document clustering. The CLOPE algorithm uses only the rate of the common items, not considering individual items in a cluster. Therefore, it can have some problems that the similarity between clusters may be higher, and it mayn’t control the number of clusters. In order to address this problem, we add the notion of large items about a cluster to CLOPE algorithm.

3 Extracting the Representative Structure of XML Documents XML document has sequential and hierarchical structure of elements. Therefore, the orders of the elements and the elements themselves have the feature that can distinguish the XML documents [11,13]. Thus, we use the sequential pattern mining that considers both the frequency and the order of elements.

3.1 Element Path Sequences We first extract representative structures of each document based on the path from the root to the element about elements having content value. Figure 1 is an example XML document to show how to find out the representative structures from the documents.

A New XML Clustering for Structural Retrieval

Fig. 1. An XML document

379

Fig. 2. Element mapping table

We rename each element with alphabet to easily distinguish elements using the element mapping table, as shown in Figure 2. Based on the renamed element by Figure 2, the element paths having contents value is represented as Figure 3, in which element paths are regarded as the sequences and each element contained in sequence is considered to be the items. And then we find out the frequent sequence structures that satisfy the given minimum support by the sequential pattern mining algorithm.

3.2 The Sequential Pattern Algorithm to Extract The Frequent Structure To extract the frequent structures, we use the PrefixSpan Fig. 3. Element path sealgorithm[8] about Figure 3. To do this, we define the quences frequent structure minimum support as follows. Definition 1 (Frequent Structure Minimum Support). Frequent structure minimum support is the least frequency that satisfies the rate of the frequent structure among the whole paths in a document, and the path sequences that satisfy this condition are the frequent structures. The formula of this is as follows. FFMS = frequent structure rate the number of path of the whole documents If the frequent structure rate is 0.2, FFMS of sequence set of Figure 3 is 2 And the element frequency of the length-1 satisfying the FFMS is a: 6, b: 6, c2: 4. Starting from this length-1 sequential pattern, we extract the frequent pattern structures using the projected DB(refer to [8] for the detail algorithm). According to this method, the maximal frequent structure in Figure 3 is , and this path is occurred at the rate of about 66%(4/6) to the whole document. We also include the structures of length over the regular rate to the maximal frequent structure (e.g. the most frequent length 5 80% = the frequent structure length 4) to the representative structures as the input data for clustering. The reason is that it can avoid frequent structures missing, in case there are various subjects in a document.

380

Jeong Hee Hwang and Keun Ho Ryu

4 Large Item Based Document Cluster The frequent structures of each XML document are basic data for clustering. We assume the XML documents as a transaction, the frequent structures extracted from each document as the items of the transaction, and then we perform the document clustering using the notion of large items.

4.1 A New Clustering Criterion The item set included all the transaction is defined as cluster set as and transaction set that represents the document as As a criterion to allocate a transaction to the appropriate cluster, we define the cluster allocation gain. Definition 2 (Cluster Allocation Gain). The cluster allocation gain is the sum of the ratio of the total occurrences to the individual items in every cluster. The following equation expresses this.

where G is the occurrence rate(H) to individual item(W) in a cluster, H = T (the total occurrence of the individual items) / W (the number of the individual items), Gain is a criterion function for cluster allocation of the transaction, and the higher the rate of the common items, the more the cluster allocation gain. Therefore we allocate a transaction to the cluster to be the largest Gain. However if we use only the rate of the common items, not considering the individual items like CLOPE, it causes some problems as follows. Example 1. Assume that transaction t4 = {f, c} is to be inserted, under the condition of the cluster C1 = {a:3, b:3, c:1}, C2 = {d:3, e:1, c:3} including three transactions respectively. If t4 is allocated to C1 or C2, then Gain is

Other

while, if t4 is allocated to a new cluster, then Gain is Thus, t4 is allocated a new cluster by Definition 2. As you see in this example, we can get the considerably higher allocation grain about a new cluster, because Gain about a new cluster equals Due to this, it causes the production of many clusters over the regular size, so that it may reduce cluster cohesion. In order to address this problem, we define the large items and the cluster participation as follows. Definition 3 (Large Items). Item support of cluster Ci is defined as the number of the transactions including item (j and its closing tag is placed after the tag . Based on the IR annotations the intrasite search engine can improve the quality and accuracy of query results. Only pages which have IRDisplayContentType equal to “content” are indexed. Pages defined as “entry” represent those pages which are entry points to the Web site. Notice that the distinction between content and entry pages is necessary because user queries can also be classified in two categories [22,29]: (1) bookmark queries, which refer to locating an entry page for a specific site portion. For example, searching for the entry page of the economy section of a newspaper Web site. (2) content queries, which are the most common sort, denote user queries that result in single content pages. For example, finding a page that describes how the stock market operates.

412

Keyla Ahnizeret et al.

Pages with IRDisplayContentType “irrelevant” are automatically discarded. Annotating irrelevant pages is important as it makes the system to index pages with relevant content only, making the resulting search engine more efficient and accurate. Each piece of information has a level of significance defined by the value given to the attribute IRSignificance. This feature specify the importance of a particular piece of information with respect to the page where it is placed. This means that the same piece of information might have a different level of significance when placed in another page. In the indexing process the system stores each piece of information, their location and their IRSignificance value. In addition, the number of occurrences of each term present in the piece of information is also stored. This information is used by our information retrieval model to compute the ranking of documents for each user query submitted to the intrasite search engine. The information retrieval model adopted here is an extension of the well know Vector Space Model [27]. This model is based on identifying the importance how related is each term (word) to each document (page) which should be expressed as a function The queries in this document are modelled in the same way, and the function is used to represent each element modelled as a vector in a space determined by the set of all distinct terms. The ranking in this model is computed by the score function for each document in the collection and a given query as in the equation below.

which is the cosine between vectors and and expresses how similar is document to the query The documents which have similarity higher than zero are presented to the users in a descending order. The function in Equation 1 gives a measurement of how related term and document are. This value is usually computed as where is the inverse document frequency and measures the importance of term for the whole set of documents, while expresses the importance of term for document The idf value is usually computed as

where #docs is the number of documents (pages) in the collection and is the number of documents where term occurs. The tf value can be computed in several ways. However, it is always a function of the term frequency in the document. Common formulae directly compute number of occurrences of in [16]. We here propose the use of information provided from the Web site modelling to define the function tf based not only on the term frequency, but also in the IRSignificance described during the Web site

Information Retrieval Aware Web Site Modelling and Generation

modelling. Given a Web page we define

composed of

413

different pieces of information

where gives the number of occurrences of term in the piece of information and assigns values 0,1,2 or 3 corresponding to irrelevant, low, medium or high, respectively, for piece of information derived from the Web site modelling. By using this equation, the system assigns to each piece of information a precise importance value, allowing ranking the pages according to the terms used in a query that match the most significant pieces of information.

4.1

Generating a Web Site and Its Associated Intrasite Search Engine A high-level specification of an application must be provided as the starting point of the development process. Existing conceptual models can be used for this task as discussed earlier. Since issues related to mapping a data model constructs to our intermediate representation language are not in the scope of this paper, we assume that this task has been already carried out. Details of mapping procedures from an ER schema to our intermediate representation can be found in [4]. Once an intermediate representation of the Web site application is provided, the next step is the generation of the pages and the search engine. The steps to perform a complete generation involve: 1. Instantiation of pieces of information, what usually involves access to databases. 2. Creation of pages. For each display unit a number of corresponding pages are created depending on the number of instances of its pieces of information. 3. Instantiation of links. 4. Translation of all intermediate representation to a target language, such as HTML. 5. Application of visualization styles to all pieces of information and pages, based on style and page templates definitions. 6. Creation of additional style sheets, as CSS specifications. 7. Creation of the intrasite search engine.

Visualization is described by individual pieces of information styles, page styles (stylesheet) and page templates. A suitable interface should be offered to the designer in order to input all necessary. Currently, a standard CSS stylesheet is automatically generated including definitions provided by the designer. The reason to make use of stylesheets is to keep the representation for our visualization styles simple. Without a stylesheet, all visual details would have to be included as arguments to the mapping procedure which translates a visualization style to HTML.

414

Keyla Ahnizeret et al.

As for the creation of the intrasite search engine, its code is automatically incorporated as part of the resulting Web site. Furthermore its index is generated along with the Web site pages, according to the IRDisplayContentType and IRSignificance specifications.

5

Experiments

In this section we present experiments to evaluate the impact of our new integrated strategy for designing Web sites and intrasite search engines. For these experiments we have constructed two intrasite search systems for the Brazilian Web portal “ultimosegundo”, indexing 12,903 news Web pages. The first system was constructed without considering information provided by the application data model and it has been implemented using the traditional vector space model [27]. The second system was constructed using our IR-aware data model described in Section 4. To construct the second system we first modelled the Web site using our IR-aware methodology, generating a new version where the IRDisplayContentType of each page and the IRSignificance of the semantic pieces of information that compose each page are available. Figure 6 illustrates a small portion of the intermediate representation of the Web site modelled using our modelling language. The structure and content of this new site is equal to the original version, preserving all pages and keep them with the same content. The first side effect of our methodology is that only pages with useful content are indexed. In the example only pages derived from the display unit NewsPage

Fig. 6. Example of a Partial Intermediate Representation of a Web Site

Information Retrieval Aware Web Site Modelling and Generation

415

are indexed. Furthermore, pieces of information that do not represent useful information are also excluded from the search system. For instance, each news Web page in the site have links to related news (OtherNews), these links are considered as non-relevant pieces of information because they are included in the page as a navigation facility, not as content. As a result, the final index size was only 43% of the index file created to index the original site, which means our intrasite search version uses less storage space and is faster when processing user queries. The experiments evaluating the quality of results were performed using a set of 50 queries extracted from a log of queries on new Web sites. The queries were randomly selected from the log having an average length of 1.5 terms, as the majority of queries are composed of one or two terms. In order to evaluate the results, we have used a precision recall curve, which is the most applied method for evaluating information retrieval systems [1]. The precision at any point of this curve computed using the set of relevant answers for each query (N) and the set of answers given by each system this query (R). The formulae for computing precision and recall are described in Equations 4. For further details about precision recall curve the interested reader is referred to [1, 32].

To obtain the precision recall curve we need to use human judgment for determining the set of relevant answers for each query evaluated. This set was determined here using the pooling method used for the Web-based collections of TREC [19]. This method consists of retrieving a fix number of top answers from each of the system options evaluated and then make a pool of answers which is used for determining the set of relevant documents. Each answer in the pool is analyzed by humans and is classified as relevant or non relevant for the given user query. After analyzing the answers in the pool, we use the relevant answers identified by humans as the set N in the Equations 4. For each of the 50 queries of our experiments, we composed a query pool formed by the top 50 documents generated by each of the 2 intra site search systems evaluated. The query pools contained an average of 62.2 pages (some queries had less than 50 documents in the answer). All documents in each query pool were submitted to a manual evaluation. The average number of relevant pages per query pool is 28.5. Figure 7 shows the precision recall curve obtained in our experiment for both systems. Our Modelling-aware intrasite search is labelled in the figure as “Modelling-aware”, while the original vector space model is labelled as “Conventional” . This Figure shows that the quality of the ranking results of our system was superior in all points of recall. The precision at the first points in the curve was roughly 96% in our system, against 86.5% which means an improvement of almost 11% in the precision. For higher levels of recall the difference becomes ever higher, being roughly 20% at 50% of recall and 50% at 100%. This last result indicates that our system found in average 50% more relevant documents in this experiment. The average precision for the 11 points were 56% for the

416

Keyla Ahnizeret et al.

Fig. 7. Comparison of average precision versus recall curves obtained when processing the 50 queries using the original vector space model and the IR-aware model

conventional system and 84% for the Modelling-aware system, which represents an improvement of 48%. Another important data about the experiment is that our system has returned on the average only 209.8 documents per query (from these, we selected 50 for evaluating) while the original system has returned 957.66 results on the average. This difference is again due to the elimination of non relevant information from the index. To give an example, the original system gave almost all pages as a result for the query “september 11th”, while our system gives less than 300 documents. This difference happened because almost all pages in the site had a footnote text linking a special section about this topic in the site.

6

Conclusions

We have presented a new modelling technique for Web site design that transfers information about the model to the Web pages generated. We also presented a new intrasite search model that uses this information to improve the quality of results presented to users and to reduce the size of the indexes generated for processing queries. In our experiments we have presented one particular example of application of our method that illustrates its viability and effectiveness. The gains obtained in precision and storage space reduction may vary for different Web sites. However this example has shown a good indication that our method can be effectively deployed to solve the problem of intrasite and Intranet search. For the site modelled we had an improvement of 48% in the average precision and at the same time a reduction in the index size, occupying only 43% of the space used by the traditional implementation. That means our method produces faster and more precise intrasite search systems. As future work we are planning to study the application of our method to other Web sites in order to evaluate in more detail the gains obtained and to refine our approach. We are also studying strategies for automatically compute

Information Retrieval Aware Web Site Modelling and Generation

417

the IRSiginificance of pieces of information and for automatically determining the weights of each piece of information for each display unit. These automatic methods will allow the use of our approach for non-modelled Web sites which may be used for extending the benefits of our method to global search engines. The paradigm described here opens new possibilities for designing better intrasite search systems. Another future research direction is defining new modelling characteristics that can be useful for intrasite search systems. For instance, we are interested in finding ways for determining the semantic relations between Web pages during the modelling phase and use this information to cluster these pages in a search system. The idea is to use the cluster properties for improving the knowledge about the semantic meaning of each Web page in the site.

Acknowledgements This paper is the result of research work done in the context of the SiteFix and Gerindo projects sponsored by the Brazilian Research Council - CNPq, grants no.55.2197/02-5 and 55.2087/05-5. The work was also supported by a R&D grant by Philips MDS-Manaus. The second author was sponsored by The Amazonas State Research Foundation - FAPEAM. The fourth author is sponsored by the Brazilian Research Council - CNPq, grant no.300220/2002-2.

References 1. BAEZA-YATES, R., AND RIBEIRO-NETO, B. Modern Information Retrieval, 1st ed. Addison-Wesley-Longman, 1999. 2. BRIN, S., AND PAGE, L. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th International World Wide Web Conference (Brisbane, Australia, April 1998), pp. 107–117. 3. CAVALCANTI, J., AND ROBERTSON, D. Synthesis of Web Sites from High Level Descriptions. In Web Engineering: Managing Diversity and Complexity in Web Application Development, S. M. . Y. Deshpande, Ed., vol. 2016 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, Germany, 2001, pp. 190–203. 4. CAVALCANTI, J., AND ROBERTSON, D. Web Site Synthesis based on Computational Logic. Knowledge and Information Systems Journal (KAIS) 5, 3 (Sept. 2003), 263– 287. 5. CAVALCANTI, J., AND VASCONCELOS, W. A Logic-Based Approach for Automatic Synthesis and Maintenance of Web Sites. In Proceedings of the 14th International Conference on Software Engineering and Knowledge Engineering - SEKE’02 (Ischia, Italy, July 2002), pp. 619–626. 6. CERI, S., FRATERNALI, P., AND BONGIO, A. Web Modeling Language (WebML): a Modeling Language for Designing Web Sites. In Proceedings of the WWW9 conference (Amsterdam, the Netherlands, May 2000), pp. 137–157. 7. CHEN, M., HEARST, M., HONG, J., AND LIN, J. Cha-cha: a system for organizing intranet search results. In Proceedings of the 2nd USENIX Symposium on Internet Technologies and Systems (Boulder,USA, October 1999). 8. CHEN, P. The entity-relationship model: toward a unified view of data. ACM Transactions on Database Systems 1, 1 (1976).

418

Keyla Ahnizeret et al.

9. CRASWELL, N., HAWKING, D., AND ROBERTSON, S. Effective site finding using link anchor information. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (New Orleans, USA, September 2001), pp. 250–257. 10. EIRON, N., AND MCCURLEY, K. S. Analysis of anchor text for web search. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Toronto, Canada, July 2003), pp. 459– 460. 11. FERNÁNDEZ, M., FLORESCU, D., KANG, J., LEVY, A., AND SUCIU, D. Catching the Boat with Strudel: Experience with a A Web-site Management System. SIGMOD Record 27, 2 (June 1998), 414–425. 12. FLORESCU, D., LEVY, A., AND MENDELZON, A. Database Techniques for the World-Wide Web: A Survey. SIGMOD Record 27, 3 (Sept. 1998), 59–74. 13. GARZOTTO, G., PAOLINI, P., AND SCHWABE, D. HDM - A Model-Based Approach to Hypertext Application Design. TOIS 11, 1 (1993), 1–26. 14. GEVREY, J., AND RÜGER, S. M. Link-based approaches for text retrieval. In The Tenth Text REtrieval Conference (TREC-2001) (Gaithersburg, Maryland, USA, November 2001), pp. 279–285. 15. GÓMEZ, J., CACHERO, C., AND PASTOR, O. Conceptual Modeling of DeviceIndependent Web Applications. IEEE Multimedia 8, 2 (Apr. 2001), 26–39. 16. G.SALTON, AND BUCKLEY, C. Term-weighting approaches in automatic text retrieval. Information Processing & Management 5, 24 (1988), 513–523. 17. HAGEN, P., MANNING, H., AND PAUL, Y. Must Search Stink ? The Forrester Report, June 2000. 18. HAWKING, D., CRASWELL, N., AND THISTLEWAITE, P. B. Overview of TREC-7 very large collection track. In The Seventh Text REtrieval Conference (TREC-7) (Gaithersburg, Maryland, USA, November 1998), pp. 91–104. 19. HAWKING, D., CRASWELL, N., THISTLEWAITE, P. B., AND HARMAN, D. Results and challenges in web search evaluation. Computer Networks 31, 11–16 (May 1999), 1321–1330. Also in Proceedings of the 8th International World Wide Web Conference. 20. HAWKING, D., VOORHEES, E., BAILEY, P., AND CRASWELL, N. Overview of trec-8 web track. In Proc. of TREC-8 (Gaithersburg MD, November 1999), pp. 131–150. 21. JIN, Y., DECKER, S., AND WIEDERHOLD, G. OntoWebber: Model-driven ontologybased Web site management. In Proceedings of the first international semantic Web working symposium (SWWS’01) (Stanford, CA, USA, July 2001). 22. KANG, I.-H., AND KIM, G. Query type classification in web document retrieval. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Toronto, July 2003), pp. 64–71. 23. KANUNGO, T., AND ZIEN, J. Y. Integrating link structure and content information for ranking web documents. In The Tenth Text REtrieval Conference (TREC-2001) (Gaithersburg, Maryland, USA, November 2001), pp. 237–239. 24. KLEINBERG, J. M. Authoritative sources in a hyperlinked environment. In Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms (San Francisco, California, USA, January 1998), pp. 668–677. 25. MAEDCHE, A., STAAB, S., STOJANOVIC, N., STUDER, R., AND SURE, Y. SEAL - A framework for developing semantic Web portals. In Proceedings of the 18th British national conference on databases (BNCOD 2001) (Oxford, England, UK, July 2001).

Information Retrieval Aware Web Site Modelling and Generation

419

26. MECCA, G., ATZENI, P., MASCI, A., MERIALDO, P., AND SINDONI, G. The ARANEUS Web-Base Management System. SIGMOD Record (ACM Special Interest Group on Management of Data) 27, 2 (1998), 544. 27. SALTON, G., AND MCGILL, M. J. Introduction to Modern Information Retrieval, 1st ed. McGraw-Hill, 1983. 28. SCHWABE, D., AND ROSSI, G. The Object-oriented Hypermedia Design Model. Communications of the ACM 38, 8 (Aug. 1995), 45–46. 29. UPSTILL, T., CRASWELL, N., AND HAWKING, D. Query-independent evidence in home page finding. ACM Transactions on Information Systems - ACM TOIS 21, 3 (2003), 286–313. 30. VASCONCELOS, W., AND CAVALCANTI, J. An Agent-Based Approach to Web Site Maintenance. In Proceedings of the 4th International Conference on Web Engineering and Knowledge Engineering - ICWE 2004- To Appear. (Munich, Germany, July 2004). 31. WESTERVELD, T., KRAAIJ, W., AND HIEMSTRA, D. Retrieving Web pages using content, links, URLs and anchors. In The Tenth Text REtrieval Conference (TREC-2001) (Gaithersburg, Maryland, USA, November 2001), pp. 663–672. 32. WITTEN, I., MOFFAT, A., AND BELL, T. Managing Gigabytes, second ed. Morgan Kaufmann Publishers, New York, 1999. 33. XUE, G.-R., ZENG, H.-J., CHEN, Z., MA, W.-Y., ZHANG, H.-J., AND LU, C.-J. Implicit link analysis for small web search. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Toronto - Canada, July 2003), pp. 56–63.

Expressive Profile Specification and Its Semantics for a Web Monitoring System* Ajay Eppili, Jyoti Jacob, Alpa Sachde, and Sharma Chakravarthy Information Technology Laboratory Computer Science and Engineering, UT Arlington, Texas, USA {eppili,jacob,sachde,sharma}@cse.uta.edu

Abstract. World wide web has gained a lot of prominence with respect to information retrieval and data delivery. With such a prolific growth, a user interested in a specific change has to continuously retrieve/pull information from the web and analyze it. This results in wastage of resources and more importantly the burden is on the user. Pull-based retrieval needs to be replaced with a pushbased paradigm for efficiency and notification of relevant information in a timely manner. WebVigiL is an efficient profile-based system to monitor, retrieve, detect and notify specific changes to HTML and XML pages on the web. In this paper, we describe the expressive profile specification language along with its semantics. We also present an efficient implementation of these profiles. Finally, we present the overall architecture of the WebVigiL system and its implementation status.

1 Introduction Information on the Internet, growing at a rapid rate, is spread over multiple repositories. This has greatly affected the way information is accessed, delivered and disseminated. Users, at present, are not only interested in the new information available on web pages but also in retrieving changes of interest in a timely manner. More specifically, users may only be interested in particular changes (such as keywords, phrases, links etc). Push and Pull paradigms [1] are traditionally used for monitoring the pages of interest. Pull Paradigm is an approach where the user performs an explicit action in the form of a query, transaction execution on a periodic basis on the pages of interest. Here, the burden of retrieving the required information is on the user and may result in changes being missed when a large number of web sites need to be monitored. In the push paradigm, the system is responsible for accepting user needs and informs the user (or a set of users) when something of interest happens. Although this approach reduces the burden on the user, naive use of a push paradigm results in informing users about the changes to web pages irrespective of the user’s interest. At present most of the systems use a mailing list to send the same compiled changes to all its subscribers. * This work was supported, in part, by the Office of Naval Research & the SPAWAR System Center–San Diego & by the Rome Laboratory (grant F30602-01 -2-05430), and by NSF (grant IIS-0123730). P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 420–433, 2004. © Springer-Verlag Berlin Heidelberg 2004

Expressive Profile Specification and Its Semantics for a Web Monitoring System

421

Hence, an approach is needed which replaces periodic polling and notifies the user of the relevant changes in a timely manner. The emphasis in WebVigiL is on selective change notification. This entails notifying the user about the changes to the web pages based on user specified interest/policy. WebVigiL is a web monitoring system, which uses an appropriate combination of push and intelligent pull paradigm with the help of active capability to monitor customized changes to HTML and XML pages. WebVigiL intelligently pulls the information using a learning-based algorithm [2] from the web server based on user profile and propagates/pushes only the relevant information to the end user. In addition, WebVigiL is a scalable system, designed to detect even composite changes for a large number of users. An overview of the paradigm used and the basic approach taken for effective monitoring is discussed in [3]. This paper concentrates on the expressiveness of change specification, its semantics, and its implementation. In order for the user to specify notification and monitoring requirements, an expressive change specification language is needed. The remainder of the paper is organized as follows. Section 2 discusses related work. Section 3 discusses the syntax and semantics of the change specification language which captures the monitoring requirements of the user and in addition supports inheritance, event-based duration and composite changes. Section 4 gives an overview of the current architecture and status of the system. Section 5 concludes the paper with an emphasis on future work.

2 Related Work Many research groups have been working to address detecting changes to documents. GNU diff [4] detects changes between any two text files. Most of the previous work in change detection has dealt only with flat-files [5] and not structured or unstructured web documents. Several tools have been developed to detect changes between two versions of unstructured HTML documents [6]. Some change–monitoring tools such as ChangeDetection.com [7] have been developed using the push-pull paradigm. But these tools detect changes to the entire page instead of user specified components and the changes can be tracked only on limited pages.

2.1 Approaches for User Specification Present day users are interested in monitoring changes to pages and want to be notified based on his/her profile. Hence, an expressive language is necessary to specify user-intent on fetching, monitoring and propagating changes. WebCQ [8] detects customized changes between two given HTML pages and provides an expressive language for the user to specify his/her interests. But WebCQ only supports changes between the last two pages of interest. As a result, flexible and expressive compare options are not provided to the user. AT&T Internet Difference Engine [9] views a HTML document as a sequence of sentences and sentence-breaking markups. This approach may be expensive computationally as each sentence may need to be compared with all sentences in the document. WYSIGOT [10] is a commercial application that can be used to detect changes to HTML pages. It has to be installed on the local

422

Ajay Eppili et al.

machine, which is not always possible. This system gives an interface to specify the specifications for monitoring a web page. It has the feature to monitor an HTML page and also all the pages that it points to. But the granularity of change detection is at the page level. In [11], the authors allow the user to submit monitoring requests and continuous queries on the XML documents stored in the Xyleme repository. WebVigiL supports a life-span for change monitoring request which is akin to a continuous query. Change detection is continuously performed over the life-span. To the best of our knowledge, customized changes, inheritance, different reference selection or correlated specifications cannot be specified in Xyleme.

3 Change Specification Language The present day web user’s interest has evolved from mere retrieval of information to monitoring the changes on web pages that are of interest. As the web pages are distributed over large repositories, the emphasis is on selective and timely propagation of information/changes. Changes need to be notified to the user in different ways based on their profiles/policies. In addition, the notification of these changes may have to be sent to different devices that have different storage and communication bandwidths. The language for establishing the user policies should be able to accommodate the requirements of a heterogeneous distributed large network-centric environment. Hence, there is a need to define an expressive and extensible specification language wherein the user can specify details such as the web page(s) to be monitored, the type of change (keywords, phrases etc.) and the interval for comparing occurrence of changes. User should also be able to specify how, when, and where to be notified taking into consideration the quality of service factors such as timeliness, size vs. quality of notification. WebVigiL provides an expressive language with well-defined semantics for specifying the monitoring requirements of a user pertaining to the web [12]. Each monitoring request is termed a Sentinel. The change specification language developed for this purpose allows the user to create a monitoring request based on his/her requirements. The semantics of this language for WebVigiL have been formalized. Complete syntax of the language is shown in Fig 1. Following are a few monitoring scenarios that can be represented using the above sentinel specification language. Example 1: Alex wants to monitor http://www.uta.edu/spring04/cse/classes.htm for the keyword “cse5331” to take a decision for registering the course cse5331. The sentinel starts from May 15, 2004 to August 10, 2004 (summer semester) and she wants to be notified as soon as a change is detected. Sentinel (s1) for this scenario is as follows: Create Sentinel s1 Using http://www.uta.edu/ spring04/cse/classes.htm Monitor keyword (cse5331) Fetch 1 day From 05/15/04 To 08/10/04

Expressive Profile Specification and Its Semantics for a Web Monitoring System

423

Notify By email [emailprotected] Every best effort Compare pairwise ChangeHistory 3 Presentation dual frame Example 2: Alex wants to monitor the same URL as in Example 1 for regular updates on new courses getting added but is not interested in changes to images. As it is correlated with sentinel s1, the duration is specified between the start of s1 and the end of s1. The sentinel (s2) for the above scenario is: Create Sentinel s2 Using s1 Monitor Anychange AND (NOT) images Presentation only change

Fig. 1. Sentinel Syntax

3.1 Sentinel Name This is to specify a name for user’s request. The syntax of sentinel name is Create Sentinel <sentinel-name>. For every sentinel, the WebVigiL system generates a

424

Ajay Eppili et al.

unique identifier. In addition, the system also allows the user to specify a sentinel name. The user is required to specify a distinct name for all his sentinels. This name identifies a request uniquely. Further it facilitates the user to specify another sentinel in terms of his/her previously defined sentinels.

3.2 Sentinel Target The syntax of sentinel target is Using <sentinel- target>. The sentinel-target could be either a URL or a previously defined sentinel If the new sentinel specifies the sentinel target as then inherits all properties of unless the user overrides those properties in the current specification. In Example 1, Alex is interested in monitoring the course web page for the keyword ‘cse5331’. Alex should be able to specify this URL as the target on which the system monitors the changes on the keyword cse5331. Later Alex wants to get updates on the new classes being added to the page, as this may affect her decision for registering for the course cse5331. She should place another sentinel for the same URL but with different change criteria. As the second case is correlated with the first case, Alex can specify s1 as the sentinel target with a different change type. Sentinels are correlated if they inherit run time properties such as start and end time of a sentinel. Otherwise, they merely inherit static properties (e.g., URL, change type, etc. of the sentinel). The language allows the user to specify the reference web page or a previously placed sentinel as the target.

3.3 Sentinel Type WebVigiL allows the detection of customized changes in the form of sentinel type and provides explicit semantics for the user to specify his/her desired type of change. The syntax of sentinel type is given as: Monitor <sentinel-type>, where sentinel type is sentinel-type= [] [ ] In Example 1, Alex is interested in ‘cse5331’. Detecting changes to the entire page leads to wasteful computations and further sends unwanted information to Alex. In Example 2, Alex is interested in any change to the class web page but is not interested in the changes pertaining to images. WebVigiL handles such requests by introducing change type and operators in its change specification language. The contents of a web page can be any combination of objects such as set of words, links and images. Users can specify such objects using change type and use operators over these objects. Change Specification Language defines Primitive change and Composite change for a sentinel type. Primitive change: It is the detection of a single type of change between two versions of the same page. For keyword change, the user must specify a set of words. An exception list can also be given for any change. For phrase change, a set of phrases is specified. For regular expressions, a valid regular expression is given. Composite change: It comprises of a combination of distinct primitive change(s) specified on the same page, using one of the binary operators AND and OR. The

Expressive Profile Specification and Its Semantics for a Web Monitoring System

425

semantics of composite change formed by the use of an operator can be defined as follows (Note that V, and ~ are Boolean AND, OR, and NOT operators, respectively).

3.4 Change Type If and are two different versions of the same page, then Change on with reference to is defined as: if the change type t is detected as insert in or delete in False otherwise. The sentinel-type is the change type t selected from the set T where T = {any change, links, images, all words except <set of words>, phrase:<set of phrases>, keywords:<set of words>, table:

, list :<list id>, regular expression: <exp> }. Based on the form of information that is usually available on web pages change types may be classified as links, images, keywords, phrases, all words, table, list, regular expression and any change based on the form of information. Links: Corresponds to a set of hypertext references. In HTML, links are presentationbased objects represented between the hypertext tag (). Given two versions of a page, if any of the old links are deleted in the new version or new links are inserted, a change is flagged. Images: Corresponds to a set of image references extracted from the image source. In HTML, images are represented by the image source tag (< IMG src=“.”>). The changes detected are similar to the links except that the images are monitored. Keywords<set of words>: Corresponds to a set of unique words from the page. A change is flagged when any of the keyword (mentioned in the set of words) appears or disappears in a page with respect to the previous version of the same page. Phrase<set of phrases>: Corresponds to a set of contiguous words from the page. A change is flagged on the appearance or disappearance of a given phrase in a page with respect to the previous version of the same page. Update to a phrase is also flagged depending on the percentage of words that has been modified in a phrase. If the number of words changed exceeds above a threshold, it is deemed as a delete (or disappearance). Table: Corresponds to the content of the page represented in a tabular format. Though the table is a presentation object, the changes are tracked on the contents of the table. Hence, whenever the table contents are changed, it is flagged as a table change. List: Corresponds to the contents of a page represented in a list format. The list format can be bullets or numbered. Any change detected on the set of words represented in a list format is flagged as a change. Regular expression <exp>: Expressed as valid regular expression syntax for querying and extracting specific information from the document data.

426

Ajay Eppili et al.

All words: A page can be divided into a set of words, links and images. Any change to the set of words between two versions of the same page is detected as all words change. All words encompass phrases, keywords and words in the table and list. While considering changes to all words, the presentation objects such as table and list are not considered and only the content in these presentation objects are taken into consideration. Anychange: Anychange encompasses all the above given types of changes. Changes to any of the defined set (i.e., all words, all links and all images) are flagged as anychange. Hence, the granularity is limited to a page for anychange. Any change is the superset of all changes.

3.5 Operators Users may want to detect more than one type of change on a given page or the nonoccurrence of a type of change. To facilitate such detections the change specification language includes unary and binary operators. NOT: A unary operator, which detects the non-occurrence of a change type. For a given change type t on version with reference to version of the same page the semantics of NOT are: OR: A binary operator representing disjunction of change types. It is denoted by OR for two primitive changes and specified on version with reference to version of the same page. A change is detected if either is detected or is OR detected. Formally, where t1, t2 are the types of changes and t1t2 AND: A binary operator representing conjunction of change types. It is denoted by AND for two primitive changes and specified on version with reference to version of the same page. A change is detected when both and are detected. Formally, AND where t1, t2 are types of changes and t1 t2 The unary operator NOT can be used to specify a constituent primitive change in a composite change. For example, for a page containing the list of fiction books, a user can specify a change type as: All words AND NOT phrase {“ Lord of the Rings”}. A change will be flagged only if given two versions of a page, at least some words may change such as insertion of a new book and author etc. but the phrase “Lord of the Rings” has not changed. Hence, the user is interested in monitoring the arrival of new books or removal of old books, only as long as the book “Lord of the Rings” is available.

Expressive Profile Specification and Its Semantics for a Web Monitoring System

427

3.6 Fetch Changes can be detected for a web page only when a new version of the page is fetched. New versions can be fetched based on the freshness of the page. The page properties (or meta-data) of a web page, such as the last modified date for static pages or checksum for dynamic pages define whether a page has been modified or not. The syntax of fetch is Fetch on change. User can specify a ‘time interval’ indicating how often a new page should be fetched, or can specify ‘on change’ to indicate that he/she is unaware of the change frequency of the page. On change: This option relieves the user of knowing when the page changes. WebVigiL’s fetch module uses a heuristic-based fetch algorithm called Best Effort Algorithm [13] to determine the interval with which a page should be fetched. This algorithm uses change history and meta-data of the page. Fixed Interval User can specify a fixed user-defined fetch interval when a page is fetched by the system, can be in terms of minutes, hours, days or weeks (a non-negative integer).

3.7 Sentinel Duration WebVigiL monitors a web page for changes during the lifespan of the sentinel. The lifespan of a sentinel is a closed interval formed by the start time and end time of sentinel. This is defined as: From To Let the timeline be an equidistant discrete time domain having “0” as the origin and each time point as a positive integer as defined in [14]. Defining it in terms of the timeline, occurrences of the created Sentinel S are specific points on the time line and the duration (lifespan) defines the closed interval within which S occurs. The ‘From’ modifier denotes the start of a sentinel S and the ‘To’ modifier denotes the end of S. The start and end times of a sentinel can be specific times or can depend upon the attributes of other correlated sentinels. The user has the flexibility to specify the duration as one of the following: (a) Now (b) Absolute time (c) Relative time (d) Eventbased time Now: A system-defined variable that keeps track of the current time. Absolute time: Denoted as time point T, it can be specified as a definite point on the time line. The format for specifying the time point is MM/DD/YYYY. Relative time: It is defined as an offset from a time point (either absolute or eventbased). The offset can be specified by the time interval defined in Section 3.6. Event-based time: Events, such as the start and end of a sentinel can be mapped to specific time points and can be used to trigger the start or end of a new sentinel. Start of a sentinel can also depend on the active state of another sentinel and is specified by the event ‘during’. During defines that a sentinel should be started in the closed

428

Ajay Eppili et al.

interval of and the start should be mapped to Now. When a sentinel inherits from another sentinel having a start time of Now, as the properties are inherited, the time of the current sentinel will be mapped to the current time.

3.8 Notification Users need to be notified of detected changes. How, when and where to notify is an important criterion for notification and should be resolved by the change specification semantics. Notification Mechanism: The mechanism selected for notification is important especially when multiple types of devices with varying capabilities are involved. The syntax for specifying the notification mechanism is given by: Notify By . The allows the users to select the appropriate mechanism for notification from a set of options O = {email, fax, PDA}. The default is email. Notification Frequency: The notification module has to ensure that the detected changes are presented to the user at the specified frequency. The system should incorporate the flexibility to allow users to specify the desired frequency of notification. The syntax of notification frequency has been defined as: best effort immediate interactive where is as defined in the Section 3.6. Immediate denotes immediate (without delay) notification on change detection. Best effort is defined as notify as soon as possible after change detection. Hence, best effort is equivalent to immediate but will have lesser priority than immediate for notification. Interactive is a navigational style notification approach where the user visits the WebVigiL dashboard to retrieve the detected changes at his/her convenience.

3.9 Compare Options One of the unique aspects of WebVigiL is its compare option and its efficient implementation. Changes are detected between two versions of the same page. Each fetch of the same page is given a version number. The first version of the page will be the first page fetched after a sentinel starts. Given a sequence of versions of the same page, the user may be interested in knowing changes with respect to different references. In order to facilitate this, the change specification language allows users to specify three types of compare options. The syntax of compare options is: Compare where compare options can be selected from a set P = {pairwise, moving n, every n}. Pairwise: The default is pairwise, which will allow change comparison between two chronologically adjacent versions as shown in Fig 2. Every n: Consider an example where the user is aware of the changes occurring on a page such as a web developer or administrator and is interested in the cumulative changes between only n versions. This compare option allows detecting changes

Expressive Profile Specification and Its Semantics for a Web Monitoring System

429

between versions and For the next comparison, the nth page becomes the reference page. For example if a user wants to detect changes between every 4 versions of the page, the versions for comparing will be selected as shown in Fig 2.

Fig. 2. Compare Options

Moving n: This is a moving window concept for tracking changes. When a user wants to monitor the trend of a particular stock where meaningful change detection is only possible between particular set of pages occurring in a moving window. For moving n, If a user specifies the compare option of moving n where n=4, as shown in Fig 2, will be the reference page for The next comparison will be between and WebVigiL believes in giving the users more flexibility and options for change detection and hence has incorporated several compare options for change specification along with efficient change detection algorithms. By default, the previous page (based on user-defined fetch interval where appropriate) and the current page are used for change detection.

3.10 Change History The syntax of Change History is ChangeHistory . Change Specification language facilitates the user to specify the number of previous changes to be maintained by the system. User should be able to view last n changes detected for a particular request (sentinel). WebVigiL provides an interface to users to view and manage the sentinels they have placed. A user dashboard is provided for this purpose. Interactive option is a navigational style notification approach where the users visit the WebVigiL dashboard to retrieve the detected changes at their convenience. Through the WebVigiL dashboard users can view and query the changes generated by their sentinels. Change history, mentioned by the user will be used by the system to maintain detected changes.

3.11 Presentation Presentation semantics are included in the language to present the detected changes to users in a meaningful manner. In Example 1 Alex is interested in viewing the content cse5331 along with the context, but in Example 2 she is interested in getting a brief

430

Ajay Eppili et al.

overview of the changes occurring to the page. To support these, change specification language facilitates the users with two types of presentations. In change only approach, changes, to the page along with the type (insert/delete/update) of change information are displayed in an HTML file using a tabular representation. Dual Frame approach shows both documents (involved in the change) on the same page in different frames side-by-side, highlighting the changes between the documents. The syntax is Presentation <presentation options > where presentation options is specified as <presentation options> change-only dual-frame approach.

3.12 Desiderata All of the above expressiveness is of not much use if they are not implemented efficiently. One of the focuses of WebVigiL was to design efficient ways of supporting the sentinel specification, provide a truly asynchronous way of notification and managing the sentinels using the active capability developed by the team earlier. In the following sections, we describe the overall WebVigiL architecture and the current status of the working system. The reader is welcome to access the system at http://berlin.uta.edu:8081/webvigil/and test the usage of the system.

4 WebVigiL Architecture and Current Status WebVigiL is a profile-based change detection and notification system. The high-level block diagram shown in Fig 3 details the architecture of WebVigiL. WebVigiL aims at investigating the specification, management and propagation of changes as requested by the user in a timely manner while meeting the quality of service requirements [15]. All the modules shown in the architecture (Fig 3) have been implemented.

Fig. 3. WebVigiL Architecture

Expressive Profile Specification and Its Semantics for a Web Monitoring System

431

User specification module provides an interface for the first time users to register with the system and a dashboard for registered users to place, view, and manage their sentinels. Sentinel captures the user’s specification for monitoring a web page. Verification module is used to validate user-defined sentinels before sending the information to the Knowledgebase. The Knowledgebase is used to persist meta-data about each user and his/her sentinels. Change detection module is responsible for generating ECA rules [16] for the run time management of a validated sentinel. Fetch module is used to fetch pages for all active or enabled sentinels. Currently fetch module supports fixed interval and best effort approaches for fetching the web pages. Version management module deals with a centralized server based repository service that retrieves, archives, and manages versions of pages. A page is saved in the repository only if the latest copy in the repository is older than the fetched page. Subsequent requests for the web page can access the page from the cache instead of repeatedly invoking the fetch procedure. [3] discusses how each URL is mapped to a unique directory and how all the versions of this URL are stored in this directory. Versions are checked for deletion periodically and versions no longer needed are deleted.

Fig. 4. Presentation using dual frame approach for a html page

The change detection module [17] builds a change detection graph to efficiently detect and propagate the changes. The graph captures the relationship between the

432

Ajay Eppili et al.

pages and sentinels, and groups the sentinels based on the change type and target web page. Change detection is performed over the versions of the web page and the sentinels associated with the groups are informed about the detected changes. Currently, grouping is performed only for sentinels that follow best effort approach for fetching pages. WebVigiL architecture can support various page types such as XML, HTML, and TEXT in a uniform way. Currently changes are detected between HTML pages using specifically developed CH-Diff [2] module and XML pages using CX-Diff [18] module. Change detection modules for new page types can be added or current modules for HTML and XML page types can be replaced by efficient modules without disturbing the rest of the architecture. Among the change types discussed above in Section 0 all change types except Table, List and Regular expressions are currently supported by WebVigiL. Currently notification module propagates the detected changes to users via email. Presentation module supports both change-only and dual-frame approaches for presenting the detected changes. A screenshot of the notification using dual frame approach for html pages is shown in Fig 4. This approach is visually intuitive and enhances user interpretation since changes are presented along with the context.

5 Conclusion and Future Work In this paper we have discussed the rationale for an expressive change specification language, its syntax as well as its semantics. We have given a brief overview of WebVigiL architecture and have discussed the current status of the system, which included a complete implementation of the language presented. We are currently working on several extensions. The change specification language can be extended to provide the capability of supporting sentinels on multiple URLs. The current fetch module is being extended to a distributed fetch module to reduce the network traffic. The deletion algorithm for the cached versions discussed in Section 0 is being improved to efficiently delete the no longer needed pages as soon as possible instead of the slightly conservative approach used currently.

References 1. Deolasee, P., et al., Adaptive Push-Pull: Disseminating Dynamic Web Data, in Proceeding of the 10th International WWW Conference. Hong Kong: p. 265-274, 2001. 2. Pandrangi, N., et al., WebVigiL: User Profile-Based Change Detection for HTML/XML Documents, in Twentieth British National Conference on Databases. Coventry, UK. pages 38 - 55, 2003, 3. Chakravarthy, S., et al., WebVigiL: An approach to Just-In-Time Information Propagation In Large Network-Centric Environments. in Second International Workshop on Web Dynamics. Hawaii, 2002. 4. GNUDiff. http://www.gnu.org/software/diffutils/diffutils.html, 5. Hunt, J.W. and Mcllroy, M.D. An algorithm for efficient file comparison, in Technical Report. Bell Laboratories. 1975. 6. Zhang, K., A New Editing based Distance between Unordered Labeled Trees. Combinatorial Pattern Matching, vol. 1 p. 254-265, 1993.

Expressive Profile Specification and Its Semantics for a Web Monitoring System

433

7. Changedetection. http://www.changedetection.com, 8. Liu, L., et al., Information Monitoring on the Web: A Scalable Solution, in World Wide Web: p. 263-304, 2002. 9. Douglis, F., et al., The AT&T Internet Difference Engine: Tracking and Viewing Changes on the Web, in World Wide Web. Baltzer Science Publishers, p. 27-44. 1998. 10. WYSIGOT. http://www.wysigot.com/, 11. Nguyen, B., et al., Monitoring XML Data on the Web. in Proceedings of the 2001 ACM SIGMOD International Conference on Management of Data, 2001. 12. Jacob, J. WebVigiL: Sentinel specification and user-intent based change detection for Extensible Markup Language (XML), in MS Thesis. The University of Texas at Arlington. 2003. 13. Chakravarthy, S., et al., A Learning-Based Approach to fetching Pages in WebVigiL. in Proc of the 19th ACM Symposium on Applied Computing, March 2004. 14. Chakravarthy, S. and Mishra, D., Snoop: An Expressive Event Specification Language for Active Databases. Data and Knowledge Engineering, vol. 14(10): p. 1--26, 1994. 15. Jacob, J., et al., WebVigiL: An approach to Just-In-Time Information Propagation In Large Network-Centric Environments(to be published), in Web Dynamics Book. SpringerVerlag. 2003. 16. Chakravarthy, S., et al., Composite Events for Active Databases: Semantics, Contexts and Detection, in Proc. Int’l. Conf. on Very Large Data Bases VLDB: Santiago, Chile. p. 606-617. 1994. 17. Sanka, A. A Dataflow Approach to Efficient Change Detection of HTML/XML Documents in webVigiL, in MS Thesis. The University of Texas at Arlington, August 2003. 18. Jacob, J., Sachde, A., and Chakravarthy, S., CX-DIFF: A Change Detection Algorithm for XML Content and change Presentation Issues for WebVigiL, ER Workshops October 2003: 273-284.

On Modelling Cooperative Retrieval Using an Ontology-Based Query Refinement Process Nenad Stojanovic1 and Ljiljana Stojanovic2 1

Institute AIFB, Research Group Knowledge Management, University of Karlsruhe, 76128 Karlsruhe, Germany [emailprotected]

2

FZI - Research Center for Information Technologies at the University of Karlsruhe, 76128 Karlsruhe, Germany [emailprotected]

Abstract. In this paper we present an approach for the interactive refinement of ontology-based queries. The approach is based on generating a lattice of the refinements, that enables a step-by-step tailoring of a query to the current information needs of a user. These needs are implicitly elicited by analysing the user’s behaviour during the searching process. The gap between a user’s need and his query is quantified by measuring several types of query ambiguities, which are used for ranking of the refinements. The main advantage of the approach is a more cooperative support in the refinement process: by exploiting the ontology background, the approach supports finding “similar” results and enables efficient relaxing of failing queries.

1 Introduction Although a lot of research was dedicated to improving the cooperativeness of an information access process [1], almost all of them were focused on resolving the problem of an empty answer set. Indeed, either due to false presuppositions concerning the content of the knowledge base which lead to the stonewalling behaviour of the retrieval system, or due to the misconceptions (concerning the schema of the domain) which cause mismatches between a user’s view on the world and the concrete conceptualisation of the domain, when a query fails it is more cooperative to identify the cause of failure, rather than just to report the empty answer set. If there is no a cause per se for the query’s failure it is then worthwhile to report the part of the query which failed. Further, some types of query’s generalizations [2] or relaxations [3], [4] were proposed for weakening a user’s query in order to allow him to find some relevant results. The growing nature of the web information content implies a users behaviour’s pattern that should be treated in a more collaborative way in the modern retrieval systems: users tend to make short queries which they refine (expand) subsequently. Indeed, in order to be sure to get any answer to a query, a user forms as short as possible query and depending on the list of answers, he tries to narrow his query in several refinement steps. Probably the most expressive examples are product catalogue applications that serve as web interfaces to the large product databases. The main problem here is that a user cannot express clearly his need for a product by using only 2-3 terms, i.e. a user’s query represents just an approximation of his information need [5]. Therefore, a user tries in several refinement steps to filter the list of retrieved products, so that only the products which are most relevant for his information need P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 434–449, 2004. © Springer-Verlag Berlin Heidelberg 2004

On Modelling Cooperative Retrieval

435

remain. Unfortunately, most of the retrieval systems do not provide a cooperative support in the query refinement process, so that a user is “forced” to change his query on his own in order to find the most suitable results. Indeed, although in an interactive query refinement process [6] a user is provided with a list of terms that appear frequently in retrieved documents, a more semantic analysis of the relationships between these terms is missing. For example, if a user made a query for a metallic car, then the refinements that include the value of the car’s colour of the car can be treated more relevant than the refinements regarding the speed of the car, since the feature metallic is strongly related to the colour. At least, such reasoning can be expected from a human shop assistant. Obviously, if a retrieval system has more information about the model of the underlying product data, then a more cooperative (human-like) retrieval process can be created. In our previous work we have developed a query refinement process, called Librarian Agent Query Refinement process, that uses an ontology for modelling an information repository [7],[8]. That process is based on incrementally and interactively tailoring a query to the current information need of a user, whereas that need is discovered implicitly by analysing the user’s behaviour during the search process. The gap between the user’s query and his information need is defined as the query ambiguity and it is measured by several ambiguity parameters that take into account the used ontology as well as the content of the underlying information repository. In order to provide a user with suitable candidates for the refinement of his query, we calculate the so-called Neighbourhood of that query. It contains the query’s direct neighbours regarding the lattice of queries defined by considering the inclusion relation between query results. In this paper we extend this work by involving more user’s-related information in the query refinement phase of the query refinement process. In that way our approach ensures continual adaptation of the retrieval system to the changing preferences of users. Due to the reluctance of users to give explicit information about the quality of the retrieval process, we base our work on the implicit user’s feedback, a very popular information retrieval technique for gathering user’s preferences [9]. From a user’s point of view, our approach provides more cooperative retrieval process: In each refinement step a user is provided with a complete but minimal set of refinements, which enables him to develop/express his information need in a step-by-step fashion. Secondly, although all users’ interactions are anonymous, we personalize the searching process and achieve the so-called ephemeral personalization by implicitly discovering a user’s need. The next benefit is the possibility to anticipate which alternative resources can be interesting for the user. Finally, this principle enables coping with a user’s requests that cannot be fulfilled in the given repository (i.e. the requests that returns zero results), a hard-solvable problem for existing information retrieval approaches. The paper is organised as follows: In the second Section we present the extended Librarian Agent Query Refinement process and discuss its cooperative nature. Section 3 provides related work and Section 4 contains concluding remarks.

2 Librarian Agent Query Refinement Process The goal of the Librarian Agent Query Refinement process [8] is to enable a user to efficiently find results relevant for his information need in an ontology-based infor-

436

Nenad Stojanovic and Ljiljana Stojanovic

mation repository, even if some problems we sketched in the previous section appear in the searching process. These problems lead to some misinterpretations of a user’s need in his query, so that either a lot of irrelevant results and/or only a few relevant results are retrieved. In the Librarian Agent Query Refinement process, potential ambiguities (i.e. misinterpretations) of the initial query are firstly discovered and assessed (cf. the so-called Ambiguity-Discovery phase). Next, these ambiguities are interpreted regarding the user’s information need, in order to estimate the effects of an ambiguity on the fulfilment of the user’s goals (cf. the so-called AmbiguityInterpretation phase). Finally, the recommendations for refining the given query are ranked according to their relevance for fulfilling the user’s information need and according to the possibility to disambiguate the meaning of the query (cf. the so-called Query Refinement phase). In that way, the user is provided with a list of relevant query refinements ordered according to their capabilities to decrease the number of irrelevant results or/and to increase the number of relevant results. In the next three subsections we explain these three phases further, whereas the first phase is just sketched here, since its complete description is given in [8]. In order to present the approach in a more illustrative way, we refer to examples based on the ontology presented in Fig. 1. Table 1 represents a small product catalog, indexed/annotated with this ontology. Each row represents the features assigned to a product (a car), e.g. product P8 is a cabriolet, its colour is green metallic and it has an automatic gear changing system. The features are organised in an isA hierarchy, for example the feature (concept) “BlueColor” has two specializations “DarkBlue” and “WhiteBlue” which means that a dark or white blue car is also a blue colour car.

Fig. 1. The car-feature ontology used throughout the paper

2.1 Phase 1: Ambiguity Discovery We define query ambiguity as an indicator of the gap between the user’s information need and the query that results from that need. Since we have found two main factors that cause the ambiguity of a query: the vocabulary (ontology) and the information repository, we define two types of the ambiguity that can arise in interpreting a query: (i) the semantic ambiguity, as the characteristic of the used ontology and (ii) the content-related ambiguity, as the characteristic of the repository. In the next two subsections we give more details on them.

On Modelling Cooperative Retrieval

437

2.1.1 Semantic Ambiguity The goal of an ontology-based query is to retrieve the set of all instances which fulfil all constraints given in that query. In such a logic query the constraints are applied on the query variables. For example in the query: x is a query variable and hascolorvalue(x,BlueColour) is a query constraint. The stronger these constraints are (by assuming that all of them correspond to the user’s need), the more relevant the retrieved instances are for the user’s information need. Since an instance in an ontology is described through (i) the concept it belongs to and (ii) the relations to other instances, we see two factors which determine the semantic ambiguity of a query variable: the concept hierarchy: How general is the concept the variable belongs to the relation-instantiation: How descriptive/strong are constraints applied to that variable Consequently, we define the following two parameters in order to estimate these values: Definition 1: Variable Generality VariableGenerality(X) = Subconcepts(Type(X)) + 1, where Type(X) is the concept the variable X belongs to, Subconcepts(C) is the number of subconcepts of the concept C. Definition 2: Variable Ambiguity where Relation(C) is the set of all relations defined for the concept C in the ontology, AssignedRelations(C,Q) is the set of all relations defined in the set Relation(C) and which appear in the query Q. AssignedConstraints(X,Q) is the set of all constraints related to the variable X that appear in the query Q. The total ambiguity of a variable is calculated as the product of these two parameters, in order to model uniformly the directly proportional effect of both parameters to the ambiguity. Note that the second parameter is typically less than 1. We now define the ambiguity as follows:

438

Nenad Stojanovic and Ljiljana Stojanovic

Finally, the Semantic Ambiguity for the query Q is calculated as follows:

where Var(Q) represents the set of variables that appear in the query Q. By analysing these ambiguity parameters it is possible to discover which of the query variables introduces the highest ambiguity in a query. Consequently, this variable should be refined in the query refinement phase. 2.1.2 Content-Related Ambiguity An ontology defines just a model how the entities from a real domain should be structured. If there is a part of that model that is not instantiated in the given domain, then that part of the model cannot be used for calculating ambiguity. Therefore, we should use the content of the information repository to prune the results from the ontologyrelated analyses of a user’s query. 2.1.2.1 Query Neighbourhood We introduce here the notation for ontology-related entities that are used in the rest of this subsection: Q(O) is a query defined against the ontology O. The setting in this paper encompasses positive conjunctive queries. However, the approach can be easily extended to queries that include negation and disjunction. is the set of all possible elementary queries (queries that contain only one constraint) for an ontology O. KB(O) is the set of all relation instances (facts) which can be proven in the given ontology O. It is called the knowledge base. A(Q(O)) is the set of answers (in the logical sense) for the query Q regarding the ontology O. Definition 3: Ontology-based information repository An Ontology-based information repository IR is the structure (R, O, ann), where: R is a set of elements that are called resources, O is an ontology, which defines the vocabulary used for annotating these resources. We say that the repository is annotated with ontology O and a knowledge base KB(O); ann is a binary relation between a set of resources and a set of facts from the knowledge base KB(O), We write meaning that a fact is assigned to the resource r (i.e. a resource r is annotated with a fact Definition 4: Resources-Attributes group (user’s request) A Resources-Attributes group in an IR=(R, O, ann) is a tuple

where

is called a set of is called a set of It follows: ann(r,i)} , i.e. this is the set of resources which are annotated with all attributes of the query Q’.

On Modelling Cooperative Retrieval

439

Definition 5: Structural equivalence (=) between two user’s requests is defined by: It means that two user’s requests are structurally equivalent if their sets of result resources are equivalent. Definition 6: Cluster of users’ queries (in the rest of text: Query cluster) is a set of all structurally equivalent user’s requests where is called the set of (attribute set) and contains the union of attributes of all requests that are equivalent. For a user’s request it is calculated in the following manner It holds: is called the set of set of the query Formally:

(resource set) and is equal to the

The Query cluster which contains all existing resources in IR (i.e. a cluster for which is called the root cluster. The set of all Query clusters is denoted by Definition 7: Structural subsumption (parent-child relation) (...,group=“A”, Matches in Group A); (null,Brazil,

); and (null,Brazil,Brazil vs Scotland). Structure Matching. This subsystem discovers all structure mappings between elements in the XML and HTML DOMs. We adopt two constraints used in GLUE system [4] as a guide to determine whether two nodes are structurally matched: Neighbourhood Constraint: “two nodes match if nodes in their neighbourhood also match”, where the neighbourhood is defined to be the children. Union Constraint: “if every child of node A matches node B, then node A also matches node B”. Note that there could be a range of possible matching cases, depending on the completeness and precision of the match. In the ideal case, all components of the structures in the two nodes fully match. Alternatively, only some of the components are matched (a partial structural match). In the case of partial structure matching between two nodes, there are some extra nodes, i.e. children of the first node that do not match with any children of the second node; and/or vice versa. Since extra nodes do not have any match in the other document, they are ignored in the structure matching process. Therefore, the above constraints need to be modified to construct the definition of structure matching which accommodates both partial and full structure matching:

484

Stella Waworuntu and James Bailey

Neighbourhood Constraint: “XML node X structurally matches HTML node H if H is not an extra HTML node and every non-extra child of H either text matches or structurally matches a non-extra descendant of X”. Union Constraint: “X structurally matches H if every non-extra child of H either text matches or structurally matches X”. As stated in the above constraints, we need to examine the children of the two nodes being compared in order to determine if a structure matching exists. Therefore, structure matching is implemented using a bottom-up approach that visits each node in the HTML DOM in post-order and searches for a matching node in the entire XML DOM. If the list of substring mappings is still empty after the structure matching process finishes, we add a mapping from the XML root element to the HTML body element, if it exists, or to the HTML root element, otherwise. Revisiting the Soccer example (Fig. 2), some of the discovered structure mappings are:(null,match, tr) (neighbourhood constraint), (null,match, table) (union constraint) and (null,soccer, body). Sequence Checking. Up to this point, the mappings generated by the text matching and structure matching subsystems are limited to 1-1 mappings. In cases where the XML and HTML documents have more complex structure, these mappings may not be accurate and this can affect the quality of the XSLT rules generated from these mappings. Consider the following example: In Fig. 2, the sequence of the children of XML node soccer is made up of nodes with the same name, match; whereas the sequence of the children of the matching HTML node body follows a specific pattern: it starts with h1 and is followed repetitively by h2 and table. Using only the discovered 1-1 mappings, it is not possible to create an XSLT rule for soccer that resembles this pattern, since match maps only to table according to structure matching. In other words, there will be no template that will generate the HTML node h2. Focusing on the structure mapping (match,table) and the substring mappings {(team[1],h2),(team[2],h2)}, we can see in the DOM trees that the children of match, i.e. team[1] and team[2], are not mapped to the descendant of table. Instead, they map to the sibling of table, i.e. h2. Normally, we expect that the descendant of match maps only to the descendant of table, so that the notion of 1-1 mapping is kept. In this case, there is an intuition that match should not only map to table, but also to h2. In fact, match should map to the concatenation of nodes h2 and table, so that the sequence of the children of body is preserved when generating the XSLT rule. This is called a 1-m mapping, where an XML node maps to the concatenation of several HTML nodes. The 1-m mapping (match,h2 ++ table) can be found by examining the subelement sequence of soccer and the subelement sequence of body described above. Note that the subelement sequence of a node can be represented using a regular expression, which is a combination of symbols representing each subelement and metacharacters: To obtain this regular expression, XTRACT [5], a system for inferring a DTD of an element, is employed. In our example, the regular expression representing the subelement sequence of soccer is whereas

XSLTGen

485

the one representing subelement sequence of body is We then check whether the elements in the first sequence conform to the elements in the second sequence, as follows: According to the substring mapping (soccer,group,h1), element conforms to an attribute of soccer and thus, we ignore it and remove from the second sequence. Comparing with we can see that element match should conform to elements table) since the sequence corresponds directly to the sequence i.e. they both are in repetitive pattern, denoted by However, element match conforms only to element table, as indicated by the structure matching (match,table). The verification therefore fails, which indicates that the structure matching (match,table) is not accurate. Consequently, based on the sequences and we deduce the accurate 1-m mapping: (null,match,h2 ++ table). The main objective of the sequence checking subsystem is to discover 1-m mappings using the technique of comparing two sequences described above. XSLT Stylesheet Generation. This subsystem constructs a template rule for each mapping discovered in and (the list of 1-m mappings); and puts them together to compose an XSLT stylesheet. We do not consider the substring mappings in because in substring mappings, it is possible to have a situation where the text value of the HTML node is a concatenation of text values from two or more XML nodes. Hence, it is impossible to create template rules for those XML nodes. Moreover, the HTML text value may contain substrings that do not have matching XML text value (termed as extra string). Considering these situations, we implement a procedure that generates a template for each distinct HTML node in The XSLT stylesheet generation process begins by generating the list of substring rules. We then construct a stylesheet by creating the <xsl: stylesheet> root element and subsequently, filling it with template rules for the 1-m mappings in the structure mappings in and the exact mappings in The template rules for the 1-m mappings have to be constructed first, since within that process, they may invalidate several mappings in and and thus, the template rules for those omitted mappings do not get used . In each mapping list and the template rule is constructed for each distinct mapping, to avoid having some conflicting template rules. In the next three subsections, we give more detail on the XSLT generation process. Discussion of 1-m mappings is left out due to space constraints. Substring Rule Generation. The substring rule generator creates a template from an XML node or a set of XML nodes to each distinct HTML node presented in the substring mapping list The result of this subsystem is a list of substring rules SUB_RULES, where each element of the list is a tuple (html_node, rule). Due to space constraints, we omit the detailed description of our algorithm for generating the substring rule itself. The following example illustrates how substring rule generation works. Consider the following substring mappings discovered in the Soccer example (Fig. 2): (null, Scotland,Brazil vs Scotland) and (null,

486

Stella Waworuntu and James Bailey

Brazil,Brazil vs Scotland). The HTML string is “Brazil vs Scotland”, while the set of XML strings is {“Brazil,”Scotland”}. By replacing parts of the HTML string that appear in the set of XML strings with the corresponding XSLT instruction, the substring rule is: <xsl:value-of select=“team[1]”/> vs <xsl:value-of select=“team[2]”/>, where “vs” is an extra string. Constructing a Template Rule for an Exact Mapping in Each template rule begins with an XSLT <xsl:template> element and ends by closing that element. For a mapping the pattern of the corresponding template rule is and the only XSLT instruction used in the template is <xsl:value-of>. In this procedure, we only construct a template rule when is a mapping between an XML ELEMENT_NODE to an HTML node or a concatenation of HTML nodes. The reason that we ignore mappings involving XML ATTRIBUTE_NODEs is that the template for this mapping will be generated directly within the construction of the template rule for structure mapping and 1-m mapping. In text matching, there could be mappings from an XML node to a concatenation of HTML nodes, hence, we need to create a template for each HTML node in E.g., the template rule for an exact mapping (null, line,text()++br) is:

Constructing a Template Rule for a Structure Mapping in Recall that in structure matching, one of the mappings in must be the mapping whose XML component is the root of the XML document. Let denote this special mapping. The template for begins with copying the root of the HTML document and its subtree, excluding the HTML component and its subtree. The next step in constructing the template for mapping follows the steps performed for the other mappings in For any mapping in the opening tag for is created, then a template for each child of is created, and finally, the tag is closed. E.g., Suppose there is a structure mapping (null,match,table) discovered in Soccer (Fig. 2). And suppose we have exact mappings (null,date,td[1]), (null,team[1],td[2]), (null,team[2],td[3]) in The template rule representing this structure mapping is:

Refining the XSLT Stylesheet. In some cases, the (new) HTML document obtained by applying the generated XSLT stylesheet to the XML document may not be accurate, i.e. there are differences between this (new) HTML document and the original (user-defined) HTML document. By examining those differences, we can improve the accuracy of the XSLT stylesheets generated. This

XSLTGen

487

step is applicable when we have a set of complete and accurate mappings between the XML and HTML documents, but the generated XSLT stylesheet is erroneous. If the discovered mappings themselves are incorrect or incomplete, then this refinement step is not effective and it is better to address the problem by improving the matching techniques. An indicator that we have complete and accurate mappings is that each element in the new HTML document corresponds exactly to the element in the original HTML document at the same depth. One possible factor that can cause the generated XSLT stylesheet to be inaccurate, is the wrong ordering of XSLT instructions within a template. This situation typically occurs when we have XML nodes with the same name but different order or sequence of children. Therefore, the main objective of the refinement step is to fix the order of the XSLT instructions within the template matches of the generated XSLT stylesheet, so that the resulting HTML document is closer to or exactly the same as the original HTML document. A naive approach to the above problem is to use brute force and attempt all possible orderings of instructions within templates until the correct one is found (there exist no differences between the new HTML and the original HTML). However, this approach is prohibitively costly. Therefore, we adopt a heuristic approach, which begins by examining the differences between the original HTML document and the one produced by the generated XSLT stylesheet. We employ a change-detection algorithm [2], that produces a sequence of edit operations needed to transform the original HTML document to the new HTML document. The types of edit operations returned are insert, delete, change, and move. To carry out the refinement, the edit operation that we focus on is the move operations, since we want to swap around the XSLT instructions in a template match to get the correct order. In order for this to work, we require that there are no missing XSLT instructions for any template match in the XSLT stylesheet. After examining all move operations, this procedure is started over using the fixed XSLT stylesheet. This repetition is stopped when no move operations are found in one iteration; or, the number of move operations found in one iteration is greater than those found in the previous iteration. The second condition is there to prevent the possibility of fixing the stylesheet incorrectly. We want the number of move operations to decrease in each iteration until it reaches zero.

3

Empirical Evaluation

We have conducted experiments to study and measure the performance of XSLTGen. To give the reader some idea on how our system performs, we evaluated XSLTGen on four examples taken from a popular XSLT book3 and a real-life data taken from MSN Messenger chat history. These datasets exhibit a wide variety of characteristics ranging from 10 - 244 element nodes. Originally, they were pairs of (XML document, XSLT stylesheet). To get the HTML document associated with each dataset, we apply the original XSLT stylesheet to the XML 3

http://www.wrox.com/books/0764543814.shtml

488

Stella Waworuntu and James Bailey

document using Xalan4 XSLT processor. We then manually determined the correct mappings between the XML and HTML DOMs in each dataset. For each dataset, we applied XSLTGen to find the mappings between the elements in the XML and HTML DOMs, and generate the corresponding XSLT stylesheet. We then measured the matching accuracy, i.e. the percentage of the manually determined mappings that XSLTGen discovered, and the quality of the XSLT stylesheet inferred by XSLTGen. To evaluate the quality of the XSLT stylesheet generated by XSLTGen in each dataset, we applied the generated XSLT stylesheet back to the XML document using Xalan and then compared the resulting HTML with the original HTML document using HTMLDiff5. HTMLDiff is a tool for analysing changes made between two revisions of the same file. It is commonly used for analysing HTML and XML documents. The differences may be viewed visually in a browser, or be analysed at the source level. The results of the matching accuracy are impressive. XSLTGen achieves high matching accuracy across all five datasets. Exact mappings reach 100% accuracy in four out of five datasets. In the dataset Chat Log, exact mappings reach 86% accuracy. This is caused by the undiscovered mappings from XML ATTRIBUTE NODEs to HTML ATTRIBUTE NODEs, which violates our assumption in Section 2.2 that the value of an HTML ATTRIBUTE_NODE is usually specific to the display of the HTML document in Web browsers and is not generated from a text within the XML document. Substring mappings achieve 100% accuracy in the datasets Itinerary and Soccer. In contrast, substring mappings achieve 0% accuracy in the dataset Poem. This poor performance is caused by incorrectly classifying substring mappings as exact mappings during the text matching process. In the datasets Books and Chat Log, substring mappings do not exist. Structure mappings achieve perfect accuracy in all datasets except Poem. In the dataset Poem, structure mappings achieves 80% accuracy because an XML node is incorrectly matched with an HTML TEXT_NODE in text matching, while it should be matched with other HTML node in structure matching. Following the success of the other mappings, 1-m mappings achieve 100% accuracy in the datasets Itinerary and Soccer. In the datasets Books, Poem and Chat Log, there are no 1-m mappings. This results indicate that in most of these cases, the XSLTGen system is capable of discovering complete and accurate mappings. The results returned by HTMLDiff are also impressive. The new HTML documents have a very high percentage of correct nodes. In the datasets Itinerary and Soccer, the HTML documents being compared are identical, which is shown by the achievement of 100% in all types of nodes. In the dataset Poem, the two HTML documents have exactly the same appearance in Web browsers, but according to HTMLDiff, there are some missing whitespaces in each line within the paragraphs of the new HTML document. That is why the percentage of correct TEXT_NODEs in this dataset is very low (14%). The reason of this low percentage is that in the text matching subsystem, we remove the leading and trailing whitespaces of a string before the matching is done. The improvement 4 5

http://xml.apache.org/xalan-j/index.html http://www.componentsoftware.com/products/HTMLDiff/

XSLTGen

489

stage does not fix the stylesheet since there are no move operations. In the dataset Books, the difference occurs in the first column of the table. In the original HTML document, the first column is a sequence of numbers 1, 2, 3 and 4; whereas in the new HTML document, the first column is a sequence of 1s. The numbers 1, 2, 3 and 4 in the original HTML document are represented using four extra nodes. However, our template rule constructor assumes that all extra nodes that are cousins (their parent are siblings and have the same node name) have the same structure and values. Since the four extra nodes have different text values in this dataset, the percentage of correct TEXT_NODEs in the new HTML document is slightly affected (86%). Lastly, the differences between the original and the new HTML documents in the dataset Chat Log are caused by the undiscovered mappings mentioned in the previous paragraph. Because of this, it is not possible to fix the XSLT stylesheet. However, the percentage of correct ATTRIBUTE_NODEs is still acceptable (75%). We have tested XSLTGen on many other examples and the results are very similar to those obtained in this experiment. However, there are some problems that prevent XSLTGen from obtaining even higher matching accuracy. First, in a few cases, XSLTGen is not able to discover some mappings between XML ATTRIBUTE_NODEs and HTML ATTRIBUTE_NODEs because these mappings violate our assumption stated in Section 2.2. This problem can be alleviated by considering HTML ATTRIBUTE_NODEs in the matching process. Undiscovered mappings are also caused by incorrectly matching some nodes, which is the second problem faced in the matching process. Incorrect matchings typically occur when an XML or an HTML TEXT_NODE has some ELEMENT_NODE siblings. In some cases, these nodes should be matched during the text matching process, while in other cases they should be matched in structure matching. Here, the challenge will be in developing matching techniques that are able to determine whether a TEXT_NODE should be matched during text matching or structure matching. The third problem concerns with incorrectly classified mappings. This problem only occurs between a substring mapping and an exact mapping, when the compared strings have some leading and trailing whitespaces. Determining whether whitespaces should be kept or removed is a difficult choice. Besides this, as the theme of our text matching subsystem is text-based matching (matching two strings), the performance of the matching process decreases if the supplied documents contain mainly numerical data. In this case, the mappings discovered, especially substring mappings, are often inaccurate and conflicting, i.e. more than one HTML node is matched with an XML node. Finally, the current version of XSLTGen does not support the capability to automatically generate XSLT stylesheets with complex functions (e.g. sorting). This is a very challenging task and an interesting direction for future work.

4

Related Work

There is little work in the literature about automatic XSLT stylesheets generation. The only prior work of which we are aware of is XSLbyDemo [10], a system

490

Stella Waworuntu and James Bailey

that generates an XSLT stylesheet by example. In this system, the process of generating XSLT stylesheet begins with transforming the XML document to an initial HTML page, which is an HTML page using a manually created XSLT stylesheet, taking into account the DTD of the XML document. The user then modifies the initial HTML page using a WYSIWYG editor and their actions are recorded in an operation history. Based on the user’s operation history, a new stylesheet is generated. Obviously, this system is not automatic, since the user is directly involved at some stages of the XSLT generation process. Hence, it is not comparable to our fully automatic XSLTGen system. Specifically, our approach differs from XSLbyDemo in three key ways: (i) Our algorithm produces a stylesheet that transforms an XML document to an HTML document, while XSLbyDemo generates transformations from an initial HTML document to its modified HTML document, (ii) Our generated XSLT can be applied directly to other XML documents from the same document class, whereas using XSLbyDemo, the other XML documents have to be converted to their initial HTML pages before the generated stylesheet can be applied, (iii) Finally, our users do not have to be familiar with a WYSIWYG editor and the need of providing structural information through the editing actions. The only thing that they need to possess is knowledge of a basic HTML tool. In the process of generating XSLT, semantic mappings need to be found. There are a number of algorithms available for tree matching. Work done in [12, 13] on the tree distance problem or tree-to-tree correction problem and in [2] known as the change-detection algorithm, compare and discover the sequence of edit operations needed to transform the source tree into the result tree given. These algorithms are mainly based on structure matching, and their input comprises of two labelled trees of the same type, i.e. two HTML trees or two XML trees. The text matching involved is very simple and limited since it compares only the labels of the trees. Clearly, these algorithms do not accommodate our needs, since we require an algorithm that matches an XML tree with an HTML tree. However, these algorithms are certainly useful in our refinement stage since within that subsystem, we are comparing two HTML documents. In the field of semantic mapping, a significant amount of work has focused on schema matching (refer to [11] for survey). Schema matching is similar to our matching problem in the sense that two different schemas are compared, which have different sets of element names and data instances. However, the two schemas being compared are mostly from the same domain and therefore, their element names are different but comparable. Besides using structure matching, most of the schema mapping systems rely on element name matchers to match schemas. The TransSCM system [9] matches schema based on the structure and names of the SGML tags extracted from DTD files by using concept of labelled graphs. The Artemis system [1] measures similarity of element names, data types and structure to match schemas. In XSLTGen, it is impossible to compare the element names since XML and HTML have completely different tag names. XMapper [7] is another system for finding semantic mappings between structured documents within a given domain, particularly XML sources. This system

XSLTGen

491

uses an inductive machine learning approach to improve accuracy of mappings for XML data sources, whose data types are either identical or very similar, and the tag names between these data sources are significantly different. In essence, this system is suitable for our matching process in XSLTGen since the tag names of XML and HTML documents are absolutely different. However, this system requires the user to select one matching tag between two documents, which violates our principle intention of creating a fully automatic system. Recent work in the area of ontology matching also focuses on the problem of Finding semantic mappings between two ontologies. One ontology matching system that we are aware of is GLUE system [4]. GLUE also employs machine learning techniques to semi-automatically create such semantic mappings. Given two ontologies: for each node in one ontology, the purpose is to find the most similar node in the other ontology using the notions of Similarity Measures and Relaxation Labelling. Similar to our matching process, the basis used in the similarity measure and relaxation labelling are data values and the structure of the ontologies, respectively. However, GLUE is only capable of finding 1-1 mappings whereas our XSLTGen matching process is able to discover not only 11 mappings but also 1-m and sometimes m-1 mappings (in substring mappings). The main difference between mapping in XSLTGen and other mapping systems, is that in XSLTGen we believe that mappings exist between the elements in the XML and HTML documents, since the HTML document is derived from the XML document by the user; whereas in other systems, the mapping may not exist. Moreover, the mappings generated by the matching process in XSLTGen are used to generate code (an XSLT stylesheet) and that is why the mappings found have to be accurate and complete, while in schema matching and ontology matching, the purpose is only to find the most similar nodes between the two sources, without further processing of the results. To accommodate the XSLT stylesheet generation, XSLTGen is capable of finding 1-1 mappings, 1-m mappings and sometimes m-1 mappings; whereas the other mapping systems focus only on discovering 1-1 mappings. Besides this, the matching subsystem in XSLTGen has the advantage of having very similar and related data sources, since the HTML data is derived from the XML data. Hence, they can be used as the primary basis to find the mappings. In other systems, the data instances in the two sources are completely different, the only association that they have is that the sources come from the same domain. Following this argument, XSLTGen discovers the mappings between two different types of document, i.e. an XML and an HTML document, whereas the other systems compare two documents of the same type. Finally, another important aspect which differs XSLTGen from several other systems, is that the process of discovering the mappings which will then be used to generate XSLT stylesheet is completely automatic.

5

Conclusion

With the upsurge in data exchange and publishing on the Web, conversion of data from its stored representation (XML) to its publishing format (HTML)

492

Stella Waworuntu and James Bailey

is increasingly important. XSLT plays a prominent role in transforming XML documents into HTML documents. However, it is difficult for users to learn. We have devised XSLTGen, a system for automatically generating an XSLT stylesheet, given an XML document and its corresponding HTML document. This is useful for helping users to learn XSLT. The main strong characteristics of the generated XSLT stylesheets are accuracy and reusability. We have described how the text matching, structure matching and sequence checking enables XSLTGen to discover not only 1-1 semantic mappings between the elements in the XML document and those in the HTML document, but also 1-m mappings and sometimes m-1 mappings. We have also described a fully automatic XSLT generation system that generates XSLT rules based on the mappings found. Our experiments showed that XSLTGen can achieve high matching accuracy and produce high quality stylesheets.

References 1. S. Bergamaschi, S. Castano, S.D.C.D. Vimeracati, and M. Vincini. An Intelligent Approach to Information Integration. In Proceedings of the 1st International Conference on Formal Ontology in Information Systems, pages 253–267, Trento, Italy, June 1998. 2. S.S. Chawathe, A. Rajaraman, H. Garcia-Molina, and J. Widom. Change Detection in Hierarchically Structured Information. In Proceedings of the 1996 International Conference on Management of Data, pages 493–504, Montreal, Canada, June 1996. 3. J. Clark. XSL Transformation (XSLT) Version 1.0. W3C Recommendation, November 1999. http: //www.w3.org/TR/xslt. 4. A. Doan, J. Madhavan, P. Domingos, and A. Halevy. Learning to Map between Ontologies on the Semantic Web. In Proceedings of the 11th International Conference on World Wide Web, pages 662–673, Honolulu, USA, May 2002. 5. M. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim. XTRACT: Learning Document Type Descriptors from XML Document Collections. Data Mining and Knowledge Discovery, 7(1):23–56, January 2003. 6. M. Kay. XSLT Programmer’s Reference. Wrox Press Ltd., 2000. 7. L. Kurgan, W. Swiercz, and K.J. Cios. Semantic Mapping of XML Tags using Inductive Machine Learning. In Proceedings of the 2002 International Conference on Machine Learning and Applications, pages 99–109, Las Vegas, USA, June 2002. 8. M. Leventhal. XSL Considered Harmful. http : //www.xml.com/pub/a/1999/05/xsl/xslconsidered_1.html, 1999. 9. T. Milo and S. Zohar. Using Schema Matching to Simplify Heterogeneous Data Translation. In Proceedings of 24th International Conference on Very Large Data Bases, pages 122–133, New York, USA, August 1998. 10. K. Ono, T. Koyanagi, M. Abe, and M. Hori. XSLT Stylesheet Generation by Example with WYSIWYG Editing. In Proceedings of the 2002 International Symposium on Applications and the Internet, Nara, Japan, March 2002. 11. E. Rahm and P.A. Bernstein. A Survey of Approaches to Automatic Schema Matching. VLDB Journal, 10(4):334–350, December 2001. 12. S.M. Selkow. The Tree-to-Tree Editing Problem. Information Processing Letters, 6(6): 184–186, December 1977. 13. K.C. Tai. The Tree-to-Tree Correction Problem. Journal of the ACM, 26(3):422– 433, July 1979.

Efficient Recursive XML Query Processing in Relational Database Systems Sandeep Prakash1, Sourav S. Bhowmick1, and Sanjay Madria2 1

School of Computer Engineering Nanyang Technological University Singapore [emailprotected]

2

Department of Computer Science University of Missouri-Rolla Rolla, MO 65409 [emailprotected]

Abstract. There is growing evidence that schema-conscious approaches are a better option than schema-oblivious techniques as far as XML query performance is concerned in relational environment. However, the issue of recursive XML queries for such approaches has not been dealt with satisfactorily. In this paper we argue that it is possible to design a schema-oblivious approach that outperforms schema-conscious approaches for certain types of recursive queries. To that end, we propose a novel schema-oblivious approach called SUCXENT++ that outperforms existing schema-oblivious approaches such as XParent by up to 15 times and schema-conscious approaches (Shared-Inlining) by up to 3 times for recursive query execution. Our approach has up to 2 times smaller storage requirements compared to existing schema-oblivious approaches and 10% less than schema-conscious techniques. In addition, existing schemaoblivious approaches are hampered by poor query plans generated by the relational query optimizer. We propose optimizations in the XML query to SQL translation process that generate queries with more optimal query plans.

1 Introduction Recursive XML queries are considered to be quite significant in the context of XML query processing [3] and yet this issue has not been addressed satisfactorily in existing literature. Recursive XML queries are XML queries that contain the descendant axis (//). The use of the ‘//’ is quite common in XML queries due to the semi-structured nature of XML data [3]. For example, consider the XML document in Figure 2. The element item could occur either under europe or africa. Consider the scenario where a user needs to retrieve all item elements. The user will have to execute the path expression Q = /site//item. Another scenario could be that the document structure is not completely known to the user except that each item has a name and price. Suppose, the user needs to P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 493–510, 2004. © Springer-Verlag Berlin Heidelberg 2004

494

Sandeep Prakash, Sourav S. Bhowmick, and Sanjay Madria

find out the price of the item with name “Gold Ignot”. Q = //item[name=“ Gold lgnot”]/price will be the corresponding path expression. Efficient execution of XML queries, recursive or otherwise, is largely determined by the underlying storage approach. There has been a substantial research effort in storing and processing XML data using existing relational databases [1, 6,2]. These approaches can be broadly classified as: (a) Schema-conscious approach: This method first creates a relational schema based on the DTD of the XML documents. Examples of such approach is the inlining approach [5]. (b) Schema-oblivious approach: This method maintains a fixed schema which is used to store XML documents irrespective of their DTD. Examples of schemaoblivious approaches are the Edge approach [1], XRel [7] and XParent [2]. Schema-oblivious approaches have obvious advantages such as the ability to handle XML schema changes better as there is no need to change the relational schema and a uniform query translation approach. Schema-conscious approaches, on the other hand, have the advantage of more efficient query processing [6]. Also, no special relational schema needs to be designed for schema-conscious approaches as it can be generated on the fly based on the DTD of the XML document (s). In this paper, we present an efficient approach to process recursive XML queries using a schema-oblivious approach. At this point, one would question the justification of this work for two reasons. First, this issue may have already been addressed. Surprisingly, this is not the case as highlighted in [3]. Second, a growing body of work suggests that schema-conscious approaches perform better than schema-oblivious approaches. In fact, Tian et al. have demonstrated in [6] that schema-conscious approaches generally perform substantially better in terms of query processing and storage size. However, the Edge approach [1] was used as the representative schema-oblivious approach for comparison. Although the Edge approach is a pioneering relational approach, we argue that it is not a good representation of the schema-oblivious approach as far as query processing is concerned. In fact, XParent [2] and XRel [7] have been shown to outperform the Edge approach by up to 20 times, with XParent outperforming XRel [2]. However, this does not mean that XParent outperforms schema-conscious approaches. In fact as we will show in Section 6, schema-conscious approaches still outperform XParent. Hence, it may seem that schema-conscious generally outperforms schema-oblivious in terms of query processing. In this paper we argue that it is indeed possible to design a schema-oblivious approach that can outperform schema-conscious approaches for certain types of recursive queries. To justify our claim, we propose a novel schema-oblivious approach, called SUCXENT++ (Schema Unconcious XML Enabled System (pronounced “succinct++”)), and investigate the performance of recursive XML queries. We only store the leaf nodes and the associated paths together with two additional attributes for efficient query processing (details follow in Section 3). SUCXENT++ outperforms existing schema-oblivious techniques, such as XParent, by up to 15 times and shared-inlining - a schema-conscious approach - by up to 3 times for recursive queries with characteristics described in Section 6. In addition,

Efficient Recursive XML Query Processing in Relational Database Systems

495

Fig. 1. Sample DTD.

Fig. 2. Sample XML document.

SUCXENT++ can reconstruct shredded documents up to 2 times faster than Shared-Inlining. The main reasons SUCXENT++ performs better than existing approaches are 1) Significantly lower storage size and, consequently, lower I/O-cost associated with query processing, 2) Fewer number of joins in the corresponding SQL queries and, 3)Additional optimizations, discussed in Section 5, that are made to improve the query plan generated by the relational query optimizer. In summary, the main contributions of this paper are: (1) A novel schema-oblivious approach whose storage size depends only on the number of leaf nodes in the document. (2) Optimizations to improve the query plan generated by the relational query optimizer. Traditional schema-oblivious approaches

496

Sandeep Prakash, Sourav S. Bhowmick, and Sanjay Madria

have been hampered by the poor query plan selection of the underlying relational query optimizer [6,8]. (3) To the best of our knowledge, this is the first attempt to show that it is indeed possible to design a schema-oblivious approach that can outperform schema-conscious approaches as far as the execution of certain types of recursive XML queries is concerned.

2

Related Work

All existing schema-oblivious approaches store, at the very least, every node in the XML document. The Edge approach [1] essentially captures edge information of the tree that represents the XML document. However, resolving ancestor-descendant relationships requires the traversal of all the edges from the ancestor to the descendant (or vice-versa). The system proposed by Zhang et. al in [8] labels each node with its preorder and postorder traversal numbers. Then, ancestor-descendant relationships can be resolved in constant time using the property preorder(ancestor) < prearder(descendant) and postorder (ancestor) > postorder (descendant). It still results in as many joins as there are path separators. To solve the problem of multiple joins, XRel [7] stores the path of each node in the document. Then, the resolution of path expressions only requires the paths (which can be represented as strings) to be matched using string matching operators. However, the XRel approach still makes use of the containment property mentioned above to resolve ancestor-descendant relationships. It involve joins with (< or >) operators that have been shown to be quite expensive due to the manner in which an RDBMS processes joins [8]. In fact, special algorithms such as the Multi-predicate merge sort join algorithm [8] have been proposed to optimize these operations. However, to the best of our knowledge there is no off-the-shelf RDBMS that implements these algorithms. XParent [2] solves the problem of by using an Ancestor table that stores all the ancestors of a particular node in a single table. It then replaces with equi-joins over this set of ancestors. However, this approach results in an explosion in the database size as compared to the original document. The number of relational joins is also quite substantial. XParent requires a join between the LabelPath, Data Path, Element and Ancestor tables for each path in the query expression. The joins are quite expensive especially when the Ancestor table is involved as it can be quite large in size, SUCXENT++ is different from existing approaches in that it only stores leaf nodes and their associated paths. We store two additional attributes, called BranchOrder and BranchOrderSum, for each leaf node that capture the relationship between leaf nodes. Essentially, they allow the determination of common nodes between the paths of any two leaf nodes in a constant time. This results in a substantial reduction in storage size and query processing time. In addition, we propose optimizations that enable the underlying relational query optimizer to generate near-optimal query plans for our approach, resulting in a substantial performance improvement. Our studies indicate that these optimizations can be applied to other schema-oblivious approaches as well.

Efficient Recursive XML Query Processing in Relational Database Systems

Fig. 3. XParent schema.

497

Fig. 4. SUCXENT++ schema.

Fig. 5. SUCXENT++: XML data in RDBMS.

Schema-oblivious approaches are not influenced by recursion in the schema. However, the Edge approach uses recursive SQL queries using the SQL99 with construct to evaluate recursive XML queries. XParent and XRel handle recursive queries like any other query. Unlike these schema-oblivious approaches, schemaconscious strategies have to treat recursion in both schema and queries as special cases. In [3], the authors propose a generic algorithm to translate recursive XML queries for schema-conscious approaches using the SQL99 with construct. However, no performance evaluation of the resulting SQL queries is presented and it is assumed that schema-conscious approaches will outperform schema-oblivious approaches. SUCXENT++ also treats recursive XML queries like other queries. It also implements optimizations to generate SQL translations of recursive XML queries that enable the relational query optimizer to produce better query plans resulting in significant performance gains.

498

3

Sandeep Prakash, Sourav S. Bhowmick, and Sanjay Madria

Storing XML Data

In this section, we first discuss the SUCXENT++ schema. This will be followed by a formal algorithm to reconstruct XML documents from their relational form. The document in Figure 2 is used as a running example.

3.1

SUCXENT++ Schema

The schema is shown in Figure 4 and the shredded document in Figure 5. The semantics of the schema is as follows. The Document table is used for storing the names of the documents in the database. Each document has unique id recorded in DocID.Path is used to record the path of all the leaf nodes. For example, the path of the first leaf node name in Figure 2 is /site/regions/europe/item/ name. This table maintains path_ids, relative path expressions and their length recorded as instances of PathID, PathExp and Length respectively. This is to reduce the storage size so that we only need to store path_id in the PathValue table. The Length attribute is useful for resolving recursive queries. PathValue stores only the leaf nodes. The DocID attribute indicates which XML document a particular leaf node belongs to. The PathID attribute maintains the id of the path of a particular leaf node as stored in Path. LeafOrder records the node order of leaf nodes in an XML tree. For example, when the sample XML document is parsed, the leaf node name with value “Gold Ignot” is encountered as the first leaf node. Therefore, it is assigned a LeafOrder value of 1.Branchorder of a leaf node is the level at which it intersects the preceding leaf node i.e., it is the level of the highest common ancestor of the leaf nodes under consideration. Consider the leaf node with LeafOrder=2 in Figure 2. This leaf node intersects the leaf node with LeafOrder=1 at the node item which is at level 4. So, the Branchorder value for this node is 4. Similarly, the node name with value “Item2” has Branchorder=2 (intersecting the node to the left at regions). PathValue stores the textual content of the leaf nodes in the column LeafValue. The attribute Branchorder in this table is useful for reconstructing the XML documents from their shredded relational format as discussed in Section 3.2. The significance of DocumentRValue and BranchOrderSum in PathValue is elaborated in Section 4 and CPathId in Path is discussed in Section 5. For the remainder of the paper, we will refer to LeafOrder and Branchorder as Order information.

3.2

Extraction of XML Documents

The algorithm for reconstruction is presented in Figure 6. The input to the algorithm is a list of leaf nodes arranged in ascending LeafOrder. Each leaf node path is first split into its constituent nodes (lines 5 to 7). If the document construction has not yet started (line 10) then the first node obtained by splitting the first leaf node path is made the root (lines 11 to 15). When the next leaf node is processed we only need to look at the nodes after Branchorder of that node as the nodes up to this level have already been added to the document

Efficient Recursive XML Query Processing in Relational Database Systems

499

Fig. 6. Extraction algorithm.

(lines 20 to 22). The remaining are now added to the document (lines 27 to 32). Document extraction is completed once all the leaf nodes have been processed. In addition to reconstructing the whole document, this algorithm can be used to construct a document fragment given a partial list of consecutive leaf nodes.

4

Recursive Query Processing

Consider the recursive query XQuery 1 in Figure 7. A tree representation of the query is shown in Figure 8. This query returns those price leaf nodes that intersect the constraint-satisfying text leaf node at item. Consider how XParent resolves this query. The schema for XParent is shown in Figure 3. XParent evaluates this query by locating leaf nodes from the Data table that satisfy the constraint on text. This involves a join between the LabelPath a nd Data to satisfy the path constraint /site/regions/africa/item//text and a predicate on the Data to satisfy the value constraint. Next, LabelPath and Data tables are joined again to obtain those leaf nodes that satisfy /site/regions/ africa/item/price. These two results sets are joined using the Ancestor table to find nodes that have a common ancestor at level 4 ( at item). Thus, the final SQL query involves five joins - two between the LabelPath and Data, two between the Data and Ancestor and one between two Ancestor tables (SQL query translation details for XParent can be found in [2]). These joins can be quite expensive due to the large size of Ancestor. XRel follows a similar approach to

Fig. 7. Running example.

Fig. 8. Query Tree.

500

Sandeep Prakash, Sourav S. Bhowmick, and Sanjay Madria

resolving path expressions except that it uses the ancestor-descendant containment property instead of an Ancestor table. This produces resulting in performance worse than XParent. A detailed evaluation of XRel vs. XParent can be found in [2].

4.1

The SUCXENT++ Approach

In order to reduce the I/O cost involved in query evaluation, SUCXENT++ only stores the leaf nodes of a document. However, the attributes discussed till now are insufficient for query processing. The schema needs to be extended as follows. An attribute BranchOrderSum, denoted as is assigned to a leaf node with LeafOrder In addition, we store an attribute RValue, in the DocumentRvalue table for each level, in the document. Essentially, these allow the determination of common nodes between the paths of any two leaf nodes in a constant time. This results in a substantial reduction in storage size and query processing time. Given an XML document with maximum depth D the RValue and BranchOrderSum assignment is done as follows. (1) RValue is assigned recursively based on the equation: where (a) is the maximum number of consecutive leaf nodes with (b) (2) Let us denote the BranchOrder of a node with LeafOrder as Then, the BranchOrderSum of this node is We illustrate the above attributes with an example. Consider the document in Figure 2. For simplicity, ignore the parlist element. Then, the depth of the document in 6. So, and This means that The maximum number of consecutive leaf nodes with BranchOrder 5 is 1. Therefore, The maximum number of consecutive leaf nodes withBranchOrder 4 is 3 (e.g., price, text, keyword under the first item element). So, BranchOrderSum of the first leaf node is 0. Since BranchOrder of the second leaf node is 4 and BranchOrderSum of the second leaf node is 3. The values for the complete document are shown in DocumentRValue and PathValue of Figure 5. Lemma 1. If then nodes with LeafOrders and intersect at a level greater than That is, where is the level at which nodes with leaf orders and intersect. The proof for the above lemma is not presented here due to space constraints. The attributes RValue and BranchOrderSum allow the determination of the intersection level between any two leaf nodes in a more or less constant time, whereas in XParent, it depends on the size of the Ancestor and Data tables as a join between these tables is required to determine the ancestor node at a particular level. This reduces the query processing time drastically. Since this is achieved without storing separate ancestor information, the storage requirements are also reduced significantly. We will now discuss how these attributes are useful in query processing. Consider XQuery1. The BranchOrderSum value for the first constraint satisfying text is 6. The BranchOrderSum value for the first price node is 3. Also,

Efficient Recursive XML Query Processing in Relational Database Systems

501

10. Using the property proven above we conclude that these two nodes have ancestors till a since Since, item is at level 4 in both cases it is clear that they have a common item node and, therefore, satisfy the query. Similarly, we can conclude that the first text node and the item node with name Item3 intersect at a level > 1 (since and and therefore do not form a part of the query result.

4.2

SQL Translation

We have implemented an algorithm to translate XQuery queries to SQL in SUCXENT++. Due to space constraints we discuss the translation procedure informally. Consider the recursive query of Figure 7 (XQuery 1) and its corresponding SQL translation (SQL 1). The translation can be explained as follows: (1) Lines 5, 7 and 8 translate the part of the query that seeks an entry with contains(text,“Gold Ignot”). Note that we store only the leaf nodes, their textual content and path_id) in the PathValue table. The actual path expression corresponding to the leaf node is stored in the Path table. Therefore, we need to join the two to obtain leaf nodes that correspond to the path /site/regions/africa/item//text and contain the phrase “Gold Ignot”. Notice that the corresponding SQL translation has the LIKE clause to resolve the // relationship. This is how recursive queries are handled in SUCXENT++. (2) Lines 6 and 7 do the same for the extraction of leaf nodes that correspond to the path /site/regions/africa/item/price. (3) Line 9 ensures that the leaf nodes extracted in Lines 5 to 8 belong to the same document. (4) Line 10 ensures that the two sets of leaf nodes intersect at least level 4. The reason a level 4 ancestor is needed is that the two paths in the query intersect at level 4. It calculates the absolute value of the difference between the BranchOrderSum values and ensures that it is below the RValue for level 4. (5) Line 1 returns the properties of the leaf nodes corresponding to the price element. These properties are needed to construct the corresponding XML fragment based on the algorithm in Figure 6. Say, the return clause in Figure 7 was $b. Then, line 6 in the translation would change to p2.PathExp LIKE ’/site/regions/africa/item%’ to extract all leaf nodes that have paths beginning with $b. This way, elements and their children can be retrieved. Compared to XParent, SUCXENT++ uses only the PathValue, Path and DocumentRValue tables to evaluate a query. The size of the PathValue and Path tables is the same as that of the Data and LabelPath tables in XParent. DocumentRValue has the same number of rows as the depth of the document as compared to the Ancestor table in XParent which stores the ancestor list of every node in the document. This results in substantially better query performance in addition to much smaller storage size.

5

Optimizat ions

A preliminary performance evaluation using the above translation procedure yielded some interesting results. We checked the query plans generated by the

502

Sandeep Prakash, Sourav S. Bhowmick, and Sanjay Madria

Fig. 9. Initial query plan.

Fig. 10. Path optimization.

Fig. 11. Multiple-queries optimization.

query optimizer and noticed that the join between the Path and PathValue tables took a significant portion of the query processing time. This was because for most of the queries this join was being performed last. For example, in SQL 1 of Figure 7 the joins in lines 8 to 10 were evaluated first and only then was the join between Path and PathValue tables performed. The initial query plan is shown in Figure 9. We have not shown the DocumentRValue table in the plan, even though the query optimizer includes it, as it does not influence the optimization. The two Hash-Joins (labelled 1 and 2) in this plan are both very expensive. The first takes the PathValue table (with alias v2) as one of its inputs. The second join takes the result of this join as one of its inputs. Both these inputs are quite substantial in size resulting in very expensive join operations. In order to improve the above query plans we propose three optimizations that are discussed below. Optimization for Simple Path Expressions. The join expression v1.PathId = p1.Id and p1.PathExp = path is replaced with v1.PathId = n where n is the PathId value corresponding to path in the table Path. Similarly, v1.PathId = p1.Id and p1.PathExp LIKE path% is replaced with v1.PathId >= n and v1.PathId = 1 and p.CPathId , =, [][] <selectClause> :: = SELECT <pageList> :: = query [, query...] :: = FROM <matrixIdentifier> FROM <matrixList> :: = WHERE LINKWT integer :: = GROUP BY <pageList> <pageList> :: = pageIdentifier [, pageIdentifier...] <matrixList> :: = matrixIdentifier [, matrixIdentifier...] There are four main clauses in a query expression: the select, from, condition, and group . Among them the select and the from clauses are compulsory, while the condition and the group clauses are optional. Similar to SQL, WUML is a simple declarative language but is powerful enough to express query on the log information stored as navigation matrices. We execute a WUML expression by translating it into a sequence of NLA operations using Algorithm 2. Suppose is a set of Web pages in the select clause, is a set of matrices in the

WUML: A Web Usage Manipulation Language for Querying Web Log Data

Fig. 1. WUML query tree

from clause,

573

Fig. 2. Optimized WUML query tree

is a set of

Web pages in the group clause, and are two non-negative integers. Note that the input WUML query expression is assumed to be syntactically valid. If or then Fig. 1 depicts a query tree which illustrates the basic idea. Now, we present a set of examples, which illustrates the usage of the WUML expressions and the translation into the corresponding sequence of NLA operations. Let M, and be navigation matrices. We want to know how frequently the pages and were visited. WUML expression: SELECT FROM SUM NLA operation: We want to find out the essential difference of preferences between the two groups of users in and We consider those links having the weight > 3. WUML expression: SELECT FROM DIFF WHERE LINKWT > 3. NLA operation: We want to get the navigation details in an information unit [9] consisting of the pages and We may gain insight to decide whether it is better to combine these three Web pages or not. So we consider them as a group. WUML expression: SELECT FROM SUM GROUP BY NLA operation: We want to know whether some pages were visited by users after 3 clicks. If they were seldom visited or lately visited in a user session, we may decide to remove or update them to make them more popular. WUML expression: SELECT FROM POWER 3 M. NLA operation: Now we want to get the specific information of a particular set of Web pages. WUML expression: SELECT FROM INTERSECT WHERE LINKWT > 6. NLA operation:

574

Qingzhao Tan, Yiping Ke, and Wilfred Ng

Let us again consider the query, We will see in Sect. 6 that the running time of NLA operators are proportional to the number of non-zero elements in the executed matrix. Therefore, the optimal plan is to first execute the NLA operators which can minimize the number of non-zero elements in the matrix. For the sake of efficiency, the projection should be executed as early as possible. So a better NLA execution plan of can be obtained as follows: We now summarize some optimization rules as depicted in Fig. 2. First, projection should be done as early as possible since it can eliminate some nonzero elements. Note that projection is not distributive under difference and power. Second, since selection is not distributive under some binary operators such as difference, we do not change the execution order. Finally, grouping creates a view different from the underlying Web topology. Therefore, it should be done at the last step except some operators taking another navigation matrix whose structure is the same as the grouped one. Note that these rules are simple heuristics to sort NLA operations. We still need to find out a more systematic way to generate an optimized execution plan for a given WUML expression.

5

Storage Schemes for Navigation Matrices

As the navigation matrices generated from the Web log files are usually sparse, the storage scheme of a matrix greatly affects the performance of WUML. In this section we introduce three storage schemes, COO, CSR, and CSC, to study their impacts on the NLA operations. In literature, the technique of storing sparse matrices has been intensively studied [3,8]. In our WUML environment, we store the navigation matrix as three separate parts: the first row (i.e. the weights of the links starting from S), the last column(i.e. the weights of the links ending in F) and the square matrix despite the rows and columns of S and F. We employ two vectors, and which contains an array for the non-zero values as well as corresponding indices, to store the first row and the last column. Table 2 and 3 show examples using the matrix in Table 1. As for the third part, we implement the storage schemes proposed in [8]. We illustrate the schemes using the matrix in Table 1. The Coordinate (COO) storage scheme is the most straightforward structure to represent a sparse matrix. As illustrated in Table 4, it records each nonzero

WUML: A Web Usage Manipulation Language for Querying Web Log Data

575

entry together with its column and row index in three arrays in row-first order. Similar to COO, the Compressed Sparse Row (CSR) storage scheme also consists of three arrays. It differs from COO in the Compressed Row array which stores the location of the first non-zero entry in that row. Table 5 shows the structure of CSR. The Compressed Sparse Column (CSC) storage scheme, as shown in Table 6, is similar to CSR. It has three arrays: Nonzero array to hold the non-zero values in column-first order, Compressed Column array to hold the location of the first non-zero entry of that column, Row array for the row indices. CSC is the transpose of CSR. There are also other sparse matrix storage schemes, such as Compressed Diagonal Storage (CDS) and Jagged Diagonal Storage (JDS) [12]. However, they are used for storing a banded sparse matrix. In reality, the navigation matrix should not be banded. Therefore, these schemes are not studied in our experiments.

6

Experimental Results and Analysis

We carry out a set of experiments to compare the performances of the three storage schemes introduced in Sect. 5. We also study the usability and efficiency of WUML on different data sets. The data set we used is a set of synthetic Web logs on different Web topology, which are generated by a log generator designed in [10]. The parameters used to generate the log files are described in Table 7. Among these four parameters, PageNum and MeanLink are dependent on the underlying Web topology while the other two are not. These experiments are run on Pentium 4, 2.5GHz, and 1G of RAM machine configuration.

6.1

Construction Time of Storage Schemes

We choose three data sets: and in which the components represent the parameters LogSize, UserNum, PageNum and MeanLink, respectively. Then we construct three storage schemes based on the generated log files from to

576

Qingzhao Tan, Yiping Ke, and Wilfred Ng

Our measurement of the system response time includes I/O processing time and CPU processing time. As shown in Fig. 3, the response time grows significantly as the parameters increase. Since most of the time is consumed in reading the log files, the construction time for the same given data set varies slightly among the three storage schemes. But it still takes more time to construct COO than the other two schemes, since there is no compressed array for COO. CSC needs more time than CSR because the storage order in CSC is column-first while reading in the file is in row-first order.

Fig. 3. Construction Time

6.2

Fig. 4. Running Four Operators

Processing Time of Binary Operators

We present the CPU processing time results of four binary operators: sum,union, difference and intersection. Each time we tune one of the four parameters to see how the processing time changes on COO, CSR and CSC storage schemes. For each parameter, we carry out experiments on ten different sets of Web logs. We first compare the processing time of each single operator under different storage schemes. Then we present the processing time of each storage scheme under different operators. Tuning LogSize. We set UserNum and PageNum as 3000, MeanLink as 5. The results are shown in Fig. 5. When LogSize increases, the processing time of the same operator on each storage scheme also increases. The reason is that the number of non-zero elements in the navigation matrix grows with the increase of LogSize, and therefore it needs more time to do the operations. Tuning PageNum. We set LogSize as 5000, UserNum as 3000, and MeanLink as 5. The results are presented in Fig. 7. With the growth of PageNum, the CPU time for each operator on specified storage scheme grows quickly. It is because PageNum is a significant parameter when constructing the navigation matrix. The more pages in the Web site, the larger dimension of a navigation matrix, and consequently, the more time needed to construct the navigation matrix. Tuning UseNum. Figure 6 shows the results when LogSize is 5000, PageNum is 3000 and MeanLink is 5. The processing time remains almost unchanged when UserNum grows. The main reason is that, although different user may have different behavior when traversing the Web site, the number of non-zero elements in navigation matrix is roughly the same due to the fixed LogSize.

WUML: A Web Usage Manipulation Language for Querying Web Log Data

577

Fig. 5. The CPU time by tuning LogSize

Tuning MeanLink. We use the log files with LogSize of 5000, UserNum of 3000 and PageNum of 3000. The results are shown in Fig. 8 which indicates that, with the increase of MeanLink, the processing time decreases. Note that for sum, COO always outperforms others, while CSR and CSC perform almost the same (see Fig. 5(a), 6(a), 7(a) and 8(a)). The similar phenomenon can be observed in Fig. 5(d), 6(d), 7(d) and 8(d) for intersection. As shown in Fig. 5(b), 6(b), 7(b) and 8(b), the processing time for union on three storage schemes has no significant difference. Finally, from Fig. 5(c), 6(c), 7(c) and 8(c), the performances of CSR and CSC are much better than COO for difference. Note also that from Fig. 4, difference requires the most processing time, and sum needs the least. The Web logs used are of 5000 LogSize, 1000 UserNum, 3000 PageNum and 5 MeanLink. The reason for this result is as follows. As we have mentioned, we do not need to check the balance of Web pages and the validity of the navigation matrix for sum. Therefore, it takes the least time. For union, we only need to check the balance of Web pages without checking the validity of the output matrix. But for difference and intersection, we have to check both the page balance and matrix validity, which is rather time-consuming. It can be found that intersection does not need much time since there are very few non-zero elements in the output matrix.

578

Qingzhao Tan, Yiping Ke, and Wilfred Ng

Fig. 6. The CPU time by tuning UserNum

6.3

Performance of Unary Operators

Power. We use log files with 5000 LogSize, 3000 UserNum, 5 MeanLink, (100, 500,1000) PageNum. Each matrix multiplies twice (i.e. power = 2). As shown in Fig. 9, COO performs much worse than CSR and CSC. We also see that power is a rather time-consuming operator. Projection and Selection. Since projection and selection are commutative, we study the time cost by swapping them on the navigation matrix with 5000 LogSize, 5000 PageNum, 3000 UserNum and 5 MeanLink. As shown in Fig. 10, doing projection before selection is more efficient than doing selection and then projection. According to this result, we can do some optimization when interpreting some queries. Moreover, COO outperforms CSR and CSC. From the experimental results shown above, we have the following observations. First, from construction point of view, CSR is the best. Second, COO is the best for sum and intersection. Third, CSR and CSC perform almost the same for difference and power, and greatly outperform COO. Finally, COO, CSR and CSC perform the same for union. Taking these observations into consideration, CSR is the best for our WUML expressions. Although COO performs better in sum and intersection, it needs too much time for difference which is intolerant. Although the performance of CSC is the same as CSR with respect to the operations, CSC needs more time to be constructed. We also observe that

WUML: A Web Usage Manipulation Language for Querying Web Log Data

579

Fig. 7. The CPU time by tuning PageNum

the time growth for each operator is linear to the growth of parameters, which indicates that the usability and scalability of WUML is acceptable in practice.

7

Concluding Remarks

We presented NLA which consists of a set of operators on navigation matrices and proposed an efficient algorithm VALID space and time complexities) to ensure the validity of an output matrix by NLA operators. Within NLA, we develop a query language WUML and study the mapping between the WUML statements and NLA expressions. To choose an efficient storage scheme for the sparse navigation matrix, we carried out a set of experiments on different synthetic Web log data sets, which are generated by tuning different parameters such as the number of pages, the number of mean links and the number of users. By the experimental results on three storage schemes of COO, CSC and CSR, we can see that the CSR scheme is relatively efficient for NLA. As for future work, we plan to develop a full-fledged WUML system to preform both analyzing and mining the real Web log data sets. We are also studying a more complete set of optimization heuristic rules for the NLA operators in order to generate a better execution plan for an input WUML expression.

580

Qingzhao Tan, Yiping Ke, and Wilfred Ng

Fig. 8. The CPU time by tuning MeanLink

Fig. 9. Power (in log scale)

Fig. 10. Projection and Selection

References l. R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. of VLDB, 1994. 2. R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. of ICDE, 1995. 3. Nawaaz Ahmed et al. A framework for sparse matrix code synthesis from high-level specifications. In Proc. of the 2000 ACM/IEEE Conf. on Supercomputing, 2000. 4. M. Baglioni et al. Preprocessing and mining web log data for web personalization. The 8th Italian Conf. on AI, 2829:237–249, September 2003.

WUML: A Web Usage Manipulation Language for Querying Web Log Data

581

5. B. Berendt and M. Spiliopoulou. Analyzing navigation behavior in web sites integrating multiple information systems. The VLDB Journal, 9(1), 2000. 6. L. D. Catledge and J. E. Pitkow. Characterizing browsing strategies in the world wide web. Journal of Artificial Intelligence Research, 27(6):1065–1073, 1995. 7. Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava. Web mining: Information and pattern discovery on the world wide web. ICTAI, 1997. 8. N. Goharian, A. Jain, and Q. Sun. Comparative analysis of sparse matrix algorithms for information retrieval. Journal of Sys., Cyb. and Inf., 1(1), 2003. 9. Wen-Syan Li et al. Retrieving and organizing web pages by information unit. In Proc. of WWW, pages 230–244, 2001. 10. W. Lou. loggen: A generic random log generator: Design and implementation. Technical report, CS Department, HKUST, December 2001. 11. Wilfred Ng. Capturing the semantics of web log data by navigation matrices. Semantic Issues in E-Commerce Systems. Kluwer Academic, pages 155–170, 2003. 12. Y. Saad. Krylov subspace methods on supercomputers. SIAM J. of Sci. Stat. Comput., 10:1200–1232, 1989. 13. M. Spiliopoulou and L.C. Faulstich. WUM: A web utilization miner. In Proc. of EDBT Workshop, WebDB98, 1998. 14. J. Srivastava et al. Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations, 1(2):12–23, 2000.

An Agent-Based Approach for Interleaved Composition and Execution of Web Services Xiaocong Fan, Karthikeyan Umapathy, John Yen, and Sandeep Purao School of Information Sciences and Technology The Pennsylvania State University University Park, PA 16802 {zfan,kxu110,jyen,spurao}@ist.psu.edu

Abstract. The emerging paradigm of web services promises to bring to distributed computing the same flexibility that the web has brought to the publication and search of information contained in documents. This new paradigm puts severe demands on composition and execution of workflows that must survive and respond to changes in the computing and business environments. Workflows facilitated by web services must, therefore, allow dynamic composition in ways that cannot be predicted in advance. Utilizing the notions of shared mental models and proactive information exchange in agent teamwork research, we propose a solution that interleaves planning and execution in a distributed manner. This paper proposes a generic model, gives the mappings of terminology between Web services and team-based agents, describes a comprehensive architecture for realizing the approach, and demonstrates its usefulness with the help of an example. A key benefit of the approach is the proactive failures handling that may be encountered during execution of complex web services.

1

Introduction

The mandate for effective composition of web services comes from the need to support complex business processes. Web services allow a more granular specification of tasks contained in workflows, and suggest the possibility of gracefully accommodating short-term trading relationships, which can be as brief as a single business transaction [1]. Facilitating such workflows requires dynamic composition of complex web services that must be monitored for successful execution. Drawing from research in workflow management systems [2], the realization of complex web services can be characterized by the following elements: (a) creation of execution order of operations from the short-listed Web services; (b) enacting the execution of the services in the sequenced order; and (c) administrating and monitoring the execution process. The web services composition problem has, therefore, been recognized to include both the coordination of sequence of services execution and also managing the execution of services as a unit [3]. Much current work in web service composition continues to focus on the first ingredient, i.e. discovery of appropriate services and planning for the sequencing P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 582–595, 2004. © Springer-Verlag Berlin Heidelberg 2004

An Agent-Based Approach for Interleaved Composition

583

and invocation of these web services [4]. Effective web services composition, however, must encompass concerns beyond the planning stage, including the ability to handle errors and exceptions that may arise in a distributed environment. Monitoring of the execution and exception handling for the web services must, therefore, be part of an effective strategy for web service composition [5]. An approach to realizing this strategy is to incorporate research from intelligent agents, in particular, team-based agents [6]. The natural match between web services and intelligent agents - as modularized intelligence - has been alluded to by several researchers [7]. The objective of the research reported in this paper is to go a step further: to develop an approach for interleaved composition and execution of web services by incorporating ideas from research on team-based agents. In particular, an agent architecture will be proposed such that agents can respond to environmental changes and adjust their behaviors proactively to achieve better QoS (quality of services).

2 2.1

Prior Work Composition of Web Services

Web services are loosely coupled, dynamically locatable software, which provides a common platform-independent framework that simplifies heterogeneous application integration. Web services use a service-oriented architecture (SOA) that communicates over the web using XML messages. The standard technologies for implementing the SOA operations include Web Services Description Language (WSDL), Universal Description, Discovery and Integration (UDDI), Simple Object Access Protocol (SOAP) [8], and Business Process Execution Language for Web Services (BPEL4WS). Such function-oriented approaches have provided guidelines for planning of web service compositions [4,9]. However, the technology to compose web services has not kept pace with the rapid growth and volatility of available opportunities [10]. While the composition of web services requires considerable effort, its benefit can be short-lived and may only support short-term partnerships that are formed during execution and disbanded on completion [10]. Web services composition can be conceived as two-phase procedure, involving planning and execution [11]. The planning phase includes determining series of operations needed for accomplishing the desired goals from user query, customizing services, scheduling execution of composed services and constructing a concrete and unambiguously defined composition of services ready to be executed. The execution phase involves process of collaborating with other services to attain desirable goals of the composed services. The overall process has been classified along several dimensions. The dimension most relevant for our discussion is: pre-compiled vs. dynamic composition [12]. Compared with the pre-compilation approach, dynamic compositions can better exploit the present state of services, provide runtime optimizations, as well as respond to changes in the business environment. But on the other hand, dynamic compositions of web services is a particularly difficult problem because of the continued need to

584

Xiaocong Fan et al.

provide high availability, reliability, and scalability in the face of high degrees of autonomy and heterogeneity with which services are deployed and managed on the web [3]. The use of intelligent agents has been suggested to handle the challenges.

2.2

Intelligent Agents for Web Service Composition

There is increasing recognition that web services and intelligent agents represent a natural match. It has been argued that both represent a form of “modularized intelligence” [7]. The analogy has been carried further to articulate the ultimate challenge as the creation of effective frameworks, standards and software for automating web service discovery, execution, composition and interoperation on the web [13]. Following the discussion of web service composition above, the role of intelligent agents may be identified as on-demand planning, and proactively responding to changes in the environment. In particular, planning techniques have been applied to web services composition. Kay [14] describes the ATL Postmaster system that uses agent-based collaboration for service composition. A drawback of the system is that the ATL postmaster is not fault-tolerant. If a node fails, the agents residing in it are destroyed and state information is lost. Maamar and et. al. [15] propose a framework based on software agents for web services composition, but fail to tie their framework to web services standards. It is not clear how their framework will function with BPEL4WS and other web services standards and handle exceptions. Srivastava and Koehler [4], while discussing use of planning approaches to web services composition, indicate planning alone is not sufficient; and useful solutions must consider failure handling as well as composition with multiple partners. Effective web service composition, thus, requires expertise regarding available services, as well as process decomposition knowledge. A particular flavor of intelligent agents, called team-based agents, allows expertise to be distributed, making them a more appropriate fit for web services composition. Team-based agents are a special kind of intelligent agents with distributed expertise (knowledge) and emphasize on cooperativeness and proactiveness in pursuing their common goals. Several computational models of teamwork have been proposed including [16], STEAM [17] and CAST [6]. These models allow multiple agents to solve (e.g., planning, task execution) complex problems collaboratively. In web services composition, team-based agents can facilitate a distributed approach to dynamic composition, which can be scalable, facilitate learning about specific types of services across multiple compositions, and allow proactive failure handling. In particular, the CAST architecture (Collaborative Agents for Simulating Teamwork) [6] offers a feasible solution for dynamic web services composition. Two key features of CAST are (1) CAST agents can work collaboratively using a shared mental model of the changing environment; (2) CAST agents proactively inform each other of changes that they perceive to handle any exceptions that arise in achieving a team goal. By collaboratively monitoring the progress of a shared process, a team of CAST agents can not only

An Agent-Based Approach for Interleaved Composition

585

initiate helping behaviors proactively but can also adjust their own behaviors to the dynamically changing environment. In the rest of this paper, we first propose a generic team-based agent framework for dynamic web-service composition, and then extend the existing CAST agent architecture to realize the framework.

3

A Methodology for Interleaved Composition and Execution

We illustrate the proposed methodology with the help of an example that demonstrates how team-based agents may help with dynamic web services composition. The example concerns dynamic service outsourcing in a virtual software development organization, called ‘VOSoft’. VOSoft possesses expertise in designing and developing software packages for customers from a diversity of domains. It usually employs one of two developing methodologies (or software processes): prototype-based approach (Mp) is preferred for software systems composed of tightly-coupled modules (integration problems reveal earlier), and unit-based approach (Mu) is preferred for software systems composed of loosely-coupled modules (more efficient due to parallel tasks). Suppose a customer “WSClient” engages VOSoft to develop CAD designsoftware for metal casting patterns. It is required that the software is able to (a) read AutoCAD drawings automatically, (b) develop designs for metal casting patterns, and (c) maintain all the designs and user details in a database. Based on its expertise, VOSoft designs the the software as being composed of three modules: database management system (DMS), CAD, and pattern design. Assume VOSoft’s core competency is developing the application logic that is required for designing metal casting patterns, but it cannot develop CAD software and the database module. Hence, VOSoft needs to compose a process where the DMS and CAD modules could be outsourced to competent service providers. In this scenario, several possible exceptions may be envisioned. We list three below to illustrate the nature and source of these exceptions. First, non-performance by a service provider will result in a service failure exception, which may be resolved by locating another service to perform the task. Second, module integration exceptions may be raised if two modules cannot interact with each other. This may be resolved by adding tasks to develop common APIs for the two modules. Third, the customer may change or add new functionality, which may necessitate the change of the entire process. It is clear that both internal (capability of web services) as well as external (objectives being pursued) changes can influence the planning and execution of composite web services in such scenarios. It thus requires an approach being able to monitor service execution and proactively handle services failures.

3.1

Composing with Team-Based Agents

A team-based agent A is defined in terms of (a) a set of capabilities (service names), denoted as (b) a list of service providers under its management,

586

Xiaocong Fan et al.

and (c) an acquaintance model (a set of agents known to A, and their respective capabilities: An agent in our framework may play multiple roles. First, every agent is a Web-service manager. An agent A knows which providers in can offer a service or at least knows how to find a provider for S (e.g. by searching the UDDI registry) if none of the providers in are capable of performing the service. Services in are primitive to agent A in the sense that it can directly delegate the services to appropriate service providers. Second, an agent becomes a service composer upon being requested of a complex service. An agent is responsible for composing a process using known services when it receives a user request that falls beyond its capabilities. In such situations, the set of acquaintances, forms a community of contacts available to agent A. The acquaintance model is dynamically modified based on the agent’s collaboration with other agents (e.g., assigning credit to those with successful collaborations). This additional, local knowledge supplements the global knowledge about publicly advertised web services (say, on the UDDI registry). Third, an agent becomes a team leader when it tries to forming a team to honor a complex service.

Fig. 1. Team formation and Collaboration

3.2

Responding to Request for a Complex Service

An agent, upon receiving a complex service request, initiates a team formation process: (1) The agent (say, C) adopts “offering service S” as its persistent goal. (2) If (i.e., S is within its capabilities), agent C simply delegates S to a competent provider (or first finds a service provider, if no provider known to C is competent).

An Agent-Based Approach for Interleaved Composition

587

(3) If (i.e., agent C cannot directly serve S), then C tries to compose a process (say, H) using its expertise and the services in (i.e., it considers its own capabilities and the capabilities of those agents in its acquaintance model), then starts to form a team: i. Agent C identifies teammates by examining agents in its acquaintance model who have the capability to contribute to the process (i.e. and where is the set of services used in process H). ii. Agent C chooses willing and competent agents from (e.g., using contractnet protocol [18]) as teammates, and shares the process H with them with a view to working together as a team jointly working on H.

(4) If the previous step fails, then agent C either fails in honoring the external request (is penalized), or, if possible, may proactively discover a different agent (either using or a using UDDI) and delegate S to it. [Example] Suppose agent VOSoft composes a top-level process as shown in Figure 1(a). In the process, the “contract” service is followed by a choice point, where VOSoft needs to make a decision on which methodology ( M p or Mu) to choose. If Mu is chosen, then services DMS-WS, CAD-WS and Pattern-WS are required; if Mp is chosen, then services need to be more refined so that interactions between service providers in the software development process could be carried out frequently to avoid potential integration problems at later stages. Now, suppose VOSoft chooses branch Mu, and manages to form a team including agents T1, T3 and VOSoft to collaboratively satisfy the user’s request. It is possible that agent T4 was asked but declined to join the team for certain reason (e.g., lack of interest or capacity). After the team is formed, each agent’s responsibility is determined and mutually known. As a team leader, agent VOsoft is responsible for coordinating others’ behavior to work towards their common goal, and making decisions at critical points (e.g., adjust the process if service fails). Agent T1 is responsible for service DMS-WS; and agent T3 is responsible for service CAD-WS. As service managers, both T1 and T3 are responsible for choosing an appropriate service provider for service DMS-WS and CAD-WS, respectively.

3.3

Collaborating in Service Execution

The sharing of high-level process enables agents in a team to perform proactive teamwork behaviors during service execution. Proactive Service Discovery: Knowing the joint responsibility of the team and individual responsibility of team members, one agent can help another find web services. For example, in Figure 1(b), agent T1 is responsible for contributing service D-design. Agent T3, who happened to identify a service provider for service D-design while interacting with the external world, can proactively inform T1 about the provider. This can not only improve T1’s competency regarding service D-design, but also can enhance T3’s credibility in T1’s acquaintance model.

588

Xiaocong Fan et al.

Proactive Service Delegation: An agent can proactively contract out services to competent teammates. For example, suppose branch Mu is selected and service CAD-WS is a complex service for T3, who has composed a process for CAD-WS as shown in Figure 1(b). Even though T3 can perform C-design and C-code, services C-test and C-debug are beyond its capabilities. In order to provide the committed service CAD-WS, T3 can proactively form another team and delegate the services to the recruited agents (i.e., T6). It might be argued that agent VOSoft would have generated a high-level process with more detailed decomposition, say, the sub-process generated by T3 for CAD-WS were embedded (in the place of CAD-WS) as a part of the process generated by VOSoft. If so, T6 would have been recruited as VOSoft’s teammate, and no delegation will be needed. However, the ability to derive a process at all decomposition levels is too stringent a requirement to place on any single agent. One benefit of using agent teams is that one agent can leverage the knowledge (expertise) distributed among team members even though each of them only have limited resources. Proactive Information Delivery: Proactive information delivery can occur in various situations. (i) A complex process may have critical choice points where several branches are specified, but which one will be selected depends on the known state of the external environment. Thus, teammates can proactively inform the team leader about those changes in states that are relevant to its decision-making, (ii) Upon making a decision, other teammates will be informed of the decision so that they can better anticipate potential collaboration needs, (iii) A web service (say, the service Test in branch Mu) may fail due to many reasons. The responsible agent can proactively report the service failures to the leader so that the leader can decide how to respond to the failure: choose an alternative branch (say, Mp), or request the responsible agent to re-attempt the service from another provider.

4

The CAST-WS Architecture

We have designed a team-based agent architecture CAST-WS (Collaborative Agents for Simulating Teamwork among Web Services) to realize our methodology (see Figure 2). In the following, we describe the components of CAST-WS and explain their relationships.

4.1

Core Representation Decisions

The core representation decisions that drive the architecture involve mapping concepts from team-based agents to composition and execution of complex web services with an underlying representation that may be common to both domains. Such a representation is found in Petri nets [19]. The CAST architecture utilizes hierarchical predicate-transition nets as the underlying representation for specifying plans created and shared among agents. In the web service domain, the dominant standard for specifying compositions, BPEL4WS can also be interpreted based on a broad interpretation of the Petri net formalism. Another

An Agent-Based Approach for Interleaved Composition

589

Fig. 2. The CAST-WS Architecture

key building block for realizing complex web services, protocols for conversations among web services [20], uses state-based representations that may be mapped to Petri-net based models for specifying conversation states and their evolution. As a conceptual model, therefore, a control-oriented representation of workflows, complex web services and conversations can share the Petri-net structure, with the semantics provided by each of the domains. The mapping between teambased agents and complex web services is summarized in Table 1 below.

Following this mapping, we have devised the following components of the architecture: service planning (i.e. composing complex web services), team coordination (i.e. knowledge sharing among web services), and executing (i.e. realizing the execution of complex web services).

4.2

WS-Planning Component

The Planning component is responsible for composing services and forming teams. This component includes three modules. The service discovery module is used by service planner to lookup in UDDI registry for required services. The

590

Xiaocong Fan et al.

team formation module, together with acquaintance model, is used to find team agents who can support the required services. A web service composition starts from user’s request. The agent who gets the request is the composer agent, who is in charge of fulfilling the request. Upon receiving a request, the composer agent turns the request into its persistent goal and invokes its service planner module to generate a business process for it. The architecture uses hierarchical predicate-transition nets (PrT nets) to represent and monitor business processes. PrT Nets consists of the following components: (1) a set P of token places for controlling the process execution; (2) a set T of transitions, which represent either an abstraction of a sub PrT net (i.e. an invocation of some sub-plans), or an operation (e.g., primitive web service). A transition is associated with preconditions (predicates), which is used to specify conditions for continuing the process. (3) a set of arcs over P × T that describes the order of execution that the team will follow; and (4) a labeling function on arcs, which are tuples of agents and bindings for variables. The services that are used by the service planner for composing a process come from two sources: the UDDI directory and the acquaintance model. Assume from the requested service we can derive a set of expected effects, which will be the goals to be achieved by CAST agents. Given any set of goals G, a partial order (binary relation) can be defined over (1) (2) (3) Given and as

its pre-set, denoted as is defined as such that its post-set, denoted as is defined Given and such that any and G2 are independent iff such that or is indetachable from G iff and vice versa. Element such that or The following algorithm is used by the service planner to generate a Petri-net process for a given goal (service request).

An Agent-Based Approach for Interleaved Composition

4.3

591

The Team Coordination Component

The team coordination component is used to coordinate with other agents during service execution. This component includes an inference engine with a built-in knowledge base, a process shared by all team members, a PrT interpreter, a plan adjustor, and an inter-agent coordination module. Knowledge base holds the (accumulated) expertise needed for service composition. The inter-agent coordination module, embedded with team coordination strategies and conversation policies [21], is used for behavior collaboration among teammates. Here we mainly focus on the process interpreter and the plan adaptor. Each agent in a team uses its PrT net interpreter to interpret the business process generated by its service planner, monitor the progress of the shared process and takes its turn to perform tasks assigned to it. If the assigned task is a primitive web service, the agent invokes the service through its BPEL4WS process controller. If a task is assigned to multiple agents, the responsible agents coordinate their behavior (e.g., not compete for common resources) through the inter-agent coordination module. If an agent faces an unassigned task, it evaluates constrains associated with the task and tries to find a competent teammate for the task. If the assigned task is a complex service (i.e. further decomposition required) and is beyond its capabilities, the agent treats it as an internal request, composes a sub-process for the task, and forms another team to solve it. The plan adjustor uses the knowledge base and inference engine to adjust and repair the process whenever an exception or a need for change in the process arises. The algorithm used by the plan adjustor utilizes the failure handling policy implemented in CAST. Due to the hierarchical organization of the team process, each CAST agent maintains a stack of active process and sub-processes. A sub-process returns the control to its parent process when its execution is com-

592

Xiaocong Fan et al.

pleted. Failure handling is interleaved with (abstract) service executing: execute a service; check termination conditions; handle failures, and propagate failures to the parent process if needed. The algorithm captures four kinds of termination modes resulting from a service execution. The first (i.e. return 0) result indicates the service is completed successfully. The second (i.e. return 1) indicates that the process is terminated abnormally but the expected effects from the service has already been achieved “magically” (e.g. by proactive help from teammates). The third (i.e., return 2) indicates that the process is not completed and is likely at an impasse. In this case, if the current service is just one alternative of a choice point, another alternative can be selected to re-attempt the service. Otherwise, the failure is propagated to the upper level. The fourth (i.e. return 3) indicates that the process is terminated because the service has become irrelevant. This may happen if the goal or context changes. In this case, the irrelevance is propagated to the parent service, which checks its own relevance. The plan adjustor algorithm is shown below.

4.4

The WS-Execution Component

A service manager agent executes the primitive services (or a process of primitive services) through the WS-Execution component. The WS-Execution component consists of a commitment manager, a capability manager, a BPEL4WS process controller, an active process, and a failure detector. The capability manager maps services to known service providers. The commitment manager is used to schedule the services assigned to it in an appropriate order. An agent ultimately needs to delegate those contracted services to appropriate service providers. The process controller generates a BPEL4WS process based on the WSDL of the selected service providers and the sequence indicated in the PrT process. The failure detector identifies execution failure by

An Agent-Based Approach for Interleaved Composition

593

Fig. 3. The relations between generated team process and other modules

checking the termination conditions associated with services. If a termination condition has been reached, the failure detector throws an error and the plan adjustor module is invoked. If it is a service failure, the plan adjustor simply asks the agent to choose another service provider and re-attempt the service; if it is a process failure (unexpected changes make the process unworkable), the plan adjustor back-tracks the PrT process, tries to find another (sub-)process that would satisfy the task, and uses it to fix the one that failed.

4.5

The Example Revisited

Figure 3 shows how web service composition for VOSoft may be performed with interleaved planning and execution. The figure shows the core (hierarchical) Petri net representation used by the CAST architecture, and the manner in which each of the modules in the architecture use this representation. Due to the dynamic nature of the process, it is not feasible to show all possible paths that the execution may take. Instead, we show one plausible path, indicating the responsibilities for each of the modules in the architecture such as planning, team formation, undertaking execution, sensing changes in the internal or external environment (that may lead to exceptions), proactive information sharing, and how these will allow adapting the process to changes in the environment (proactive exception handling). The result is an interleaved process that includes planning and execution. The figure shows mapping to elements of the web service tech-

594

Xiaocong Fan et al.

nology stack (e.g. BPEL4WS specification), which allows use of the proposed architecture with current proposals from W3C.

5

Discussion

As business processes, specified as workflows and executed with web services, need to be adaptive and flexible, approaches are needed to facilitate this evolution. The methodology and architecture we have outlined addresses this concern by pushing the burden of ensuring this flexibility to the web services participating in the process. To achieve this, we have adapted and extended research in the area of team-based agents. A key consequence of this choice is that our approach allows interleaving of execution with planning, providing several distinct advantage over current web service composition approaches to facilitate adaptive workflows. First, it supports an adaptive process that suitable for the highly dynamic and distributed manner in which web services are deployed and used. The specification of a joint goal allows each team member to contribute relevant information to the composer agent, who can make decisions at critical choice points. Second, it elicits a hierarchical methodology for process management where a service composer can compose a process at a coarse level appropriate to its capability and knowledge, leaving further decomposition to competent teammates. Third, it interleaves planning with execution, providing a natural vehicle for implementing adaptive workflows. Our work in this direction so far has provided us with the fundamental insight that further progress in effective and efficient web service composition can be made by better understanding how distributed and partial knowledge about the availability and capabilities of web services, and the environment in which they are expected to operate, can be shared among the team of agents that must collaborate to perform the composed web service. Our current work involves extending the ideas to address these opportunities and concerns, and reflecting the outcomes in the ongoing implementation.

References 1. Heuvel, v.d., Maamar, Z.: Moving toward a framework to compose intelligent web services. Communications of the ACM 46 (2003) 103–109 2. Allen, R.: Workflow: An introduction. In Fisher, L., ed.: The Workflow Handbook 2001. Workflow Management Coalition (2001) 15–38 3. Pires, P., Benevides, M., Mattoso, M.: Building reliable web services compositions. In: Web, Web-Services, and Database Systems 2002. LNCS-2593. Springer (2003) 59–72 4. Koehler, J., Srivastava, B.: Web service composition: Current solutions and open problems. In: ICAPS 2003 Workshop on Planning for Web Services. (2003) 28–35 5. Oberleitner, J., Dustdar, S.: Workflow-based composition and testing of combined e-services and components. Technical Report TUV-1841-2003-25, Vienna University of Technology, Austria (2003)

An Agent-Based Approach for Interleaved Composition

595

6. Yen, J., Yin, J., Ioerger, T., Miller, M., Xu, D., Volz, R.: CAST: Collaborative agents for simulating teamworks. In: Proceedings of IJCAI’2001. (2001) 1135–1142 7. Bernard, B.: Agents in the world of active web-services. Digital Cities (2001) 343– 356 8. Manes, A.T.: Web services: A manager’s guide. Addison-Wesley Information Technology Series (2003) 47–82 9. Casati, F., Shan, M.C.: Dynamic and adaptive composition of e-services. Information Systems 26 (2001) 143–163 10. Sheng, Q., Benatallah, B., Dumas, M., Mak, E.: SELF-SERV: A platform for rapid composition of web services in a peer-to-peer environment. In: Demo Session of the 28th Intl. Conf. on Very Large Databases. (2002) 11. McIlraith, S., Son, T.C.: Adopting Golog for composition of semantic web services. In: Proceedings of the International Conference on knowledge representation and Reasoning (KR2002). (2002) 482–493 12. Chakraborty, D., Joshi, A.: Dynamic service composition: State-of-the-art and research directions. Technical Report TR-CS-01-19, Department of Computer Science and Electrical Engineering, University of Maryland, Baltimore, USA (2001) 13. Ermolayev, V.: Towards cooperative distributed service composition on the semantic web. Talks at Informatics Colloquium (2003) 14. Kay, J., Etzl, J., Rao, G., Thies, J.: The ATL postmaster: a system for agent collaboration and information dissemination. In: Proceedings of the second international conference on Autonomous agents, ACM (1998) 15. Maamar, Z., Sheng, Q., Benatallah, B.: Interleaving web services composition and execution using software agents and delegation. In: AAMAS’03 Workshop on web Services and Agent-based Engineering. (2003) 16. Jennings, N.R.: Controlling cooperative problem solving in industrial multi-agent systems using joint intentions. Artificial Intelligence 75 (1995) 195–240 17. Tambe, M.: Towards flexible teamwork. Journal of Artificial Intelligence Research 7 (1997) 83–124 18. Smith, R.G.: The contract net protocol: High-level communication and control in a distributed problem solver. IEEE Transactions on Computers 29 (1980) 1104–1113 19. van der Aalst, W., vanHee, K.: Workflow Management: Models, Methods, and Systems. MIT Press (2002) 20. Hanson, J.E., Nandi, P., Kumaran, S.: Conversation support for business process integration. In: Proc. of the IEEE International Enterprise Distributed Object Computing Conference. (2002) 65–74 21. Umapathy, K., Purao, S., Sugumaran, V.: Facilitating conversations among web services as speech-act based discourses. In: Proceedings of the Workshop on Information Technologies and Systems. (2003) 85–90

A Probabilistic QoS Model and Computation Framework for Web Services-Based Workflows San-Yih Hwang1,2,*, Haojun Wang2,**, Jaideep Srivastava2, and Raymond A. Paul3 1

Department of Information Management National Sun Yat-sen University, Kaohsiung 80424, Taiwan [emailprotected] 2

Department of Computer Science University of Minnesota, Minneapolis 55455, USA {haojun,srivasta}@cs.umn.edu 3

Department of Defense, United States

Abstract. Web services promise to become a key enabling technology for B2B e-commerce. Several languages have been proposed to compose Web services into workflows. The QoS of the Web services-based workflows may play an essential role in choosing constituent Web services and determining service level agreement with their users. In this paper, we identify a set of QoS metrics in the context of Web services and propose a unified probabilistic model for describing QoS values of (atomic/composite) Web services. In our model, each QoS measure of a Web service is regarded as a discrete random variable with probability mass function (PMF). We describe a computation framework to derive QoS values of a Web services-based workflow. Two algorithms are proposed to reduce the sample space size when combining PMFs. The experimental results show that our computation framework is efficient and results in PMFs that are very close to the real model.

1 Introduction Web services have become a de facto standard for achieving interoperability among business applications over the Internet. In a nutshell, a Web service can be regarded as an abstract data type that comprises a set of operations and data (or message types). Requests to and responses from Web service operations are transmitted through SOAP (Simple Object Access Protocol), which provides XML-based message delivery over an HTTP connection. The existing SOAP protocol uses synchronous RPC for invoking operations in Web services. However, in response to an increasing need to facilitate long running activities new proposals have been made to extend SOAP to allow asynchronous message exchange (i.e., requests and responses are not synchronous). One notable proposal is ASAP (Asynchronous Service Access Protocol) [1], which allows the execution of long-running Web service operations, * **

San-Yih Hwang was supported in part by Fulbright Scholarship. Haojun Wang was supported in part by the NSF under grant ISS-0308264.

P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 596–609, 2004. © Springer-Verlag Berlin Heidelberg 2004

A Probabilistic QoS Model and Computation Framework

597

and also non-blocking Web services invocation, in a less reliable environment (e.g., wireless networks). In the following discussion, we use the term Web service, to refer to an atomic activity, which may encompass either a single Web service operation (in the case of asynchronous Web services) or a pair of invoke/respond operations (in the case of synchronous Web services), and the term WS-workflow to refer to a workflow composed of a set of Web service invocations threaded into a directed graph. Several languages have been proposed to compose Web services into workflows. Notable examples include WSFL (Web Service Flow Language) [13] and XLANG (Web Services for Business Process Design) [16]. The ideas of WSFL and XLANG have converged and been superceded by BPEL4WS (Business Process Execution Language for Web Services) specification [2]. Such Web services-based workflows may subsequently become (composite) Web services, thereby enabling nested Web Services Workflows (WS-workflows). While the syntactic description of Web services can be specified through WSDL (Web Service Description Language), their semantics and quality of service (QoS) are left unspecified. The concept of QoS has been introduced and extensively studied in computer networks, multimedia systems, and real-time systems. QoS was mainly considered as an overload management problem that measures non-functional aspects of the target system, such as timeliness (e.g., message delay ratio) and completeness (e.g., message drop percentage). More recently, the concept of QoS is finding its way into application specification, especially in describing the level of service provided by a server. Typical QoS metrics at the application level include throughput, response time, cost, reliability, fidelity, etc [12]. Some work has been devoted to the specification and estimation of workflow QoS [3, 7]. However, previous work in workflow QoS estimation either focused on the static case (e.g., computing the average or the worst case QoS values) or relied on simulation to compute workflow QoS in a broader context. While the former has limited applicability, the later requires substantial computation before reaching stable results. In this paper, we propose a probability-based QoS model on Web services and WS-workflows that allows for efficient and accurate QoS estimation. Such an estimation serves as the basis for dealing with Web services selection problem [11] and service level agreement (SLA) specification problem [6]. The main contributions of our research are: 1. We identify a set of QoS metrics tailored for Web services and WS-workflows and give an anatomy of these metrics. 2. We propose a probability-based WS-workflow QoS model and its computation framework. This computation framework can be used to compute QoS of a complete or partial WS-workflow. 3. We explore alternative algorithms for computing probability distribution functions of WS-workflow QoS. The efficiency and accuracy of these algorithms are compared.

This paper is organized as follows. In Section 2 we define the QoS model in the context of WS-workflows. In Section 3 we present the QoS computation framework for WS-workflows. In Section 4 we describe algorithms for efficiently computing the

598

San-Yih Hwang et al.

QoS values of a WS-workflow. Section 5 presents preliminary results of our performance evaluation. Section 6 reviews related work. Finally, Section 7 concludes this paper and identifies directions for future research.

2 QoS Model for Web Services 2.1 Web Services QoS Metrics Many workflow-related QoS metrics have been proposed in the literature [3, 7, 9, 11, 12]. Typical categories of QoS metrics include performance (e.g., response time and throughput), resources (e.g., cost, memory/cpu/bandwidth consumption), dependability (e.g., reliability, availability, and time to repair), fidelity, transactional properties (e.g., ACID properties and commit protocols), and security (e.g., confidentiality, nonrepudiation, and encryption). Some of the proposed metrics are related to the system capacity for executing a WS-workflow. For example, metrics used to measure the power of servers, such as throughput, memory/cpu/bandwidth consumption, time to repair (TTR), and availability, falls in the category called system-level QoS. However, the capacities of servers for executing Web services (e.g., man power for manual activities and computing power for automatic activities) are unlikely to be revealed due to autonomy consideration, and may change over time without notification. These metrics might be useful in some workflow context such as intra-organizational workflows (for determining the amount of resources to spend on executing workflows). For interorganizational workflows, where a needed Web service may be controlled by another organization, QoS metrics in this category generally cannot be measured, and are thus excluded from further discussion. Another QoS metrics require all instances of the same Web service to share the same values. In this case, it is better to view these metrics as service classes rather than quality of service. Metrics of service class include those categorized as transactional properties and security. In this paper we focus on those WS-workflow QoS metrics that measure a WS-workflow instance and whose value may change across instances. These metrics, called instance-level QoS metrics, include response time, cost, reliability, and fidelity rating. Note that cost is a complicated metric and could be a function of both service class and/or other QoS values. For example, a Web service instance that imposes weaker security requirements or incurs longer execution time might be entitled to lower cost. Some services may adopt a different pricing scheme that charges based on factors other than usage (e.g., membership fee or monthly fee). In this paper, we consider the pay-per-service pricing scheme, which allows us to include cost as an instance-level QoS metric. In summary, our work considers four metrics: Response time (i.e., time elapsed from the submission of a request to the receiving of the response), Reliability (i.e., the probability that the service can be successfully completed), Fidelity (i.e., reputation rating) and Cost (i.e., the amount of money paid for executing an activity), which can be equally applicable to both atomic Web services and WS-workflows (also called composite Web services). These QoS metrics are defined such that different instances of the same Web service may have different QoS values.

A Probabilistic QoS Model and Computation Framework

599

2.2 Probabilistic Modeling of Web Services QoS We use a probability model for describing Web service QoS. In particular, we use probability mass function (PMF) on finite scalar domain as the QoS probability model. In other words, each QoS metric of a Web service is viewed as a discrete random variable, and the PMF indicates the probability that the QoS metric assumes a particular value. For example, the fidelity F of an example Web service with five grades (1-5) may have the following PMF: Note that it is natural to describe Reliability, Fidelity rating and Cost as random variables and to model them as PMFs with domains being {0 (fail), 1 (success)}, a set of distinct ratings, and a set of possible costs respectively. However, it is less intuitive to use PMF for describing response time whose domain is inherently continuous. By viewing response time at a coarser granularity, it is possible to model response time as a discrete random variable. Specifically, we partition the range of response time into a finite sequence of sub-intervals and use a representative number (e.g., the mean) to indicate each sub-interval. For example, suppose that the probabilities of a Web service being completed in one day, two to four days, and five to seven days, are 0.2, 0.6, and 0.2, respectively. The PMF of its response time X is represented as follows: (3 is the mean of [2, 4]), and (6 is the mean of [5, 7]) As expected, finer granularity on response time will yield more accurate estimation with higher overhead in representation and computation. We explore these tradeoffs in our experiments.

2.3 WS-Workflow Composition For an atomic Web service, its QoS PMFs can be derived from its past records of invocations. For a newly developed WS-workflow that is composed of a set of atomic Web services, we need a way to determine its QoS PMFs. Different workflow composition languages may provide different constructs for specifying the control-flow among constituent activities (e.g., see [14, 15] for a comparison of the expressive powers of various workflow and Web services composition languages). Kiepuszewski et al. [8] define a structured workflow model that consists of only four constructs: sequential, or-split/or-join, and-split/and-join, and loop, which allows for recursive construction of larger workflows. Although it has been shown that this structured workflow model is unable to model arbitrary workflows [8], it is nevertheless powerful enough to describe many real-world workflows. In fact, there exist some commercial workflow systems that support only structured workflows, such as SAP R/3 and Filenet Visual workflo. In this paper, as an initial step of the study, we focus our attention on structured workflows. To distinguish between exclusive or and (multiple choice) or, which is crucial in deriving WS-workflow QoS, we extend the structured workflow model to include five constructs:

600

San-Yih Hwang et al.

1. sequential: a sequence of activities 2. parallel (and split/and join): multiple activities that can be concurrently executed and merged with synchronization. 3. conditional (exclusive split/exclusive join): multiple activities among which only one activity can be executed. 4. fault-tolerant (and split/exclusive join): multiple activities that can be concurrently executed but merged without synchronization. 5. loop: a block of activities a guarded by a condition “LC”. Here we adopt while loop in our following discussion.

3 Computing QoS Values of WS Compositions We now describe how to compute the WS-workflow QoS values for each composition construct introduced earlier. We identify five basic operations for manipulating random variables, namely (i) addition, (ii) multiplication, (iii) maximum, (iv) minimum, and (v) conditional selection. Each of these operations takes as input a number of random variables characterized by PMFs and produces a random variable characterized by another PMF. The first four operations are quite straightforward, and their detailed descriptions are omitted here due to space limitations. For their formal definitions, interested readers are referred to [5]. The conditional selection, denoted as is defined as following1. Let be n random variables, with being the probability that is selected by the conditional selection operation CS. Note the selection of any random variable is exclusive, i.e., exactly one of these would be selected. The result of is a new random variable Z with

Specifically, the

of Z is as follows:

For each activity a, we consider four QoS metrics, namely response time, cost, reliability, and fidelity, denoted T(a), C(a), R(a), and F(a) respectively2. A WS-workflow composed of activities using some composition construct is denoted The QoS values of w, under various composition constructs, are shown in Table 1. We assume that the fidelity of w using sequential or parallel composition is a weighted sum of the fidelities of its constituent activities. The fidelity weight of each 1

Ensure not to confuse the conditional selection by the weighted sum The weighted sum results in a random variable whose domain may not be the union of the domains of the constituent activities. While weighted sum is used for computing the average value of a set of scalar values, it should not be used to compute the PMF resulted from the conditional selection of a set of random variables. 2 Note that each QoS metric of an activity is NOT a scalar value but a discrete random variable characterized by a PMF.

A Probabilistic QoS Model and Computation Framework

601

activity can be either manually assigned by the designer, or automatically derived from past history, e.g. by using linear regression. For the conditional construct, exactly one activity will be selected at run-time. Thus, the fidelity of w is the conditional selection of the fidelity of its constituent activities with the associated probabilities. For the fault-tolerant construct, the fidelity of the activity that is the first to complete becomes the fidelity of w. Thus, where

A loop construct is defined as a repetition of a block guarded by a condition “LC”, i.e., this block is repetitively executed till the condition “LC” no longer holds. Cardoso et al. assumed a geometric distribution on the number of iterations [3]. However, the memoryless property of the geometric distribution fails to capture a common phenomenon that a repeated execution of a block usually has a better chance to exit the loop. Gillmann et al [7] assumed the number of iterations to be uniformly distributed, which again may not hold in many applications. In this paper, rather than assuming a particular distribution, we simply regard the number of iterations as a PMF with a finite scalar domain. Let be the PMF of the number of iterations of a loop structure L defined on a block a, where c is the maximum number of iterations. Let T(a), C(a), R(a), F(a) denote the PMFs of the response time, cost, reliability, and fidelity of a respectively. If a is executed for l times, the response time is The response time of L is the conditional selection on with probabilities

Thus, the response time of L is Similar arguments can be applied to the computation of

cost and reliability. Regarding fidelity, let be the probability of executing at least one iteration and When a is executed at least once, the fidelity of a loop structure, in our view, is determined simply by its last execution of a. Let denote the fidelity that a is executed at least once (i.e., and be the fidelity that a is not executed. The fidelity of L is therefore computed as follows:

4 Efficient Computation of WS-Workflow QoS 4.1 High Level Algorithm A structured WS-workflow can be recursively constructed by using the five basic constructs. Figure 1 shows an example WS-workflow, namely PC order fulfillment. This WS-workflow is designed to tailor-make and to deliver personal computers at a customer’s request. At the highest level, the WS-workflow is a sequential construct that consists of Parts procurement, Assembly, Test, Adjustment, Shipping, and Cus-

602

San-Yih Hwang et al.

Fig. 1. An example WS-workflow PC order fulfillment

Customer notification. Parts procurement is a parallel construct that comprises of CPU procurement, HDD procurement, and CD-ROM procurement. CPU procurement in turn is a conditional construct composed of Intel CPU procurement and AMD CPU procurement. Adjustment is a loop construct on Fix&Test, which is iteratively executed until the quality of the PC is ensured. Customer notification is a fault-

A Probabilistic QoS Model and Computation Framework

603

Fig. 2. Pseudo code for computing QoS of a WS-workflow

tolerant construct that consists of Email notification and Phone notification. The success of either notification marks the completion of the entire WS-workflow. The QoS of the entire WS-workflow can be recursively computed. The pseudocode is listed in Figure 2. Note that SequentialQoS(A.activities), ParallelQoS (A.activities), ConditionalQoS(A.activities), FaultTolerantQoS(A.activities), LoopQoS(A. activities) are used to compute the four QoS metric values for sequential, parallel, conditional, fault tolerant, and loop constructs respectively. Their pseudo codes are quite clear from our discussion in Section 3 and omitted here for brevity.

4.2 Sample Space Reduction When combining PMFs of discrete random variables with respect to a given operation, the sample space size of the resultant random variable may become huge. Consider adding k discrete random variables each having n elements in their respective domains. The sample space size of the resultant random variable, in the worst case, is of the order of In order to keep the domain of a PMF after each operation at a reasonable size, we propose to group the elements in the sample space. In other words, several consecutive scalar values in the sample space will be represented by a single value and the aggregated probability is computed. The problem is formally described below. Let the domain of X be where and the PMF of X be We called another random variable Y an aggregate random variable of X if there exists a partition of where such that domain of Y is

604

San-Yih Hwang et al.

and the PMF for Y is

The aggregate error of Y with re-

spect to X, denoted aggregate_error(Y, X), is the mean square error defined as follows:

Aggregate Random Variable Problem Given a random variable X of domain size s and a desired domain size m, the problem is to find an aggregate random variable Y of domain size m such that its aggregate error with respect to X is minimized. Dynamic Programming Method An optimal solution to this problem can be obtained by formulating it as a dynamic program. Let e(i,j,k) be the optimal aggregate error of partitioning into k subsequences. We have the following recurrence:

where error(i, j) is the aggregated error introduced in representing a single value. Specifically,

by

where

The time complexity of the dynamic programming algorithm is

and its

space complexity is Greedy Method To reduce the computation overhead, we propose a heuristic method for solving this problem. The idea is to continuously merge the adjacent pair of samples that gives the least error until a reasonable sample space size is reached. When an adjacent pair is merged, a new element x’ is created to replace and where

A Probabilistic QoS Model and Computation Framework

The error of merging

denoted

605

is computed as follows:

We can use a priority queue to store the errors of merging adjacent pairs. In each iteration, we perform the following steps: 1. Extract an adjacent pair with the least pair_error() value from the priority queue, say 2. Replace and by the new value x’ in the domain of X. 3. Compute if i>1 and if i y) then x else y lesser = lambda x.lambda y.if (x < y) then x else y The function flatmap applies a list-valued function f to each member of a list xs and is defined in terms of fold: flatmap f xs = fold f (++) [] xs flatmap can in turn be used to define selection, projection and join operators and, more generally, comprehensions. For example, the following comprehension iterates through a list of students and returns those students who are not members of staff: [x x >; not (member > x)] and it translates into: flatmap (lambda x.if (not (member > x)) then [x] else []) > Grouping operators are also definable in terms of fold. In particular, the operator group takes as an argument a list of pairs xs and groups them on their first component, while gc aggFun xs groups a list of pairs xs on their first component and then applies the aggregation function aggFun to the second component. 3

We refer the reader to [8] for details of IQL

644

Hao Fan and Alexandra Poulovassilis

There are several algebraic properties of IQL’s operators that we can use in order to incrementally compute materialised data and to reason about IQL expressions, specifically for the purposes of this paper in a schema/data evolution context (note that the algebraic properties of fold below apply to all the operators defined in terms of fold): (a) e ++[] = []++ e = e, e -- [] = e, [] -- e = [], distinct [] = sort [] = [] for any list-valued expression e. Since Void represents a construct for which no data is obtainable from a data source, it has the semantics of the empty list, and thus the above equivalences also hold if Void is substituted for []. (b) fold f op e [] = fold f op e Void = e, for any f, op, e (c) fold f op e (b1 ++ b2) = (fold f op e b1) op (fold f op e b2) for any f, op, e, b1, b2. Thus, we can always incrementally compute the value of fold-based functions if collections expand. (d) fold f op e (b1 -- b2) = (fold f op e b1) op’ (fold f op e b2) provided there is an operator op’ which is the inverse of op i.e. such that (a op b) op’ b = a for all a,b. For example, if op = + then op’ = -, and thus we can always incrementally compute the value of aggregation functions such as count, sum and avg if collections contract. Note that this is not possible for min and max since lesser and greater have no inverses. Although IQL is list-based, if the ordering of elements within lists is ignored then its operators are faithful to the expected bag semantics, and within AutoMed we generally do assume bag semantics. Under this assumption, (xs ++ ys) -- ys = xs for all xs,ys and thus we can incrementally compute the value of flatmap and all its derivative operators if collections contract4.

2.2

An Example

We will use schemas expressed in a simple relational data model and a simple XML data model to illustrate our techniques. However, we stress that these techniques are applicable to schemas defined in any data modelling language that has been specified within AutoMed’s Model Definitions Repository. In the simple relational model, there are two kinds of schema construct: Rel and Att. The extent of a Rel construct is the projection of the relation R onto its primary key attributes The extent of each Att construct where is an attribute (key or non-key) of R is the projection of relation R onto For example, the schema of table MAtab in Figure 2 consists of a Rel construct MAtab and four Att constructs MAtab, Dept MAtab, CID MAtab, SID and MAtab, Mark We refer the reader to [15] for an encoding of a richer relational data model, including the modelling of constraints. In the simple XML data model, there are three kinds of schema construct: Element, Attribute and NestSet. The extent of an Element construct consists 4

The distinct operator can also be used to obtain set semantics, if needed

Schema Evolution in Data Warehousing Environments

645

of all the elements with tag in the XML document; the extent of each Attribute construct consists of all pairs of elements and attributes such that element has tag and has an attribute with value and the extent of each NestSet construct consists of all pairs of elements such that element has tag and has a child element with tag We refer the reader to [21] for an encoding of a richer model for XML data sources, called XMLDSS, which also captures the ordering of children elements under parent elements and cardinality constraints. That paper gives an algorithm for generating the XMLDSS schema of an XML document. That paper also discusses a unique naming scheme for Element constructs so as to handle instances of the same element tag occurring at multiple positions in the XMLDSS tree. Figure 2 illustrates the integration of three data sources and which respectively store students’ marks for three departments MA, IS and CS.

Fig. 2. An example integration

Database for department MA has one table of students’ marks for each course, where the relation name is the course ID. Database for department IS is an XML file containing information of course IDs, course names, student IDs and students’ marks. Database for department CS has one table containing one row per student, giving the student’s ID, name, and mark for the courses CSC01, CSC02 and CSC03. and are the materialised conformed databases for each data source. Finally, the global database GD contains one table CourseSum(Dept,CID,Total,Avg) which gives the total and average mark for each course of each department. Note that the virtual union schema US (not shown) combines all the information from all the conformed schemas and consists of a virtual table Details(Dept,CID,SID,CName,SName,Mark). The following transformation pathways express the schema transformation and integration processes in this example. Due to space limitations, we have not given the remaining steps for deleting/contracting the constructs in the source schema of each pathway (note that this ‘growing’ and ‘shrinking’ of schemas is characteristic of AutoMed schema transformation pathways):

646

Hao Fan and Alexandra Poulovassilis

The removal of the other two tables in

3

is similar.

Expressing Schema and Data Model Evolution

In a heterogeneous data warehousing environment, it is possible for either a data source schema or the integrated database schema to evolve. This schema evolution may be a change in the schema, or a change in the data model in which the schema is expressed, or both. AutoMed transformations can be used to express the schema evolution in all three cases: (a) Consider first a schema S expressed in a modelling language We can express the evolution of S to also expressed in as a series of primitive transformations that rename, add, extend, delete or contract constructs of For example, suppose that the relational schema in the above example

Schema Evolution in Data Warehousing Environments

647

evolves so its three tables become a single table with an extra column for the course ID. This evolution is captured by a pathway which is identical to the pathway given above. This kind of transformation that captures well-known equivalences between schemas can be defined in AutoMed by means of a parametrised transformation template which is schema- and data-independent. When invoked with specific schema constructs and their extents, a template generates the appropriate sequence of primitive transformations within the Schemas & Transformations Repository – see [5] for details. which evolves (b) Consider now a schema S expressed in a modelling language into an equivalent schema expressed in a modelling language We can express this translation by a series of add steps that define the constructs of in in terms of the constructs of S in At this stage, we have an intermediate schema that contains the constructs of both S and We then specify a series of delete steps that remove the constructs of (the queries within these transformations indicate that these are now redundant constructs since they can be derived from the new constructs). For example, suppose that XML schema in the above example evolves into an equivalent relational schema consisting of single table with one column per attribute of This evolution is captured by a pathway which is identical to the pathway given above. Again, such generic inter-model translations between one data model and another can be defined in AutoMed by means of transformation templates. (c) Considering finally to an evolution which is both a change in the schema and in the data model, this can be expressed by a combination of (a) and (b) above: either (a) followed by (b), or (b) followed by (a), or indeed by interleaving the two processes.

4

Handling Schema Evolution

In this section we consider how the general integration network illustrated in Figure 1 is evolvable in the face of evolution of a local schema or the warehouse schema. We have seen in the previous section how AutoMed transformations can be used to express the schema evolution if either the schema or the data model changes, or both. We can therefore treat schema and data model change in a uniform way for the purposes of handling schema evolution: both are expressed as a sequence of AutoMed primitive transformations, in the first case staying within the original data model, and in the second case transforming the original schema in the original data model into a new schema in a new data model. In this section we describe the actions that are taken in order to evolve the integration network of Figure 1 if the global schema GS evolves (Section 4.1) or if a local schema evolves (Section 4.2). Given an evolution pathway from a schema S to a schema in both cases each successive primitive transformation within the pathway is treated one at a time. Thus, we describe in sections 4.1 and 4.2 the actions that are taken if consists of just

648

Hao Fan and Alexandra Poulovassilis

one primitive transformation. If is a composite transformation, then it is handled as a sequence of primitive transformations. Our discussion below assumes that the primitive transformation being handled is adding, removing or renaming a construct of S that has an underlying data extent. We do not discuss the addition or removal of constraints here as these do not impact on the materialised data, and we make the assumption that any constraints in the pathway have been verified as being valid.

4.1

Evolution of the Global Schema

Suppose the global schema GS evolves by means of a primitive transformation into This is expressed by the step being appended to the pathway of Figure 1. The new global schema is and its associated extension is GS is now an intermediate schema in the extended pathway and it no longer has an extension associated with it. may be a rename, add, extend, delete or contract transformation. The following actions are taken in each case: 1. If is rename then there is nothing further to do. GS is semantically equivalent to and is identical to GD except that the extent of in GD is now the extent of in then there is nothing further to do at the schema level. GS is 2. If is add semantically equivalent to However, the new construct in must now be populated, and this is achieved by evaluating the query over GD. is populated by an empty 3. If is extend then the new construct in extent. This new construct may subsequently be populated by an expansion in a data source (see Section 4.2). 4. If is delete or contract then the extent of must be removed from GD in order to create (it is assumed that this a legal deletion/contraction, e.g if we wanted to delete/contract a table from a relational schema, then first the constraints and then the columns would be deleted/contracted and lastly the table itself; such syntactic correctness of transformation pathways is automatically verified by AutoMed). It may now be possible to simplify the transformation network, in that if contains a matching transformation add or extend then both this and the new transformation can be removed from the pathway This is purely an optimization – it does not change the meaning of a pathway, nor its effect on view generation and query/data translation. We refer the reader to [19] for details of the algorithms that simplify AutoMed transformation pathways.

In cases 2 and 3 above, the new construct will automatically be propagated into the schema DMS of any data mart derived from GS. To prevent this, a transformation contract can be prefixed to the pathway Alternatively, the new construct can be propagated to DMS if so desired, and materialised there. In cases 1 and 4 above, the change in GS and GD may impact on the data marts derived from GS, and we discuss this in Section 4.3.

Schema Evolution in Data Warehousing Environments

4.2

649

Evolution of a Local Schema

Suppose a local schema evolves by means of a primitive transformation into As discussed in Section 2, there is automatically available a reverse transformation from to and hence a pathway from to The new local schema is and its associated extension is is now just an intermediate schema in the extended pathway and it no longer has an associated extension. may be a rename, add, delete, extend or contract transformation. In 1–5 below we see what further actions are taken in each case for evolving the integration network and the downstream materialised data as necessary. We first introduce some necessary terminology: If is a pathway and is a construct in S, we denote by the constructs of which are directly or indirectly dependent on either because itself appears in or because a construct of is created by a transformation add within where the query directly or indirectly references The set can be straight-forwardly computed by traversing and inspecting the query associated with each add transformation within in. 1. If is rename then schema is semantically equivalent to The new transformation pathway is The new local database is identical to except that the extent of in is now the extent of in then has evolved to contain a new construct whose 2. If is add extent is equivalent to the expression over the other constructs of The new transformation pathway is this means that has evolved to not include a construct 3. If is delete whose extent is derivable from the expression over the other constructs of and the new local database no longer contains an extent for The new transformation pathway is

In the above three cases, schema is semantically equivalent to and nothing further needs to be done to any of the transformation pathways, schemas or databases and GD. This may not be the case if is a contract or extend transformation, which we consider next. 4. If is extend then there will be a new construct available from that was not available before. That is, has evolved to contain the new construct whose extent is not derivable from the other constructs of If we left the transformation pathway as it is, this would result in a pathway from to which would immediately drop the new construct from the integration network. That is, is consistent but it does not utilize the new data.

However, recall that we said earlier that we assume no contract steps in the pathways from local schemas to their union schemas, and that all the data in should be available to the integration network. In order to achieve this, there are four cases to consider:

650

Hao Fan and Alexandra Poulovassilis

appears in and has the same semantics as the newly added in Since cannot be derived from the original there must be a transformation extend in We remove from the new contract c step and this matching extend step. This propagates into and we populate its extent in the materialised database by replicating its extent from (b) does not appear in but it can be derived from by means of some transformation T. In this case, we remove from the first contract c step, so that is now present in and in We populate the extent of in by replicating its extent from To repair the other pathways and schemas for we append T to the end of each As a result, the new construct now appears in all the union schemas. To add the extent of this new construct to each materialised database for we compute it from the extents of the other constructs in using the queries within successive add steps in T. We finally append the necessary new id steps between pairs of union schemas to assert the semantic equivalence of the construct within them. (c) does not appear in and cannot be derived from In this case, we again remove from the first contract c step so that is now present in schema To repair the other pathways and schemas for we append an extend step to the end of each As a result, the new construct now appears in all the conformed schemas The construct may need further translation into the data model of the union schemas and this is done by appending the necessary sequence, T, of add/delete/rename steps to all the pathways We compute the extent of within the database from its extent within using the queries within successive add steps in T. We finally append the necessary new id steps between pairs of union schemas to assert the semantic equivalence of the new construct(s) within them. (d) appears in but has different semantics to the newly added in In this case, we rename in to a new construct The situation reverts to adding a new construct to and one of (a)-(c) above applies. (a)

We note that determining whether can or cannot be derived from the existing constructs of the union schemas in (a)–(d) above requires domain or expert human knowledge. Thereafter, the remaining actions are fully automatic. In cases (a) and (b), there is new data added to one or more of the conformed databases which needs to be propagated to GD. This is done by computing and using the algebraic equivalences of Section 2.1 to propagate changes in the extent of to each of its descendant constructs gc in GS. Using these equivalences, we can in most cases incrementally recompute the extent of gc. If at any stage in there is a transformation add where no equivalence can be applied, then we have to recompute the whole extent of

Schema Evolution in Data Warehousing Environments

651

In cases (b) and (c), there is a new schema construct appearing in the This construct will automatically appear in the schema GS. If this is not desired, a transformation contract can be prefixed to 5. If is contract then the construct in will no longer be available from That is, has evolved so as to not include a construct whose extent is not derivable from the other constructs of The new local database no longer contains an extent for The new transformation pathway is Since the extent of is now Void, the materialised data in and GD must be modified so as to remove any data derived from the old extent of In order to repair we compute For each construct uc in we compute its new extent and replace its old extent in by the new extent. Again, the algebraic properties of IQL queries discussed in Section 2.1 can be used to propagate the new Void extent of construct in to each of its descendant constructs uc in Using these equivalences, we can in most cases incrementally recompute the extent of uc as we traverse the pathway In order to repair GD, we similarly propagate changes in the extent of each uc along the pathway Finally, it may also be necessary to amend the transformation pathways if there are one or more constructs in GD which now will always have an empty extent as a result of this contraction of For any construct uc in US whose extent has become empty, we examine all pathways If all these pathways contain an extend uc transformation, or if using the equivalences of Section 2.1 we can deduce from them that the extent of uc will always be empty, then we can suffix a contract gc step to for every gc in and then handle this case as paragraph 4 in Section 4.1.

4.3

Evolution of Downstream Data Marts

We have discussed how evolutions to the global schema or to a source schema are handled. One remaining question is how to handle the impact of a change to the data warehouse schema, and possibly its data, on any data marts that have been derived from it. In [7] we discuss how it is possible to express the derivation of a data marts from a data warehouse by means of an AutoMed transformation pathway. Such a pathway expresses the relationship of a data mart schema DMS to the warehouse schema GS. As such, this scenario can be regarded as a special case of the general integration scenario of Figure 1, where GS now plays the role of the single source schema, databases and GD collectively play the role of the data associated with this source schema and DMS plays the role of the global schema. Therefore, the same techniques as discussed in sections 4.1 and 4.2 can be applied.

652

5

Hao Fan and Alexandra Poulovassilis

Concluding Remarks

In this paper we have described how the AutoMed heterogeneous data integration toolkit can be used to handle the problem of schema evolution in heterogeneous data warehousing environments so that the previous transformation, integration and data materialisation effort can be reused. Our algorithms are mainly automatic, except for the aspects that require domain or expert human knowledge regarding the semantics of new schema constructs. We have shown how AutoMed transformations can be used to express schema evolution within the same data model, or a change in the data model, or both, whereas other schema evolution literature has focussed on just one data model. Schema evolution within the relational data model has been discussed in previous work such as [11,12,18]. The approach in [18] uses a first-order schema in which all values in a schema of interest to a user are modelled as data, and other schemas can be expressed as a query over this first-order schema. The approach in [12] uses the notation of a flat scheme, and gives four operators UNITE, FOLD, UNFOLD and SPLIT to perform relational schema evolution using the SchemaSQL language. In contrast, with AutoMed the process of schema evolution is expressed using a simple set of primitive schema transformations augmented with a functional query language, both of which are applicable to multiple data models. Our approach is complementary to work on mapping composition, e.g. [20, 14], in that in our case the new mappings are a composition of the original transformation pathway and the transformation pathway which expresses the schema evolution. Thus, the new mappings are, by definition, correct. There are two aspects to our approach: (i) handling the transformation pathways and (ii) handling the queries within them. In this paper we have in particular assumed that the queries are expressed in IQL. However, the AutoMed toolkit allows any query language syntax to be used within primitive transformations, and therefore this aspect of our approach could be extended to other query languages. Materialised data warehouse views need to be maintained when the data sources change, and much previous work has addressed this problem at the data level. However, as we have discussed in this paper, materialised data warehouse views may also need to be modified if there is an evolution of a data source schema. Incremental maintenance of schema-restructuring views within the relational data model is discussed in [10], whereas our approach can handle this problem in a heterogeneous data warehousing environment with multiple data models and changes in data models. Our previous work [7] has discussed how AutoMed transformation pathways can also be used for incrementally maintaining materialised views at the data level. For future work, we are implementing our approach and evaluating it in the context of biological data warehousing.

Schema Evolution in Data Warehousing Environments

653

References 1. J. Andany, M. Léonard, and C. Palisser. Management of schema evolution in databases. In Proc. VLDB’91, pages 161–170. Morgan Kaufmann, 1991. 2. Z. Bellahsene. View mechanism for schema evolution in object-oriented DBMS. In Proc. BNCOD’96, LNCS 1094. Springer, 1996. 3. B. Benatallah. A unified framework for supporting dynamic schema evolution in object databases. In Proc. ER’99, LNCS 1728. Springer, 1999. 4. M. Blaschka, C. Sapia, and G. Höfling. On schema evolution in multidimensional databases. In Proc. DaWaK’99, LNCS 1767. Springer, 1999. 5. M. Boyd, S. Kittivoravitkul, C. Lazanitis, P.J. McBrien, and N. Rizopoulos. AutoMed: A BAV data integration system for heterogeneous data sources. In Proc. CAiSE’04, 2004. 6. P. Buneman et al. Comprehension syntax. SIGMOD Record, 23(1):87–96, 1994. 7. H. Fan and A. Poulovassilis. Using AutoMed metadata in data warehousing environments. In Proc. DOLAP’03, pages 86–93. ACM Press, 2003. 8. E. Jasper, A. Poulovassilis, and L. Zamboulis. Processing IQL queries and migrating data in the AutoMed toolkit. Technical Report 20, Automed Project, 2003. 9. E. Jasper, N. Tong, P. McBrien, and A. Poulovassilis. View generation and optimisation in the AutoMed data integration framework. In Proc. 6th Baltic Conference on Databases and Information Systems, 2004. 10. A. Koeller and E. A. Rundensteiner. Incremental maintenance of schemarestructuring views. In Proc. EDBT’02, LNCS 2287. Springer, 2002. 11. L. V. S. Lakshmanan, F. Sadri, and I. N. Subramanian. On the logical foundations of schema integration and evolution in heterogeneous database systems. In Proc. DOOD’93, LNCS 760. Springer, 1993. 12. L. V. S. Lakshmanan, F. Sadri, and S. N. Subramanian. On efficiently implementing SchemaSQL on an SQL database system. In Proc. VLDB’99, pages 471–482. Morgan Kaufmann, 1999. 13. M. Lenzerini. Data integration: A theoretical perspective. In Proc. PODS’02, 2002. 14. Jayant Madhavan and Alon Y. Halevy. Composing mappings among data sources. In Proc. VLDB’03. Morgan Kaufmann, 2003. 15. P. McBrien and A. Poulovassilis. A uniform approach to inter-model transformations. In Proc. CAiSE’99, LNCS 1626, pages 333–348. Springer, 1999. 16. P. McBrien and A. Poulovassilis. Schema evolution in heterogeneous database architectures, a schema transformation approach. In Proc. CAiSE’02, LNCS 2348, pages 484–499. Springer, 2002. 17. P. McBrien and A. Poulovassilis. Data integration by bi-directional schema transformation rules. In Proc. ICDE’03, pages 227–238, 2003. 18. Renée J. Miller. Using schematically heterogeneous structures. In Proc. ACM SIGMOD’98, pages 189–200. ACM Press, 1998. 19. N. Tong. Database schema transformation optimisation techniques for the AutoMed system. In Proc. BNCOD’03, LNCS 2712. Springer, 2003. 20. Yannis Velegrakis, Renée J. Miller, and Lucian Popa. Mapping adaptation under evolving schemas. In Proc. VLDB’03. Morgan Kaufmann, 2003. 21. L. Zamboulis. XML data integration by graph restrucring. In Proc. BNCOD’04, LNCS 3112. Springer, 2004.

Metaprogramming for Relational Databases Jernej Kovse, Christian Weber, and Theo Härder Department of Computer Science Kaiserslautern University of Technology P.O. Box 3049, D-67653 Kaiserslautern, Germany {kovse,c_weber,haerder}@informatik.uni-kl.de

Abstract. For systems that share enough structural and functional commonalities, reuse in schema development and data manipulation can be achieved by defining problem-oriented languages. Such languages are often called domainspecific, because they introduce powerful abstractions meaningful only within the domain of observed systems. In order to use domain-specific languages for database applications, a mapping to SQL is required. In this paper, we deal with metaprogramming concepts required for easy definition of such mappings. Using an example domain-specific language, we provide an evaluation of mapping performance.

1 Introduction A large variety of approaches use SQL as a language for interacting with the database, but at the same time provide a separate problem-oriented language for developing database schemas and formulating queries. A translator maps a statement in such problem-oriented language to a series of SQL statements that get executed by the DBMS. An example of such a system is Preference SQL, described by Kießling and Köstler [8]. Preference SQL is an SQL extension that provides a set of language constructs which support easy use of soft preferences. This kind of preferences is useful when searching for products and services in diverse e-commerce applications where a set of strictly observed hard constraints usually results in an empty result set, although products that approximately match the user’s demands do exist. The supported constructs include approximation (clauses AROUND and BETWEEN), minimization/maximization (clauses LOWEST, HIGHEST), favorites and dislikes (clauses POS, NEG), pareto accumulation (clause AND), and cascading of preferences (clause CASCADE) (see [8] for examples). In general, problem-oriented programming languages are also called domain-specific languages (DSLs), because they prove useful when developing and using systems from a predefined domain. The systems in a domain will exhibit a range of similar structural and functional features (see [4,5] for details), making it possible to describe them (and, in our case, query their data) using higher-level programming constructs. In turn, these constructs carry semantics meaningful only within this domain. As the activity of using these constructs is referred to as programming, defining such conP. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 654–667, 2004. © Springer-Verlag Berlin Heidelberg 2004

Metaprogramming for Relational Databases

655

structs and their mappings to languages that can be compiled or interpreted to allow their execution is referred to as metaprogramming. This paper focuses on the application of metaprogramming for relational databases. In particular, we are interested in concepts that guide the implementation of fast mappings of custom languages, used for developing database schemas and manipulating data, onto SQL-DDL and SQL-DML. The paper is structured as follows. First, in Sect. 2, we further motivate the need for DSLs for data management. An overview of related work is given by Sect. 3. Our system prototype (DSL-DA – domain-specific languages for database applications) that supports the presented ideas is outlined in Sect. 4. A detailed performance evaluation of a DSL for the example product line will be presented in Sect. 5. Sect. 6 gives a detailed overview of metaprogramming concepts. Finally, in Sect. 7, we summarize our results and give some ideas for the future work related to our approach.

2 Domain-Specific Languages The idea of DSLs is tightly related to domain engineering. According to Czarnecki and Eisenecker [5], domain engineering deals with collecting, organizing, and storing past experience in building systems in form of reusable assets. In general, we can rely that a given asset can be reused in a new system in case this system possesses some structural and functional similarity to previous systems. Indeed, systems that share enough common properties are said to constitute a system family (a more market-oriented term for a system family is a software product-line). Examples of software product-lines are extensively outlined by Clements and Northrop [4] and include satellite controllers, internal combustion engine controllers, and systems for displaying and tracing stock-market data. Further examples of more data-centric product lines include CRM and ERP systems. Our example product line for versioning systems will be introduced in Sect. 4. Three approaches can be applied to allow the reuse of “assets” when developing database schemas for systems in a data-intensive product line. Components: Schema components can be used to group larger reusable parts of a database schema to be used in diverse systems afterwards (see Thalheim [16] for an extensive overview of this approach). Generally, the modularity of system specification (which components are to be used) directly corresponds to the modularity of the resulting implementation, because a component does not influence the internal implementation of other components. This kind of specification transformations towards the implementation is referred to as vertical transformations or forward refinements [5]. Frameworks: Much like software frameworks in general (see, for example, Apache Struts [1] or IBM San Francisco [2]), schema frameworks rely on the user to extend them with system-specific parts. This step is called framework instantiation and requires certain knowledge of how the missing parts will be called by the framework. Most often, this is achieved by extending superclasses defined by the framework or implementing call-back methods which will be invoked by mechanisms such as reflection. In a DBMS, application logic otherwise captured by such methods can be defined by means of constraints, trigger conditions and actions, and stored procedures. A detailed

656

Jernej Kovse, Christian Weber, and Theo Härder

overview of schema frameworks is given by Mahnke [9]. Being more flexible than components, frameworks generally require more expertise from the user. Moreover, due to performance reasons, most DBMSs restrain from dynamic invocation possibilities through method overloading or reflection (otherwise supported in common OO programming languages). For this reason, schema frameworks are difficult to implement without middleware acting as a mediator for such calls. Generators: Schema generators are, in our opinion, the most advanced approach to reuse and are the central topic of this paper. A schema generator acts much like a compiler: It transforms a high-level specification of the system to a schema definition, possibly equipped with constraints, triggers, and stored procedures. In general, the modularity of the specification does not have to be preserved. Two modular parts of the specification can be interwoven to obtain a single modular part in the schema (these transformations are called horizontal transformations; in case the obtained part in the schema is also refined, for example, columns not explicitly defined in the specification are added to a table, this is called an oblique transformation, i.e., a combination of a horizontal and a vertical transformation.) It is important to note that there is no special “magic” associated with schema generators that allows them to obtain a ready-to-use schema out of a short specification. By narrowing the domain of systems, it is possible to introduce very powerful language abstractions that are used at the specification level. Due to similarities between systems, these abstractions aggregate a lot of semantics that is dispersed across many schema elements. Because defining this semantics in SQL-DDL proves labour-intensive, we rather choose to define a special domain-specific DDL (DS-DDL) for specifying the schema at a higher level of abstraction and implement the corresponding mapping to SQLDDL. The mapping represents the “reusable asset” and can be used with any schema definition in this DS-DDL. The data manipulation part complementary to DS-DDL is called DS-DML and allows the use of domain-specific query and update statements in application programs. Defining custom DS-DDLs and their mappings to SQL-DDL as well as fast translation of DS-DML statements is the topic we explore in this paper.

3

Related Work

Generators are the central idea of the OMG’s Model Driven Architecture (MDA) [13] which proposes the specification of systems using standardized modeling languages (UML) and automatic generation of implementations from models. However, even OMG notices the need of supporting custom domain-specific modeling languages. As noted by Frankel [6], this can be done in three different ways: Completely new modeling languages: A new DSL can be obtained by defining a new MOF-based metamodel. Heavyweight language extensions: A new DSL can be obtained by extending the elements of a standardized metamodel (e.g., the UML Metamodel). Lightweight language extensions: A new DSL can be obtained by defining new language abstractions using the language itself. In UML, this possibility is supported by UML Profiles.

Metaprogramming for Relational Databases

657

The research area that deals with developing custom (domain-specific) software engineering methodologies well suited for particular systems is called computer-aided method engineering (CAME) [14]. CAME tools allow the user to describe an own modeling method and afterwards generate a CASE tool that supports this method. For an example of a tool supporting this approach, see MetaEdit+ [11]. The idea of a rapid definition of domain-specific programming languages and their mapping to a platform where they can be executed is materialized in Simonyi’s work on Intentional Programming (IP) [5,15]. IP introduces an IDE based on active libraries that are used to import language abstractions (also called intentions) into this environment. Programs in the environment are represented as source graphs in which each node possesses a special pointer to a corresponding abstraction. The abstractions define extension methods which are metaprograms that specify the behavior of nodes. The following are the most important extension methods in IP. Rendering and type-in methods. Because it is cumbersome to edit the source graph directly, rendering methods are used to visualize the source graph in an editable notation. Type-in methods convert the code typed in this notation back to the source graph. This is especially convenient when different notations prove useful for a single source graph. Refactoring methods. These methods are used to restructure the source graph by factoring out repeating code parts to improve reuse. Reduction methods. The most important component of IP, these methods reduce the source graph to a graph of low-level abstractions (also called reduced code or R-code) that represent programs executable on a given platform. Different reduction methods can be used to obtain the R-code for different platforms. How does this work relate to our problem? Similar as in IP, we want to support a custom definition of abstractions that form both a custom DS-DDL and a custom DS-DML. We want to support the rendering of source graphs for DS-DDL and DS-DML statements to (possibly diverse) domain-specific textual representations. Most importantly, we want to support the reduction of these graphs to graphs representing SQL statements that can be executed by a particular DBMS.

4

DSL-DA System

In our DSL-DA system, the user starts by defining a domain-specific (DS) metamodel that describes language abstractions that can appear in the source graph (the language used for defining metamodels is a simplified variant of the MOF Model) for the DSDDL. We used the system to fully implement a DSL for the example product line of versioning systems which we also use in the next section for the evaluation of our approach. In this product line, each system is used to store and version objects (of some object type) and relationships (of some relationship type). Thus individual systems differ in their type definitions (also called information models [3]) as well as other features illustrated in the DS-DDL metamodel in Fig. 1 and explained below.

658

Jernej Kovse, Christian Weber, and Theo Härder

Fig. 1. DS-DDL metamodel for the example product line

Object types can be versioned or unversioned. The number of direct successors to a version can be limited to some number (maxSuccessors) for a given versioned object type. Relationship types connect to object types using either non-floating or floating relationship ends. A non-floating relationship end connects directly to a particular version as if this version were a regular object. On the other hand, a floating relationship end maintains a user-managed subset of all object versions for each connected object. Such subsets are called candidate version collections (CVC) and prove useful for managing configurations. In unfiltered navigation from some origin object, all versions contained in every connected CVC will be returned. In filtered navigation, a version preselected for each CVC (also called the pinned version) will be returned. In case there is no pinned version, we return the latest version from the CVC. Workspace objects act as containers for other objects. However, only one version of a contained object can be present in the workspace at a time. In this way, workspaces allow a version-free view to the contents of a versioning system. When executed within a workspace, filtered navigation returns versions from the CVC that are connected to this workspace and ignores the pin setting of the CVC. Operations create object, copy, delete, create successor, attach/detach (connects/ disconnects an object to/from a workspace), freeze, and checkout/checkin (locks/ unlocks the object) can propagate across relationships. A model expressed using the DS-DDL metamodel from Fig. 1 will represent a source graph for a particular DS-DDL schema definition used to describe a given versioning system. To work with these models (manipulate the graph nodes), DSL-DA uses the DS-DDL metamodel to generate a schema editor that displays the graphs in a tree-like form (see the left-hand side of Fig. 2). A more convenient graphical notation of a source graph for our example versioning system that we will use for the evaluation in the next section is illustrated in Fig. 3. The metamodel classes define rendering and type-in methods that render the source graph to a textual representation and allow its editing (right-hand side of Fig. 2). More importantly, the metamodel classes define reduction methods that will reduce the

Metaprogramming for Relational Databases

659

Fig. 2. DS-DDL schema development with the generated editor

Fig. 3. Example DS-DDL schema used in performance evaluation

source graph to its representation in SQL-DDL. In analogy with the domain-specific level of the editor, the obtained SQL-DDL schema is also represented as a source graph; the classes used for this graph are the classes defined by the package Relational of the OMG’s Common Warehouse Metamodel (CWM) [12]. The rendering methods of these

660

Jernej Kovse, Christian Weber, and Theo Härder

classes are customizable so that by rendering the SQL-DDL source graphs, SQL-DDL schemas in SQL dialects of diverse DBMS vendors can be obtained. Once an SQL-DDL schema is installed in a database, how do we handle statements in DS-DML (three examples of such statements are given by Table 1)? As for the DSDDL, there is a complementary DS-DML metamodel that describes language abstractions of the supported DS-DML statements. This metamodel can be simply defined by first coming up with an EBNF for DS-DML and afterwards translating the EBNF symbols to class definitions in a straightforward fashion. The EBNF of our DS-DML for the sample product line for versioning systems is available through [17]. DS-DML statements can then be represented as source graphs, where each node in the graph is an instance of some class from the DS-DML metamodel. Again, metamodel classes define reduction methods that reduce the corresponding DS-DML source graph to an SQLDML source graph, out of which SQL-DML statements can be obtained through rendering. DS-DML is used by an application programmer to embed domain-specific queries and data manipulation statements in the application code. In certain cases, the general structure of a DS-DML statement will be known at the time the application is written and the parameters of the statement will only need to be filled with user-provided values at run time. Since these parameters do not influence the reduction, the reduction from DS-DML to SQL-DML can take place using a precompiler. Sometimes, however, especially in the case of Web applications, the structure of the DS-DML query will depend on the user’s search criteria and other preferences and is thus not known at compile time. The solution in this case is to wrap the native DBMS driver into a domain-specific driver that performs the reduction at run time, passes the SQL-DML statements to the native driver, and restructures the result sets before returning them to the user, if necessary. To handle both cases where query structure is known at compile time and when it is not, DSL-DA can generate both the precompiler and the domain-specific driver from the DS-DML metamodel, its reduction methods, and its rendering methods for SQLDML. We assumed the worst-case scenario in which all SQL-DML statements need to be reduced at run time for our evaluation in the next section to examine the effect of run time reduction in detail.

Metaprogramming for Relational Databases

661

5 Evaluation of the Example Product Line The purpose of the evaluation presented in this section is to demonstrate the following. Even for structurally complex DS-DML statements, the reduction process carried out at run time represents a very small proportion of costs needed to carry out the SQL-DML statements obtained by reduction. DS-DDL schemas that have been reduced to SQL-DDL with certain optimizations in mind imply reduction that is more difficult to implement. Somewhat surprisingly, this does not necessarily mean that such reduction will also take more processing time. Optimization considerations can significantly contribute to a faster execution of DS-DML statements once reduced to SQL-DML. To demonstrate both points, we implemented four very different variants of both DSDDL and DS-DML reduction methods for the example product line. The DS-DDL schema from Fig. 3 has thus been reduced to four different SQL-DDL schemas. In all four variants, object types from Fig. 3 are mapped to tables (called object tables) with the specified attributes. An object version is then represented as a tuple in this table. The identifiers in each object table include an objectId (all versions of a particular object, i.e., all versions within the same version tree, possess the same objectId), a versionId (identifies a particular version within the version tree) and a globalId, which is a combination of an objectId and a versionId. The four reductions differ in the following way. Variant 1: Store all relationships, regardless of relationship type, using a single “generic” table. For a particular relationship, store the origin globalId, objectId, versionId and the target rolename, globalId, objectId, and versionId as columns. Use an additional column as a flag denoting whether the target version is pinned. Variant 2: Use separate tables for every relationship type. In case a relationship type defines no floating ends or two floating ends, this relationship type can be represented by a single table. In case only one relationship end is floating, such relationship type requires two tables, one for each direction of navigation. Variant 3: Improve Variant 2 by considering maximal multiplicity of 1 on nonfloating ends. For such ends, the globalId of the connected target object is stored as a column in the object table of the origin object. Variant 4: Improve Variant 3 by considering maximal multiplicity of 1 of floating ends. For such ends, the globalIds of the pinned version and the latest version of the CVC for the target object can be stored as columns in the object table of the origin object. Our benchmark, consisting of 141,775 DS-DML statements was then run using four different domain-specific drivers corresponding to four different variants of reduction. To eliminate the need of fetching metadata from the database, we assumed that, once defined, the DS-DDL schema does not change, so each driver accessed the DS-DDL schema defined in Fig. 3 directly in the main memory. The overall time for executing a DS-DML statement is defined as where is the required DS-DML parsing time, the time required for reduction, the time required for rendering all resulting SQL-DML statements, and the time used to carry out these statements. Note that is independent of the variant, so we were mainly interested in the remaining three times as well as the overall time. The average and

662

Jernej Kovse, Christian Weber, and Theo Härder

Fig. 4. Execution times for the category of select statements

Fig. 5. Execution times for the category of create relationship statements

Fig. 6. Overhead due to DS-DML parsing, reduction and rendering

values (in for the category of select statements are illustrated in Fig. 4. This category included queries over versioned data within and outside workspaces that contained up to four navigation steps. As evident from Fig. 4, Variant 4 demonstrates a very good performance and also allows the fastest reduction. On the other hand, due to materialization of the globalIds of pinned and latest versions for CVCs in Variant 4, Variant 2 proves faster for manipulation (i.e., creation and deletion of relationships). The values for the category of create relationship statements are illustrated in Fig. 5. Most importantly, the overhead time required due to the domain-specific driver proves to be only a small portion of As illustrated in Fig. 6, when using Variant4, the portion is lowest (0.8%) for the category of select statements

Metaprogramming for Relational Databases

663

Fig. 7. Properties of reduction methods

and highest (9.9%) for merge statements. When merging two versions (denoted as primary and secondary version), their attribute values have to be compared to their socalled base (latest common) version in the version graph to decide which values should be used for the result of the merging. This comparison, which is performed in the driver, accounts for a high value (9.1% of Note that is the minimal time an application spends executing SQL-DML statements in any case (with or without DSDML available) to provide the user with equivalent results: Even without DS-DML, the programmer would have to implement data flows to connect sequences of SQL-DML statements to perform a given operation (in our evaluation, we treat data flows as part of How difficult is it to implement the DS-DML reduction methods? To estimate this aspect, we used measures such as the count of expressions, statements, conditional statements, loops, as well as McCabe’s cyclomatic complexity [10] and Halstead effort [7] on our Java implementation of reduction methods. The summarized results obtained using these measures are illustrated by Fig. 7. All measures, except for the count of loops confirm an increasing difficulty to implement the reduction (e.g., the Halstead effort almost doubles from Variant 1 to Variant 4). Is there a correlation between the Halstead effort for writing a method and the times and We try to answer this question in Fig. 8. Somewhat surprisingly, a statement with a reduction more difficult to implement will sometimes also reduce faster (i.e., an increase in Halstead effort does not necessarily imply an increase in which is most evident for the category of select statements. The explanation is that even though the developer has to consider a large variety of different reductions for a complex variant (e.g., Variant 4), once the driver has found the right reduction (see Sect. 6), the reduction can proceed even faster than for a variant with less optimization considerations (e.g., Variant 1). For all categories in Fig. 8, a decreasing trend for values can be observed. However, in categories that manipulate the state of the CVC (note that operations from the category

664

Jernej Kovse, Christian Weber, and Theo Härder

Fig. 8. Correlation of

and

to the Halstead effort

copy object propagate across relationships and thus manipulate the CVCs), impedance due to materializing the pin setting and the latest version comes into effect and often results in only minor differences in values among Variants 2-4.

6

Metaprogramming Concepts

Writing metacode is different and more difficult than writing code, because the programmer has to consider a large variety of cases that may occur depending on the form of the statement and the properties defined in the DS-DDL schema. Our key idea to developing reduction methods is the so-called reduction polymorphism. In OO programming languages, polymorphism supports dynamic selection of the “right” method depending on the type of object held by a reference (since the type is not known until run time, this is usually called late binding). In this way, it is possible to avoid disturbing conditional statements (explicit type checking by the programmer) in the code. In a similar way, we use reduction polymorphism to avoid explicit use of conditional statements in metacode. This means that for an incoming DS-DML statement, the domain-specific driver will execute reduction methods that (a) match the syntactic structure of the statement and (b) apply for the specifics of the DS-DDL schema constructs used in the statement. We illustrate both concepts using a practical example. Suppose the following DS-DML statement.

Metaprogramming for Relational Databases

665

Using our DS-DDL schema from Fig. 3 and reduction Variant 4, the statement gets reduced to the following SQL-DML statement (OT denotes object table, ATT the attachment relationship table, F a floating end, and NF a non-floating end).

First, any SELECT statement will match a very generic reduction method that will insert SELECT and FROM clauses into the SQL-DML source graph. A reduction method on the projection clause will reduce to a projection of identifiers (globalId, objectId, and versionId), user-defined attributes and the flag denoting whether the version is frozen. Note that because the maximal multiplicity of the end causedBy pointing from Cost to Task is 1, the table CostOT also contains the materialization of a pinned or latest version of some task, but the column for this materialization is left out in the projection, because it is irrelevant for the user. Next, a reduction method is invoked on the DS-DML FROM clause, which itself calls reduction methods on two DS-DML subnodes, one for each navigation step. Thus, the reduction of Offer-contains->Task results in conditions in lines 5–6 and the reduction of Task-ratedCosts->Cost results in conditions in lines 7–8. The reductions carried out in this example rely on two mechanisms, DS-DDL schema divergence and source-graph divergence. DS-DDL schema divergence is applied in the following way. The relationship type used in the first navigation step defines only one floating end while the one used in the second navigation step defines both ends as floating. Thus in the reduction of DSDDL, we had to map the first relationship type to two distinct tables (because relationships with only one floating end are not necessarily symmetric). Therefore, the choice of the table we use (isPartOfF_containsNF) is based on the direction of navigation. The situation would be even more different in case the multiplicity defined for the non-floating end would be 1, where we would have to use a foreign key column in the object table. Another important situation where schema divergence is used in our example product line is operation propagation. To deal with DS-DDL schema divergence, each reduction method for a given node comes with a set of preconditions related to DS-DDL schema that have to be satisfied for method execution. Source-graph divergence is applied in the following way. In filtered navigation within a workspace, we have to use the table causedByF_ratedCostsF to arrive at costs. The obtained versions are further filtered in lines 9, 11, and 13 to arrive only at costs attached to the workspace with globalId 435532. The situation would be different outside a workspace, where another table which stores the materialized globalIds of versions of costs that are either pinned or latest in the corresponding CVC would have to be used for the join. Thus the reduction of the second navigation step depends on

666

Jernej Kovse, Christian Weber, and Theo Härder

whether the clause USE WORKSPACE is used. To deal with source-graph divergence, each reduction method for a given node comes with a set of preconditions related to node neighborhood in the source graph that have to be satisfies for method execution. Due to source-graph divergence, line 3 of the DS-DML statement gets reduced to lines 9–15 of the SQL-DML statement. Obviously, it is a good choice for the developer to shift decisions due to divergence to many “very specialized” reduction methods that can be reused in diverse superordinated methods and thus abstract from both types of divergence. In this way, the subordinated methods can be explicitly invoked by the developer using generic calls and the driver itself selects the matching method. Four different APIs are available to the developer within a reduction method. Source tree traversal. This API is used to explicitly traverse the neighboring nodes to make reduction decisions not automatically captured by source-graph polymorphism. The API is automatically generated from the DS-DML metamodel. DS-DDL schema traversal. This API is used to explicitly query the DS-DDL schema to make reduction decisions not automatically captured by DS-DDL schema polymorphism. The API is automatically generated from the DS-DDL metamodel. SQL-DML API. This API is used to manipulate the SQL-DML source graphs. Reduction API. This API is used for explicit invocation of reduction methods on subordinated nodes in the DS-DML source graph.

7

Conclusion and Future Work

In this paper, we examined the topic of custom schema development and data manipulation languages which facilitate increased reuse within database-oriented software product lines. Our empirical evaluation, based on an example product line for versioning systems, shows that the portion of time required for mapping domain-specific statements to SQL at run time is below 9.9%. For this reason, we claim that domain-specific languages introduce great benefits in terms of raising the abstraction level in schema development and data queries at practically no cost. There is a range of topics we want to focus on in our future work. Is there a way to make DS-DMLs even faster? Complex reduction methods can clearly benefit from the following ideas. Source graphs typically consist of an unusually large number of objects that have to be created at run time. Thus the approach could benefit from instance pools for objects to minimize object creation overhead. Caching of SQL-DML source graphs can be applied to reuse them when reducing upcoming statements. Would it be possible to use parameterized stored procedures to answer DS-DML statements? This makes the reduction of DS-DML statements simpler, because a statement can be reduced to a single stored procedure call. On the other hand, it makes the reduction of DS-DDL schema more complex, because stored procedures capable of answering the queries have to be prepared. We assume this approach is especially useful when many SQL-DML statements are needed to execute a DS-

Metaprogramming for Relational Databases

667

DML statement. Implementing a stored procedure for a sequence of statements avoids excessive communication between (a) the domain-specific and the native driver and (b) between the native driver and the database. In a number of cases where a sequence of SQL-DML statements is produced as a result of reduction, these statements need not necessarily be executed sequentially. Thus developers of reduction methods should be given the possibility to explicitly mark situations where the driver could take advantage of parallel execution. In addition, dealing with DS-DDL schemas raises two important questions. DS-DDL schema evolution. Clearly, supplementary approaches are required to deal with modifications in a DS-DDL schema which imply a number of changes in existing SQL-DDL constructs. Product-line mining. Many companies develop and market a number of systems implemented independently despite their structural and functional similarities, i.e., without the proper product-line support. Existing schemas for these systems could be mined to extract common domain-specific abstractions and possible reductions, which can afterwards be used in future development of new systems.

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17.

Apache Jakarta Project: Struts, available as: http://jakarta.apache.org/struts/ Ben-Natan, R., Sasson, O.: IBM San Francisco Developer’s Guider, McGraw-Hill, 1999 Bernstein, P.A.: Repositories and Object-Oriented Databases, in: SIGMOD Record 27:1 (1998), 34-46 Clements, P., Northrop, L.: Software Product Lines, Addison-Wesley, 2001 Czarnecki, K., Eisenecker, U.W.: Generative Programming: Methods, Tools, and Applications, Addison-Wesley, 2000 Frankel, D.S.: Model Driven Architecture: Applying MDA to Enterprise Computing, Wiley Publishing, 2003. Halstead, M.H.: Elements of Software Science, Elsevier, 1977 Kießling, W, Köstler, G: Preference SQL – Design, Implementation, Experiences, in: Proc. VLDB 2002, Hong Kong, Aug. 2002, 990-1001 Mahnke, W.: Towards a Modular, Object-Relational Schema Design, in: Proc. CAiSE 2002 Doctoral Consortium, Toronto, May 2002, 61-71 McCabe, T.J.: A Complexity Measure, in: IEEE Transactions on Software Engineering 2:4 (1976), 308-320 MetaCase: MetaEdit+ Product Website, available as: http://www.metacase.com/mep/ OMG: Common Warehouse Metamodel (CWM) Specification, Vol. 1, Oct. 2001 OMG: Model Driven Architecture (MDA) – A Technical Perspective, July 2001 Saeki, M.: Toward Automated Method Engineering: Supporting Method Assembly in CAME, presentation at EMSISE’03 workshop, Geneva, Sept. 2003 Simonyi, C.: The Death of Computer Languages, the Birth of Intentional Programming, Tech. Report MSR-TR-95-52, Microsoft Research, Sept. 1995 Thalheim, B.: Component Construction of Database Schemes, in: Proc. ER 2002, Tampere, Oct. 2002, 20-34 Weber, C., Kovse, J.: A Domain-Specific Language for Versioning, Jan. 2004, available as: http://wwwdvs.informatik.uni-kl.de/agdbis/staff/Kovse/DSVers/DSVers.pdf

Incremental Navigation: Providing Simple and Generic Access to Heterogeneous Structures* Shawn Bowers1 and Lois Delcambre2 1

San Diego Supercomputer Center at UCSD, La Jolla CA 92093, USA [emailprotected]

2

OGI School of Science and Engineering at OHSU, Beaverton OR 97006, USA [emailprotected]

Abstract. We present an approach to support incremental navigation of structured information, where the structure is introduced by the data model and schema (if present) of a data source. Simple browsing through data values and their connections is an effective way for a user or an automated system to access and explore information. We use our previously defined Uni-Level Description (ULD) to represent an information source explicitly by capturing the source’s data model, schema (if present), and data values. We define generic operators for incremental navigation that use the ULD directly along with techniques for specifying how a given representation scheme can be navigated. Because our navigation is based on the ULD, the operations can easily move from data to schema to data model and back, supporting a wide range of applications for exploring and integrating data. Further, because the ULD can express a broad range of data models, our navigation operators are applicable, without modification, across the corresponding model or schema. In general, we believe that information sources may usefully support various styles of navigation, depending on the type of user and the user’s desired task.

1 Introduction With the WWW at our fingertips, we have grown accustomed to easily using unstructured and loosely-structured information of various kinds, from all over the world. With a web browser it is very easy to: (1) view information (typically presented in HTML), and (2) download information for viewing or manipulating in tools available on our desktops (e.g., Word, PowerPoint, or Adobe Acrobat files). In our work, we are focused on providing similar access to structured (and semi-structured) information, in which data conforms to the structures of a representation scheme or data model. There is a large and growing number of structural representation schemes being used today including the relational, E-R, object-oriented, XML, RDF, and Topic Map models along with special-purpose representations, e.g., for exchanging scientific data. Each representation scheme is typically characterized by its choice of constructs for representing data and schema, allowing data engineers to select the representation best suited for their needs. However, there are few tools that allow data stored in different representations to be viewed and accessed in a standard way, with a consistent interface. *

This work supported in part by NSF grants EIA 9983518 and ITR 0225674.

P. Atzeni et al. (Eds.): ER 2004, LNCS 3288, pp. 668–681, 2004. © Springer-Verlag Berlin Heidelberg 2004

Incremental Navigation

669

The goal of this work is to provide generic access to structured information, much like a web browser provides generic access to viewable information. We are particularly interested in browsing a data source where a user can select an individual item, select a path that leads from the item, follow the path to a new item, and so on, incrementally through the source. The need for incremental navigation is motivated by the following uses. First, we believe that simple browsing tools provide people with a powerful and easy way to access data in a structured information source. Second, generic access to heterogeneous information sources supports tools that can be broadly used in the process of data integration [8,10]. Once an information source has been identified, its contents can be examined (by a person or an agent) to determine if and how it should be combined (or integrated) with other sources. In this paper, we describe a generic set of incremental-navigation operators that are implemented against our Uni-Level Description (ULD) framework [4,6]. We consider both a low-level approach for creating detailed and complete specifications as well as a simple, high-level approach for defining specifications. The high-level approach exploits the rich structural descriptions offered by the ULD to automatically generate the corresponding detailed specifications for navigating information sources. Thus, our high-level specification language allows a user to easily define and experiment with various navigation styles for a given data model or representation scheme. The rest of this paper is organized as follows. In Section 2 we describe motivating examples and Section 3 briefly presents the Uni-Level Description. In Section 4, we define the incremental navigation operators and discuss approaches to specifying their implementation. Related work is presented in Section 5 and in Section 6 we discuss future work.

2 Motivating Examples When an information agent discovers a new source (e.g., see Figure 1) it may wish to know: (1) what data model is used (is it an RDF, XML, Topic Map, or relational source?), (2) (assuming RDF) whether any classes are defined for the source (what is the source schema?), (3) which properties are defined for a given class (what properties does the film class have?), (4) which objects exist for the class (what are the instances of the film class?) and (5) what kinds of values exist for a given property of a particular object of the class (what actor objects are involved in this film object?). This example assumes the agent (or user) understands the data model of the source. For example, if the data model used was XML (e.g., see Figure 2) instead of RDF, the agent could have started navigation by asking for all of the available element types (rather than RDF classes). We call this approach data-model-aware navigation, in which the constructs of the data model can be used to guide navigation. In contrast, we also propose a form of browsing where the user or agent need not have any awareness of the data-model structures used in a data source. The user or agent is able to navigate through the data and schema directly. As an example (again using Figure 1), the user or agent might ask for: (1) the kind of information the source contains, which in our example would include “films,” “actors,” and “awards,” etc., (2) (assuming the crawler is interested in films) the things that describe films, which

670

Shawn Bowers and Lois Delcambre

Fig. 1. An example of an RDF schema and instance.

Fig. 2. An example XML DTD (left) and instance document (right).

would include “titles” and relationships to awards and actors, (3) the available films in the source, and (4) the actors of a particular film, which is obtained by stepping across the “involved” link for the film in question. We call this form of browsing simple navigation.

3 The Uni-level Description The Uni-Level Description (ULD) is both a meta-data-model (i.e., capable of describing data models) and a distinct representation scheme: it can directly represent both schema and instance information expressed in terms of data-model constructs. Figure 3 shows how the ULD represents information, where a portion of an object-oriented data model is described. The ULD is a flat representation in that all information stored in the ULD is uniformly accessible (e.g., within a single query) using the logic-based operations described in Table 1. Information stored in the ULD is logically divided into three layers, denoted metadata-model, data model, and schema and data instances. The ULD meta-data-model, shown as the top level in Figure 3, consists of construct types that denote structural primitives. The middle level uses the structural primitives to define both data and schema constructs, possibly with conforma

Brazil