KBMS overview

KBMS

edited for WWW by Gio Wiederhold, August 1995

A Precis of Research on

Knowledge-Based Management Systems

1977-1988

People

Investigators

Professor Gio Wiederhold
Departments of Medicine and Computer Science
Stanford University
Stanford, CA 94305-2140

Professor Jack Milton
Department of Mathematics
University of California at Davis
Davis, CA 95616

Prof.~Stefano Ceri (summers 1982 - 1988)
Politechnico di Milano
Milano, Italy

Participants

Peter Apers
Vrye Universiteit, Amsterdam, The Netherlands; now Prof. at the Tech. Un. Twente, Enschede, Holland.

David Beech
Hewlett-Packard Research Laboratory, Palo Alto, CA; Oracle

Robert Blum
MD, PhD, now in practice, Menlo Park CA.

Ralph Cherubini
Digital Equipment Corporation, Hudson MA

Christian Esculier
Paris and Univ.~of Grenoble, France

Barbara Grosz
PhD, SRI International; now Professor at Harvard Univ.

David Hartzband
Digital Equipment Corporation.

Gary Hendrix, PhD
SRI International; now at Symantics, Menlo Park

Jerrold Kaplan. PhD
Lotus Corp.; Go Corp.; ONSale Corp.

Prof.~Sham Navathe
University of Florida, Gainesville, FL; Georgia Tech.

Gordon Novak, PhD (1977)
SRI International, now at Univ.~of Texas at Austin

ChoChun Hsu
Beijing University

Arthur M. Keller, PhD
Stanford University and University of Texas at Austin

Kurt Konolidge, PhD
SRI International

Ingeborg M. Kuhn, PhD
Stanford AAMRS project; later at Veterans Admin., San Francisco

Prof.~Robert Paige
Rutgers University, New Brunswick NJ; now at NYU

Udo Pletat, PhD (summer 1988)
IBM Research, Stuttgart, FRG; now Professor Germany

Tore Risch, PhD (1990-1991)
HP fellow, now Professor at Linkooing Un., Sweden

Earl Sacerdoti, PhD (1977-1980)
SRI International; now at Teknowledge, Palo Alto

Daniel Sagalowicz (1977-1981)
PhD, later at Teknowledge, Palo Alto; Framentec, Monaco, and now at Oracle.

ZaiSheng Song
Beijing University

ShiWei Tang
Beijing University

Mike Walker
Stanford University (later MIS PhD); now a Consulting Professor in the MIS program.

Prof.~David Warren
SRI International; now Professor at Manchester Univ., Great Britain

Richard Weyhrauch, PhD
SRI International and Stanford University

Prof.~Marianne Winslett* (1983-1988)
Professor Univ.~of Illinois

JuanFen Yu
Beijing University

Research Assistants:

Most of the credit for the results achieved is due to the students which participated in the project. This list includes research assistants within the KBMS project as well as students who were supported by external funding or by related projects and participated as part of their academic work.

Name (year) subsequent venues

  1. James Allchin (1979-1980) PhD GA Tech 1983; Banyan Sys., Westboro MA; Microsoft
  2. Jim A. Andrews (1981)
  3. Mike Anderson (1981)
  4. Anne Beetem (1977-1981) Un.Wisconsin
  5. Robert L. Blum*, MD (RX project) (1980-1982) PhD, Stanford RADIX, Kaiser
  6. Stefano Ceri (1981) Prof. Politechnico Milano
  7. James Brinkley*, MD (supported by Dept.~of Medicine, Gyn-Ob.) (1980-1983) Prof. Univ.of Washington
  8. Sang K. Cha* (1984-1990) Prof. National Univ. of Korea, Seoul
  9. Edward Chan (1981)
  10. Surajit Chaudhuri (since 1988 supported by NAIL!) (1985-1990)
  11. Lei-Hou Chee (CVS) (1983-1984) IBM Santa Teresa Lab
  12. Francisco Corella (1981-1982) Symantec, IBM Yorktown
  13. Lori Craven (Bell OYOC) (sum.~1980) Bell Naperville
  14. James E. Davidson* (1979-1983) Teknowledge
  15. JinLi Dou (1982 summer)
  16. Steven Downs, MD (RADIX project) (1984-1985)
  17. Ramez A. ElMasri* (1977-1980) Honeywell; Prof. Univ.Houston; Univ.Texas, Arlington TX
  18. Goran Fagerstrom (1984-1985)
  19. Sheldon Finkelstein (1977-1982) IBM Res. SJ; Tandem
  20. Victor Franco (Bell OYOC) (1983) Bell
  21. Erik Gilbert* (supported by Lawrence Livermore Lab) (1978-1982)
  22. Hector Garcia-Molina* (1977-1979) Prof. Princeton; Stanford Un.
  23. Keith Hall (1986-1992) Consultant, Germany
  24. Waqar Hasan (1984-1988, 1994-1995) Hewlett-Packard
  25. Ram Kedlaya (1983 summer) UTex, Austin
  26. Arthur M. Keller* (1980-1985) Prof. UTex, Austin; Research Scientist Stanford Un.
  27. Jonathan King* (1979-1982) HP,Symantec,Teknowledge (deceased)
  28. CharLin Koo* (supported by Civ.Eng. Dep.) (1985-1987) Teknekron; Intellisys; ReliaSoft
  29. Byung Suk Lee* (1988-1991) Prof. at the National University, Seoul, Korea
  30. Wyatt Leung (1986-1987)
  31. Linda DeMichiel* (HCP, supported by Lucid) (1983-1990) IBM Research
  32. Toshimi Minoura* (Prof.~Susan Owicki) (1980-1981) Prof. USC; Oregon State Un.
  33. Robert W. Molter (US Army) (1983-1984) Army Pers. Office
  34. Diep Lan Tranh (1984)
  35. Kin-Kee Leung (1980)
  36. Kurt Lingel (Amdahl Corporation) (1985)
  37. Mohammed Olumi (part.~supported by IBM) (1979-1983) IBM Res, H-P, SAIC
  38. Edwin Pednault* (1980-1986) SRI AI center; AT&T
  39. Barbara Pernici (1983-1984) Politecnico Milano
  40. Patricia Pickett (RX project) (1979-1980) MD
  41. XiaoLei Qian (supported since 1987 by Kestrel Inst.) (1983-1990) Bell Labs, SRI International
  42. Peter Rathmann (NSF fellow 1982-1985) (1982-1990) Petrovision; Silicon Graphics
  43. Neil Rowe* (1980-1983) Prof. NPGS Monterey
  44. Thomas Rogers (1977-1981) HP; Symantec
  45. Gary Rubin (ARAMIS project) (1981)
  46. Luigi Semenzato (H-P) (1984)
  47. David E. Shaw* (1979-1980) Prof. Columbia U.; D.Shaw Ass.
  48. Kitty Shih (RLIN) (1983-1984) Stanford RLIN
  49. Mien Shih (supported by IBM) (1983-1986)
  50. Garrett Short (1979-1980) Calma, EDS
  51. John Shoch* (supported by XEROX PARC (1977-1979) XEROX; consultant
  52. Arun Swami* (supported since 1987 by HP) (1984-1990) IBM, Silicon Graphics
  53. Kyu-Young Whang* (1980-1983) IBM Yorktown; Prof. KAIST
  54. Marianne Winslett* (Bell fellow 1982-1986) (1982-1987) Prof. Univ.of Illinois
  55. John Woodfill (NSF fellow) (1984-1986)
  56. Lena Yesil (1981) Stanford Consulting Group
* obtained their PhD within the KBMS project.

Acknowledgement

This research has been mainly supported by the Defense Advanced Research Projects Agency under DARPA MDA903-77-C-322, N39-80-G-132, N39-82-C-250 as task (3), and N39-84-C-211 as task (10). The monitoring agency has been the U.S.~Navy SPAWAR office, formerly NAVALEX.

A task (6) on disjoint domains was supported directly by SPAWAR. Implementation of concepts developed earlier is supported by the DoD STARS program, through a contract with SRI International. Knowledge-acquisition research was partially supported through the RX project, funded by the National Center for Health Services Research (HS-3650 and HS-4389) and by the National Library of Medicine (LM-4334).

Work in natural-language interaction has been partially supported by NSF (IST-8023889).

Computer facilities for medically related applications were provided by the Division of Research Resources of NIH (RR-785). Additional equipment is being provided through a cooperative grant provided by Digital Equipment Corporation, through their Hudson, MA, Artificial Intelligence Research Group.

The views and conclusions contained in this document are those of the authors and should not be interpreted as representative of the official policies, either expressed or implied, of any agency of the U.S. Government.

The Knowledge-Based Management Systems Project

Summary

The Knowledge Based Management Systems (KBMS) Project addresses the problems of intelligent processing of information from large databases. Such databases are vital to the management functions of modern enterprises. In modern computer networks these databases are typically accessed indirectly, since they reside on other local and remote computers. The supporting systems may be autonomous and heterogeneous. The data is stored on files and maintained by a variety of programs. To make effective use of such data the available knowledge about the data semantics and representation has to be formalized in such a way that it can be exploited automatically.

We have investigated a wide variety of types of knowledge about data and databases, from relatively formal semantics based on relational models, to probabalistic and uncertain concepts. The goal of the individual investigations is to devise methods which make access to data more convenient, more effective, or more efficient.

Data and knowledge bases will only be of value of they are comprehensive within their scope, and up-to-date. We have devoted much of our research efforts to the problems arising with the maintenance of large databases and, more recently, large knowledge bases.

Rationale

The utility of computers and computer-based systems in large organizations has been limited more by inadequate techniques for managing large quantities of data than by our technical ability to gather and store those data. Existing systems for collecting, storing, and retrieving data are too inflexible and complex for easy extraction of relevant, concise information. To obtain information, data must be selected and processed so that it can be of value to decision-makers. Traditional output from databases produces large reports and messages from the execution of formally stated queries, but this represents primarily selected data rather than focused information. Typically, several layers of management and technical staff must intervene between the data systems and the users, impairing the efficiency and responsiveness of decision-making that depends on information stored within the databases.

Managers are aware of the value of control of and access to information; they depend on control of the data developed and used by their enterprises. The centralization of this data, which until recently was an essential part of having and sharing large databases, has resulted in a loss of control because of awkward access and dependence on remote personnel. While economic and technological considerations have favored centralization of data storage and processing during the past, recent directions in hardware development, including VLSI technology, are making available networks of smaller, yet very powerful, special purpose computers that can be used to collect, store, and process information close to its source. Loosely-coupled networks of different machines, personal workstations, etc., are likely to be the dominant type of computing environment within a few years. Consequently, local control of local information will be the rule rather than the exception. Such distributed environments pose new difficulties for the management of data resources. For instance, while control over local data has improved, access to remote data may be more difficult.

Another challenge presented by the move to distributed data distributed systems, is that the knowledge and authority required to manage the diverse sources of information are themselves distributed among multiple sites [Wiederhold;82e]. Familiar methods of data management--in which perfect information about the structure, semantics, and content of the data is normally assumed--are not adequate to deal with the less structured environment encountered in truly distributed systems. We have found such areas, where knowledge is not perfect, ideal for the application of artificial intelligence techniques.

Developments in artificial intelligence over the last two decades provide concepts that are of value in increasing the responsiveness of database systems to their user's needs. To pursue our work in applying artificial intelligence techniques to the practical domain of database management, we study the problems of data management in centralized and distributed computing environments, and the management of the knowledge that pertains to that task.

Difficulties occur in every aspect of data management: database design, understanding the information potential of the database, and actual use of the computer. Specific issues include data abstraction, query processing, security, redundancy, optimization, integrity, and decentralized processing. The management of knowledge becomes more specific when applied to databases. While the quantity of knowledge needed may be large, it has a scope defined by the corresponding databases and operates in a world which in many respects is closed relative to that scope.

Research Approach

The history of database development and research shows a continuing interaction of industrial efforts. The scenario is often that industrial achievements concentrate on handling larger tasks with adequate performance, and research contributes formalizations and clarifying methodologies [Wiederhold;84b]. We follow that scenario.

Our research probes the quantitative and complexity boundaries of data management. Large quantities of data, complex data relationships, and the limits of storage technology all require that advanced techniques be pursued if realistic problems are to be approached effectively.

In order to understand the problems formally we typically begin by investigating methods and algorithms available in database design and operation [Wiederhold;83b]. We first ascertain the limits of algorithmic approaches and new technology, and subsequently develop heuristic methods to probe beyond those boundaries. In much of our research we subsequently bring to bear the techniques developed in the artificial intelligence (AI) community for dealing with partially understood, arbitrarily structured problem domains [Barr;80], [Wiederhold;83f].

The introduction of heuristics leads to imprecision. We can quote, however, the British mathematician Alan Turing [Hodges;83], who stated in 1946:

In other words then, if a machine is expected to be infallible, it cannot also be intelligent.

The capability of symbols, manipulated through these AI techniques, to represent generalizations and abstractions, provides the basis for intelligent behavior. Recognition of uncertainty, which is an essential aspect of dealing with higher level concepts rather than with specific instances, is hence an essential feature of knowledge in our problem domain.

The specific AI tools are based on such concepts as semantic modeling and build on existing database technology. The databases are effective keepers of information describing the instances that the knowledge is applied to [Wiederhold;86a]. Although the traditional province of database management is in business data processing, we are also considering the data-management of large scale scientific problems [Wiederhold;82g]. The diagram in Section 10 of this report provides a layout of the tasks we have addressed.

To provide an experimental basis for our research we maintain several large databases. One of these includes data on $34\,000$ merchant ships, ports, and shipping information. A second database contains design specifications for the dec pdp-11 computer.

Most of the early work was carried out on dec-10 and 20 computers at SRI International and Stanford. For the new tasks, described in Section 5, we have equipment configured to emulate a modern workstation, with 8 megabytes of memory and 600 megabytes of storage, and access to the ethernet. The current computing engine is a VAX 750, with graphic and speech output capabilities. Primary system software is Common Lisp and the rdb relational database under VAX vms.

A variety of database-management approaches have been used in our research, network [codasyl;71,74], relational [Codd;70,71], [Blasgen;77], [Held;71], as well as special ones. The medical data are kept on tod [Weyl;75]. The bibliographic database is available for search in fully inverted form using the news service (ns) system on the Stanford sail computer. This bibliography can also be shipped over the arpa net. A printout can be furnished upon request, but we consider it too large for manual perusal.

We concentrate on promising task areas within our research definition rather than on building complete systems. Since this note is intended principally as a review of our own research, we have not cited all of the many publications of other authors that have provided us with pertinent results, important ideas, and useful guidance in respect to general research directions. General references which contributed significantly to our research are listed in Section 12 of this report, while the papers published under the aegis of the KBMS project are listed in Section 11.

Research Areas and Results

Our research falls into the areas of database design, human interfaces for databases, and performance models of database operation for both centralized and distributed databases. Our work is concentrated at a high conceptual level so that our results may be applied to relational, hierarchical, and network implementations. The application of knowledge about databases is a critical ingredient in all of our research [Wiederhold;84a].

We have categorized our efforts into six areas, although many issues overlap. The six sections (4 through 9) which follow provide short summarizations of our research. Many subsections represent a PhD dissertation and related research. Section 10 presents general conclusions about the results obtained.

  1. Sec.~4, Design Methodology:
    The Structural Model;
    The Data Definition Facility of critias;
    Integration of Diverse User Models;
    Integration of Diverse User Models at Diverse Sites;
    The Design of Physical Databases from the Integrated Model;
    Distributed Database Design;
    Design of Database Management Systems.
  2. Sec.~5, The Architecture of Knowledge and Databases:
    Inferencing Over a Database;
    Formal Semantics of Nulls;
    Subset Definition;
    Lexicon Management.
  3. Sec.~6, Maintenance of Databases:
    Update Constraints;
    Updating through Views;
    Natural-Language Database Updating;
    Finite Differencing;
    Maintenance of Distributed Databases;
    Distributed Resiliency Mechanisms.
  4. Sec.~7, Querying Databases:
    Graph Model of the Database;
    Cooperative Interactive Interfaces to Databases;
    Descriptive Responses to Database Queries;
    Task Models;
    Semantic Query Optimization;
    Common Expression Analysis;
    Use of a Statistical Abstract of a Large Database.
  5. Sec.~8, Applications:
    Experiments with Database Technology to Support VLSI Design;
    Planning;
    Experiments in Intelligent Data Analysis.
  6. Sec.~9, Tools:
    File Access System;
    Intelligent File Systems;
    Database Machines;
    Access to Optical Storage.

We will now present these topics in more detail and cite the relevant references. Note that this precis is limited to work actually performed. Work being planned or in progress will be reported in further editions of this precis, as well as in separate publications.

Knowledge Base Structures

Problems of knowledge maintenance in large knowledge-based systems motivate our research. Today these problems are evident in only some instances, but will become more prevalent as knowledge-based systems grow in scope and depth, and last beyond the lifetime of a PhD thesis. Some researchers from the AI community have looked towards database technology to help in dealing with issues of size and update management [Kerschberg]. Database systems have focused on simple structuring and normalization to deal with large bodies of information, and do not deal well with the complexities of structures needed to represent knowledge. We are using concepts from database research here as well, but must be very careful in intermingling database and knowledge-base representations. We need to avoid creating a combination with the weaknesses of the two fields, rather than the strengths. Future information systems will benefit from distributed knowledge sources and distributed computation. An architecture to deal with future systems must consider the technological opportunities that are becoming available. We see these systems supporting decision-makers through a two-phase process:
  1. Locating and selecting relevant factual data and aggregating it according to the decision alternatives.
  2. Processing and reducing the data so that the number of alternative choices to be decided among is small, and the parameters for each choice are aggregated to a high conceptual level.
Today most of these support tasks are carried out by human experts who mediate between the database and the decision maker. For many tasks in medicine, warfare, emergency relief, and other areas requiring rapid actions, dependence on human intermediaries introduces an intolerable delay. Future information systems will increasingly need to use automatic mediators to speed up these support processes [Wiederhold89c].

The databases, the mediators, and the applications will all reside on nodes of powerful networks. The end-users will always have computers available to serve their specific tasks. We refer to those machines as application workstations, although they may at times be large and powerful processors.

Demonstration of a Frame-based Interpreter to Demonstrate Mediation

To demonstrate the concepts of a frame-based interpreter which accepts directions and focus from frames, we have chosen the task of assigning reviewers for journal papers submission. The data sources are distributed, the mediators are distinct, but all code was combined into one demonstration program. Five SoDs serve the task:
  1. Relevance: we need reviewers with a background relevant to the submitted paper. This task is performed by matching in a keyword classification hierarchy.
  2. Quality: we prefer the most qualified reviewers. For this task we rank potential reviewers based on their published output in books, journals, etc.
  3. Conflict avoidance: we cannot assign reviewers to friends or colleagues. Colleagues are people that have been affiliated with the same institution as one of the authors with overlapping time intervals.
  4. Coauthor avoidance: coauthors of other papers with one of the authors are also to be avoided.
  5. Responsiveness: The reviewers must produce their reviews in time. Here we can look at a log of electronic-mail interactions.
We can use this example to elucidate the difference of the AI paradigms allocated to level H3 and H2. The tasks in the SoDs at layer H2 can all be defined quite formally. At the top layer H3 some unwarranted but critically important pragmatic heuristics are used to implement the reviewer selection task. For instance:
  1. Having written high quality publications in a topic area does not assure one that the candidate does equally well as reviewer. It is the best guess that our application task can make, but we all know some excellent critics who do not write much. The mapping of {\tt qualified\_writer \RIGHTARROW qualified\_reviewer} is pragmatic. The establishment of a set of {\tt qualified\_writer}s is adequately formal to justify its allocation to a SoD.
  2. Having worked together does not make one a friend, and being a friend does not imply favoritism. But we do need to weed out risky matches--in fact, due to prior publications the most likely best match is the submittor of the paper.
  3. Electronic mail responsiveness is probably only weakly correlated with fast reviewing: there are people who respond instantly to email and never respond to review requests.
The language currently used between layers H1 and H2 is lisp because it supports the extensibility essential to rapid research progress.

The data accessed by the first four SoDs are distinct views of an extensive bibliography of knowledge and database references. Information exploited includes type of publication (for the Quality-SoD), authors (the principal identifiers), author's location (for the Conflict-avoidance-SoD), publication details and sequence with dates (for the Conflict-avoidance-SoD), coauthor lists (for the Coauthor avoidance-SoD), as well as title, abstract, and classification (the last three are used by the Relevance-SoD).

Two of the SoDs are currently implemented--relevance and conflict avoidance. They are implemented as Lisp programs which have access to the object system and a commercial relational database. As we gain more experience, we intend to replace the Lisp code with a more declarative representation.

At the simplest level, the relevance SoD takes a keyword (or list of keywords) describing the subject of the paper to be reviewed, and looks in the database for authors who have written papers on the keyword(s). If the enough authors are returned by this database query, this is all that happens. If however, the database query does not find enough authors, or if the application asks for more candidate reviewers later, the relevance SoD replaces the original query by a more general one, in order to increase the cardinality of the result.

This capability is an example of query generalization [Chaudhuri;89]. It is possible because the SoD makes use of some of the semantics of the keywords. The keywords are arranged in a hierarchy, in which the parent is the more general keyword, and the children the more specific. If a query is does not return sufficiently many results, a concept of semantic distince in the hierarchy is used to suggest alternate keywords to try.

The data structures used by the SoD are designed to efficiently support this kind of iterated query style efficiently. A set of authors can be found by a succession of related queries can be answered with about the same total effort as would be needed to find that same set of authors with one more general query.

The application interface is simply a set of Lisp functions which the application can use. As our system evolves, we intend to built a higher-level interface. In our design, an application which is looking for reviewers would submit a query of the form:


select best 3 reviewer
  from relevance-SoD
where relevant = 'knowledge-base'   
      and reviewer not in 
          (select all friend
          from conflict-avoidance-SoD
          where author = 'Gio Wiederhold')

Note that this query refers to two different SoDs. Since a particular SoD can only answer queries about its own domain, this query is translated into a slightly lower level form, which specifies the individual queries to the SoDs, and the information flow between them.


X := select all friend
         from conflict-avoidance-SoD 
         where author = 'Gio Wiederhold'
Y := select best 3 reviewer
         from relevance-SoD
         where relevant = 'knowledge-base'
         and not in X

RETURN Y

Note that even this second query is fairly high level. It refers to such abstractions as relevant, which are implemented by the SoDs.

At the end of the funding period we shall be able to demonstate the complete system, with all five SoDs. In addition we expect to be able to demonstate advanced techniques for combining the knowledge and results from the five SoDs to achieve a balanced assessment of the reviewers. In particular this require a means to weigh conflicting information from the SoDs, as a given individual may be outstanding according to some criteria but poor according to others.

Integration of Object-Oriented Methods with a Relational Architecture

Person: Barsalou
This work investigates the hypothesis that an object-oriented approach can serve as a unifying framework for developing expert database system (EDSs). Objects offer the appropriate level of abstraction for EDSs [Barsalou;88a, a, 89a, b]. Storing information in the form of complex objects, however, can seriously inhibit sharing and flexibility since persistent objects bind application-dependent knowledge to the data. A desirable compromise is to exploit existing database technology by defining an object-based layer on top of a relational dbms. This approach does not call for storing objects explicitly in the database, but rather for generating and manipulating temporary object instances by binding data from base relations to predefined object templates. This work focuses both on the theoretical understanding and practical aspects of managing complex objects in a relational framework.

The primary intent of our formal model is to remedy the lack of structural abstraction in the relational model. Objects are specified recursively as follows: A base object is a non-empty set of projections defined on the base relations. A composite object is a set of other objects. Composite objects can be further subdivided into non-recursive and recursive objects. To render practical the mapping between the relational store and the object working representation, I add constraints to this general definition. Those additional constraints rely on the use of a semantic data model--the structural model--and enable me, among other things, to derive a unique hierarchical structure for base objects, to support record-valued and set-valued domains, and to guarantee termination for specific types of composite recursive objects.

Our prototype system, called penguin, defines three components for the object layer: (1) The object generator maps relations into object templates where each template can be a complex combination of join and projection operations on the base relations. (2) The object instantiator provides nonprocedural access to the actual object instances. Combining the database-access function (stored in the template), and a declarative query, penguin automatically generates the relational query and transmits it to the dbms, which in turn transmits back the set of matching relational tuples. Those tuples are then assembled into new object instances; (3) The object decomposer implements the inverse function; that is, it maps the object instances back to the base relations. This component is invoked when changes to some object instances need to be made persistent at the database level.

We expect our results to indicate that an object-based architecture provides (1) efficient access to and manipulation of complex units of information while preserving the advantages associated with persistent storage of data in relational format, and (2) a domain-independent, bidirectional communication channel between relational database systems and expert systems.

In-Query Cooperative Ad hoc Query Interface

Person: ChaSK
Ad hoc query interfaces are critical to solving open-ended problems arising from human organizational activities. Providing cooperative ad hoc query interfaces to users has become more important than before, as the growing number of personal computers and workstations leaves more end users in direct touch with databases. These users cannot afford intermediary programmers to provide handcrafted solutions for a given task.

For ad hoc querying, linear query languages have been widely used. sql and restricted forms of natural language are such languages. Most ad hoc query interfaces based on linear query languages are, however, passive. Their behavior is characterized by the passive reception of user queries and the provision of isolated textual user guidance. Passive ad hoc query interfaces are more frustrating to end users than a typical operating system interface, because database queries tend to be more complex than operating system commands, and thus more prone to various types of query failure. This query failure is a major obstacle to achieving the high productivity of end users.

Kaleidoscope is a cooperative ad hoc query interface designed with the productivity of end users in mind. Two query languages are considered: sql and a restricted form of natural language. The key feature of Kaleidoscope designed for increasing the productivity of end users is multi-step in-query cooperative guidance. Instead of receiving query constituents passively, the system leads users through a sequence of cooperative interactions to guide them away from the potential failures in a query. At each step, a set of legitimate query constituents or templates for such constituents are proposed to the user based on the system's knowledge of query language grammar and the structure and content of underlying databases. In addition, any information that is derivable by adding a query constituent to the partial query structure (but that may be unknown to the user) is displayed on a system message window. Essential to Kaleidoscope's cooperative interactions is a menu front-end that facilitates the communication between the user and the system as in the so-called menu-based natural language interfaces NLMenu and inglish.

Kaleidoscope's in-query cooperative guidance is relevant to the past work on post-query cooperative response. Many types of knowledge turn out to be commonly useful to both in-query guidance and post-query response. However, Kaleidoscope's in-query guidance is based on more active way of utilizing such knowledge than post-query cooperative systems: if the system is knowledgeable enough to suggest the post-query correction of failed queries, its knowledge should be used instead to guide users away >From query failure. Even if most types of query failure is prevented in Kaleidoscope, it is still possible that the result of a query may be unsatisfactory to the user. For instance, a query may produce no tuples, or it may produce too many tuples. In such cases, post-query cooperative response may complement in-query guidance.

Semantics of Knowledge Views

Person: Rathmann
<%still needs some editing - some of the claims are a little strong. pkr> Interacting with and comprehending almost any large and complex system is easier if that system can be decomposed into a number of smaller, simpler parts. Software designers realized this early on, and introduced subroutines and modules to provide global structure to programs. This research provides a semantics for structure and views in knowledge bases.

We consider knowledge to be represented as logical theories, perhaps augmented with non-monotonic assumptions, such as a closed world. Integrating parts of a knowledge base then becomes semantically equivalent to the process of combining subtheories to give a global theory. Since the resulting knowledge base may contain much information which is irrelevant to a given user, it is then helpful to have a capability for customizing the subsets of the integrated knowledge base to give user views.

We need a firm foundation before such operations can be defined. Therefore a portion of this work has focused on issues of semantics of knowledge bases. First order logic is not capable of representing the non-monotonic assumptions which a knowledge base needs to accurately depict the world. To gain such a non-monotonic capability, we have turned to a circumscription-based formalism.

Circumscription provides a very powerful and precise tool for defining unique name and closed world assumptions. Investigations into the unique name assumption have resulted in a new form of circumscription that is well suited for representing the semantics of knowledge bases. It is the first form of circumscription to handle equality in a principled fashion, and the first to provide unique names assumptions by default.

A prioritized extension of equality circumscription can then be used to define the semantics of integrating knowledge components and definining views on the resulting knowledge base.

A practical benefit is that the semantic description provides a guide for implementation. In particular, it provides a test for when queries to the global knnowledge base into may be decomposed into queries to the components, as well as a technique for performing the decomposition.

This formalism also suggests the beginnings of a design theory for knowledge bases. In particular, it provides checks for when a knowledge base contains redundant information, as well as a test for when a view faithfully reflects the state of the global knowledge base. <% reference Rathmann IJCAI 89>

Performing Database Operations over Mismatched Domains

Person: DeMichiel
A database system must often combine data from a variety of sources so that decisions can be based on a broad range of information. When data must be obtained from several database systems, it may occur in forms that are not directly comparable.

We present a solution to the problem of correctly performing relational database query operations despite data domain mismatch. We use the mechanisms of domain mappings, virtual attributes, and partial values to resolve domain differences and to map conflicting attribute values to common domains. We formalize operations over relations containing such virtual attributes and partial values by the definition of an extended relational algebra. We analyze the properties of this extended algebra and describe the system we have implemented that uses it to resolve domain mismatch and to perform operations over information from mismatched domains. This work enables information derived from multiple sources to be presented in an integrated form [DeMichiel;89a, b].

Optimization of Large Join Queries

Person: Arun Swami
Current query optimizers normally expect to process queries involving only a small number of joins (less than 10 joins). We expect that some future applications will require processing of queries with a much larger number of joins. In this research, we investigate the problem of optimizing Select-Project-Join queries with large numbers of joins (specifically 10 to 100 joins) [Swami;88, 89].

The results in this thesis are validated using several different synthetic benchmarks, and two different models of join processing costs: one, a conventional disk based model, and, two, a model for memory resident databases. We develop a cost model for join processing in main memory databases. This model is validated for the uniform distribution and more general distributions including the Zipf distribution. Simulations based on the cost model are used to study the performance of various join methods.

To study the characteristics of the search space of query plans, we obtain the statistical distributions of query plan costs. We observe that, as the number of joins increase, the difficulty of the optimization problem increases dramatically in the range considered. We study the effectiveness of the heuristic of postponing cross products as late as possible. We find that the number of sampled query plans of low cost increases significantly using the heuristic. For 10 join queries, the percentage of low cost query plans increases from 2.4\% to 79.6\%, and for 50 join queries from less than 0.1\% to 3.8\%.

We next address the problem of determining good methods for optimization. Taking advantage of the commonly used heuristics of pushing selections down and performing projections as soon as possible, the problem of optimizing large join queries is reduced to the problem of determining the optimal join order. The latter problem is a hard combinatorial optimization problem. We apply general techniques, such as iterative improvement and simulated annealing, that have proved effective in a variety of combinatorial optimization problems. We then study the use of heuristic techniques, such as augmentation and local improvement, in combination with the general combinatorial techniques. We observe great differences between the many different methods considered. We find that two combinations of the augmentation heuristic and the iterative improvement technique are superior to all the other methods.

Use of constraint rules

Person: Qian
Programming remains to be the primary means of instructing computers to perform work. Conventional programming is costly and error prone. It does not have precise and well-understood principles. The automation of programming has been a dream since the birth of the first computer. Although the concept has changed a lot, what was considered automated programming fourty years ago becomes compiling now, the idea is still the same: we tell the computer what to do and have the computer itself figuring out how to do it.

Database programming is an instance of programming. Besides the common properties it shares with programming, database programming requires having the knowledge of database semantics - the rules and the constraints applied to the database so it remains consistent and sharable by many users. Such knowledge is necessary both to maintain database integrity and to explore more optimization opportunities. Automated programming in databases is common. The fruitful research in query processing, semantic query optimization, and logic programming shows that automated programming of data retrieval programs is not only possible but also feasible. Similar approaches can be taken to programming data manipulation programs, namely transactions [Qian 86, 87, 88a, 88b, 88c].

The paradigm of database management makes automated programming more desirable. Since most database systems today support to various degrees the specification of database semantics, it is time to provide automated tools to exploit the use of this semantics. User-supplied transaction specifications are often small incrementals to the specification of database semantics, which is oriented towards human understanding rather than machine execution, and transaction programming involves routine translation of such combined specifications into procedural code. Programmers rarely have a complete understanding of database semantics. For reasons such as ease of use and security, databases often provide programmers with access to views, which are selected small portions of data. Transactions are reused over long period of time through changing situations, while prorammers do not stay. Moreover, tremendous opportunities for optimization exist because transactions usually perform small changes to database states. Since manual optimization carries extremely high risks of violating database semantics, it should hence be the responsibility of database systems to synthesize executable code from transaction specifications that preserve the validity of database semantics.

Database technology makes automation more feasible. Less effort is needed in writing transaction specifications because a large portion of them, namely the database semantics, is already present. Transactions are dominated by data manipulation instead of computation. On the average, eighty percent of the common transaction code is about constraint checking and enforcement. A small number of language constructs, such as insert, delete, modify, test, and iterate, constitute a reasonable transaction language. Simple algorithms and regular data structures suffice for most of the data manipulation jobs. These features significantly simplify the synthesis task.

Certain aspects of database management raise new challenges to automated programming. Transactions are imperative programs which manipulate database states. Specifications are often incomplete, requiring the synthesis of transactions with attached applicability conditions. Database semantics must be utilized to optimize both the synthesis process to reduce search effort and the transactions generated to enhance runtime performance.

We approach the synthesis of transactions which preserve the validity of semantic integrity constraints using deductive tachniques. A class of iterative transactions is defined and shown to have exactly polynomial-time expressive power. An iterative transaction logic is developed as the formalism with which the synthesis of iterative transactions can be conducted. Transactions are generated as the by-product of proving specifications in the logic. The deductive-tableau proof system developed by Manna and Waldinger is extended with inference rules for the extraction of iterative transactions from proofs, which require the coorperation of multiple tableaux. It is shown that the proof system is sound and complete for the synthesis of the subclass of first-order transactions. Properties of transaction specifications are studied to identify a class of bounded stratified specifications for which the synthesis process always terminates. Control strategies are also developed which utilize the database semantics to reduce search effort in the synthesis process and to enhance runtime efficiency of the transactions generated. Our approach is demonstrated with a prototype implementation.

Design Methodology

This area concerns research to improve the techniques of database design and modeling. The structural model, formally defined within the KBMS project, captures those semantics that are of importance in the design of the database's physical structure [Hendrix;75] [Wiederhold;77, 78a, b, 82d, 83a]. The concepts of the structural model have been expanded and applied in other research aimed at facilitating various stages of the database design process. They provide the basis for the data definition facility of critias, which implements a schema language based on the structural model [Qian;85b]. They are used in the task of integrating diverse user views, whether in a centralized or ditributed setting. And finally, they are used to aid the design of physical databases from the integrated user model.

Further avenues of design oriented investigation have been distributed database design and the design of database management systems.

The Structural Model

Person: Wiederhold
<% Subsection 4.1 > The structural model adds the concept of connections among relations to the basic concepts of relations. Connections model relationships between entities, whereas relations primarily model entities. Connections have formal properties, embodying constraints related to functional and inclusion dependencies. Rules for the maintenance of correctness under update of the database are also given. Three types, ownership, reference, and subset describe distinct constraints imposed by the user on the relationships. The structural model can be considered as having formalized the important relationship aspects of general semantic models. These models (e.g., the entity-relationship models) are used to embody the requirements of the users [Chen;77] [ElMasri;79b] [Shipman;81] [Wiederhold;79a, c, 82b, 83b].

Our work on database design is based on the use of these structurally relevant semantics, which form the basis for defined connections in the database. Three connection types are adequate to model any structural database constraint other than arbitrary functions: ownership, reference, and subset. When the expected usage loads are mapped to defined connections, the codified knowledge will provide the quantifiable guidance for the physical design. The correctness of binding decisions is necessary to make large databases capable of acceptable performance [Wiederhold;81b], [Ceri;81,85b]. The existence of a model also allows making decisions about reconfiguration of a database as the loads change over time. The problems of asynchronous operation of distributed databases are being modeled using the identity connection [Wiederhold;87].

It does not matter in what manner the connections are implemented. This permits the logical design to proceed independent of the eventual implementation. This freedom is critical for databases which are expected to outlive current technologies and for databases which may be implemented differently on multiple sites in distributed system.

The primitives of the structural model are also applicable to databases serving non-traditional functions as engineering information systems for hardware [Wiederhold;82b] and maintenance control systems for software [Olumi;83].

The Data Definition Facility of critias

In order to clarify the concepts of schemas independent of their implementation we have developed a free-standing schema definition language critias. The language provides syntactic embodiment of a semantic data model. The basic modeling concepts of critias are entity types, attributes, and three types of relationships (subtype, association, and reference), which are intended to model the conceptual objects, their properties, and relationships among them. Language constructs are provided for specifying semantic integrity constraints at all these levels of modeling concepts. We also show that there is a direct mapping from critias constructs to relational schema hence the implementation of critias is straightforward [Qian;85b].

The Integration of User Models into a Comprehensive Database Model

<% subsection{4}1.2> In order to generate a comprehensive database model many data models of individual applications may have to be integrated. The design of a large database must consider many potential users, each with a distinct perception of what the database should look like. The structural model supports the integration of many such views into a database model to support all users [ElMasri;79a, 80]. Included in the model is a formalism for the clear expression of structural-integrity constraints in each user's view of the application area and rules for resolving these constraints relative to all users. The integrated database model can remain invariant while performance and operational constraints cause the implemented version of the database to change [Wiederhold;83a. [Wiederhold;82d]. Each integrated application also obtains access to the data belonging to other applications in a consistent manner. The eventual database submodels for the applications will be richer than their original specification. These revised database submodels can be used to define views and windows for user access to the database.

Integration of Diverse User Models at Diverse Sites

Person: Wiederhold
<% subsection 4.3> In a centralized operation a database administrator is responsible for the design of a central database that supports the diverse views of individual users, each of whom can be aware of only a portion of the actual database. In truly distributed situations there may be no single individual with sufficient global knowledge (or authority) to perform this task [Wiederhold;83a]. While no central database will exist, a comprehensive data model is critical for automatic maintanance of overall consistency and for performing distributed processing of queries and updates.

An alternative approach, distributing integrity authority and responsibility to autonomous local sites has been explored in the context of networks of personal computers. The bulk of actual maintenance actions can be delegated to rule-driven agents, which inspect the update messages [Wiederhold;84d].

The Design of Physical Databases from the Integrated Model

Person: Whang
<% subsection 4.4> The structural model permits the attachment of projected usage factors >From the individual users, and the combined knowledge can be employed to design an effective database implementation. Such a database may be implemented using technology based on relational, hierarchical, or network concepts [Wiederhold;82a].

A straightforward algorithmic optimization approach is not feasible when designing a multi-file database since the combinations of the decision space greatly exceed reasonable time bounds. Addressing this problem, we have developed a method which will provide implementation guidelines for a relational database design in close to linear time [Whang;81a, b, 83a, 84, 85], [Wiederhold;83a]. The concept developed is termed separability of joins between the participating relations.

When the relations of a database are separable, the allocation of clustering and indexing choices for a database implementation can proceed relation by relation. The assumptions required for this approach appear to be fulfilled over a wide range of practical cases. We have also extended this approach to network databases [Whang;82].

A by-product of this research is an improved equation for the estimation of block accesses when retrieving records from files [Whang;85]. This formula has been adapted for the optimization of query processing in relational systems.

Distributed Database Design

Person: Ceri, Wiederhold
<% subsection 4.5> We have also explored the design issues in distributed systems and explored useful algorithms for optimality. In this process we have again found limits of design problem complexity which require the development and evaluation of heuristic approaches. We have developed and published algorithms for the partitioning of databases by selecting fragments defined as tuple sets (horizontal partitioning) or as attribute sets (vertical partitioning) for distribution. In a related project we have developed and evaluated algorithms for distribution within clustered processors.

Our work on database partitioning uses structurally relevant semantics to define connections in the database. These connections and the expected usage loads provide the guidance for the physical binding design decisions [Ceri;81].

This issue maps directly into distributed systems: the cost of connections that go via communication links becomes even more critical. We assume initial user input to select candidate fragments of the database for partitioning and a list of important transactions using those fragments. With each transaction an origin site is associated. A simple model that permits replication but ignores storage and update costs would have that site contain all its needed fragments.

When storage and update costs are considered, the optimal site assignment for the fragments is very difficult. We have been able to formulate and solve an integer programming model for this problem for a modest number of fragments and transactions when there is no replication. We have also analyzed the heuristics required when replication is permitted [Navathe;82], [Ceri;83].

Multiple processors may also be used in clusters, perhaps connected by a high-bandwidth local net, in order to achieve higher aggregate performance than a single large processor can provide. When the processors are clustered, the entry point for a transaction is of little concern. Partitioning is now constrained by limits imposed by processor, input-output, and communication capability [Sacca;83], [Sacca;85].

However, in distributed databases a major technique to achieve acceptable performance is the replication of data [Garcia;78c] [Barbera;82] We are beginning to model this technique using the concept of an identity connection [Wiederhold;83b]. When replication is used in the centralized environment the binding is implemented quite rigidly (i.e., no access is permitted until updated replicated elements are all set to identical values) [Wiederhold;81b]. We find that the cost of maintaining rigid identity connections is quite high in distributed systems, and are investigating methods that permit a relaxation of those constraints.

In all those approaches to database design, the limits of algorithmic design methods seem to have been reached. The additional complexity that occurs when the fragments can be replicated always requires heuristic approaches [Ceri;81].

When databases are not maintained through a central administrative service, we speak of a loosely coupled system. These systems will require techniques to deal with the uncertainty of identifying relevant data, locating their instances, and determining their status in terms of currency and quality. We have begun to investigate techniques where expert systems help manage these issues [Apers;84b, c]. While there is an inherent level of uncertainty in the obtained results, it appears that the uncertainty can be quantified and manipulated. Development of adequate techniques in this arena is essential so that the benefits of autonomous systems are not outweighed by the difficulties of sharing data in these systems.

The Design of Database Management Systems: DADAISM

Understanding the flow of data and the flow of knowledge to control the data has provided the critical underpinnings for the design of a proposed new DBMS. Here we have to consider the information flow in an application-independent manner. We consider all levels of the database system implementation problem starting with a simple operating system interface and going up to the level of knowledge-based user services [Wiederhold;86b].

This research is motivated by a plan to re-engineer a large multi-site control and command system, WWMCCS, using modern software principles. In accordance with Department of Defense (DoD) policies, the software for the WWMCS Information System (WIS) is designed to be implemented in Ada.

The methodology of the specific work is to start out fresh. This permits the use of modern software and data engineering approaches to database systems, using the power of Ada in the most appropriate way. Most of us believe strongly that the cost of adapting re-engineered applications to 20-year and 10-year old technology, as exemplified by existing DBMSs, is greater than the cost of rethinking and a new DBMS implementation. If we do not put the principles we have learned in the past twenty years to work now, we will be overtaken by others who do not carry the baggage of the past.

The proposed architecture takes advantage of the structure of Ada. Since Ada supports alternative code bodies with the same declarative specification, it becomes natural to maintain and use alternate modules. However, in order to keep the specification unchanged, the initial design must consider the information requirements by any candidate module. In practice, modules developed in the prototyping stage will not use information needed for an optimizing version of the same module type. Similarly, modules for a workstation database will not use much of the security information available for protection in a shared environment [Wiederhold;83e], [Friedman;86].

The context of this research project reinforced the consideration of security and access controls within the initial design, rather than treating security as an add-on. If a high level of access protection is needed, overall reliability and performance will be improved by incorporating clean and appropriate interfaces in the original system design. Whenever the requirements are less strict, the cost of having null or minimal modules in their place will be small [Spooner;85].

Proposing a fresh start for a major system carries a considerable risk. Doing this in the database area is made more feasible due to the initial orientation of Ada. Since Ada was designed as a language for embedded systems, it did not inherit any prior data-processing notions, a problem other general purpose languages such as pl/i have to cope with. Only the most primitive input-output functions are specified in Ada. Hence, we were able to start with a relatively clean slate.

This project presents an example of a data-oriented design methodology. In software projects where we expect that a common data structure will serve many application functions, a data semantics-oriented approach may be a better alternative to a top-down design starting from a procedural objective. In this data engineering approach we still proceed top-down, but start from the information requirements.

Especially when we envisage that modules are replaceable, procedural specifications do not provide a solid base. No procedure can generate information unless the required data and knowledge are made available to the module. When data and knowledge are identified initially, then, in most cases, a competent programmer will have few problems in coding the required transformation procedures and generating a successful program.

On reflection, it appears that the architectural approach proposed, as a conceptualization of architectures found in current DBMS, bears a significant resemblance to the blackboard concepts used in some Artificial Intelligence Systems. An early example of a blackboard is found in hearsay [Reddy;76] and recent work is bb1 [Hayes;85].

Making the distinction in this design process of:

  • Knowledge: The controlling and general information, typically complex and relatively small in volume.
  • Data: The manipulated and factual information, typically regular but voluminous.
  • helped reveal the design issues and the rules for information management within the design.

    The Architecture of Knowledge and Databases

    As we are becoming increasingly dependent on formalized semantics to handle large databases, we have to be concerned with the architecture of systems that embody both data and knowledge [Malachi;85], [Missikoff;84], [Shuey;86], [Wiederhold;85a, 85c, [Wiederhold;86a]. The knowledge bases have to address also issues due to the inherent incompleteness of the recorded facts [Winslett;86], [Blum;82b].

    KSYS

    The experience gained in the KBMS research and from other sources is being marshalled for a major subproject, ksys. The other sources include rx, implementing Knowledge Extraction from Databases [Blum;80,86a,86b]. Rx uses frame technology to manage a fairly large database [Wiederhold;81, [Wiederhold;85a]. Additional experience is derived from infobase, developed at DEC [Cherubini;84] and made available to us.

    The ksys project addresses specifically the need for an effective interface between AI Systems and Database Management Systems [Ceri;86]. The knowledge managed by the AI system represents an enhanced form of the meta-data used to describe databases. The DBMS maintains the facts which represent the continuously changing world and permit new inferences to be drawn by the AI systems linked to the database. We have found, however, that the problems in communicating between AI and database systems are not just in building an interface itself. Each side depends on its own representation structures and inference methods [Wiederhold;86a]. In an integrated system the strengths of each must be exploited. This direction reflects a maturation of the research in this field, where we can go beyond existence proofs and consider quantitative measures and architectural feasibility as well.

    The ksys architecture uses:

    1. A frame structure for the knowledge base.
    2. An inheritance scheme which is slot rather than frame based, where each slot is associated with one of an open-ended set of hierarchies.
    3. A minimal interpreter for knowledge interpretation which can be set to search forwards or backwards.
    4. An interface from attribute-defining frames to the relational database.
    5. Persistent storage of all instance information in the database.
    A number of research questions are being raised by such an architecture, and will be investigated as resources permit.

    Inferencing Over a Database

    When databases are large they can contain implicitly more knowledge about a domain than any single expert. To carry out such inferences, a model of the world represented by the database has to be constructed. Such a model involves abstractions and generalizations, and is hence a knowledge base for that domain [Blum;78, 80, 82a, c]. We have used a knowledge based system to control statistical procedure which discover and verify new relationships among the abstractions represented in the knowledge base, achieving a form of automated learning [Blum;82a], [Walker;86].

    To support the learning process it is neccessary to abstract details, stored in the database, to higher-level concepts. By moving to abstractions we can bridge missing data [Blum;81]. If we let the concepts extend over time, we can also start recognizing change, a stronger impetus to action than conventional reports describing a snapshot view of the world [Downs;86], leading to graphic summarizations of a patient's history [deZegher-Geets;88, 88].

    Lexicon Management

    Person: Davidson
    For natural-language systems to provide practical access for database users, they must handle realistic database queries involving terms identifying the stored entities such as names of towns, ships, and people. While formal languages expect the user to ask for, say, ``Size of City = `San Francisco' '' this formulation is unnatural. Natural language systems overcome this problem by having all such identifying names within their lexicons, in addition to the common verbs and nouns.

    However, databases are often quite large and subject to frequent updates. Both of these characteristics render impractical the encoding and maintenance of a lexicon associated solely with the query language processor, which requires separate maintenance,

    We have developed and implemented a technique for reducing the number of lexical ambiguities for unknown terms by deferring lexical decisions as long as possible. The database itself is then accessed to try to resolve the ambiguities and complete the parsing. A simple cost model selects an appropriate method for resolving any remaining ambiguities [Davidson;80].

    Maintenance of Databases

    Before a database can be exploited to yield information, it has to be loaded with data, and these data must be maintained to provide a correct and consistent representation of the real world. The treatment of updates to databases is not yet well treated in theory or practice. Updates to a database may conflict with general integrity constraints, may conflict with previous facts stored in the database, and may involve incomplete knowledge.

    Dealing with incompleteness in databases is a major thrust of the KBMS project. It is motivated by the observation that databases, in practice, do not contain data for all data attributes for all stored entities. This is in part an essential result from the concept of sharing data, and cannot be avoided by exhorting users to enter all data. A database pools data from a variety of sources and each source operates under different preconditions and in a different environment. It cannot be expected that a single person or unit will have the global, detailed knowledge necessary to manage the entire database; rather, each specific user of the database will be qualified to manage only a subset of this large collection.

    Specific areas of our research have included: view management, natural language or ad-hoc planning updates, subset definition, and as discussed earlier, the effect of null values and other forms of incomplete information.

    Formal Semantics of Nulls

    Person: Winslett, Keller
    <% subsection 5.4> Suppose one wishes to construct, use, and maintain a database of knowledge about the real world, even though the facts about that world are only partially known. In the database domain, this situation arises when database users must coax information into and out of databases in the face of null values and uncertainty. Such incomplete information arises most commonly from missing data in the database, but also through other scenarios such as updates through views. A database containing incomplete information represents a set of alternative worlds, one of which represents the real world, as opposed to the single alternative world modeled by complete information databases.

    In the AI domain, the problem of updating a database of incomplete information arises when an agent tries to incorporate new, possibly contradictory information into a database of beliefs reflecting its partial knowledge of the world. In the logic domain, one might choose to represent such a database as a logical theory, and view the models of the theory as possible states of the real world.

    How can new information (i.e., updates) be incorporated into the database? For example, given the new information that ``b or c is true," how can we get rid of all outdated information about b and c, add the new information, and yet in the process not disturb any other information in the database? The burden may be placed on the user or other omniscient authority to determine exactly which changes in the theory will bring about the desired set of models. But what's really needed is a way to specify an update intensionally, by stating some well-formed formula that the state of the world is now known to satisfy and letting the database management system automatically figure out how to accomplish that update.

    Alternatives for semantics of databases with incomplete information and the ramifications of these semantics have been summarized in a paper which won the student award at the IEEE Data Engineering conference in Los Angeles, 1984 [Keller;84a, 85b].

    We have explored a technique for updating databases containing incomplete information. Our approach embeds the incomplete database and the updates in the language of first-order logic, which we believe has strong advantages over relational tables and traditional data manipulation languages when information is incomplete. We have described semantics and found polynomial-time algorithms for update operators [Winslett;86, 85b]. An implementation of these algorithms has been coded in C. When null values are present in the database or in the updates, incorporating the updates directly in the database may lead to unacceptable expansion in the database size. Using methods reminiscent of those used in concurrency control, we have investigated a lazy evaluation technique which regulates this growth.

    Subset Definition

    Person: Wiederhold
    The subset concept, formalized within the structural model, permits the definition of subrelations which can be used to structurally define classes of entities where data entries would not be appropriate, thus reducing the need for null management. Since use of this concept leads to a large number of conceptula relations, in practice we will map most subsets kept on one processor system into a single file. Control of such mappings is best carried out at a knowledge-based level [Wiederhold;83a].

    To provide access to a specific subset, we may define a database view. Use of defined subsets cannot resolve all cases where nulls will occur.

    Update Constraints

    Person: Keller
    <% subsection{4}2.1> The structural model permits the description of constraints which is implementation independent. Most current systems have only incomplete validation facilities and require that programmers provide procedures which reject inappropriate updates. We have shown that the rules of the structural model provide a basis for efficiently validating updates prior to submitting them to a DBMS [Keller;81a]. If assurance is provided that the rules of the structural semantics have been obeyed at execution time, optimal query paths can then be selected while guaranteeing that the query result will not be affected. <>

    Updating through Views

    Person: Keller
    <% subsection{4}2.2> The form, content, and availability of a large database are never completely known to any single user. A user is typically restricted to a view or a set of views which define a window onto the database appropriate to the users requirements and authority. This restriction becomes yet stronger when working at remote nodes in a distributed network, since a user at any single site can certainly not be expected to have a global view.

    These restrictions lead to a class of problems when the database is to be updated. An update expressed through a view typically has multiple reasonable interpretations in the shared database. For that reason, in most current DBMSs, views can only be used to retrieve information; users operating within a view are not permitted to do updates. This restriction increases the burden on the centralized authority with power to update the database, and introduces errors due to separation of authority over data and authority over database operations.

    We have developed a theory which permits enumeration of all possible interpretations of a view update, and a complete collection of algorithms for the implementation of these interpretations. The theory is based on notions of formal correctness of updates [Keller;82, 85a, c].

    When the view includes the join columns, view update is generally feasible, especially when the key for the view is entirely contained in one of the underlying relations [Keller;81b]. This work has been extended to include other views formed by standard relational operators (selection, projection). This algorithmic approach is in contrast with the information-theoretic approach based on minimal complements [Bancilhon;81], [Keller;84b].

    In addition, we have developed an interactive approach to permit a database administrator to choose the appropriate interpretation for updates at view definition time. Once the interpretation has been fixed, an update expressed against a view will have an unambiguous interpretation in the shared database [Keller;86a].

    Natural-Language Database Updating

    Person: Davidson
    <% subsection{4}12> Problems conceptually similar to view updates arise in processing natural language queries. The difficulty here is that casual users of a natural-language system do not understand the scope nor the details of the underlying database, and so may make requests that:

    1. are reasonable, given their view of the domain, but are nevertheless not possible in the underlying database;
    2. are ambiguous with respect to the underlying database; or
    3. have unanticipated collateral effects upon the responses to earlier questions or upon alternative views of the database.
    The view concept here is a dynamic issue, not subject to predefinition by a database administrator.

    Although considerable research has been devoted to the problem of processing queries expressed in natural language, the issues and techniques for performing natural-language database updates have had little attention. The need to perform natural language updates arises when computer databases support ad-hoc planning. When using the database to plan for hypothetical future scenarios, update interpretation should not depend on the insights of a database administrator. For this situation we have developed artificial intelligence techniques to select a plausible update interpretation based on the user's previous interactions with the database, database semantic constraints, and the goal of minimal side effects. If one update interpretation clearly dominates, the user can proceed in the planning scenario without being forced into a tedious disambiguation dialog.

    A theory has been developed that makes it possible to identify a particular user's view of the database and to determine the legality, ambiguity, and potential side effects of updates expressed in natural language. A formal analysis of the problems associated with updates expressed on views (database submodels) is central to this work. A system, pique, has been implemented, based on this approach, that processes natural-language updates, explaining problems or options to the user in terms the user can understand and that, moreover, makes changes in the underlying database with a minimal disruption of other views [Kaplan;81a] [Davidson;82, 83a, 83b].

    Finite Differencing

    Person: Paige
    <% subsection{4}18> We have cooperated in the investigation of the transformational technique of finite differencing for the specification, implementation, and application of derived data in the context of a set-theoretic entity-relationship data model. A framework for the automatic maintenance of derived data has been presented. In this framework repeated costly computations are replaced by more efficient incremental counterparts. Applications of this approach are in integrity control and in query and transaction optimization [Paige;82, 83a, 83b], [Goldberg;83].

    Maintenance of Distributed Databases

    Person: Garcia
    <% subsection{4}19> The management of redundant, distributed data can seriously affect system performance. We have analyzed and developed algorithms for integrity maintenance [Garcia;77 ... 81]. The use of hole-lists to inform nodes of update status of transaction while passing a minimal amount of information has been shown to be effective [Garcia;79e]. The analyses has also shown that it is difficult to improve on the performance of centralized control methods if sufficient backup can be provided in the responsible nodes [Garcia;81].

    The separation of read-transactions into three classes, namely those that require no or minimal consistency, those that require auditable consistency, and those that require time-critical consistency, can improve aggregate system performance. The problems arising from serving transactions that support planning functions, which require access to large granules of the database, can be greatly reduced by lowering their consistency demands [Garcia;82].

    Distributed Resiliency Mechanisms

    Person: Minoura, Apers
    <% subsection{4}20> An implied, but rarely realized, benefit of having a distributed database is increased system availability in case of failures. Data availability can be accomplished via replication. Recovery of a node which fails disastrously requires repair and updating of the database at the failed node to achieve consistency.

    Such a system is considered resilient with respect to failures, rather than failure-proof. A local node may not be able to provide full local recoverability by itself. This can be due to economic considerations, since the cost of local secure redundant storage can be excessive for small systems, or can be due to operation in a high-risk environment. Techniques to use other nodes within the distributed network to restore failed nodes, whose local storage has been damaged, have been explored and documented [Minoura;78, 82, 83a, b]. We are obtaining feedback about their use in financial and manufacturing systems.

    Subsequent work in this area deals with a classification of transactions in relation to resiliency. By formally defining a set of transactions which can proceed when the database is partitioned, a set which can be conditionally handled, and a set which cannot proceed, we expect to enhance the operability of distributed databases in conditions of adversity [Apers;84a,d].

    Querying Databases

    To obtain information from a database, it must be queried. High-level interaction with databases by untrained personnel using natural language should be possible. We have applied front-end processors for natural-language queries based on previous work as ladder and ida [Sagalowicz;77] and [Hendrix;77].

    We have extended the approaches to natural language understanding addressed in earlier work [Gardner;81]. The intent here was to avoid reinventing the wheel, so that we can build on research performed earlier and study the many remaining problematical issues in regard to the human interaction with database [Rowe;82c].

    When querying issues of incomplete knowledge by the users arise, the structural model can provide guidance in query formulation and in response processing. We also tested the approach to response analysis developed by [Kaplan;79a, b, f, 80].

    Using the Graph Model of the Database

    The structural and similar semantic models can be exploited to obviate the need for joins for all those connections which are clearly understood. The concepts of ownership, reference, and subset identify ISA hierarchies which can be exploited to locate attributes at higher levels of abstraction. This means that queries giving only attributes, and not relation names, can be understood and processed according to the semantic model, as defined in gordas [ElMasri;81]. The computational effort appears to be much less than that required to deduce dynamically those relationships that are based on dependencies in a universal-relation model.

    Cooperative Interactive Interfaces to Databases

    Person: Kaplan, Davidson
    <% subsection{4}8> In an environment where a user may have incomplete or incorrect knowledge of the structure of available data, queries or updates may be posed that are either unanswerable, impossible to perform, or are otherwise misguided. Our work has shown that such misconceptions can often be detected directly from the phrasing of the user's request and a knowledge of the structure of the database [Kaplan;80]. This information can be used to correct the user's misconceptions and inform the user about the nature of the database and database system. The usefulness of our techniques for cooperative processing of transactions has been demonstrated in a library environment [Corella;84]. These techniques need to be extended to distributed environments, where the knowledge necessary to aid users in use of the database may partially reside at remote sites.

    Descriptive Responses to Database Queries

    Person: Kaplan, Davidson, more
    <% subsection{4}9> The typical response to a database query is to list the set of items from the database that satisfy the conditions presented in the query. This list can be excessively long and consequently may be inappropriate for a conversational system. Often, a more appropriate response to such queries is a description of the set, rather than a listing of its elements. For example, the response ``All corporate officers'' may be more concise and informative in response to ``Which employees profit share?'' than a list of $1\,000$ names. Practical techniques for producing a significant class of such responses >From existing database systems without using a separate world model or knowledge base have been been implemented [Kaplan;81b, 82a, b].

    Information from the structural model can provide useful hints for summarization. Since a referenced entity provides a higher level abstraction of the referencing objects, the typically small set of reference connections can provide the candidate relations for descriptive responses.

    Task Models

    Person: Davidson, Rowe
    <% subsection{4}10> Certain entities in a database--be they fields, attributes, relations, or more complicated constructs--have certain probabilities of being required in an interrogation of the database. These probabilities are dependent on the particular task a user may be performing and his or her focus and interests [Grosz;77a,b].

    By incorporating a task model into the KBMS, these characteristics can be used in a variety of ways to improve the utility of the system. Four main uses of task models have been explored in a KBMS:

    1. inclusion of relevant information not explicitly requested in the response to a query;
    2. organizing responses so that the more interesting items are presented first;
    3. checking for semantic irregularities in the performance of the task; and
    4. prefetching of items and fields that have not yet been requested but are likely to be in the near future [Rowe;82d, 83c] [Davidson;82].
    We found that tracking the user's focus dynamically during a single session to be beyond our capabilities.

    Semantic Query Optimization

    Person: King
    <% subsection{4}13> A request for information can often be formulated in more than one way, depending on knowledge about the subject domain and the ingenuity employed in determining the best access path to the desired information [Martin;77]. A question about all ships currently carrying iron ore, for example, can be answered by looking only at information about bulk ore carriers, assuming that it is known that only bulk ore carriers carry iron ore.

    Semantic query optimization is an approach to improve the performance of information requests that uses such domain knowledge automatically. The objective is to transform a query into a semantically equivalent one (i.e., one that produces the same answer but can be processed more efficiently).

    This work demonstrates that semantic knowledge about the database can be used in a new way to achieve efficient data retrieval. The method supplements conventional query optimization methods. Drawing on semantic, structural, and processing knowledge, it makes use of the heuristics of query processing developed in prior research. A system, quist, has been implemented to explore this approach. Improvements have been demonstrated in a range of query types by means of transformations that add or delete constraints on attributes or even entire files, as warranted by the database and query structure. In other cases, the system detects unsatisfiable query conditions without inspecting the database, or aborts inference when it determines that no improvement can be expected. Analysis overhead is low in all cases [King;79, 80a, b, 81a, b].

    Common Expression Analysis

    Person: Finkelstein
    <% subsection{4}20> Another optimization task depends on expressions and their results from prior queries. We have shown that common expression analysis can be applied in three different settings for optimization of database request. The query graph representation of queries was originated for this work. It represents the structure of requests in an intuitively clear manner. Also the decomposition into nodes (relation occurrences in queries labelled with internal selection predicates and projected attributes) and edges (join predicates) serves as a convenient structure for automatic analysis in query optimization.

    We have written a program that detects whether or not the result of one query can be used to support answering another query. Finkelstein describes both the query graph and the first two settings for optimization described below [Finkelstein;82].

    1. In an interpretive system, answers to previous queries, or temporaries formed in the process of obtaining them, may significantly reduce processing time for subsequent queries. We have defined a methodology, based on the query graph representation, for deciding cheaply whether such temporaries are helpful.
    2. A transaction which contains multiple queries may perform redundant database operations. We have described a procedure for optimization of a collection of queries, which involves computing least upper bounds on query graph nodes (which correspond to scans through relations which might profitably be combined). These are extended to maximal subexpressions between queries, and a heuristic procedure examines possible query revisions to find the best execution procedure.
    3. We have extended the concept of optimization of generated code to languages with embedded queries (such as sql in pl/i). We have analyzed how queries in loops can be optimized differentially. Also, we have studied how global flow analysis can be used to determine available database expressions at places in the program.
    This work is now being applied within the IBM San Jose Research Laboratory to prototype relational database systems.

    Menu-Guided Cooperative Ad Hoc Query Interface

    Person: Sang Cha
    Ad hoc query interfaces based on a linear syntax language burden the end users with the need of remembering the precise syntax and the semantics of the query language as well as the structure of data. This is true with not only those based on formal languages like SQL but more natural yet still artificial ones that have been supported in many natural language interface systems. While the natural language interfaces were initially motivated to get around the difficulties of using the formal ones, their success has been limited by their incomplete linguistic and domain knowledge. The imperfection of a natural language interface is more problematic to the end users, causing psychological interference between the user's habitual language and the system's accepted language. Recent work on transportable natural language interface systems provides user-friendly domain knowledge acquisition tools on top of a structured knowledge base. Such tools help to reduce the end user problem by making it easy to add the domain knowledge, but they make only a partial solution in face of an indefinite amount of system knowledge required for handling open-ended user behavior. Our research addresses the presence of the problem more in the passive and uncooperative system behavior. We propose an active ad hoc query interface system that is cooperative at the early stage of the user's query composition. This also makes our approach different from that of other cooperative systems based on post-query corrective/informative response. To implement the early cooperative response, we borrow the idea of menu-guided query creation from a so-called menu-based natural language interface NLMenu of Texas Instruments, Inc., in which uses the users are guided by a pseudo-natural query language grammar. With the menu guidance, a query is constructed incrementally through a multiple steps of menu interaction. At each step, a new menu state is predictively generated by the system for the next legitimate increment to a partially constructed query. As the user chooses one of the presented choices, the system proceeds to the next step. In addition to the grammatical knowledge of a query language, the system needs structured domain knowledge to make the menu guidance a cooperative response. We found the following knowledge is essential: (1) the conceptual knowledge over a domain, which allows to talk about objects and legitimate relationships among them and; (2) the domain and integrity constraint knowledge for pruning further the choices leading to intensional failures. The lack of such knowledge not only may lead to either a failing or a meaningless query as experienced in NLMenu but also reduces the effectiveness of menu interaction by flooding the menu space with irrelevant choices and thus increasing the user's choice search time.

    Compared with conventional natural language interface systems, our approach is based on a normative system assumption. The system works with a relatively small-sized grammar, vocabulary, and knowledge base. Besides, they can be made to grow incrementally.

    To demonstrate the effectiveness of the menu-guided cooperative ad hoc query interface described in this section, we plan to implement it on a XEROX lisp machine with a SQL DBMS configured on a remote server. The idea of menu-guided cooperative system carries over to formal query language interfaces as well. In addition to testing over a small subset of natural language grammar, we plan to create a menu-guided cooperative interface for a subset of SQL. We expect such SQL interface to be useful to casual and advanced database users, while the interface based on a natural language to be suited for wide range of users.

    Use of a Statistical Abstract of a Large Database

    Person: Rowe
    <% subsection{4}14> The size of data sets subjected to statistical analysis is increasing as computer technology becomes more sophisticated [Wiederhold;84c]. It is an attractive alternative for analysis to have quick estimates of descriptive statistics rather than exact values obtained with considerable delay. We have demonstrated a new technique for estimating database statistics that is a top-down alternative to the bottom-up method of sampling. This approach precomputes a set of general-purpose database statistics (a database abstract) and then uses a set of approximately 400 inference rules to make bounded estimates of other, arbitrary statistics requested by users.

    The inference rules comprise a new example of an artificial-intelligence expert system. There are several important advantages of this approach over sampling methods, as has been demonstrated experimentally in estimating a variety of statistics on two different databases [Rowe;81b]. [82b, d, e, f, 83a, b, d].

    Applications

    <% section 8> In order to gain experience with our concepts and validate their applicability, we also carry out research in application areas. We have obtained databases of merchant ships, of computer circuits, and of chronic disease patients to support this work.

    Experiments with Database Technology to Support VLSI Design

    Person: Milton, Beetem, Winslett
    The implementation of a large and complex device requires a massive design effort. The length of the design cycle of VLSI non-regular devices is often critical. For a complex device, approaches involving only a single designer are either too slow or make so many compromises that they utilize chip area very poorly. If instead, a team of designers work together on a project, they will require much support for communication and task scheduling. We have applied our techniques to manage databases to aid the VLSI design process. The database is used here as a communication medium among people working at different levels on the same object, namely the same device or system. This use of a database is in contrast to current design automation aids which use task specific files without automatic feedback of analysis results or decisions made by the designers.

    Logic and circuit design information of two devices, an ALU and a pdp-11 CPU, and the component library required to build them, have been placed into databases. The lower level data elements can be automatically generated using an existing design automation program. Performance measurements indicated only a performance degradation by a factor of about 2 versus use of specialized design files [Wiederhold;80b]. Modification of lower level elements during the design process is signalled automatically, using a height-first algorithm, to the related parent levels, so that this detailed knowledge can be incorporated in the higher level abstraction when these are accessed during successive design iterations [Beetem;82], [Wiederhold;82b].

    We have developed some access techniques for the design data which permit retrieval of data that is partially stored. If required elements are not stored, they are generated from higher level circuit or logic specifications. Partial storage is important. The volume of lower level elements in a VLSI design can be greater than is manageable by the largest storage devices now available, so that automatic methods for VLSI design automation will need access to a mix of generatable, regular elements and instantiated, irregular elements if they are to handle large circuits that are not totally regular [Wiederhold;82b].

    Our initial experiments utilized a codasyl database system, DBMS-20. We are in the process of evaluating our approaches for other database environments to make our results more accessible to practitioners. We have ported the database to a relational and object oriented system at XEROX Palo Alto Research Center and evaluated the performance of the VLSI-design programs in both a purely relational and an entity-relationship model implementation [Winslett;84, 85a]. The entity-relationship model improved performance, but the overriding factor in both versions was the cost of writing the results to disk.

    An important byproduct of this work is an assessment of the tradeoffs among database implementation models. Much argument has been created in the arena of relational versus network implementations. We believe that some solid experiments in this one critical area can help in quantizing the issue in an objective fashion [Milton;83].

    We have begun to consider the issue of the automatic management of design versions. This task is performed now by databases that support engineering change control [Olumi;83]. The same problem is being addressed in conventional databases that are concerned with time or planning updates, but there are yet few results to draw upon, so that in fact the design area may be able to lead in this research field [Belady;77] , [Bennett;82], [Parnas;76], [Rochkind;75]. The high capacity permanent storage provided by optical disks permits a greater retention of past versions [Rathmann;84].

    Planning

    Person: Pednault
    <% subsection{4}16> Databases are important resources for planning. Much planning is based on extrapolation from past events and includes hypothetical data defining alternate future scenarios. The intrinsic complexity of planning achievement of a goal from a current state is high.

    In research performed at the SRI International AI Center theoretical issues in planning are being investigated. A method which permits plans to be created, although not necessarily executed, efficiently is the focus of current research [Pednault;84], [Manna;86]. The method can be related to many techniques which had a heuristic base.

    Experiments in Intelligent Data Analysis (RX)

    Person: Blum
    <% subsection{4}17> If database analysis can deal effectively with both past facts and multiple future scenarios, its use as a planning tool will be greatly enhanced. There remains a need for improving the corresponding analytical support capability of planning systems. Many of the critical data which are found in large systems appear in a time sequence. Our existing systems have not dealt well with the special demands stemming from the temporal relationships among the data. In related projects (acme, aamrs, aramis, rx) we have had the opportunity to structure, collect, and analyze large medical databases. In such databases, for instance tod, time is a key attribute [Weyl;75]. Some of this work was performed at a system level [Wiederhold;80a, c, 81a, c, 82c, f], [Kuhn;82, 84], while other work analyzes the content statistically [Blum;81].

    Content analysis requires large collections of well maintained data. Many years of operational use were required to get to this stage. In this research we have developed tools that use artificial intelligence techniques to assist in the processing of time-oriented data. A frame-based model is used to capture domain knowledge. This model assists in combing the database for unexpected time-lagged correlations and reporting such events as prospective hypotheses. If analysis of the finding appears warranted, a statistical model is built that includes all possible covariates given in the knowledge base.

    The appropriate statistical tests are selected by applying a collection of rules that use the descriptions of the data from the database schema and the conditions that are satisfied by the statistical procedures. If the findings turn out to be significant, they are added to the knowledge base for use in another iteration of hypotheses generation and testing. This approach to knowledge generation complements the familiar, but less formal approach taken by expert systems [Shortliffe;73, 76, 79].

    This approach has been demonstrated by using a small subset of a time-oriented rheumatology database (aramis) [Fries;79]. The statistical methods now available are limited to multi-variate regression. Tentative hypotheses were generated and automatically tested for statistical significance. Several verified hypotheses could be added to the original loaded knowledge base [Blum;82a, b, c].

    Work is continuing in the medical arena, but its applications to other databases (e.g., military manpower management) are also under way. Basic research to deal with reduction of the inherent complexity of hypotheses generation is continuing.

    Tools

    <% section 9> As part of the research we have developed software tools. Some of these have had broader applicability and have been used as prototypes for commercial and industrial database modules. We are further investing in this area again, partially so that we may rapidly build demonstration systems, but also to demonstrate that future database and knowledge base systems can be effectively assembled from standard modules.

    Modules developed and under consideration include a symmetric file access system, a database definition language processor, a relational algebra processor, and recovery support. We hope to integrate these on the naxos system, a vax 750 dedicated to experimentation with advanced databases.

    File Access System

    Person: Allchin, Qian, Keller
    A file access system that uses symbolic keys to access variable length records based in pascal and supporting several host languages, including pascal, interlisp, and fortran has been developed and tested. The services that this system, named flash, expects from the underlying operating system are limited to:
    1. directory management for named segments of secondary storage, and
    2. access to fixed size blocks or pages of these segments.
    In a multi-user environment some locking facilities are also needed. Since this subproject is the basis for long-range database system development, reliability and efficiency have been major design and implementation objectives. It is specifically designed to provide strong and symmetric support facilities for databases, so that powerful database systems can become easier to implement than they are using conventional files designed with only programmer's needs in mind. The underlying structure uses B+ trees for storage of both primary and secondary keys [Allchin;80]. This system will be used to study various dynamic storage and retrieval strategies. The experience of implementing flash has been used to define an Input-Output package for the Ada language [Keller;86c].

    While traditionally the notions of primary key have implied both a high degree of locality as well as uniqueness of key, it is desirable for a database-driven system not to make that distinction. Included in that development is a port to dec vax equipment. The available unix software is particularly deficient in the area of file management.

    Intelligent File Systems

    Person: Shih
    <% subsection{4}22> The understanding we have developed in the file system area leads us to consider hardware implementations for these facilities. The processing capability of modern disk controllers is such that a substantial fraction of a file system could be provided outboard of the main processors.

    Such an intelligent file server would respond to high level queries. Whereas conventional equipment requires the processor to specify a device and a storage address for retrieval, we foresee having to submit only a symbolic name and a key. The objects to be retrieved will no longer be fixed-length blocks, but sets of variable length records.

    The difficulties of moving advanced file functions out of the processors realm are in the respecification of the interface. Linkages for serial access within the retrieved sets, algorithms for memory allocation and for setting and managing locks for concurrent access all need analysis and formalization.

    Database Machines

    Person: Shaw
    <% subsection{4}23> One focus of research activity within the KBMS project has involved the design and analysis of alternative hardware machine architectures oriented toward the very rapid execution of the relational operations which arise frequently in large-scale database management applications. This research has yielded certain surprising and potentially important results which may ultimately prove useful in designing cost-effective, extremely high-performance database machines [Shaw;79].

    These results have led to a ``high-level'' design for a specialized non-von Neumann machine, portions of which would be implemented in VLSI, which would support the highly efficient evaluation of the most computationally complex operators of a relational algebra, specifically projection with elimination of duplicate tuples and the equi-join. Based on a hierarchy of associative storage devices, this architecture permits an O(log n) improvement in time complexity over the best known evaluation methods for these operators on a conventional computer system, without the use of redundant storage, and using currently available and potentially competitive technology. In many cases of practical import, the proposed architecture should also permit a very significant improvement (by a factor roughly proportional to the size of the primary associative storage device) over the performance of previously implemented or proposed database machine architectures based on associative secondary storage devices [Shaw;80a, b, 81]. This work is the basis for further VLSI data-oriented architectural research in progress at Columbia University.

    Access to Optical Storage

    Person: Rathmann
    <% subsection{4}24> Optical storage technology promises to greatly increase the capacity and reduce the cost of data systems. Optical storage comes with media which are either:
    1. created from master disks and can only be read subsequently
    2. written once by appending blocks of data, but never erased
    3. erased and rewritten, similar to magnetic storage.
    The type of optical storage which we consider in particular uses optical disks which are writable once, but can be read often. It is these devices that can serve data-processing needs because of their incremental writability and their high storage density.

    While research on rewritable optical media is progressing, it seems that the linear density will be reduced by at least a factor of 2, leading to reduction in storage capacity by a factor of 4.

    With this development, new problems have to be addressed. At this time we recognize that:

    1. more data will be replicated because storage costs are dropping more rapidly than communication costs
    2. more versions of past data will be kept
    3. the index mechanisms used for smaller, rewritable devices are not directly applicable to this medium.
    While we are addressing the problem of replication in general, the problems of indexing and access to versions has our particular attention. In our databases we find that the space occupied by indexes can exceed the storage space used for the data. Traditional indexing techniques depend on wholesale or incremental update of the index trees. This means that escaping to magnetic storage for index maintenance will create other restrictions. Either past versions will not be indexed, so that that they are lost for practical purposes, since scanning of these high volume storage devices will take many hours, if not days; or the granularity of indexing will be high, so that one retrieval must fetch millions of characters, which must then be post-processed.

    We have developed a technique where indexing structures, in particular tries and various kinds of trees [Knuth;73], are stored incrementally within the data blocks as they are rewritten [Rathmann;84]. This appears to solve the basic problem, although further investigations are foreseen. The trie approach has been implemented [SYTH;85], in an optical version of flash, a language independent portable file access system [Allchin;80a].

    Conclusion

    Our work on Knowledge-Based Management Systems, the Management of Distributed Knowledge, and Management of Knowledge and Databases at Stanford and SRI International has demonstrated that substantial improvements in the utility of database systems result from the application of techniques borrowed from Artificial Intelligence. This work also incorporated considerations of distributed semantics. The incorporation of higher-level semantic knowledge into database systems has been shown to help with the design, improve the manageability, and increase the utility of data in a variety of ways. Upgrading databases from passive systems which mainly store information to active systems which understand information, by incorporating knowledge of the data and the domain into the system itself, is a natural and neccessary precondition for growth in such systems. Our research conclusions can be summarized as follows:
  • Traditional, algorithmic techniques provide a solid foundation for work in databases, but the models used cannot cope adequately with the complexities and the scale of systems of realistic proportions.
  • The techniques developed in artificial intelligence research apply well to the database world because of the intrinsic formal abstraction provided by databases in their modeling of the real world. Artificial intelligence research has been most effective where it has focused on highly structured domains that are subject to simplified abstract descriptions. Databases provide a realistic example of such a domain.
  • Within the KBMS framework we have been able to develop innovative approaches to modeling the database needs of diverse user groups and integrating various user database models; improving query processing through the use of task models, semantic query optimization, common subexpression analysis, and giving descriptive responses; as well as dealing with issues of updating large databases, including problems arising due to incomplete data, incomplete knowledge in the users view, and natural-language database updates.
  • To support this research we have developed tools, such as portable file-access systems, database technology to support VLSI design, designs for database machines, and methods for optimal design of centralized and distributed physical databases.
  • Our work is finding applications in the design and operation of information systems in the world outside the laboratory. Moreover, it is beginning to contribute to tasks being undertaken by NASA, organizations involved in military manpower planning, and other institutions, including applications for manufacturing, medical, economic planning, and financial systems [Ghosh;79], [Wiederhold;79d].
  • Our work has been characterized throughout by a conviction that future systems will have to deal with large data and knowledge bases. In such environments a user has only a limited view from which to manage the database and infer new information for decision-making. Views may be physically related to distribution of storage. We are now synthesizing and formalizing the results from our studies. To integrate the concepts developed we have chosen an approach which permits distinct maintenance of data and knowledge, while at the same time integrating the inference processes. To store the knowledge we are employing frames, while the databases uses a relational DBMS. Efficiency issues are being addressed through dynamic binding of information as it is being brought into memory.

    Conceptual~Interactions~of~KBMS~Components

                             ---------------------------
                            |           Users           |
                             ---------------------------
                       Free form||               /\             /\ data
                        queries ||               ||             || flow
                          and   ||               ||
                        updates ||               ||
                                \/               ||              ^ control
        -------       ------------------         ||              | flow
       |       |====>| Natural language |        ||
       | User  |     |   front end      |        ||
       |       |      ------------------         || Responses
       | Focus |           |  |   ||             ||
       |       | Lexical   |  |   || Knowledge   ||
       |       | data      |  |   || base        ||
       |       | inter-    |  |   || queries     ||
       |       |  action   |  |   ||             ||  
       |       |           v  |   \/             ||  
       |       |      -----------------     -------------      ------------   
       |       |     |      Domain     |   | Cooperative |    |  Database  | 
       |       |<====|     Knowledge   |===|  response   |<===|  Abstract  |  
        -------      |       Base      |   |  analysis   |    |            |  
                      -----------------     -------------      ------------   
                        ^  |  |   ||             /\  ^                 /\^      
                Organi- |  |  |   || Database    ||  | Access          ||      
                zation  |  |  |   || queries     ||  | path            ||      
                advice  |  |  |   ||             ||  | information     ||      
                        |  |  |   ||             ||  |                 ||      
     -------------\     |  |  |   ||             ||  |                 ||      
     |    Data  |  \    |  |  |   ||             ||  |                 ||      
     |    View  |   \   |  |  |   \/             ||  |                 ||
     |   Models |    \ ------------------------------------            || 
     |     of   |\         Structural Schema               |           ||      
     |  Applica-| |  / ------------------------------------            ||      
     |   tions  | | /   |  |  |   ||             ||  ^                 ||      
     |          | |/    |  |  |   || Data        ||  |                 ||      
     ------------/|     |  |  |   || base        ||  | Access          ||
                  |     |  |  |   || queries     ||  | path            ||
         Structural     |  |  |   ||             ||  | information     ||      
        information     |  |  |   ||             ||  |                 ||      
                        |  |  |   \/             ||  |     ----------  ||   
                  ------------------------------------    | Access   | ||    
                 |        Access Path Organizer       |<=>|statistics| ||    
                  ------------------------------------    |  file    | ||     
                        |  |  |   ||             /\        ---------   ||    
            directives  |  |  |   || queries     ||                    ||
                        |  v  |   \/             ||                    ||  
      -------------------    -----------------   ------------------    ||   
     | Modern File access|  | Commercial DBMS | | VLSI based hard- |   ||    
     |  software, FLASH  |  |   a.o.CODASYL   | |   ware concepts  |   ||    
      -------------------    -----------------   ------------------    ||   
                                  ||             /\                    ||     
                                  ||             ||                    ||     
                                  ||             ||                    ||     
             -------------------------------------------------------------
            |               DATA BASES                                    |     
            |  1: Merchant ships, ports, cargo, voyages                   |     
            |  2: Logic of a DEC-11 machine, circuits, components         |     
            |  3: Time-oriented data from Arthritis patients              |     
             -------------------------------------------------------------     
    

    Stanford Reports and Papers

    These books, papers and reports document research performed fully or partially under current and preceding ARPA contracts to the investigators collaborating in the KBMS project.} <% KBMS References , Maintained by Peter Rathmann> <% Homebase is [SCORE]KBMS.REFERENCES> <% used by KBMS.TEX and KBMS.PAPERS>
      Entries with `*' are KBMS documents, others are anciliary documents.
    1. [Allchin;80a]* {J.~Allchin, A.~Keller, and G.~Wiederhold: ``flash: A Language-Independent Portable File Access System''; Proceedings of the ACM-SIGMOD Conference, May 1980, pp.151-156.}
    2. [Allchin;80b] {J. Allchin: ``Modula and a Question of Time''; IEEE Transactions on Software Engineering, vol.SE-6 no.4, July 1980, pp.390-391.}
    3. [Apers;84a]* {Peter M.G. Apers, and Gio Wiederhold: ``Transaction Classification to Survive a Network Partition''; Stanford CS report STAN-CS-85-1053, August 1984.} <%submitted to ACM TODS, August 1984.>
    4. [Apers;84b]* {Peter M.G. Apers, and Gio Wiederhold: ``Expert Database System in a Loosely Coupled Environment''; Proc. First Workshop on Expert Database Systems, Kiawah Island, South Carolina, Oct.1984, Institute of Information Management, Technology and Policy, Univ. of South Carolina, vol.2, pp.611-617.}
    5. [Apers;84c]* {Peter M.G. Apers, and Gio Wiederhold: ``Transaction Handling in a Loosely Coupled Environment''; Proc. ACM Int. Conf., Florence Italy, 1985, North-Holland Pub.}
    6. [Apers;85] {P.M.G. Apers and G. Wiederhold: ``Transaction Handling in a Loosely Coupled Environement''; in Computing 85: A Broad Perspective of Current Developments, G. Bucci and G. Valle (eds), North Holland 1985, pp. 27-36.}
    7. [Arvidson;85] {Raymond Arvidson, Gio Wiederhold, et al: Issues and Recommendations Associated with Distributed Computation and Data Management Systems for the Space Sciences, volume 2; Committee on Data Management and Computation, Space Sciences Board, National Academy of Sciences, January 14, 1985.}
    8. [Banks;85] {Peter M. Banks (chairman SESAC Task Force on Scientific Uses of the Space Station): Space Station Summer Study Report; Michael Wiskerchen and Gio Wiederhold: ``Section{5}2: IOC Facility and Operations Requirements: Communications and Information Systems''; NASA, Washington DC, March 21, 1985.}
    9. [Barr;80] {A. Barr and J. Davidson: Representation of Knowledge; Stanford University CS Report CS-80-793, March 1980; also in The Handbook of Artificial Intelligence, vol I, A.Barr and E.Feigenbaum (eds), 1981.}
    10. [Barsalou;87a]* {T. Barsalou, W.A. Moore, L.A. Herzenberg, and G. Wiederhold: ``A database system to facilitate the design of {FACS} experiment protocols (abstract)''; Cytometry, 97, August 1987.}
    11. [Barsalou;87b]* {Thierry Barsalou and Gio Wiederhold: ``Applying a semantic model to an immunology database''; In W.W. Stead, editor, Proceedings of the Eleventh Symposium on Computer Applications in Medical Care, pages 871-877, IEEE Computer Society Press, Washington, D.C., November 1987.}
    12. [Barsalou;88a]* {Thierry Barsalou: ``An object-based architecture for biomedical expert database systems''; In R.A. Greenes, editor, Proceedings of the Twelfth Annual Symposium on Computer Applications in Medical Care, pages 572-578, IEEE Computer Society, Washington, D.C., November 1988.}
    13. [Barsalou;88b]* {Thierry Barsalou and Gio Wiederhold: ``Knowledge-based Mapping of Relations into Objects''; Computer Aided Design, Butterworths, May 1987.}
    14. [Barsalou;89a]* {Thierry Barsalou and Gio Wiederhold: ``A Cooperative Hypertext Interface to Relational Databases''; Stanford Knowledge Systems Laboratory Report KSL-89-18, March, 1989}
    15. [Barsalou;89b]* {Thierry Barsalou, R. Martin Chavez, and Gio Wiederhold: ``Hypertext interfaces for decision-support systems: A case study''; to be published in Proceedings of {MEDINFO'89}, IFIP, Beijing, China, October 1989.}
    16. [Beetem;82]* {Anne Beetem, Jack Milton, and Gio Wiederhold: ``Performance of Database Management Systems in VLSI Design''; IEEE Database Engineering Bulletin, vol.5 no.2, June 1982, pp.15-20.}
    17. [Blum;78]* {R.L. Blum and G. Wiederhold: ``Inferring Knowledge from Clinical Data Banks Utilizing Techniques from Artificial Intelligence''; Proc. 2nd Symp. on Computer Applications in Medical Care, Nov.1978, Washington DC, IEEE, pp.303-307.}
    18. [Blum;80]* {R.L. Blum: ``Automating The Study of Clinical Hypotheses on a Time-Oriented Database: The rx Project''; MEDINFO 80, Lindberg and Kaihara(eds.), IFIP, North-Holland, 1980, pp.456-460.}
    19. [Blum;81]* {R.L. Blum: ``Displaying Clinical Data from a Time-Oriented Database''; Computers in Biology and Medicine, vol.11 no.4, 1981.}
    20. [Blum;82a]* {R.L. Blum: ``Discovery, Confirmation, and Incorporation of Causal Relationships from a Large Time-Oriented Clinical Database: The rx Project''; Computers and Biomedical Research, vol.12 no.2, pp.164-187, 1982.}
    21. [Blum;82b] {Robert L. Blum: Discovery and Representation of Causal Relationships from a Large Time-Oriented Clinical Database: The rx Project; Lecture Notes in Medical Informatics, no.19, Lindberg and Reichertz, (eds.), Springer-Verlag, New York, 1982, 242 pp.}
    22. [Blum;82c]* {Robert L. Blum and Gio C.M. Wiederhold: ``Studying Hypotheses on a Time-Oriented Clinical Database: An Overview of the rx Project''; Proc.of the 6th Symposium on Computer Applications in Medical Care, Oct.30-Nov.2 1982, Washington, DC, IEEE 82 CH1805-1, pp.725-735.}
    23. [Blum;82d] {Robert L. Blum: Discovery and Representation of Causal Relationships from a Large Time-Oriented Clinical Database: The rx Project; Springer Verlag, Lecture Notes in Medical Informatics no.19, 1982}
    24. [Blum;86a] {R.L. Blum: ``Computer-Assisted Design of Studies Using Routine Clinical Data: Prednisone Elevates Cholesterol''; Tentatively accepted for publication in Annals of Internal Medicine, currently being revised for publication.}
    25. [Blum;86b]* {R.L. Blum: ``Two-Stage Regression: Application to a Time-Oriented Clinical Database''; Submitted for publication.}
    26. [Brinkley;83] {Brinkley,J.F.: ``Artificial Intelligence and Ultrasonic Imaging: The use of learned Shape Knowledge to Analyze 3D data''; Proc. 28th annual meeting , American Institute of Ultrasound in Medicine, New York, oct. 1983.}
    27. [Ceri;81]* {Stefano Ceri, Shamkant Navathe, and Gio Wiederhold: Optimal Design of Distributed Databases; Stanford CS Report STAN-CS-81-884, Dec.1981.}
    28. [Ceri;83]* {Stefano Ceri, Shamkant B. Navathe, and Gio Wiederhold: ``Distribution Design of Logical Database Schemas''; IEEE Transactions on Software Engineering, vol. SE-9 no.4, July 1983, pp.487-563.}
    29. [Ceri;84a] {Stefano Ceri and Guiseppe Pelagatti: Distributed Dat