Documentation of Wrapper Components

Matcher - Parser - Library - Optimizer


UPDATED
AUTHOR
07/10/1996Marcus Breunig
07/14/1996Joachim Hammer


Table of Contents

  1. About This Document
  2. The Matcher
    1. What Matches What
    2. Matcher Files
    3. Execution of Actions
    4. Example
    5. Interface and Error Handling
    6. Implementation Notes
    7. Testfile main.cc
  3. The Library
  4. The Parser
    1. "Site" Requirements
    2. Actions, Rulenames, and Desirability
    3. New Interface and Error Handling
    4. Class TRACER
    5. How to Free the Memory of the Parse Trees
    6. Changes to pdp.h
  5. The Optimizer
  6. What File Goes Where in the TSIMMIS Directory


1. About This Document

This document was written by Markus Breunig (breunig@cs.stanford.edu) and describes the work I have done in June/July 1996 on the TSIMMIS project.

2. The Matcher

My original and main project was to write code to match templates and queries for the wrappers in TSIMMIS.

2.1. What Matches What

Queries contain: variables, constants
Templates contain: variables, constants, placeholders

Query
Template
Action
Variable XVariable Y remember binding X=Y
Constant Cerror
Placeholder Perror
Constant CVariable Y generate filter Y=C
Constant Cok
Constant Derror (D !=C)
Placeholder Premember binding P=C

Template
Query
Action
Variable YVariable Xremember binding X=Y
Constant Cgenerate filter Y=C
Constant CVariable Xerror
Constant Cok
Constant D(D !=C) error
Placeholder PVariable Xerror
Constant Cremeber binding P=C

2.2. Matcher Files

The directory contains code to match a query with a template. The following files are important:

Contains the data structures used to represent both the query and the template (abstract syntax tree). It was originally written by Yannis as part of the mediator parser. I tried to document it as much as I could. I added the action*type structs to support actions and support for placeholders. I also extracted all global variable definitions from pdp.h and put them into pdp.cc. pdp.h has been moved into the include/ directory. Manage the table of what matches what. more specifically what variables in the query match what variables in the template, and what placeholder in the template is assigned what constant value in the query. Do the matching itself. Modelled very closely along the data structued in pdp.h, for every struct one function to match it (mostly at least). Also contains code to execute the actions after the matching is complete. Matching is the process of 1) finding which placeholders are bound to which constant 2) checking that the query can be answered by the template 3) generate filters for indirectly supported queries. A backtracking algorithm is used. Instantiates the templates used in the matcher. Picks the best rule out of all matching rules. See 5. Testcode for the parser and the matcher.

2.3. Execution of the Actions

The actions are executed left to right. Newly defined placeholders can be used in the next rule. $$ is evaluated, however as we currently match the query to only one template (and not multiple templates as in the more general case) $i is never defined.

2.4. Example

To illustrate lets have a look at a concrete example:

We use a template consisting of 2 rules:

C :- C: <report {<title $X>}>
// "select * from book where Title = " ^ $X //.

B :- B: <book {<author $X><title Y><date Y>}>
// $S = "select * from "
$T = $S ^ "book where ",
$$ = $T ^ "Author = " ^ $X //.

and submit the following query:

A :- A: <book {<author "Joe"><title "1991"><date "1991">}>.

The parser and matcher together will produce the following results:

** Parsing template
C_0 :-
C_0:<OID1_0 report {<OID0_0 title $X_0>}>@(null)
// "select * from book where Title = " ^ $X_0 //.

B_1 :-
B_1:<OID5_1 book {<OID2_1 author $X_1> <OID3_1 title Y_1>
<OID4_1 date Y_1>}>@(null)
// $S_1 = "select * from " ,
$T_1 = $S_1 ^ "book where " ,
$$ = $T_1 ^ "Author = " ^ $X_1 //.

** Parsing Query
A_0 :-
A_0:<OID9_0 book {<OID6_0 author Joe> <OID7_0 title 1991>
<OID8_0 date 1991>}>@(null).

** Matching

** Results
Rule chosen : 1
Action : "select * from book where Author = Joe"
Placeholders:
{
$X_1 = Joe
$S_1 = select * from
$T_1 = select * from book where
}

extractor :
A_0:<OID9_0 book {<OID6_0 author Joe> <OID7_0 title 1991>
<OID8_0 date 1991>}>

constructor :
A_0

Optimizer chose rule: 1


Note:
Rules are numbered starting from 0. The second rule (rule 1) illustrates the usage of additional placeholders.

2.5. Interface and Error Handling

The matcher has a very simple interface. Standard code to use it would look like this:

void f() {
Match *m;
struct rulelisttype *templ;
struct rulelisttype *query;
Tracer TrQuery, TrTempl;
Parser parser;
char **vtTempl, **vtQuery;

// parse the template file
templ = parser.File("TEMPLATE");
// check if the parsing was successful
if(templ==NULL) { // parse error
cerr << " Parser error parsing the template. " << endl;
} else {
// get the symbol table from the parser
vtTempl = parser.GetVT();
// initialize the tracer with the symbol table
TrTempl.SetVT(vtTempl);
// print the parsed template
TrTempl.print(templ);

// parse the query file
query = parser.File("QUERY");
// check if the parsing was successful
if(query==NULL) { // parse error
cerr << " Parser error parsing the query. " << endl;
} else {
// same as for template
vtQuery = parser.GetVT();
TrQuery.SetVT(vtQuery);
TrQuery.print(query);

// create the matcher and let him do the matching
m = new Match(query->rule, templ);
// get the results from the matcher
const MatchResult *mr = m->GetResult();

// see what the matcher says
if(mr==NULL) { // matching not successful
if(m->GetState()==S_NOMATCH) {
// matching did just fail
} else if(m->GetState()==S_ERROR) {
// error while matching
// in this case more info about the error can be found in the
// ErrMsg string the matcher provides

cerr << "ERROR: " << m->GetErrMsg() << endl);
}
} else {
// matching was successful
// use the mr->... to continue working
// optimize the result.

Optimizer o;
mr = o.optimize(mr);
// now use mr to go to the source
}
delete m;
delete vtQuery;
}
delete vtTempl;
}
}

2.6. Implementation Notes

The following is NOT a concise abstract about the implementation of the matcher, but rather just s bunch of points I think might be important for someone trying to understand the code.

Matcher-local variables:

table
The binding table. It keep which template var is bound to which constant or query var and which placeholder is bound to which constant. The current contents can be checkpointed and restored, which is used for backtracking purposes.
mr
List of all the rules the were matched successfully so far and the infos gathered during the matching process. Newly matched rules are inserted at the head of this list.
ruleNo
Number of the rule we are currently trying to match.
ruleVar
number of the var in the template rule we are currently trying to match.
trule
pointer to the rule we are currently trying to match.
query, templ
the query and the list of rules as provided by the user.
extractor
The part of the query used to extract the result.
predicates
list of all predicates in the template that could not be matched to the current rule.
in_extractor
True when we are currently working an a subtree of the template var definition in the rule.
QueryHeadVars
Set of all the vars in used in the query head.
GoodVars
All query vars accessible after the query has been executed by the source. Computed while we are traversing the rule-graph.
state:S_SUCCESS, S_NOMATCH or S_ERROR
State we are in to facilitate error handling.
errmsg
error message string as set by the function that found the error.

Algorithm used:

The matcher works by recursivly traversing the data structures as defined in pdp.h, and backtracking in case of dead ends.

Preprocess:

In order for the algorithms to work properly when predicates are present it is important, that all the predicates are at the end of the cond-list that represents the rule. This function reorders the cond-list-conds so that this condition holds. The reasons that this is necessary is that in order to check if a predicate can be evaluated we need to know all "GoodVars" (See above for def of GoodVars). However, we match one rule-cond after the other by trying to match it with all remaining query-conds. Now if e.g. the first rule-cond is a predicate it will never be matches successfully because we have not computed the GoodVars yet. If it is at the end of the condlist however, it will work.

...-functions:

The functions that do the actual matching.

m_pred...-functions:
Again, this is used to handle predicates. When traversing the pdp.h data structures we need to keep track of wether we are under a top-level predicate or under a top-level objectcondition. Depending on which one we are under different actions need to be taken during the matching process (e.g. under a objectcondition when we encounter two variables, we remember the mapping and if no other mapping exitst so far everything is fine. Under a predicate however, we do need an already existing mapping and need to do some other checks). As terms can be in both predicates and object-conds we need different versions for these, too.

isgood...-functions:

Used to check if all variables occuring in this part of the data structure are GoodVars. Applied to all preds in the query that cannot be matched to any preds in the template and therefore have to be evaulated later by the engine.

eval...-functions:

Evaluation functions for the actions. New term-functions for the actions can be added here.

addGoodVars...-function:

These functions add all variables occuring in their parameters to the GoodVars. Used to add all these vars in parts of the query tree that are under the part of the tree that corresponds to the tree under the template-head-var.

2.7. Testfile main.cc

In order to test my code I've written a test-program that read a template and a query, parses them, matches them, calls the optimizer and prints the results. This program is called main.cc, read the file TEMPLATE and the file QUERY. It can be used to test more templates and queries.

3. The Library

I noticed that the TSIMMIS project did not use a library with all the standard functions so far. As I needed classes like sets etc., I decided it would be worth to put these into a general library for reuse. The library code is contained in the subdirector L. The main header file is l.h. The library makes extensive use of C++ templates. The compiler used was + 2.7.0, so it should be portable in spite of the use of templates.

4. The Parser

During the coding phase of the wrapper some deficiencies of the parser became apparent. I was assigned the task to clear up the parser and do the necessary changed. The parser was originally written by Yannis, modified by Amit Agarwal to incorporate the actions.

Unfortunately due to a misunderstanding between Amit and Vassilis the code Amit wrote did not solve the actions in a satisfactory manner. After looking at the code I decided the best way to go was to start anew from Yannis' code.

4.1. "Site" Requirements

As the parser was originally written for the mediators it required every condition to have a site associated with it, e.g. A :- A:<book {<author $B>}>@s1 As the wrapper only runs at one site this is not necessary for the wrapper. I changed to parser so the site is only optional and not required any more. So A :- A:<book {<authro $B>} will parse without error.

4.2. Actions, Rulenames and Desirability

Actions were completely recoded. Newly introduced were names for the rules and a way to specify desirability, which is used by the optimizer. The syntax for template is now:

template ::= {rule}*

rule ::= [PreHead] Head :- Condition-List Actionlist.

PreHead ::= RuleName:Desirability

RuleName ::= Constant
Desirability ::= Constant

Head ::= as before

Actionslist ::= // NoAssAction //
             |  // AssAction { , AssAction }* //

NoAssAction ::= ActionBody

AssAction   ::=  $$ = ActionBody
             |   Placeholder = ActionBody

ActionBody  ::= ActionTerm

ActionTerm  ::= Func ( ActionTerm )
             |  ActionTerm { ^ ActionTerm }*
             |  Constant
             |  Placeholder
             |  $$

Constant    ::= "{AlphanumericalChar}*"

Placeholder ::= $UpercaseChar{AlphenumericalChar}*

Condition-List ::= similar to before, but everywhere where Variables
                   were allowed we also now allow Placeholders

Func is a function symbol. Each function symbol expects a certain
number of parameters. Currently the following function-symbols and 
associated functions are available:

uppercase(X) : converts X into uppercase letters

lowercase(X) : converts X into lowercase letters

equal(X,Y)   : if the caseinsensitive comparison of X and Y yields
               that they are the same, this function returns "TRUE",
               otherwise it returns "" (the empty string)

if(X,Y,Z)    : if X is not the empty string ("") then this
               function returns X otherwise it returns Z

cr()         : returns a "\n", i.e. a newline

quote(X)     : puts the string in quotes, i.e. returns "X"

4.3. New Interface and Error Handling

A new interface to the parser has been designed. The class Parser provides two functions for parsing: one to parse a file and one to parse a string. The code for this interface is in parse.cc and parse.h. The parser error handling has been changed from abort'ing to returning NULL as a parse string if a parse error occurs.

4.4. Class Tracer

The files trace.cc and trace.h contained C functions to print a parse tree into understandable form. These functions have been encapsulated into a "Tracer" class (files tracer.cc and tracer.h). Furthermore new functions have been written to enable printing of Placeholders and Actions.

4.5. How to Free the Memory of the Parse Trees

The easiest way to free the memory occupied by the parse trees is by calling the Free functions of the class Parser. They are overloaded functions that recursively free the whole parsetree starting at the point given. Typical code would look like this:

{

Parser parse;

struct rulelisttype *res;

res = parse.File("TEMPLATES");

...

parse.Free(res);

}

4.6. Changes to pdp.h

The pdp.h file has been changed in the following ways:

  1. all global variables have been moved to pdp.cc
  2. the ruletype has 3 new fields:
  3. changes to termtype:
  4. new struct actionlisttype
  5. new struct actiontype
  6. new struct actionoplisttype
  7. new struct actionoplistlisttype
  8. new struct actionoptype

Furthermore the file pdp.h now contains comments describing all structures

and which fields are valid for which kind values (at least to the best of

my knowledge). More comments would not hurt, though.

5. The Optimizer

The parser returns a list of rules that match. The optimizer has the job of determining which one will be the most efficient to execute. It currently uses the following heuristic: Let R be the set of all matching rules. It can be partitioned into two disjoint sets P and NP, where P contains all rules that have not enough predicates for the query, i.e. rules for which additional predicates have to be evaluated after getting the result from the source. NP contains all other matching rules. So R = P union NP, and P and NP are disjoint.

If NP it not empty then the optimizer picks as the best rule the one with the highest desirability on NP, otherwise the one with the highest desirability in P.

The interface to the optimizer is very simple:

const MatchResult *mr, *omr;

Optimizer o;

.. get mr from matcher...

omr = o.optimize(mr)

..

omr points to the best element in the list mr.

6. What File Goes Where in the TSIMMIS Directory

The following files need to be copied into the TSIMMIS directories:

~breunig/tsimmis/include/pdp.h
=> ~tsimmis/tsimmis-1.0/src/common/pdp.h

~breunig/tsimmis/parser/*
=> ~tsimmis/tsimmis-1.0/src/wrappers-96/parser/*

~breunig/tsimmis/matcher/*
=> ~tsimmis/tsimmis-1.0/src/wrappers-96/matcher/*

~breunig/tsimmis/matcher/L/*
=> ~tsimmis/tsimmis-1.0/src/wrappers-96/matcher/L/*