C H A P T E R  2
Basic Concepts

2.1 RELATIONAL DATABASES

The fundamental building blocks of databases are entities and relations. Entities represent objects presumed or hypothesized to exist in the application area of the database. Relations represent properties of those objects or relationships among them.

In our examples here, we refer to entities and relations using strings of letters, digits, and a few non-alphanumeric characters (e.g. "_"). For reasons described below, we prohibit strings beginning with upper case letters; all other combinations are acceptable. Examples include a, b, 123, comp225, and helen_heavenly.

The set of all entities that can be used in a database is called the domain of the database. Database domains can be finite or infinite. Most (but not all) database domains include strings and numbers as subsets, and hence they are infinite.

The set of all relations in a database is called the signature of the database. Signatures are always finite.

The arity of a relation is the number of objects involved in any instance of that relation. Consider, for example, the teaching schedule in a university database. Every instance of this relation involves two objects, viz. a faculty member and a course that that faculty member teaches; therefore, it has arity 2. Arity is an inherent property of a relation and never changes.

A database schema consists of a domain, a signature, and an assignment of arities for each of the relations in the signature. Our definition here departs slightly from that used in many database texts. Usually, a schema does not include a restriction on the entities used in relations, whereas we have added in a domain of possible entities here. This addition simplifies our definitions but does not change any fundamental results.

In the relational data model, the extension of an n-ary relation in a database is a set of n-tuples from the domain of the database. The cardinality of a relation in an extension is the number of tuples of objects that satisfy the relation in a particular state. Unlike arity, the cardinality of a relation can change from one extension to another.

In the database literature, it is common to present the extension of a relation as a 2-dimensional table. The number of columns in the table corresponds to the arity of the relation and the number of rows corresponds to the cardinality of the extension.

For example, the extension of the teaches relation is a set of pairs of faculty members and courses, one pair for each faculty member and each course that that faculty member teaches. If we consider a state of the world in which there are 6 pairs of faculty members and courses that they teach, we can visualize this extension as a table with 2 columns and 6 rows, as shown below. Here, the extension has arity 2 and cardinality 6.

teaches
facultycourse
donna_daringarch101
bill_boringcomp101
cathy_carefulcomp225
gary_grumpcomp235
oren_overbearingcomp257
helen_heavenlycomp310

Figure 2.1 - A Binary Relation

The student relation is a property of a person in and of itself, not with respect to other people or other objects. Since there is just one object involved in any instance of the relation, the table has just 1 column. By contrast, there are 6 rows, one row for each student. In this case, the extension has arity 1 and cardinality 6.

student
aaron_aardvark
belinda_bat
calvin_carp
george_giraffe
kitty_kat
sally_squirrel

Figure 2.2 - A Unary Relation

The gradesheet relation is a relation among students and courses and grades; and so the table requires three columns, as shown below. In this case, we have an extension with arity 3 and cardinality 4.

gradesheet
studentcoursegrade
aaron_aardvarkarch101a
aaron_aardvarkcomp101a
calvin_carparch101b
sally_squirrelcomp101a

Figure 2.3 - A Ternary Relation

This tabular representation for relations makes clear that, for a finite domain, there is an upper bound on the number of possible extensions for an n-ary relation. In particular, for a universe of discourse of size b, there are bn distinct n-tuples. Every extension of an n-ary relation is a subset of these bn tuples. Therefore, the extension of an n-ary relation must be one of at most 2bn possible sets.

2.2 SENTENTIAL DATABASES

While the relational data model is an appealing way to think about data, the sentential data model is more useful for our purposes here. Everything is the same as in the relational data model up to the definition of an extension. In the sentential data model, we encode each instance of a relation in the form of a sentence consisting of the relation and the entities involved in the instance, and we define a database extension as a set of such sentences.

More precisely, given a database, we define a datum to be a structure consisting of an n-ary relation from the signature and n objects from the domain. On occasion, we call a datum a proposition. In what follows, we write data using traditional mathematical notation. For example, if r is a binary relation and a and b are entities, then r(a,b) is a datum / proposition.

Consider the data shown in Figure 2.1. We can encode this in sentential representation as shown below. There is one sentence for each row in the original table. Each sentence here has the same relation. The relation is included so that we can combine these sentences with sentences representing other relations.

teaches(donna_daring,arch101)
teaches(bill_boring,comp101)
teaches(cathy_careful,comp225)
teaches(gary_grump,comp235)
teaches(oren_overbearing,comp257)
teaches(helen_heavenly,comp310)

The sentences corresponding to the data in Figure 2.2 are shown below. student is a unary relation, and so the sentences in this case all have one argument.

student(aaron_aardvark)
student(belinda_bat)
student(calvin_carp)
student(george_giraffe)
student(kitty_kat)
student(sally_squirrel)

Finally, we have the sentences for the data in Figure 2.3. Here, each sentence has 3 arguments, since the relation gradesheet has arity 3.

gradesheet(aaron_aardvark,arch101,a)
gradesheet(aaron_aardvark,comp101,a)
gradesheet(calvin_carp,arch101,b)
gradesheet(sally_squirrel,comp101,a)

Since there are no column headings in such sentences in this presentation, as there are in tables, the order of arguments is important. Given the intended meaning of the gradesheet relation, for example, the first argument denotes the student, the second the course, and the third the grade.

The propositional base for a database schema is the set of all propositions that can be formed from the relations and the entities in the database schema. For a schema with entities a and b and relations p and q where p has arity 1 and q has arity 2, the propositional base is {p(a), p(b), q(a,a), q(a,b), q(b,a), q(b,b)}.

A sentential database is a database in which the state is expressed in sentential representation. An extension of a sentential database is a finite subset of its propositional base.

2.3 DATALOG PROGRAMS

Demonstration

Datalog programs are built up from three disjoint classes of components, viz. relations, entities, and variables. In our examples here, we denote entities and relations as described above; and we write variables as strings of letters, digits, and special characters beginning with an upper case letter, e.g. X, Y, Z, Age, F1, F2, and so forth. A term is either an entity or a variable.

An atom is an expression formed from an n-ary relation and n terms. As with database sentences, we write Datalog atoms in traditional mathematical notation - the relation constant followed by its arguments enclosed in parentheses and separated by commas. For example, if r is a binary relation, if a and b are entities, and if Y is a variable, then r(a,b) and r(b,a) are atoms, as before, but so are r(a,Y) and r(Y,a) and r(Y,Y).

A literal is either an atom or a negation of an atom. An atom is called a positive literal. The negation of an atom is called a negative literal. In what follows, we write negative literals using the negation sign ~. For example, if r(a,b) is an atom, then ~r(a,b) denotes the negation of this atom.

A rule is an expression consisting of a distinguished atom, called the head, and 0 or more literals, together called the body. In what follows, we write rules as in the example shown below. Here, q(X,Y) is the head, and the other literals constitute the body.

q(a,Y) :- p(b,Y) & ~r(Y,d)

A Datalog program is a finite set of atoms and rules of this form. In order to simpify our definitions and analysis, we occasionally talk about infinite sets of rules. While these sets are useful for purposes of analysis, they are not themselves Datalog programs.

An expression in Datalog is said to be ground if and only if it contains no variables.

A rule in a Datalog program is safe if and only if every variable that appears in the head or in any negative literal in the body also appears in at least one positive literal in the body. A Datalog program is safe if and only if every rule in the program is safe.

The dependency graph for a Datalog program is a directed graph in which the nodes are the relations in the program and in which there is an arc from one node to another if and only if the former node appears in the body of a rule in which the latter node appears in the head. A program is recursive if and only if there is a cycle in the dependency graph.

A negation in a Datalog program is said to be stratified if and only if there is no recursive cycle in the dependency graph involving the negation. A Datalog program is stratified if and only if there are no unstratified negations.

In this book, we concentrate exclusively on Datalog programs that are both safe and stratified. While it is possible to extend the results here to other programs, such extensions are beyond the scope of this work.

The propositional base for a Datalog program is the set of all atoms that can be formed from the constants in the program's schema. Said another way, it is the set of all sentences of the form r(t1,...,tn), where r is an n-ary relation constant and t1, ... , tn are object constants.

An instance of a rule in a Datalog program is a rule in which all variables have been consistently replaced by terms from the program's domain. Consistent replacement means that, if one occurrence of a variable is replaced by a given term, then all occurrences of that variable are replaced by the same term.

An interpretation for a Datalog program is an arbitrary subset of the propositional base for the program. A model of a Datalog program is an interpretation that satisfies the program (as defined below).

An interpretation D satisfies a Datalog program P if and only if D satisfies every ground instance of every sentence in P. The notion of satisfaction is defined recursively. An interpretation D satisfies a ground atom p if and only if p is in D. D satisfies a ground negation ~p if and only if p is not in D. D satisfies a ground rule p :- p1 & ... & pn if and only if D satisfies p whenever it satisfies p1, ... , pn.

In general, a Datalog program can have more than one model, which means that there can be more than one way to satisfy the rules in the program. In order to eliminate ambiguity, we adopt the minimal model approach to Datalog program semantics, i.e. we define the meaning of a safe and stratified Datalog program to be its minimal model.

A model D of a Datalog program P is minimal if and only if no proper subset of D is a model for P. A Datalog program that does not contain any negations has one and only one minimal model. A Datalog program with negation may have more than one minimal model; however, if the program is stratified, then once again there is only one minimal model.

2.4 OPEN DATALOG PROGRAMS

Datalog programs as just defined are closed in that they fix the meaning of all relations in the program. In open Datalog programs, some of the relations (the inputs) are undefined, and other relations (the outputs) are defined in terms of these. The same program can be used with different input relations, yielding different output relations in each case.

Formally, an open program is a Datalog program together with a partition of the relation constants into two types - base relations (also called input relations) and view relations (also called output relations). View relations can appear anywhere in the program, but base relations can appear only in the subgoals of rules, not in their heads.

The input base for an open Datalog program is the set of all atoms that can be formed from the base relations of the program and the entities in the program's domain. An input model is an arbitrary subset of its input base.

The output base for an open Datalog program is the set of all atoms that can be formed from the view relations of the program and the entities in the program's domain. An output model is an arbitrary subset of its output base.

Given an open Datalog program P and an input model D, we define the overall model corresponding to D to be the minimal model of PD. The output model corresponding to D is the intersection of the overall model with the program's output base.

Finally, we define the meaning of an open Datalog program to be a function that maps each input model for the program into the corresponding output model.

As an example, consider the simple open Datalog program shown below. The universe of discourse is {a,b,c}, p and q are base relations; and r is a view relation.

r(X,Y) :- p(X,Y) & ~q(Y)

In this case, there are 212 input models, and for each there is a unique output model. The table below shows the output model for two of the possible input models.

Input ModelOutput Model
p(a,c)
q(b)
r(a,c)
p(a,c)
p(b,c)
r(a,c)
r(b,c)

2.5 DATABASE QUERIES

A query for a database D is an open Datalog program with the following restrictions. (1) The program has a distinguished output relation (which, by convention, we call query). (2) The input relations of the program are all relations in D's signature and have the same arity as in D's schema.

The value of a query Q on a database instance D (written Q(D)) is the set of all atoms in the minimal model of QD that mention the query relation.

Query Q1 is contained in query Q2 (written Q1Q2) if and only if, for every database instance D, Q1(D)⊆Q2(D). Query Q1 is id='query equivalence'>equivalent to Q2 if and only if Q1Q2 and Q2Q1.

The problem of determining whether a Datalog program Q is contained in another Datalog program Q' is, in general, undecidable [Shm87].

The problem becomes decidable if Q contains only positive literals and is not recursive. The following is a decision procedure for this case [RSUV89]. First, replace all variables in Q by distinct constants. Consider the database D that consists of the literals in the bodies of all of these instantiated rules. D is called the canonical database for Q. Evaluate Q' on D. Q is contained in Q' if and only if the instantiated heads of all rules in Q are contained in Q'(D).

The expansion of a relation r defined in a program P (written r) is the set of rules obtained by substituting the definitions in P for the relations in the bodies of all rules in P with r in the head. Variables occurring in the bodies of definitions that do not appear in the head are replaced by new variables in the expansion.

2.6 DATABASE CONSTRAINTS

In our development thus far, we have assumed that the extension of an n-ary relation may be any set of n-tuples from the domain. This is rarely the case. Often, there are constraints that limit the set of possibilities. For example, a course cannot be its own prerequisite. In some cases, constraints involve multiple relations. For example, only students should appear in the first column of the grade relation; in other words, if an entity appears in the first column of the grade relation, it must also appear as an entry in the student relation.

Database management systems can use such constraints in a variety of ways. They can be used to optimize the processing of queries. They can also be used to check that updates do not lead to unacceptable extensions.

In many database texts, constraints are written in direct form - by writing rules that say, in effect, that if certain things are true in an extension, then other things must also be true. The inclusion dependency mentioned above is an example - if an entity appears in the first column of the grade relation, it must also appear as an entry in the student relation.

In what follows, we use a slightly less direct approach - we encode limitations by writing rules that say when a database is not well-formed. For example, if an entity appears in the first column of the teaches relation and does not appear in the faculty relation, then the extension is not well-formed.

One merit of this approach is that we can use Datalog to write such rules. We simply invent a new 0-ary relation, here called error, and define it to be true in any extension that does not satisfy our constraints.

This approach works particularly well for consistency constraints like the one stating that a course cannot be its own prerequisite.

error :- prerequisite(X,X).

Using this technique, we can write the constraint on teaches as shown below. There is an error if an entity is in the first column of the grade relation and it does not occur in the student relation.

error :- grade(X,Y,Z), ~student(X).

And we can even write the existential (tuple-generating) constraint that every faculty member teaches at least one course, though in this case we must define a helper relation teacher.

error :- faculty(X), -teacher(X)
teacher(X) :- teaches(X,Y)

In fact, using this approach, we can define constraints of all sorts. In order for the direct approach to work, it is necessary to introduce explicit quantifiers into rules. While this presents no theoretical difficulty, it means that we need to extend all of our results and methods to handle such constructs, whereas, using the Datalog programming approach, we need nothing more that the theorems and algorithms described in the preceding sections.

2.7 PARTIAL DATALOG PROGRAMS

The rules in Datalog programs provide complete definitions of view relations in terms of base relations. In data integration, complete definitions are not always possible. On the other hand, we may know something about the extension of the view relation and it is desirable to capture this partial information.

In order to express what we do know about views, we can extend Datalog in various ways. In what follows, we consider two particular extensions - the introduction of functional terms and the introduction of disjunctions.

Consider, for example, the relation of parenthood and the relation of grandparenthood. While it is possible to define grandparenthood in terms of parenthood, it is not possible to define parenthood in terms of grandparenthood. On the other hand, if we were given an instance of the grandparenthood relation, we could say something about the parenthood relation. For example, if a person a is a grandparent of person c, we know that there is a person b such that a is the parent of b and b is the parent of a; we just do not know the identity of b.

Consider the kinship relations father and mother and parent. While it is possible to define parent in terms of father and mother, it is not possible to define father or mother in terms of parent. On the other hand, if we know that the parent relation holds of person a and person b, then we can know that either a is the father of b or a is the mother of b.

2.8 FUNCTIONAL DATALOG PROGRAMS

The main difference between Functional Datalog and Basic Datalog is the availability of functional terms to refer to entities in the domain of a database.

A functional term is an expression consisting of an n-ary function and n terms. In what follows, we write functional terms in traditional mathematical notation - the function followed by its arguments enclosed in parentheses and separated by commas. For example, if f is a function constant, if a is an object constant, and if Y is a variable, then f(a,Y) is a term. Functional terms can be nested within other functional terms. For example, if f(a,Y) is a functional term, then so is f(f(a,Y),Y).

In Functional Datalog, the enhanced domain for a program is the set of all entities and all functional terms that can be formed from the functions and entities in the program. Note that, in the presence of functions, the Herbrand universe is infinite. The propositional base is the set of all propositions that can be formed from the relations in the program and the objects in the enhanced domain.

An interpretation for a Functional Datalog program is an arbitrary subset of the propositional base for the program. A model of a Datalog program is an interpretation that satisfies the program (as defined above).

Note that, unlike Basic Datalog programs, Functional Datalog programs are not guaranteed to terminate since they can generate terms with arbitrarily nested functional terms. As an example, consider the following logic program.

r(X) :- p(X)
r(f(X)) :- r(X)

The result in this case contains infinitely many terms. If the extension of the p relation contains the constant a, the result includes a, f(a), f(f(a)), etc.

Fortunately, in many cases, termination can be assured, e.g. when the programs are not recursive or when all recursions are bounded in one way or another.

2.9 DISJUNCTIVE DATALOG PROGRAMS

A disjunctive Horn rule is a sentence of the following form, where every qi is an atom and every pi is a literal.

q1 | ... | qn :- p1 & ... & pm

Every variable in the head of a rule must also occur in the body of the rule. A Horn rule is a disjunctive Horn rule where the head consists of just one disjunct. A disjunctive program is a set of function-free rules.

A conjunctive program is a single non-recursive function-free Horn rule. A positive program is a set of conjunctive queries with the same relation in the head.

2.10 ENHANCED DATALOG PROGRAMS

On occasion, it is useful to write programs utilizing well-known relationships, such as arithmetic functions (such as +, *, -, /), string functions (such as concatenation), comparison operators (such as < and >), and equality (=). The definitions for most of these relations would require tables of infinite size. For this reason, we usually include these relations as part of the language itself, leading to Enhanced Datalog, Enhanced Functional Datalog, or Enhanced Disjunctive Datalog.

If the bodies of rules are allowed to contain these predefined relations, there are syntactic restrictions. For example, if rules contain literals with comparison operators, then every variable that occurs in such literals must appear in at least once in a positive literal in the body.

Even with this restriction, the presence of predefined relations can significantly affect the computational properties of the resulting programs. In the following chapters, we talk about these effects as we use various enhancements.