Chapter 3 - Queries

In Chapter 2, we saw how to represent the state of an application area as a dataset. If a dataset is large, it can be difficult to answer questions based on that dataset. In this chapter, we look at various ways of querying a dataset to find just the information that we need.

The simplest form of query is a true-or-false question. Given a factoid and a dataset, we might want to know whether or not the factoid is true in that dataset. For example, we might want to know whether a person Art is the parent of Bob. Answering an atomic true-or-false question is simply a matter of checking whether the given factoid is a member of the dataset.

A more interesting form of query is a fill-in-the-blanks question. Given a factoid with "blanks", we might want values that, when substituted for the blanks make the query true. For example, we might want to look up the children of Art or the parents of Bill or pairs of parents and children.

An even more interesting form of query is a compound question. We might want values for which a Boolean combination of conditions is true. For example, we might want whether Art is the parent of Bob or the parent of Bud. Or we might want to find all people who have sons and who have no daughters.

We begin this chapter by looking at an extension of our dataset language that allows us to express such questions. In the next section, we define the syntax of our language; and, in the section thereafter, we define its semantics. We then look at some examples of using this language to query datasets. With that introduction behind us, we look at an important syntactic restriction, called safety. And, finally, we finish by discussing useful predefined concepts (e.g. arithmetic operators) that increase the power of our query language.

3.2 Query Syntax

The language of queries includes the language of datasets but provides some additional features that make it more expressive, viz. variables and query rules. Variables allow us to write fill-in-the-blanks queries. Query rules allow us to express compound queries, notably negations (to say that a condition is false), conjunctions (to say that several conditions are all true), and disjunctions (to say that at least one of several conditions is true).

In our query language, a variable is either a lone underscore or a string of letters, digits, and underscores beginning with an upper case letter. For example, _, X23, X_23, and Somebody are all variables.

An atomic sentence, or atom, is analogous to a factoid in a dataset except that the arguments may include variables as well as symbols. For example, if p is a binary predicate and a is a symbol and Y is a variable, then p(a,Y)is an atomic sentence.

A literal is either an atom or a negation of an atom. A simple atom is called a positive literal, The negation of an atom is called a negative literal. In what follows, we write negative literals using the negation sign ~. For example, if p(a,b) is an atom, then ~p(a,b) denotes the negation of this atom. Both are literals.

A query rule is an expression consisting of a distinguished atom, called the head and a collection of zero or more literals, called the body. The literals in the body are called subgoals. The predicate in the head of a query rule must be a new predicate (i.e. not one in the vocabulary of our dataset), and all of the predicates in the body must be dataset predicates.

In what follows, we write rules as in the example shown below. Here, goal(a,b) is the head; p(a,b) & ~q(b) is the body; and p(a,b) and ~q(b) are subgoals.

As we shall see in the next section, a query rule is something like a reverse implication - it is a statement that the head of the rule (i.e. the overall goal) is true whenever the subgoals are true. For example, the rule above states that goal(a,b) is true if p(a,b) is true and q(b) is not true.

The expressive power of query rules is greatly enhanced through the use of variables. Consider, for example, the rule shown below. This is a more general version of the rule shown above. Instead of applying to just the specific objects a and b it applies to all objects. In this case, the rule states that goal is true of any object X and any object Y if p is true of X and Y and q is not true of Y.

A query is a non-empty, finite set of query rules. Typically, a query consists of just one rule. In fact, most Logic Programming systems do not support queries with multiple rules (at least not directly). However, queries with multiple rules are sometimes useful and do not add any major complexity, so in what follows we allow for the possibility of queries with multiple rules.

3.3 Query Semantics

An instance of an expression (atom, literal, or rule) is one in which all variables have been consistently replaced by ground terms (i.e. terms without variables). For example, if we have a language with symbols a and b, then the instances of goal(X,Y) :- p(X,Y) & ~q(Y) are shown below.

Given this notion, we can define the result of the application of a single rule to a dataset. Given a rule r and a dataset Δ, we define v(r,Δ) to be the set of all ψ such that (1) ψ is the head of an arbitrary instance of r, (2) every positive subgoal in the instance is a member of Δ, and (3) no negative subgoal in the instance is a member of Δ.

The extension of a query is the set of all facts that can be "deduced" on the basis of the rules in the program, i.e. it is the union of v(r_i,Δ) for each r_i in our query.

To illustrate these definitions, consider a dataset describing a small directed graph. In the sentences below, we use symbols to designate the nodes of the graph, and we use the p relation to designate the arcs of the graph.

Now suppose we were given the following query. Here, the predicate goal is defined to be true of every node that has an outgoing arc to another node and also an incoming arc from that node.

Since there are two variables here and three symbols, there are nine instances of this rule, viz. the ones shown below.

The body in the first of these instances is not satisfied. In fact, the body is true only in the sixth and eighth instances. Consequently, the extension of this query contains just the two atoms shown below.

The definition of semantics in terms of rule instances is simple and clear. However, Logic Programming systems typically do not implement query processing in this way. There are more efficient ways of computing such extensions. In subsequent chapters, we look at some algorithms of this sort.

3.4 Safety

A query rule is safe if and only if every variable that appears in the head or in any negative literal in the body also appears in at least one positive literal in the body.

The rule shown below is safe. Every variable in the head and every variable in the negative subgoal appears in a positive subgoal in the body. Note that it is okay for the body to contain variables that do not appear in the head.

By contrast, the two rules shown below are not safe. The first rule is not safe because the variable Z appears in the head but does not appear in any positive subgoal. The second rule is not safe because the variable Z appears in a negative subgoal but not in any positive subgoal.

To see why safety matters in the case of the first rule, suppose we had a database in which p(a,b) is true. Then, the body of the first rule is satisfied if we let X be a and Y be b. In this case, we can conclude that every corresponding instance of the head is true. But what should we substitute for Z? Intuitively, we could put anything there; but there could be many possibilities. While this is conceptually okay, it is practically problematic.

To see why safety matters in the second rule, suppose we had a database with just two facts, viz. p(a,b) and q(b,c). In this case, if we let X be a and Y be b and Z be anything other than c, then both subgoals are true, and we can conclude goal(a,b,a).

The main problem with this is that many people incorrectly interpret that negation as meaning there is no Z for which q(Y,Z) is true, whereas the correct reading is that q(Y,Z) needs to be false for just one value of Z. As we will see, there are various ways of expressing this second meaning without writing unsafe queries.

3.5 Predefined Concepts

In practical logic programming languages, it is common to predefine useful concepts. These typically include arithmetic functions (such as plus, times, max, min), string functions (such as concatenation), equality and inequality, aggregates (such as countofall), and so forth.

In Epilog, equality and inequality are expressed using the relations same and distinct. The sentence same(σ,τ) is true iff σ and τ are identical. The sentence distinct(σ,τ) is true if and only if σ and τ are different.

The evaluate relation is used to represent equations involving predefined functions. For example, we would write evaluate(plus(times(3,3),times(2,3),1),16) to represent the equation 3^2+2x3+1=16. If height is a binary predicate relating a figure and its height and if width is a binary predicate relating a figure and its width, we can define the area of the object as shown below. The area of X is A if the height of X is H and the width of X is W and A is the result of multiplying H and W.

In logic programming languages that provide such predefined concepts, there are usually syntactic restrictions on their use. For example, if a query contains a subgoal with a comparison relation (such as same and distinct), then every variable that occurs in that subgoal must occur in at least one positive literal in the body and that occurrence must precede the subgoal with the comparison relation. If a query uses evaluate in a subgoal, then any variable that occurs in the first argument of that subgoal must occur in at least one positive literal in the body and that occurrence must precede the subgoal with the arithmetic relation. Details are typically found in the documentation of systems that supply such built-in concepts

In practical logic programming languages, it is also common to include predefined aggregate operators, such as setofall and countofall.

Aggregate operators are typically represented as relations with special syntax. For example the following rule uses the countofall operator to request the number of a person's children. N is the number of children of X if and only if N is the count of all Y such that X is the parent of Y.

As with special relations, there are syntactic restrictions on their use. In particular, aggregate subgoals must be safe in that all variables in the second argument must be included in the first argument or must be used within positive subgoals of the rule containing the aggregate.

Exercises

Exercise 3.1: For each of the following strings, say whether it is a syntactically legal query.

	(a)	`goal(X) :- p(a,f(f(X)))`
	(b)	`goal(X,Y) :- p(X,Y) & ~p(Y,X)`
	(c)	`~goal(X,Y) :- p(X,Y) & p(Y,X)`
	(d)	`goal(P,Y) :- P(a,Y)`
	(e)	`goal(X) :- p(X,b) & p(X,p(b,c))`

Exercise 3.2: Say whether each of the following queries is safe.

	(a)	`goal(X,Y) :- p(X,Y) & p(Y,X)`
	(b)	`goal(X,Y) :- p(X,Y) & p(Y,Z)`
	(c)	`goal(X,Y) :- p(X,X) & p(X,Z)`
	(d)	`goal(X,Y) :- p(X,Y) & ~p(Y,Z)`
	(e)	`goal(X,Y) :- p(X,Y) & ~p(Y,Y)`

Exercise 3.3: What is the result of evaluating the query goal(X,Z) :- p(X,Y) & p(Y,Z) on the dataset shown below.

p(a,b)

p(b,c)

p(b,d)

p(c,d)

3.1 Introduction

3.2 Query Syntax

3.3 Query Semantics

3.4 Safety

3.5 Predefined Concepts

Exercises