A Computational Theory of Ontological Reformulation
Computer Science Department
with contributions from Abhijeet Mohapatra
and other members of the Stanford Logic Group
Abstract: Ontological Reformulation is the process of changing the ontology used in encoding information about some portion of the world, i.e. changing the objects, functions, and relations used to describe that portion of the world. Different ontologies allow one to express different amounts of information, and different ontologies can lead to implementations with different computational characteristics. Hence, the choice of ontology is important in information processing, whether that processing is done by computers or by people. In this presentation, we formalize the notion of reformulation for ontologies encoded in the form of logic programs, we talk about how to compare the computational characteristics of ontologies, and we examine a few mechanical methods for finding such reformulations.
If you are like me, when you think about the world, you think in terms of objects and relationships among these objects. Objects include things like people and offices and buildings. Relationships include things like the parenthood, ancestry, office assignments, office locations, and so forth. The collection of objects and relations we use in thinking about the world is often called an ontology.
In building physical or computational models of the world, we usually take our ontology as a guide, using objects in our model to represent objects in the world and matching relations in our model to real-world relations. From the perspective of usability, this is good, as it allows us to update and access information in familiar terms.
However, from a computational point of view, our initial conceptualization of the world is not always the best. In many cases, we can gain computational benefit by representing things in different ways. Moreover, representing things in different ways for computational reasons sometimes leads to new and valuable insights to help us in thinking about the world.
To make this point concrete, let's consider an example from the world of genealogy - the world of kinship among people. The arcs in the tree shown below represent the parent-child relation among the people represented in the nodes of the tree. Art is the parent of Bob, Bob is the parent of Cal, and so forth.
Figure 1 - parent relation
The tree shown below is an alternative representation. In this case, the ancestry relation is fundamental. Art is the ancestor of Bob, Bob is the ancestor of Cal, Art is the ancestor of Cal, and so forth.
Figure 2 - ancestor relation
This is nonstandard, but there is nothing wrong with it. We can even define parent in terms of ancestor - one person is a parent of another if and only if he is an ancestor and there are no intermediate ancestors.
Given multiple ontologies, how do we decide among the alternatives? It is beyond my expertise to talk about what makes ontologies desirable for humans. And I am not sure anyone can say anything sensible about the philosophical issues. However, I can talk the issues involved in comparing conceptualizations in computational terms. These issues may have philosophical and/or psychological relevance, but in any case there is practical value in implementing useful computer systems.
Our genealogy example illustrates these issues. Suppose that we store the parent relation. This does not take up much space; and we can still compute the ancestor relation, although it takes some time to do so. Alternatively, we can store the ancestor relation. This representation takes more storage; but the computation of ancestry is faster - it is just a look-up.
In this case, there is a trade-off between space and time. However, things are not always so difficult. In some cases, we find that one representation is better in time without being worse in space or better in space without being worse in time.
As an example of this, consider a genealogical application in which our primary interest is to manage information about whether or not two people are related, i.e. whether they are in the same family, whether they have a common ancestor within recent history.
Now, we can compute this relation from either of the representations shown above. Alternatively, if we did not need to know who is the parent of whom, we could simply record the related relation directly with a graph like the one shown below.
Figure 3 - samefamily relation
The benefit of directly recording the information is speed of look up. To determine whether or not a person is related to another using the parenthood relation or the ancestor relation takes time that is at least quadratic in the number of links in the graph. By contrast, making this determination using the samefamily relation can be computed in time proportional to the log of the number of arcs in the graph. The downside of this representation is size. The number of arcs is the samefamily graph is the square of the number of nodes in the graph, whereas the other representations take linear space.
The good news is that there is an even better representation. Instead of recording parenthood or ancestry, we record the forefather relation. One person is a forefather of another if he is an ancestor but has no ancestor of his own in the database. The graph below shows the forefather relation corresponding to the data in Figures 1 and 2.
Figure 4 - forefather relation
Here is the interesting part. The forefather relation can be stored in the same amount of space as parent. Moreover, the computation cost for the samefamily relation is vastly reduced - computing samefamily using forefather takes logarithmic time while computing it from parent or ancestor takes multiplicative time! Clearly, the forefather ontology is superior for this purpose.
What is interesting about this example is that we are not always forced to make sacrifices to gain benefits. Here, we can improve time without using up any extra space. In some cases, we can decrease space without increasing computation cost. And in some cases we can improve both space and time.
This presentation is about finding reformulations like these. In what follows, we start with some basic definitions about ontologies. We then formalize the notion of ontological reformulation, we talk about how to compare ontologies, and we examine a few mechanical methods for reformulation.
While graphs, like those in my genealogy example, are an appealing way to visualize information, for our purposes here, it is more convenient to notate kinship in the form of sentences rather than arcs in a graph. The essentials are all the same in this sentential encoding as in the graphical encoding; only the visual appearance is different.
In what follows, we start with some basic definitions about ontologies. We then define the notion of reformulation, we talk about how to compare ontologies, and we examine a few mechanical methods for reformulation.
A vocabulary is a collection of symbols (representing objects in the world) together with a collection of predicates (representing relationships among these objects). Each predicate has an associated arity, i.e. the number of objects involved in any instance of the corresponding relation. For example, in our kinship example, we might use the symbols art, bob, bud, cal, cam, coe, and cory to represent people and the binary predicate p to represent the parent-child relationship between two people.
A datum is an expression formed from an n-ary predicate and n symbols. In this presentation, we write data in mathematical notation. For example, we write p(art,bob) to express the fact that art is the parent of bob. The Herbrand base for a vocabulary is the set of all data that can be formed from the symbols and predicates in the vocabulary. A dataset is any subset of the Herbrand base. Intuitively, we think of the items in a dataset as expressing what we believe to be true in the world; data that are not in the dataset are assumed to be false.
Given our kinship vocabulary, we can represent the parentage information in our kinship example using the dataset shown here.
3. Logic Programs
As mentioned in the introduction it is possible to define some relationships in terms of others. In this presentation, we encode such definitions in a logic language called Datalog. We could equally well use other declarative languages; however, Datalog is easy to well-studied; it is easy to understand, and it is sufficient for our purposes today.
In Datalog, there are three types of expressions - atoms, literals, and rules. Atoms are analogous to the data in datasets except that they can optionally contain variables in place of symbols. Literals are either atoms or negations of atoms. A rule is an expression consisting of a distinguished atom, called the head, and zero or more literals, together called the body. We write rules as in the example shown below. In this case, q(X,Y) is the head, and the other literals constitute the body. Words beginning with capital letters are variables, and those beginning with lower case letters are ordinary symbols.
A logic program is a finite set of atoms and rules as just defined. In order to simplify our definitions and analysis, we occasionally talk about infinite sets of rules. While these sets are useful, they are not themselves logic programs.
Unfortunately, the language of rules, as defined so far, allows for logic programs with some unpleasant properties. To avoid programs of this sort, it is common in deductive databases to add a couple of restrictions that together eliminate these problems.
The first restriction is safety. A rule in a logic program is safe if and only if every variable that appears in the head or in any negative literal in the body also appears in at least one positive literal in the body. A logic program is safe if and only if every rule in the program is safe.
All of the examples above are safe. By contrast, the two rules shown below are not safe. The first rule is not safe because the variable Z appears in the head but does not appear in any positive subgoal. The second rule is not safe because the variable Z appears in a negative subgoal but not in any positive subgoal.
To see why safety is matters in the case of the first rule, suppose we had a database in which p(a,b) is true. Then, the body of the first rule is satisfied if we let X be a and Y be b. In this case, we can conclude that every corresponding instance of the head is true. But what should we substitute for Z? Intuitively, we could put anything there; but there could be infinitely many possibilities. For example, we could write any number there. While this is conceptually okay, it is practically problematic.
To see why safety matters in the second rule, suppose we had a database with just two facts, viz. p(a,b) and q(b,c). In this case, if we let X be a and Y be b and Z be anything other than c, then both subgoals true, and we can conclude t(a,b). The main problem with this is that many people incorrectly interpret that negation as meaning there is no Z for which q(Y,Z) is true, whereas the correct reading is that q(Y,Z) needs to be false for just one binding of Z. As we will see in our examples below, there is a simple way of expressing this other meaning without writing unsafe rules.
In logic programming, these problems are avoided by requiring all rules to be safe. While this does restrict what one can say, the good news is that it is usually possible to ensure safety by adding additional subgoals to rules to ensure that the restrictions are satisfied.
The second restriction is called stratified negation. It is essential in order to avoid ambiguities. Unfortunately, it is a little more difficult to understand than safety.
The dependency graph for a logic program is a directed graph with two type of arcs, positive and negative. The nodes in the dependency graph for a program represent the relations in the program. There is a positive arc in the graph from one node to another if and only if the former node appears in a positive subgoal of a rule in which the latter node appears in the head. There is a negative arc from one node to another if and only if the former node appears in a negative subgoal of a rule in which the latter node appears in the head.
As an example, consider the following logic program. r(X,Y) is true if p(X,Y) and q(Y) are true. s(X,Y) is true if r(X,Y) is true and s(Y,X) is false.
A negation in a logic program is said to be stratified with respect to negation if and only if there is no negative arc in any cycle in the dependency graph. The logic program just shown is not stratified with respect to negation because there is a cycle involving a negative arc.
The problem with unstratified logic programs is that there is a potential ambiguity. As an example, consider the program above and assume we had a database containing p(a,b), p(b,a), q(a), and q(b). From these facts we can conclude r(a,b) and r(b,a) are both true. So far so good. But what can we say about s? If we take s(a,b) to be true and s(b,a) to be false, then the second rule is satisfied. If we take s(a,b) to be false and s(b,a) to be true, then the second rule is again satisfied. We can also take them both to be true. The upshot is that there is ambiguity about s. By concentrating exclusively on programs that are stratified with respect to negation, we avoid such ambiguities.
The immediate value of a rule on a given dataset D is the set of all rule heads obtained by consistently substituting symbols from our vocabulary for variables in such a way that all of the positive subgoals are all in D and none of the negative subgoals is in D. Note that, if we add the immediate values of a rule to a dataset, this can lead to additional values for that rule or other rules. The overall value of a rule or an entire ruleset on a given dataset is the fixpoint of this process. For safe rulesets with stratified negation, the overall value on a given dataset is unique and independent of the order in which the rules are processed.
As mentioned earlier, we can use rules to define concepts in terms of other concepts. For example, we can define the ancestor relation a in terms of the parent relation p as shown below.
Alternatively, we can define parent in terms of ancestor. Note that we need a helper relation in this case. One person is remotely related to another if and only if there is an intermediate ancestor. We then say that one person is a parent of another if and only is the first person is an ancestor of the second and is not remotely related.
Here is an example of defining multiple relationships in terms of multiple relationships. Here, we define the father relation f and the mother relation m in terms of parent p and male male. In this case, unlike the previous example, we cannot write the definitions in the other direction, since we do not know the gender of people who do not have children.
In our development thus far, we have assumed that the extension of an n-ary relation may be any set of n-tuples from the domain. This is rarely the case. Often, there are constraints that limit the set of possibilities. For example, a person cannot be his own parent. In some cases, constraints involve multiple relations. For example, all parents are adults; in other words, if an entity appears in the first column of the parent relation, it must also appear as an entry in the adult relation.
In many database texts, constraints are written in direct form - by writing rules that say, in effect, that if certain things are true in an extension, then other things must also be true. The inclusion dependency mentioned above is an example - if an entity appears in the first column of the parent relation, it must also appear as an entry in the adult relation.
In what follows, we use a slightly less direct approach - we encode limitations by writing rules that say when a database is not well-formed. We simply invent a new 0-ary relation, here called illegal, and define it to be true in any extension that does not satisfy our constraints.
This approach works particularly well for consistency constraints like the one stating that a person cannot be his own parent.
It also works well for mutual exclusion constraints like the one below, which states that a person cannot be in both the male and the female relations.
Using this technique, we can also write the inclusion dependency mentioned earlier. There is an error if an entity is in the first column of the parent relation and it does not occur in the adult relation.
Database management systems can use such constraints in a variety of ways. They can be used to optimize the processing of queries. They can also be used to check that updates do not lead to unacceptable extensions.
In what follows, we use the word ontology to refer a vocabulary together with a ruleset defining some or all of the predicates in the vocabulary. In most cases, the vocabulary is implicit in the ruleset; and, for our purposes here today, you can think of an ontology as being synonymous with the rules in the associated ruleset.
Reformulation is a relationship between ontologies. To be non-trivial, the ontologies must be different from each other. At the same time, to be useful, the ontologies must have something in common. We formalize these common properties using the notion of an application.
An application of an ontology has two parts - inputs and outputs. The inputs are relations in the ontology that are used in writing datasets that are supplied to the application by its users, and output relations are relations that users wish to retrieve. Input relations should be independent of each other, i.e. they should not be defined in terms of each other. The output relations must be either inputs or other relations that are defined entirely in terms of the inputs by the rules of the ontology.
Once again consider the ancestor-based ontology alluded to in the introduction. The rules are shown below. The family relation (here represented by sf for "same family") is defined in terms of a, and a is defined in terms of p. For simplicity in our presentation here, we assume exactly two generations of people are recorded. Multiple generations can be done in similar fashion or recursively, but the restriction to two generations makes the example simpler to follow.
There are various possible applications of this ontology. For example, we could designate p to be an input relation and sf to be an output relation. In this case, a is useful in defining sf in terms of p but is neither an input nor an output.
Another example. Here are the definitions for an ontology based on the forefather relation (here represented by ff). Again, we designate p to be an input relation and sf to be an output relation. In this case, ff is useful in defining sf in terms of p but is neither an input nor an output.
Now, using the notion of application, we can define the concept of reformulation. We say that one ontology is a reformulation of another with respect to a common application if and only if, for any dataset D in the input vocabulary of the application, the value of the output relations are found by applying the rules from the second ontology to D is the same as the value of the output relations found by applying the rules from the first ontology to D.
The same family example from the introduction illustrates. Given the rules shown above, it is easy to see that the the ontologies are equivalent. We expand the definitions of sf in terms of p. See below. (Note that the expansions here include new variables to avoid variable conflict.) In this case, the expansions are the same in both cases. (More often, the comparison requires a more complicated but still computable "containment" test.)
Note that, if our application were to include a, the second ontology would not be a reformulation of the first with respect to this application, since the second does not even contain a.
In order to assess the computational characteristics of an ontology in a given application, we need to say how the output relations are computed from the input relations. There are multiple possibilities here - the computation can be top-down or bottom-up, the evaluator might re-order subgoals or treat them in fixed order, the procedure might cache some or all of its intermediate results, it might index relations completely or partially or not at all, and so forth.
This multiplicity of choices complicates comparative analysis of ontologies. Just as the equivalence of ontologies is dependent on an external factor, viz. the application in question, so our analysis of the computational properties depends on an external factor as well, viz. our choice of evaluation procedure.
Much of the work that has been done on reformulation thus far assumes a particular evaluation procedure, viz. bottom-up computation using nested-loop joins with full indexing of input relations. This method is widely used in evaluating database queries and is amenable to computational analysis. Suppose, for example, we have an ontology with n symbols and a relation r defined as shown below. In order to compute r we would consider a worst case of n^2 possibilities for the p subgoal; and, for each of these, we would consider up to n values for q, leading to a worst-case cost of n^3. (Without indices, the worst-case cost would be n^4.)
As an example of comparative analysis, consider an ontology with the rules shown below; and let's assume we have an application with p, q, r, and s and inputs and goal as output.
Now let's consider a reformulation of this ontology. It is similar to the ontology. The only difference is that we define and use an intermediate relation t that is defined as the union of r and s.
The computational cost for the first ontology is 2*n. By contrast, the computational cost for the second ontology is just n. The reason for the difference is that we must compute p(X) & q(X) twice with the first ontology, whereas we need to do that only once in the second ontology. The improvement is only a factor of 2 in this case, but the factor could be larger if there were more cases being considered.
Note that, in some applications, bottom-up execution is not practical. This happens, for example, when the output relations are very large. In fact, if we switch from Datalog to Prolog, the relations can be infinite. In these cases, top-down evaluation of individual output tuples (rather than entire relations) is the only option.
The good news is that, through the services of the magic sets transformation, it is possible to convert such problems from top-down computation to bottom-up computation without incurring significant overhead. The magic sets transformation is a beautiful example of reformulation in its own right. Unfortunately, it is a little complicated for me to present in this brief overview, and so I will skip over it with just that one teasing remark. The upshot for us today is simply that we can assume that the analysis of bottom-up computation is by and large a good approximation to the analysis of top-down computation as well.
In the preceding section, we talked about comparing the computational characteristics of ontologies on the assumption that there is no persistent storage. On each step, we are given an input set and asked to compute the output set with no regard to previous or future uses of the input data. In practice, it is often the case that an input dataset is used to answer multiple queries. Said another way, the it is often the case that data changes slowly with respect to the queries. This latter scenario is common in information processing applications, particularly in database systems.
Obviously, in order to deal with such situations, we need to save the essential content of the input dataset for subsequent use. This is easy enough to do - we simply record and re-use the input dataset until a new dataset comes along. However, there are other possibilities. For example, instead of saving the input relations, we could compute and save the output relations. Alternatively, we could compute and save some of the intermediate relations. Storing such results is often called materialization, and the stored relations are often said to be materialized.
Over the years, various researchers have studied the question of which datasets to materialize. The decision is usually influenced by constraints on the sizes of the materialized relations and computational cost. For example, it is common to require that the materialized relations consume no more space than the input relations. (Ideally, we would prefer a materialization policy that requires *less* space.) And it is common to require that the cost of computing the materialized relations from the input relations plus the cost of computing the output relations from the materialized relations is less than or equal to the cost of computing the output relations directly from the input relations. (Ideally, we would prefer a materialization policy that supports more efficient conputation.)
Note that it is sometimes desirable to allow limited growth in one of these dimensions to gain significant or important benefit in another. For example, we might allow a factor of n more space to get n^2 improvement in performance or something like that. Alternatively, we might allow even more space growth to gain benefit in time in cases where time is critical. Or we might allow time growth in cases where space is limited.
Now comes the interesting part. This research on materialization tells us which relations in an ontology to materialize for a given application. However, it leaves open the question of whether it is possible to do even better by using a different ontology altogether.
Let's look at an example drawn from the world of genealogy. We start with data about parentage. Art is the parent of Bob, Bob is the parent of Carl, and so forth. From this, we define the notion of ancestor. One person is an ancestor of another if and only if the first person is a parent of the second or a parent of a parent, and so forth. Using ancestry, we define the family relation. Two people are in the same family if and only if they have a common ancestor. The graph shown below encodes some sample ancestry data. In this case, everyone is in the same family.
Figure 5 - Ancestor versus forefather
Now consider a different conceptualization of this world. Again, we start with parentage data. However, instead of defining ancestry, we define the forefather relation. One person is a forefather of another if he is an ancestor but has no ancestor of his own in the database. Two people are in the same family if and only if they have a common forefather.
By contrast, the forefather ontology offers excellent opportunities for materialization. The forefather relation can be stored in the same amount of space as parent. Moreover, the computation cost is vastly reduced - computing sf using ff takes constant time while computing from p or a takes multiplicative time!
Here is the interesting part. The forefather relation can be stored in the same amount of space as parent. Moreover, the computation cost for the family relation is vastly reduced - computing family using forefather takes constant time while computing it from parent or ancestor takes multiplicative time! Clearly, the forefather ontology is superior for this purpose.
What is interesting about this example is that we are not always forced to make sacrifices to gain benefits. Here, we can improve time without using up any extra space. In some cases, we can decrease space without increasing computation cost. And in some cases we can improve both space and time.
The same family application is an example where changing ontologies helps. Looking at the ancestry ontology, we see that we cannot store any relations other than the inputs if we are permitted no more than linear growth in space. In general, the ancestor relation is multiplicatively large the parent relation, and the same family relation is larger still.
Okay, given this new setting, let's consider the problem of reformulation more generally. For a given ontology and application, how many equivalent reformulations are there? For a given ontology and application, how many reformulations are there that dominate the given ontology with respect to materialization? Can we effectively generate these reformulations?
Let's start by looking at the case of ontologies where relations are defined as simple conjunctions. Rada Chirkova studied this particular case in her PhD thesis. Hers is the most interesting work in this area, and I want to talk about a couple of her results.
Now let's generalize this. We say that a rule defining a relation in terms of a given set of relations is bounded with respect to an equivalent rule if and only the body of every rule used in defining the relations in the subgoals of the new rules is a refinement of a subset of the subgoals used in the original rule.
Theorem: For every rule r' that is equivalent to r, there is a rule r'' that is bounded with respect to r and that dominates r'.
In simple terms, this means that, in looking for new relations to use in defining a relation, we need to consider only definitions whose bodies are subsets of the literals in the original definition. There are exponentially many possibilities to try (in terms of the size of the original definition), but the number we need to look at is still finite. It is a powerful result.
Although Rada's initial work focussed on conjunctive views, her results do not stop there. She has also studied reformulation in the presence of views with more complex definitions (with negations and disjunctions and quantifiers and aggregates).
6. Domain Reformulation
The reformulation techniques we have just seen are examples of relational reformulation, i.e. they produce changes in the relations defined in an ontology. There are also results on changing the domains of ontologies, i.e. the constituent objects.
One common reformulation of this sort is reification, i.e. making relations into objects, which allows us to assert properties of those relations and quantify over those relations.
There is also the reverse of reification, viz. relationalization. Combining relationalization and reification is a common way to change from one ontology to another.
Indistinguishability abstraction is another form of object reformulation. If several objects mentioned in a set of rules satisfy all of the same conditions, under appropriate circumstances, it is possible to abstract the objects to a single object that does not distinguish the identities of the individuals. This can decrease the cost of processing queries by avoiding redundant computation in which the only difference is the identities of these objects.
The classic missionaries and cannibals problem is an example. Three missionaries and three cannibals come to a river and want to get across. There is a boat available but it holds only two people. Moreover, if at any point there are more cannibals than missionaries on either side of the river, the cannibals there will eat the missionaries. The problem is to come up with a schedule of crossings that gets everyone to the other side of the river without anyone being eaten.
If we treat the missionaries as distinct from each other and the cannibals as distinct form each other, we end search a space that is larger than it needs to be since the identity of individuals is not important. It does not matter which missionaries are in the boat with which cannibals, so long as the number of missionaries is greater than the number of cannibals. Using Indistinguishability Abstraction, we can simplify the missionaries and cannibals in this way and thereby achieve significant savings.
I am something of a dabbler in the world of databases, but I find this work to be quite cool. If you are not a database person, it may not appear particularly relevant to you. However, the ideas here are interesting beyond the world of traditional databases. Many applications that one would not normally think of as database applications can nevertheless be thought of as database applications and thereby benefit from these techniques and results. Examples include constraint satisfaction problems, planning, and general game playing.
The upshot is that the theory of reformulation and reformulation techniques may be applicable to a broad class of applications, not just database applications. Once an application is mapped into this framework, we can do analysis on the data, perhaps reformulate, and map the results back to the ordinary procedures, hopefully with improved data structures.
Whether we are casting problems as database applications or simulating in accordance with the logic programming hypothesis, many of us who are working on reformulation are yearning for unified theory of reformulation to ensure applicability of the results in a wide variety of problem areas. As physical scientists use calculus on a side variety of problems, so too should computer scientists consider the use of logic and reformulation in their work and not be forced to rediscover the work of researchers working in other areas.
Okay, I have now come to the end of my remarks. We have covered quite a bit of ground here. Let me summarize for you. main points: (1) Ontologies and reformulation are well-defined notions. (2) There has been interesting progress on analysis and automation of reformulation. (3) The framework discussed can inform reformulation in other areas and may ultimately serve as the basis for a unified theory of reformulation. (4) It is a part of what is needed to build practical General Systems.