Thinking Different A Computational Theory of Ontological Reformulation Michael Genesereth Computer Science Department Stanford University with contributions from Abhijeet Mohapatra and other members of the Stanford Logic Group Abstract: Ontological Reformulation is the process of changing the ontology used in encoding information about some portion of the world, i.e. changing the objects, functions, and relations used to describe that portion of the world. Different ontologies allow one to express different amounts of information, and different ontologies can lead to implementations with different computational characteristics. Hence, the choice of ontology is important in information processing, whether that processing is done by computers or by people. In this presentation, we formalize the notion of reformulation for ontologies encoded in the form of logic programs, we talk about how to compare the computational characteristics of ontologies, and we examine a few mechanical methods for finding such reformulations. ## 1. IntroductionIf you are like me, when you think about the world, you think in terms of objects and relationships among these objects. Objects include things like people and offices and buildings. Relationships include things like the parenthood, ancestry, office assignments, office locations, and so forth. The collection of objects and relations we use in thinking about the world is often called an ontology. In building physical or computational models of the world, we usually take our ontology as a guide, using objects in our model to represent objects in the world and matching relations in our model to real-world relations. From the perspective of usability, this is good, as it allows us to update and access information in familiar terms. However, from a computational point of view, our initial conceptualization of the world is not always the best. In many cases, we can gain computational benefit by representing things in different ways. Moreover, representing things in different ways for computational reasons sometimes leads to new and valuable insights to help us in thinking about the world. To make this point concrete, let's consider an example from the world of genealogy - the world of kinship among people. The arcs in the tree shown below represent the parent-child relation among the people represented in the nodes of the tree. Art is the parent of Bob, Bob is the parent of Cal, and so forth. Figure 1 - parent relation The tree shown below is an alternative representation. In this case, the ancestry relation is fundamental. Art is the ancestor of Bob, Bob is the ancestor of Cal, Art is the ancestor of Cal, and so forth. Figure 2 - ancestor relation This is nonstandard, but there is nothing wrong with it. We can even define parent in terms of ancestor - one person is a parent of another if and only if he is an ancestor and there are no intermediate ancestors. Given multiple ontologies, how do we decide among the alternatives? It is beyond my expertise to talk about what makes ontologies desirable for humans. And I am not sure anyone can say anything sensible about the philosophical issues. However, I can talk the issues involved in comparing conceptualizations in computational terms. These issues may have philosophical and/or psychological relevance, but in any case there is practical value in implementing useful computer systems. Our genealogy example illustrates these issues. Suppose that we store the parent relation. This does not take up much space; and we can still compute the ancestor relation, although it takes some time to do so. Alternatively, we can store the ancestor relation. This representation takes more storage; but the computation of ancestry is faster - it is just a look-up. In this case, there is a trade-off between space and time. However, things are not always so difficult. In some cases, we find that one representation is better in time without being worse in space or better in space without being worse in time. As an example of this, consider a genealogical application in which our primary interest is to manage information about whether or not two people are related, i.e. whether they are in the same family, whether they have a common ancestor within recent history. Now, we can compute this relation from either of the representations shown above. Alternatively, if we did not need to know who is the parent of whom, we could simply record the related relation directly with a graph like the one shown below. Figure 3 - samefamily relation The benefit of directly recording the information is speed of look up. To determine whether or not a person is related to another using the parenthood relation or the ancestor relation takes time that is at least quadratic in the number of links in the graph. By contrast, making this determination using the samefamily relation can be computed in time proportional to the log of the number of arcs in the graph. The downside of this representation is size. The number of arcs is the samefamily graph is the square of the number of nodes in the graph, whereas the other representations take linear space. The good news is that there is an even better representation. Instead of recording parenthood or ancestry, we record the forefather relation. One person is a forefather of another if he is an ancestor but has no ancestor of his own in the database. The graph below shows the forefather relation corresponding to the data in Figures 1 and 2. Figure 4 - forefather relation Here is the interesting part. The forefather relation can be stored in the same amount of space as parent. Moreover, the computation cost for the samefamily relation is vastly reduced - computing samefamily using forefather takes logarithmic time while computing it from parent or ancestor takes multiplicative time! Clearly, the forefather ontology is superior for this purpose. What is interesting about this example is that we are not always forced to make sacrifices to gain benefits. Here, we can improve time without using up any extra space. In some cases, we can decrease space without increasing computation cost. And in some cases we can improve both space and time. This presentation is about finding reformulations like these. In what follows, we start with some basic definitions about ontologies. We then formalize the notion of ontological reformulation, we talk about how to compare ontologies, and we examine a few mechanical methods for reformulation. While graphs, like those in my genealogy example, are an appealing way to visualize information, for our purposes here, it is more convenient to notate kinship in the form of sentences rather than arcs in a graph. The essentials are all the same in this sentential encoding as in the graphical encoding; only the visual appearance is different. In what follows, we start with some basic definitions about ontologies. We then define the notion of reformulation, we talk about how to compare ontologies, and we examine a few mechanical methods for reformulation. ## 2. DatabasesA A Given our kinship vocabulary, we can represent the parentage information in our kinship example using the dataset shown here.
## 3. Logic ProgramsAs mentioned in the introduction it is possible to define some relationships in terms of others. In this presentation, we encode such definitions in a logic language called Datalog. We could equally well use other declarative languages; however, Datalog is easy to well-studied; it is easy to understand, and it is sufficient for our purposes today. In Datalog, there are three types of expressions - atoms, literals, and rules. Atoms are analogous to the data in datasets except that they can optionally contain variables in place of symbols. Literals are either atoms or negations of atoms. A rule is an expression consisting of a distinguished atom, called the head, and zero or more literals, together called the body. We write rules as in the example shown below. In this case, q(X,Y) is the head, and the other literals constitute the body. Words beginning with capital letters are variables, and those beginning with lower case letters are ordinary symbols. q(a,Y) :- p(b,Y) & ~r(Y,d)
A Unfortunately, the language of rules, as defined so far, allows for logic programs with some unpleasant properties. To avoid programs of this sort, it is common in deductive databases to add a couple of restrictions that together eliminate these problems. The first restriction is All of the examples above are safe. By contrast, the two rules shown below are not safe. The first rule is not safe because the variable
To see why safety is matters in the case of the first rule, suppose we had a database in which To see why safety matters in the second rule, suppose we had a database with just two facts, viz. In logic programming, these problems are avoided by requiring all rules to be safe. While this does restrict what one can say, the good news is that it is usually possible to ensure safety by adding additional subgoals to rules to ensure that the restrictions are satisfied. The second restriction is called The As an example, consider the following logic program.
p, q, r, and s. Due to the first rule, there is a positive arc from p to r and a positive arc from q to r. Due to the second rule, there is a positive arc from r to s and a negative arc from s to itself.
A negation in a logic program is said to be The problem with unstratified logic programs is that there is a potential ambiguity. As an example, consider the program above and assume we had a database containing The immediate value of a rule on a given dataset D is the set of all rule heads obtained by consistently substituting symbols from our vocabulary for variables in such a way that all of the positive subgoals are all in D and none of the negative subgoals is in D. Note that, if we add the immediate values of a rule to a dataset, this can lead to additional values for that rule or other rules. The overall value of a rule or an entire ruleset on a given dataset is the fixpoint of this process. For safe rulesets with stratified negation, the overall value on a given dataset is unique and independent of the order in which the rules are processed. As mentioned earlier, we can use rules to define concepts in terms of other concepts. For example, we can define the ancestor relation a in terms of the parent relation p as shown below.
Alternatively, we can define parent in terms of ancestor. Note that we need a helper relation in this case. One person is remotely related to another if and only if there is an intermediate ancestor. We then say that one person is a parent of another if and only is the first person is an ancestor of the second and is not remotely related.
Here is an example of defining multiple relationships in terms of multiple relationships. Here, we define the father relation f and the mother relation m in terms of parent p and male male. In this case, unlike the previous example, we cannot write the definitions in the other direction, since we do not know the gender of people who do not have children.
In our development thus far, we have assumed that the extension of an In many database texts, constraints are written in direct form - by writing rules that say, in effect, that if certain things are true in an extension, then other things must also be true. The In what follows, we use a slightly less direct approach - we encode limitations by writing rules that say when a database is This approach works particularly well for consistency constraints like the one stating that a person cannot be his own parent. It also works well for Using this technique, we can also write the
Database management systems can use such constraints in a variety of ways. They can be used to optimize the processing of queries. They can also be used to check that updates do not lead to unacceptable extensions. In what follows, we use the word ## 3. ReformulationReformulation is a relationship between ontologies. To be non-trivial, the ontologies must be different from each other. At the same time, to be useful, the ontologies must have something in common. We formalize these common properties using the notion of an application. An Once again consider the ancestor-based ontology alluded to in the introduction. The rules are shown below. The family relation (here represented by
There are various possible applications of this ontology. For example, we could designate Another example. Here are the definitions for an ontology based on the forefather relation (here represented by
Now, using the notion of application, we can define the concept of reformulation. We say that one ontology is a reformulation of another with respect to a common application if and only if, for any dataset D in the input vocabulary of the application, the value of the output relations are found by applying the rules from the second ontology to D is the same as the value of the output relations found by applying the rules from the first ontology to D. The same family example from the introduction illustrates. Given the rules shown above, it is easy to see that the the ontologies are equivalent. We expand the definitions of
Note that, if our application were to include ## 4. ComputationIn order to assess the computational characteristics of an ontology in a given application, we need to say how the output relations are computed from the input relations. There are multiple possibilities here - the computation can be top-down or bottom-up, the evaluator might re-order subgoals or treat them in fixed order, the procedure might cache some or all of its intermediate results, it might index relations completely or partially or not at all, and so forth. This multiplicity of choices complicates comparative analysis of ontologies. Just as the equivalence of ontologies is dependent on an external factor, viz. the application in question, so our analysis of the computational properties depends on an external factor as well, viz. our choice of evaluation procedure. Much of the work that has been done on reformulation thus far assumes a particular evaluation procedure, viz. bottom-up computation using nested-loop joins with full indexing of input relations. This method is widely used in evaluating database queries and is amenable to computational analysis. Suppose, for example, we have an ontology with
As an example of comparative analysis, consider an ontology with the rules shown below; and let's assume we have an application with
Now let's consider a reformulation of this ontology. It is similar to the ontology. The only difference is that we define and use an intermediate relation
The computational cost for the first ontology is 2* Note that, in some applications, bottom-up execution is not practical. This happens, for example, when the output relations are very large. In fact, if we switch from Datalog to Prolog, the relations can be infinite. In these cases, top-down evaluation of individual output tuples (rather than entire relations) is the only option. The good news is that, through the services of the magic sets transformation, it is possible to convert such problems from top-down computation to bottom-up computation without incurring significant overhead. The magic sets transformation is a beautiful example of reformulation in its own right. Unfortunately, it is a little complicated for me to present in this brief overview, and so I will skip over it with just that one teasing remark. The upshot for us today is simply that we can assume that the analysis of bottom-up computation is by and large a good approximation to the analysis of top-down computation as well. ## 5. MaterializationIn the preceding section, we talked about comparing the computational characteristics of ontologies on the assumption that there is no persistent storage. On each step, we are given an input set and asked to compute the output set with no regard to previous or future uses of the input data. In practice, it is often the case that an input dataset is used to answer multiple queries. Said another way, the it is often the case that data changes slowly with respect to the queries. This latter scenario is common in information processing applications, particularly in database systems. Obviously, in order to deal with such situations, we need to save the essential content of the input dataset for subsequent use. This is easy enough to do - we simply record and re-use the input dataset until a new dataset comes along. However, there are other possibilities. For example, instead of saving the input relations, we could compute and save the output relations. Alternatively, we could compute and save some of the intermediate relations. Storing such results is often called Over the years, various researchers have studied the question of which datasets to materialize. The decision is usually influenced by constraints on the sizes of the materialized relations and computational cost. For example, it is common to require that the materialized relations consume no more space than the input relations. (Ideally, we would prefer a materialization policy that requires *less* space.) And it is common to require that the cost of computing the materialized relations from the input relations plus the cost of computing the output relations from the materialized relations is less than or equal to the cost of computing the output relations directly from the input relations. (Ideally, we would prefer a materialization policy that supports more efficient conputation.) Note that it is sometimes desirable to allow limited growth in one of these dimensions to gain significant or important benefit in another. For example, we might allow a factor of Now comes the interesting part. This research on materialization tells us which relations in an ontology to materialize for a given application. However, it leaves open the question of whether it is possible to do even better by using a different ontology altogether. Let's look at an example drawn from the world of genealogy. We start with data about parentage. Art is the parent of Bob, Bob is the parent of Carl, and so forth. From this, we define the notion of ancestor. One person is an ancestor of another if and only if the first person is a parent of the second or a parent of a parent, and so forth. Using ancestry, we define the family relation. Two people are in the same family if and only if they have a common ancestor. The graph shown below encodes some sample ancestry data. In this case, everyone is in the same family. Figure 5 - Ancestor versus forefather Now consider a different conceptualization of this world. Again, we start with parentage data. However, instead of defining ancestry, we define the forefather relation. One person is a forefather of another if he is an ancestor but has no ancestor of his own in the database. Two people are in the same family if and only if they have a common forefather. By contrast, the forefather ontology offers excellent opportunities for materialization. The forefather relation can be stored in the same amount of space as parent. Moreover, the computation cost is vastly reduced - computing Here is the interesting part. The forefather relation can be stored in the same amount of space as parent. Moreover, the computation cost for the family relation is vastly reduced - computing family using forefather takes constant time while computing it from parent or ancestor takes multiplicative time! Clearly, the forefather ontology is superior for this purpose. What is interesting about this example is that we are not always forced to make sacrifices to gain benefits. Here, we can improve time without using up any extra space. In some cases, we can decrease space without increasing computation cost. And in some cases we can improve both space and time. The same family application is an example where changing ontologies helps. Looking at the ancestry ontology, we see that we cannot store any relations other than the inputs if we are permitted no more than linear growth in space. In general, the ancestor relation is multiplicatively large the parent relation, and the same family relation is larger still. Okay, given this new setting, let's consider the problem of reformulation more generally. For a given ontology and application, how many equivalent reformulations are there? For a given ontology and application, how many reformulations are there that dominate the given ontology with respect to materialization? Can we effectively generate these reformulations? Let's start by looking at the case of ontologies where relations are defined as simple conjunctions. Rada Chirkova studied this particular case in her PhD thesis. Hers is the most interesting work in this area, and I want to talk about a couple of her results. Now let's generalize this. We say that a rule defining a relation in terms of a given set of relations is bounded with respect to an equivalent rule if and only the body of every rule used in defining the relations in the subgoals of the new rules is a refinement of a subset of the subgoals used in the original rule.
In simple terms, this means that, in looking for new relations to use in defining a relation, we need to consider only definitions whose bodies are subsets of the literals in the original definition. There are exponentially many possibilities to try (in terms of the size of the original definition), but the number we need to look at is still finite. It is a powerful result. Although Rada's initial work focussed on conjunctive views, her results do not stop there. She has also studied reformulation in the presence of views with more complex definitions (with negations and disjunctions and quantifiers and aggregates). ## 6. Domain ReformulationThe reformulation techniques we have just seen are examples of relational reformulation, i.e. they produce changes in the relations defined in an ontology. There are also results on changing the domains of ontologies, i.e. the constituent objects. One common reformulation of this sort is
There is also the reverse of reification, viz.
The classic missionaries and cannibals problem is an example. Three missionaries and three cannibals come to a river and want to get across. There is a boat available but it holds only two people. Moreover, if at any point there are more cannibals than missionaries on either side of the river, the cannibals there will eat the missionaries. The problem is to come up with a schedule of crossings that gets everyone to the other side of the river without anyone being eaten. If we treat the missionaries as distinct from each other and the cannibals as distinct form each other, we end search a space that is larger than it needs to be since the identity of individuals is not important. It does not matter which missionaries are in the boat with which cannibals, so long as the number of missionaries is greater than the number of cannibals. Using Indistinguishability Abstraction, we can simplify the missionaries and cannibals in this way and thereby achieve significant savings. ## 7. ConclusionI am something of a dabbler in the world of databases, but I find this work to be quite cool. If you are not a database person, it may not appear particularly relevant to you. However, the ideas here are interesting beyond the world of traditional databases. Many applications that one would not normally think of as database applications can nevertheless be thought of as database applications and thereby benefit from these techniques and results. Examples include constraint satisfaction problems, planning, and general game playing. The upshot is that the theory of reformulation and reformulation techniques may be applicable to a broad class of applications, not just database applications. Once an application is mapped into this framework, we can do analysis on the data, perhaps reformulate, and map the results back to the ordinary procedures, hopefully with improved data structures. Whether we are casting problems as database applications or simulating in accordance with the logic programming hypothesis, many of us who are working on reformulation are yearning for unified theory of reformulation to ensure applicability of the results in a wide variety of problem areas. As physical scientists use calculus on a side variety of problems, so too should computer scientists consider the use of logic and reformulation in their work and not be forced to rediscover the work of researchers working in other areas. Okay, I have now come to the end of my remarks. We have covered quite a bit of ground here. Let me summarize for you. main points: (1) Ontologies and reformulation are well-defined notions. (2) There has been interesting progress on analysis and automation of reformulation. (3) The framework discussed can inform reformulation in other areas and may ultimately serve as the basis for a unified theory of reformulation. (4) It is a part of what is needed to build practical General Systems. |