Lesson 1: Overview
Infomaster is an information integration system. It provides integrated access to distributed, heterogeneous information sources, thus giving its users the illusion of a centralized, homogeneous information system. Information consumers can ask questions, confident that the system will provide information from all relevant information sources. Authorized information suppliers can add information updates, confident that the system will distribute those updates appropriately.
Unlike many Web search engines, Infomaster is concerned exclusively with structured information. All of Infomaster's data comes from structured data sources, such as databases, structured files (using standards like XML), and application programs with data interfaces (using protocols like LDAP and X.500).
Second, Infomaster automatically integrates information from multiple information sources, providing a single answer to each question rather than a long list of potentially irrelevant or redundant documents. In handling updates, Infomaster automatically disseminates the new information to appropriate information consumers.
Third, Infomaster performs these functions in the face of differences in the structure and vocabulary (i.e. schemas) of the sources and providers. Dealing with heterogeneity of this sort requires some additional effort. However, once the initial effort is expended and without any further effort, it is possible for multiple parties to manage information using their own schemas yet automatically share that information with parties using other schemas.
All digital documents possess structure of one form or another. Meta-information about the "type" of a document dictates how the constituent bits are to be interpreted. This structure makes it possible for computers to interpret those documents and process them in useful ways.
As an example of this, consider an ordinary text file coded in ASCII. In such a file, the bits are interpreted as characters. This structure allows computers to display such files, format them, check for misspellings, and so forth. However, ASCII coding alone does not provide a way of interpreting the meaning of the text within such files. It says nothing about the concepts being discussed or their interrelationships.
As another example, consider image files. In this case, the bits are interpreted as pixels, and these pixels can be used to paint images on computer screens. However, an image file by itself does not directly reveal what kinds of objects are in the picture or what is going on in the scene.
The problem is that most documents in use today are only partially structured. Many documents contain information that humans can discern but that is not expressed in the document's explicit structure. Since this information is not explicit, it lies beyond the processing power of machines.
In an attempt to address this problem, information specialists have begun to shift their emphasis from relatively "unstructured" documents to semantically structured documents (those in which content is expressed explicitly), thus making that content accessible to the computer for processing.
As an example, consider a company database in which personnel and offices and phone numbers are explicitly represented, along with their relationships to each other. In this case, it is possible for a computer to find all personnel who share the same office or all personnel who share the same office but do not share the same phone number or all personnel who share an office with their managers, and so forth.
Without structured information, finding information on the World Wide Web can be time-consuming and error-prone. The user can search for relevant documents using a subject hierarchy (as in Yahoo) or by specifying one or more keywords to find in the body of the document (as in Google). In either case, the result is a potentially large list of sometimes relevant but sometimes irrelevant documents. The user must then peruse these documents himself in the hope of finding the desired information.
Furthermore, the interaction is almost invariably "pull" rather than "push". There is little support for notification of changes. The next day, the user must do his search again, to see whether anything has changed.
Finally, in such systems, there is no support for update. It is not possible for an authorized user to specify information and have it distributed to appropriate sources.
One important advantage of structured information is that it enables a solution to these shortcomings.
In Infomaster, the user can ask sophisticated questions and get answers instead of long and often incomplete lists of sometimes irrelevant documents. Information from multiple sources is automatically aggregated into a single document.
He can post a request to be notified if additional relevant information becomes available and be sure of receiving only relevant notifications and receiving them only once.
Infomaster also provides support for updates. In addition to local updates to individual information sources, it supports global updates -- it accepts new information from users and, provided the users are suitably authorized, ensures that that information is automatically disseminated to the appropriate information sources for automatic or manual update. Along the way, it is capable of sending notifications to users who have expressed interest in such changes.
One of the key features of Infomaster is its ability to deal with heterogeneity among the various suppliers and consumers of information. The problem of heterogeneity can be broken into two parts.
The first problem concerns the format of the data (e.g. XML, TDT, etc.) and/or the communication protocol (e.g. LDAP, ODBC, and so forth). There are a few dozen of these, and it is easy to deal with their differences by hand coding access modules. Also, we are seeing a move toward standardization on common formats, such as XML, that will diminish this problem in the future.
The second and more serious problem is conceptual heterogeneity, i.e. mismatches in the data structure and vocabulary used by different clients and sources, i.e. different words, different shapes and sizes of data tables. In technical terms, each organization is using its own schema for its data. While there are only a few dozen different formats, there can be thousands of different schemas. Integrated product information management requires mapping between these various schemas and keeping those mappings up-to-date.
The solution to the mapping problem is the use of relational logic. Relational logic provides the conceptual framework for data integration in much the same way that, twenty years ago, relational algebra provided the conceptual framework for relational databases.
Like relational algebra, relational logic allows one to write rules to express relationships between different schemas. However, relational algebra is restricted in that the underlying database is assumed to be complete. While this is possible within the confines of a single corporation, it is a bad assumption when dealing with the Internet. The primary advantage of relational logic is that it allows one to write partial mappings, i.e. constraints expressing partial information rather than complete definitions.
In Infomaster, all mapping information is encoded in rules. One can also write business rules for an individual company e.g. the fact that a company never uses non-corrosive metals in its products; and one can express industry knowledge as well, e.g. the fact that no-one puts Teflon on glass.
Capturing mapping information and business rules and industry knowledge in the form of rules has multiple benefits. Rules are modular; each source or client described by rules that do not interfere with those for other clients and suppliers. Rules are mechanically analyzeable to detect inconsistencies or incompleteness. Most importantly, rules are the basis for a simple graphical user interface that allows nonprogrammers encode mapping information, thus putting that job in the hands domain experts rather than programmers.
Standing behind Infomaster's use of relational logic is its data transformation technology. Using the rules supplied to the system, Infomaster ensures that queries or updates in any one known schema are translated accurately and completely to other schemas as appropriate.
Mapping data directly between end user schemas works well when there are just a few schemas involved. However, as the number of schemas grows, this approach becomes unwieldy. With 1000 clients and 1000 sources, there can be 1,000,000 mappings needed.
The solution to this problem is the use of Infomaster's master schema. In this approach, each end user schema is mapped to the master schema rather than to other end user schemas. This is a much more scalable approach. With 1000 consumers and 1000 suppliers, this approach requires only 2000 mappings.
Note that using a master schema does not mean that data must actually be written in the master schema. The master schema is used only as a reference in writing mapping rules. In handling a transaction, Infomaster uses these rules to synthesize the mapping necessary to transform the suppliers' data directly into the consumer's schema.
A major challenge in creating an Infomaster system is coming up with a suitable Master Schema. Inventing a schema for any application area is difficult. Inventing one for the entire WWW is a major challenge. The solution is to break the problem down into pieces, with responsibility for different parts falling on different organizations.
In this section, we look at a demonstration of Infomaster in action. The example is a system built in 1996 at the behest of the National Housewares Manufacturers Association. The goal was to create a virtual catalog for cookware, drawing information from the catalogs of manufacturers (in this case, Corning, Mirro, and Regal) and making the information available in integrated form to retailers (here Costco, Payless, and Sears).
Architecture. The system consists of databases of product information for each of the three suppliers and interfaces for each of the three buyers. Infomaster connects these various databases and interfaces together, ensuring integrated but customized access to each of the buyers.
Illustration 1. The opening page provides the user with two types of search. The high-level categories allow the user to search by category. The type-in box allows the user to search to search by keyword.
Illustration 2. Clicking on a category, such as Product, brings up a parameteric search page for products in that category. The lee Clicking on the triangle next to a category name expands that category to show its subcategories. This process can be repeated until one finds a sufficiently narrow category. In this case, we have expanded the Product category three times to get Domestic cookware.
Illustration 3. Clicking on a category name brings up a parametric search page. The left pane shows the category hierarchy. The middle pane shows attributes appropriate to the selected category, each with appropriate values. The right pane shows entries that match the search criteria specified. In this case, no attribute values are specified; and the system shows all 260 entries in the Cookware category.
Illustration 4. By selecting values for attributes, the set of possible answers is pared down. As each choice is made, the answers in the right pane are automatically refreshed. For example, selecting skillet decreases the possibilities to 60.
Illustration 5. Selecting Teflon for interior decreases the list to 49 items.
Illustration 6. Entering 15 inches for diameter narrows the set of solutions to just 5 products.
Illustration 7. Pressing the Display button brings up a table showing further information about the selected products.
Illustration 8. Clicking any link bring up an "Inspect" page containing of information about the associated concept. These pages are constructed on the fly from the databases available at the time. Each page contains information about a single concept and contains all information about that concept.For example, clicking on Aluminum shows information about the material aluminum, such as its type (metallic), its possible uses (on the stove top and in the oven but not in the microwave), and the various products that contain aluminum.
Illustration 9. Similarly, clicking on the name of a company brings up information about that company. For example, here we see that Corning is a US company. There is also a place for suppliers, but none are known.
Illustration 10. Clicking on Regal brings up analogous information. In this case, we see that Regal is a company in the UK. (This is not actually true, but this is just a demo.)
Illustration 11. Browsing works well when there are just a few entries, as in this case. However, when there are thousands of entries, it can be tedious. Parametric search works well when the features of interest are attributes of the type of item being sought. In some cases, the features are indirect, i.e. they are attributes of the values of attributes. Since the search in such cases involves items of multiple categories, it is often called "cross-category" search. In order to do cross-category searches in Infomaster, the user clicks on the small icon beside the Attribute name. This opens up a sub-search that allows the user to enter features of the value of that attribute. In this case, we have opened up the Manufacturer attribute to allow us to specify properties of companies.
Illustration 12. Here, we have gone further and opened up the Nationality attribute of companies to allow us to specify properties on countries. Clicking on North America here causes the system to display products with the specified features made by companies incorporated in countries on the North American continent. Note that there are only 4 answers. The Regal product is no longer an answer, as its manufacturer is incorporated in the UK.
Illustration 13. In order to illustrate Infomaster's support for conceptual heterogeneity, let's compare the source data for a product from one of the suppliers with the view available to one of the buyers. Here we have the supplier's view in the window on the left and the buyer's view in the window on the right. In the Regal catalog, the product is listed as a frying pan, whereas, in the Payless catalog, it is a skillet. In the Regal catalog, the interior and exterior are arc-sprayed, whereas in the Payless catalog, the surface is listed as Teflon. The Regal catalog contains diameter and height, both in inches, whereas the Payless catalog has capacity in quarts and diameter in inches. The Regal catalog lists Price in Pounds Sterling, whereas the Payless catalog lists the price in US dollars.
Illustration 14. Now, let's change some of the information about this product. We click on the Change button to convert the Inspect page into a Change page. Then we can make modifications, e.g. switching the product to be a steamer insert, changing the exterior to be Duracote, changing the price to 100 pounds.
Illustration 15. We then click on Change to send these changes to Infomaster. The Inspect page shows that the changes have been accepted. Of course, the Payless page is the same, since it has not been refreshed.
Illustration 16. However, clicking the Refresh button causes the new values to be displayed. The product is now listed as an accessory in the Payless terminology. Its exterior is still Teflon. However, the color, computed from the actual exterior material, is listed as black. Finally, the price has changed proportionally.
Illustration 17. Finally, let's take a look at Infomaster's ability to deal with incomplete information. One thing we know about Regal is that it makes its products out of just two materials, aluminum and stainless. Unfortunately, the Regal catalog does list any material for its products. Consequently, Infomaster cannot fill in this field. On the other hand, things are not so bleak. Let's consider what happens when we search for Regal products. Here we see the Payless search page. Selecting Regal as manufacturer gives 33 products.
Illustration 18. Naturally, clicking on Aluminum as material leads to zero hits, as none of the Regal products are known for sure to be made of aluminum.
Illustration 19. Similarly, clicking on stainless leads to zero hits.
Illustration 20. However, when we click both aluminum and stainless, all of the products reappear. Although Infomaster does not know exactly which material is in a product, it knows that in either case it satisfies the user's request.Note that this is a very different answer than one would get from a traditional database system, where a disjunctive query like this one is handled by forming the union of the answers to queries formed from each disjunct. In this case, the result would be zero hits in a traditional database, despite the fact that all Regal products satisfy the user's request. Why is it that database systems do not provide the correct answer in this case? The fact is that most database systems were designed for use in corporate settings, where the data schema could be designed to ensure that no information is missing. Sadly, in the Internet setting, there is no overarching authority requiring that all fields be present. Yet we want to get as much information from the available databases as possible. The logical reasoning in Infomaster ensures that this criterion is met.