Distributed and Parallel Information Retrieval

Providing timely access to text collections both locally and across the Internet is instrumental in making information retrieval (IR) systems truly useful. In order to users to effectively access these collections, IR systems must provide coordinated, concurrent, and distributed access. We investigate different architectures for distributed and parallel information retrieval in order to provide efficient and scalable IR services. In particular, we use partial collection replication.

Performance of Distributed Information Retrieval Architectures

Distributed IR Architecture

We developed a prototype distributed information retrieval system which allows multiple users to search multiple document collections concurrently. The Distributed Information Retrieval Architecture consists of a set of clients (users) that connect to a set of IR systems via a connection server. The connection server is responsible for all messages sent between the clients and the IR systems. Since it is difficult to test the performance of large distributed systems, we created a simulation model of the distributed system in order to analyze the performance of the system under a variety of conditions. Furthermore, we are able to easily change the architecture and evaluate the effectiveness of the change. Our results show that our architecture is able to handle a large number of clients and text collections by using a surprisingly small number of connection servers.

Parallel Information Retrieval Servers

Parallel IR Server

We have also developed a Parallel IR Server to investigate how best to exploit a symmetric multiprocessors when building high performance IR servers. Although it is possible to create systems with lots of CPU and disk resources, the important questions are how much of which hardware and what software structure is needed to effectively exploit hardware resources. In some cases, adding hardware degrades performance rather than improves it. We also compare the performance of a multi-threaded IR server to a multitasking IR server. As expected, the multi-threaded server is faster, but the multitasking server is competitive. Thus, for a large legacy system, it may not be worth the effort to convert a single threaded system into a multi-threaded system.

Replicating Information Retrieval Collections

Distributing excessive workloads and searching as little data as possible while maintaining acceptable retrieval accuracy are two ways to improve performance and scalability of a distributed IR system. Replication and partial replication of a collection serves these two purposes. Replicating collections on multiple servers can distribute excessive workloads posed on a single server; replicating collections on servers that are closer to their users can improve performance by reducing network traffic and minimizing network latency. We investigate how to build and rank partial replicas in a information retrieval system. We build a hierarchy of replicas based on query frequency and available resources (the number of servers and their size).

Replication Hierarchy

Here we give an example to show how this architecture works. When a user in the user cluster 1 issues a query, the query goes to replica selection index 1 to determine relevance of Replica 1-1, Replica 1, and the original collection. If Replica 1-1 is ranked as the top one, query may go to any of Replica 1-1, Replica 1, and the original collection for processing, based on server load and network traffic. If Replica 1 is ranked as the top one, query may go to either Replica 1 or the original collection. Otherwise the query only goes to the original collection.

Query Locality

In our work, we examine queries from real system logs and show that there is sufficient query locality in real systems to justify partial collection replication. We use 40 days of queries from THOMAS, the Library of Congress, and one day of Excite logs to study locality patterns in real systems. We find there is sufficient query locality that remains high over long periods of time which will enable partial replication to maintain effectiveness and significantly improve performance. For THOMAS, updating replicas hourly or even daily is unnecessary. However, we need to some mechanism to deal with bursty events. We propose two simple updating strategies that trigger updates based on events and performance, instead of regular updating. In our traces, query exact match misses many overlaps between queries with different terms that in fact return the same top documents, whereas partial replication with an effective replica selection function will find the similarities. We believe this trend will hold for other query sets against text collections and for web queries.

Finding the Appropriate Replica

We extend the inference network model to rank and select partial replicas. We compare our new replica selection function to previous work on collection selection over a range of tuning parameters. For a given query, our replica selection algorithm correctly determines the most relevant of the replicas or original collection, and thus maintains the highest retrieval effectiveness while searching the least data as compared with the other ranking functions. Simulation results show that with load balancing, partial replication consistently improves performance over collection partitioning on multiple disks of a shared-memory multiprocessor and it requires only modest query locality.

Searching a Terabyte of Text

We use our architecture to investigate how to search a terabyte of text using partial replication. We build a hierarchy of replicas based on query frequency and available resources, and use the InQuery retrieval system for the replicas and the original text database. We demonstrate the performance of our system searching a terabyte of text using a validated simulator. We compare the performance of partial replication with partitioning over additional servers. Our results show that partial replication is more effective at reducing execution times than partitioning on significantly fewer resources. For example, using 1 or 2 additional servers for replica(s) achieves similar or better performance than partitioning over 32 additional servers, even when the largest replica satisfies only 20% of commands. Higher query locality further widens the performance differences. Our work porting and validating InQuery and the simulator from a slower processor to the Alpha, as well as experiments with faster querying times which are reported elsewhere, lead us to believe the performance trends will hold for faster systems using fewer resources.




This work is supported by grants from Digital Equipment, NSF grant EIA-9726401, and an NSF Infrastructure grant CDA-9502639, Library of Congress and Department of Commerce under cooperative agreement number EEC-9209623, and by the Defense Advanced Research Projects Agency/ITO under ARPA order number D468, issued by ESC/AXS contract number F19628-95-C-0235. Kathryn S. McKinley is supported by an NSF CAREER Award CCR-9624209. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors.
Return to ALI home page.
(Last changed: April 3, 1998.)