RFC 4981

Survey of Research towards Robust Peer-to-Peer Networks: Search Methods

Pages: 91
Informational

Part 1 of 5 – Pages 1 to 14

RFC4981 - Page 1

Network Working Group                                          J. Risson
Request for Comments: 4981                                      T. Moors
Category: Informational                    University of New South Wales
                                                          September 2007


       Survey of Research towards Robust Peer-to-Peer Networks:
                            Search Methods

Status of This Memo

   This memo provides information for the Internet community.  It does
   not specify an Internet standard of any kind.  Distribution of this
   memo is unlimited.

IESG Note

   This RFC is not a candidate for any level of Internet Standard.  The
   IETF disclaims any knowledge of the fitness of this RFC for any
   purpose and notes that the decision to publish is not based on IETF
   review apart from IESG review for conflict with IETF work.  The RFC
   Editor has chosen to publish this document at its discretion.  See
   RFC 3932 for more information.

Abstract

   The pace of research on peer-to-peer (P2P) networking in the last
   five years warrants a critical survey.  P2P has the makings of a
   disruptive technology -- it can aggregate enormous storage and
   processing resources while minimizing entry and scaling costs.

   Failures are common amongst massive numbers of distributed peers,
   though the impact of individual failures may be less than in
   conventional architectures.  Thus, the key to realizing P2P's
   potential in applications other than casual file sharing is
   robustness.

   P2P search methods are first couched within an overall P2P taxonomy.
   P2P indexes for simple key lookup are assessed, including those based
   on Plaxton trees, rings, tori, butterflies, de Bruijn graphs, and
   skip graphs.  Similarly, P2P indexes for keyword lookup, information
   retrieval and data management are explored.  Finally, early efforts
   to optimize range, multi-attribute, join, and aggregation queries
   over P2P indexes are reviewed.  Insofar as they are available in the
   primary literature, robustness mechanisms and metrics are highlighted
   throughout.  However, the low-level mechanisms that most affect
   robustness are not well isolated in the literature.  Recommendations
   are given for future research.

RFC4981 - Page 2

Table of Contents

   1. Introduction ....................................................3
      1.1. Related Disciplines ........................................6
      1.2. Structured and Unstructured Routing ........................7
      1.3. Indexes and Queries ........................................9
   2. Index Types ....................................................10
      2.1. Local Index (Gnutella) ....................................10
      2.2. Central Index (Napster) ...................................12
      2.3. Distributed Index (Freenet) ...............................13
   3. Semantic Free Index ............................................15
      3.1. Origins ...................................................15
           3.1.1. Plaxton, Rajaraman, and Richa (PRR) ................15
           3.1.2. Consistent Hashing .................................16
           3.1.3. Scalable Distributed Data Structures (LH*) .........16
      3.2. Dependability .............................................17
           3.2.1. Static Dependability ...............................17
           3.2.2. Dynamic Dependability ..............................18
           3.2.3. Ephemeral or Stable Nodes -- O(log n) or
                  O(1) Hops ..........................................19
           3.2.4. Simulation and Proof ...............................20
      3.3. Latency ...................................................21
           3.3.1. Hop Count and the O(1)-Hop DHTs ....................21
           3.3.2. Proximity and the O(log n)-Hop DHTs ................22
      3.4. Multicasting ..............................................23
           3.4.1. Multicasting vs. Broadcasting ......................23
           3.4.2. Motivation for DHT-based Multicasting ..............23
           3.4.3. Design Issues ......................................24
      3.5. Routing Geometries ........................................25
           3.5.1. Plaxton Trees (Pastry, Tapestry) ...................25
           3.5.2. Rings (Chord, DKS) .................................27
           3.5.3. Tori (CAN) .........................................28
           3.5.4. Butterflies (Viceroy) ..............................29
           3.5.5. de Bruijn (D2B, Koorde, Distance Halving, ODRI) ....30
           3.5.6. Skip Graphs ........................................32
   4. Semantic Index .................................................33
      4.1. Keyword Lookup ............................................34
           4.1.1. Gnutella Enhancements ..............................36
           4.1.2. Partition-by-Document, Partition-by-Keyword ........38
           4.1.3. Partial Search, Exhaustive Search ..................39
      4.2. Information Retrieval .....................................39
           4.2.1. Vector Model (PlanetP, FASD, eSearch) ..............41
           4.2.2. Latent Semantic Indexing (pSearch) .................43
           4.2.3. Small Worlds .......................................43
   5. Queries ........................................................44
      5.1. Range Queries .............................................45
      5.2. Multi-Attribute Queries ...................................48
      5.3. Join Queries ..............................................50

RFC4981 - Page 3

      5.4. Aggregation Queries .......................................50
   6. Security Considerations ........................................52
   7. Conclusions ....................................................52
   8. Acknowledgments ................................................53
   9. References .....................................................54
      9.1. Informative References ....................................54

1.  Introduction

   Peer-to-peer (P2P) networks are those that exhibit three
   characteristics: self-organization, symmetric communication, and
   distributed control [1].  A self-organizing P2P network
   "automatically adapts to the arrival, departure and failure of nodes"
   [2].  Communication is symmetric in that peers act as both clients
   and servers.  It has no centralized directory or control point.
   USENET servers and BGP peers have these traits [3] but the emphasis
   here is on the flurry of research since 2000.  Leading examples
   include Gnutella [4], Freenet [5], Pastry [2], Tapestry [6], Chord
   [7], the Content Addressable Network (CAN) [8], pSearch [9], and
   Edutella [10].  Some have suggested that peers are inherently
   unreliable [11].  Others have assumed well-connected, stable peers
   [12].

   This critical survey of P2P academic literature is warranted, given
   the intensity of recent research.  At the time of writing, one
   research database lists over 5,800 P2P publications [13].  One vendor
   surveyed P2P products and deployments [14].  There is also a tutorial
   survey of leading P2P systems [15].  DePaoli and Mariani recently
   reviewed the dependability of some early P2P systems at a high level
   [16].  The need for a critical survey was flagged in the peer-to-peer
   research group of the Internet Research Task Force (IRTF) [17].

   P2P is potentially a disruptive technology with numerous
   applications, but this potential will not be realized unless it is
   demonstrated to be robust.  A massively distributed search technique
   may yield numerous practical benefits for applications [18].  A P2P
   system has potential to be more dependable than architectures relying
   on a small number of centralized servers.  It has potential to evolve
   better from small configurations -- the capital outlays for high
   performance servers can be reduced and spread over time if a P2P
   assembly of general purpose nodes is used.  A similar argument
   motivated the deployment of distributed databases -- one thousand,
   off-the-shelf PC processors are more powerful and much less expensive
   than a large mainframe computer [19].  Storage and processing can be
   aggregated to achieve massive scale.  Wasteful partitioning between
   servers or clusters can be avoided.  As Gedik and Liu put it, if P2P
   is to find its way into applications other than casual file sharing,
   then reliability needs to be addressed [20].

RFC4981 - Page 4

   The taxonomy of Figure 1 divides the entire body of P2P research
   literature along four lines: search, storage, security, and
   applications.  This survey concentrates on search aspects.  A P2P
   search network consists of an underlying index (Sections 2 to 4) and
   queries that propagate over that index (Section 5).

   Search [18, 21-29]
      Semantic-Free Indexes [2, 6, 7, 30-52]
         Plaxton Trees
         Rings
         Tori
         Butterflies
         de Bruijn Graphs
         Skip Graphs
      Semantic Indexes [4, 53-71]
         Keyword Lookup
         Peer Information Retrieval
         Peer Data Management
      Queries [20, 22, 23, 25, 32, 38, 41, 56, 72-100]
         Range Queries
         Multi-Attribute Queries
         Join Queries
         Aggregation Queries
         Continuous Queries
         Recursive Queries
         Adaptive Queries

   Storage
      Consistency & Replication [101-112]
         Eventual consistency
         Trade-offs
      Distribution [39, 42, 90, 92, 113-131]
         Epidemics, Bloom Filters
      Fault Tolerance [40, 105, 132-139]
         Erasure Coding
         Byzantine Agreement
      Locality [24, 43, 47, 140-160]
      Load Balancing [37, 86, 100, 107, 151, 161-171]

RFC4981 - Page 5

   Security
      Character [172-182]
         Identity
         Reputation and Trust
         Incentives
      Goals [25, 27, 71, 183-197]
         Availability
         Authenticity
         Anonymity
         Access Control
         Fair Trading

   Applications [1, 198-200]
      Memory [32, 90, 142, 201-222]
         File Systems
         Web
         Content Delivery Networks
         Directories
      Service Discovery
      Publish / Subscribe ...
   Intelligence [223-228]
      GRID
      Security...
   Communication [12, 92, 119, 229-247]
      Multicasting
      Streaming Media
      Mobility
      Sensors...

            Figure 1: Classification of P2P Research Literature

   This survey is concerned with two questions.  The first, "How do P2P
   search networks work?"  This foundation is important given the pace
   and breadth of P2P research in the last five years.  In Section 2, we
   classify indexes as local, centralized and distributed.  Since
   distributed indexes are becoming dominant, they are given closer
   attention in Sections 3 and 4.  Section 3 compares distributed P2P
   indexes for simple key lookup; in particular, their origins (Section
   3.1), dependability (Section 3.2), latency (Section 3.3), and their
   support for multicast (Section 3.4).  It classifies those indexes
   according to their routing geometry (Section 3.5) -- Plaxton trees,
   rings, tori, butterflies, de Bruijn graphs and skip graphs.  Section
   4 reviews distributed P2P indexes supporting keyword lookup (Section
   4.1) and information retrieval (Section 4.2).  Section 5 probes the
   embryonic research on P2P queries; in particular, range queries
   (Section 5.1), multi-attribute queries (Section 5.2), join queries
   (Section 5.3), and aggregation queries (Section 5.4).

RFC4981 - Page 6

   The second question, "How robust are P2P search networks?"  Insofar
   as it is available in the research literature, we tease out the
   robustness mechanisms and metrics throughout Sections 2 to 5.
   Unfortunately, robustness is often more sensitive to low-level design
   choices than it is to the broad P2P index structure, yet these
   underlying design choices are seldom isolated in the primary
   literature [248].  Furthermore, there has been little consensus on
   P2P robustness metrics (Section 3.2).  Section 8 gives
   recommendations to address these important gaps.

1.1.  Related Disciplines

   Peer-to-peer research draws upon numerous distributed systems
   disciplines.  Networking researchers will recognize familiar issues
   of naming, routing, and congestion control.  P2P designs need to
   address routing and security issues across network region boundaries
   [152].  Networking research has traditionally been host-centric.  The
   Web's Universal Resource Identifiers are naturally tied to specific
   hosts, making object mobility a challenge [216].

   P2P work is data-centric [249].  P2P systems for dynamic object
   location and routing have borrowed heavily from the distributed
   systems corpus.  Some have used replication, erasure codes, and
   Byzantine agreement [111].  Others have used epidemics for durable
   peer group communication [39].

   Similarly, P2P research is set to benefit from database research
   [250].  Database researchers will recognize the need to reapply
   Codd's principle of physical data independence, that is, to decouple
   data indexes from the applications that use the data [23].  It was
   the invention of appropriate indexing mechanisms and query
   optimizations that enabled data independence.  Database indexes like
   B+ trees have an analog in P2P's distributed hash tables (DHTs).
   Wide-area, P2P query optimization is a ripe, but challenging, area
   for innovation.

   More flexible distribution of objects comes with increased security
   risks.  There are opportunities for security researchers to deliver
   new methods for availability, file authenticity, anonymity, and
   access control [25].  Proactive and reactive mechanisms are needed to
   deal with large numbers of autonomous, distributed peers.  To build
   robust systems from cooperating but self-interested peers, issues of
   identity, reputation, trust, and incentives need to be tackled.
   Although it is beyond the scope of this paper, robustness against
   malicious attacks also ought to be addressed [195].

   Possibly the largest portion of P2P research has majored on basic
   routing structures [18], where research on algorithms comes to the

RFC4981 - Page 7

   fore.  Should the overlay be "structured" or "unstructured"?  Are the
   two approaches competing or complementary?  Comparisons of the
   "structured" approaches (hypercubes, rings, toroids, butterflies, de
   Bruijn, and skip graphs) have weighed the amount of routing state per
   peer and the number of links per peer against overlay hop counts.
   While "unstructured" overlays initially used blind flooding and
   random walks, overheads usually trigger some structure, for example,
   super-peers and clusters.

   P2P applications rely on cooperation between these disciplines.
   Applications have included file sharing, directories, content
   delivery networks, email, distributed computation, publish-subscribe
   middleware, multicasting, and distributed authentication.  Which
   applications will be suited to which structures?  Are there adaptable
   mechanisms that can decouple applications from the underlying data
   structures?  What are the criteria for selection of applications
   amenable to a P2P design [1]?

   Robustness is emphasized throughout the survey.  We are particularly
   interested in two aspects.  The first, dependability, was a leading
   design goal for the original Internet [251].  It deserves the same
   status in P2P.  The measures of dependability are well established:
   reliability, a measure of the mean-time-to-failure (MTTF);
   availability, a measure of both the MTTF and the mean-time-to-repair
   (MTTR); maintainability; and safety [252].  The second aspect is the
   ability to accommodate variation in outcome, which one could call
   adaptability.  Its measures have yet to be defined.  In the context
   of the Internet, it was only recently acknowledged as a first-class
   requirement [253].  In P2P, it means planning for the tussles over
   resources and identity.  It means handling different kinds of queries
   and accommodating changeable application requirements with minimal
   intervention.  It means "organic scaling" [22], whereby the system
   grows gracefully, without a priori data center costs or architectural
   breakpoints.

   In the following section, we discuss one notable omission from the
   taxonomy of P2P networking in Figure 1 -- routing.

1.2.  Structured and Unstructured Routing

   P2P routing algorithms have been classified as "structured" or
   "unstructured".  Peers in unstructured overlay networks join by
   connecting to any existing peers [254].  In structured overlays, the
   identifier of the joining peer determines the set of peers that it
   connects to [254].  Early instantiations of Gnutella were
   unstructured -- keyword queries were flooded widely [255].  Napster
   [256] had decentralized content and a centralized index, so it only
   partially satisfies the distributed control criteria for P2P systems.

RFC4981 - Page 8

   Early structured algorithms included Plaxton, Rajaraman and Richa
   (PRR) [30], Pastry [2], Tapestry [31], Chord [7], and the Content
   Addressable Network [8].  Mishchke and Stiller recently classified
   P2P systems by the presence or absence of structure in routing tables
   and network topology [257].

   Some have cast unstructured and structured algorithms as competing
   alternatives.  Unstructured approaches have been called "first
   generation", implicitly inferior to the "second generation"
   structured algorithms [2, 31].  When generic key lookups are
   required, these structured, key-based routing schemes can guarantee
   location of a target within a bounded number of hops [23].  The
   broadcasting unstructured approaches, however, may have large routing
   costs, or fail to find available content [22].  Despite the apparent
   advantages of structured P2P, several research groups are still
   pursuing unstructured P2P.

   There have been two main criticisms of structured systems [61].  The
   first relates to peer transience, which in turn, affects robustness.
   Chawathe, et al. opined that highly transient peers are not well
   supported by DHTs [61].  P2P systems often exhibit "churn", with
   peers continually arriving and departing.  One objection to concerns
   about highly transient peers is that many applications use peers in
   well-connected parts of the network.  The Tapestry authors analyzed
   the impact of churn in a network of 1000 nodes [31].  Others opined
   that it is possible to maintain a robust DHT at relatively low cost
   [258].  Very few papers have quantitatively compared the resilience
   of structured systems.  Loguinov, Kumar, et al. claimed that there
   were only two such works [24, 36].

   The second criticism of structured systems is that they do not
   support keyword searches and complex queries as well as unstructured
   systems.  Given the current file-sharing deployments, keyword
   searches seem more important than exact-match key searches in the
   short term.  Paraphrased, "most queries are for hay, not needles"
   [61].

   More recently, some have justifiably seen unstructured and structured
   proposals as complementary, and have devised hybrid models [259].
   Their starting point was the observation that unstructured flooding
   or random walks are inefficient for data that is not highly
   replicated across the P2P network.  Structured graphs can find keys
   efficiently, irrespective of replication.  Castro, et al. proposed
   Structella, a hybrid of Gnutella built on top of Pastry [259].
   Another design used structured search for rare items and unstructured
   search for massively replicated items [54].

RFC4981 - Page 9

   However, the "structured versus unstructured routing" taxonomy is
   becoming less useful, for two reasons, Firstly, most "unstructured"
   proposals have evolved and incorporated structure.  Consider the
   classic "unstructured" system, Gnutella [4].  For scalability, its
   peers are either ultrapeers or leaf nodes.  This hierarchy is
   augmented with a query routing protocol whereby ultrapeers receive a
   hashed summary of the resource names available at leaf nodes.
   Between ultrapeers, simple query broadcast is still used, though
   methods to reduce the query load here have been considered [260].
   Secondly, there are emerging schema-based P2P designs [59], with
   super-node hierarchies and structure within documents.  These are
   quite distinct from the structured DHT proposals.

1.3.  Indexes and Queries

   Given that most, if not all, P2P designs today assume some structure,
   a more instructive taxonomy would describe the structure.  In this
   survey, we use a database taxonomy in lieu of the networking
   taxonomy, as suggested by Hellerstein, Cooper, and Garcia-Molina [23,
   261].  The structure is determined by the type of index (Sections 2 ,
   3, and 4).  Queries feature in lieu of routing (Section 5).  The DHT
   algorithms implement a "semantic-free index" [216].  They are
   oblivious of whether keys represent document titles, meta-data, or
   text.  Gnutella-like and schema-based proposals have a "semantic
   index".

   Index engineering is at the heart of P2P search methods.  It captures
   a broad range of P2P issues, as demonstrated by the Search/Index
   Links model [261].  As Manber put it, "the most important of the
   tools for information retrieval is the index -- a collection of terms
   with pointers to places where information about documents can be
   found" [262].  Sen and Wang noted that a "P2P network" usually
   consists of connections between hosts for application-layer
   signaling, rather than for the data transfer itself [263].
   Similarly, we concentrate on the "signaled" indexes and queries.

   Our focus here is the dependability and adaptability of the search
   network.  Static dependability is a measure of how well queries route
   around failures in a network that is normally fault-free.  Dynamic
   dependability gives an indication of query success when nodes and
   data are continually joining and leaving the P2P system.  An
   adaptable index accommodates change in the data and query
   distribution.  It enables data independence, in that it facilitates
   changes to the data layout without requiring changes to the
   applications that use the data [23].  An adaptable P2P system can
   support rich queries for a wide range of applications.  Some
   applications benefit from simple, semantic-free key lookups [264].
   Others require more complex, Structured Query Language (SQL)-like

RFC4981 - Page 10

   queries to find documents with multiple keywords, or to aggregate or
   join query results from distributed relations [22].

2.  Index Types

   A P2P index can be local, centralized, or distributed.  With a local
   index, a peer only keeps the references to its own data, and does not
   receive references for data at other nodes.  The very early Gnutella
   design epitomized the local index (Section 2.1).  In a centralized
   index, a single server keeps references to data on many peers.  The
   classic example is Napster (Section 2.2).  With distributed indexes,
   pointers towards the target reside at several nodes.  One very early
   example is Freenet (Section 2.3).  Distributed indexes are used in
   most P2P designs nowadays -- they dominate this survey.

   P2P indexes can also be classified as non-forwarding and forwarding.
   When queries are guided by a non-forwarding index, they jump to the
   node containing the target data in a single hop.  There have been
   semantic and semantic-free one-hop schemes [138, 265, 266].  Where
   scalability to a massive number of peers is required, these schemes
   have been extended to two hops [267, 268].  More common are the
   forwarding P2Ps, where the number of hops varies with the total
   number of peers, often logarithmically.  The related trade-offs
   between routing state, lookup latency, update bandwidth, and peer
   churn are critical to total system dependability.

2.1.  Local Index (Gnutella)

   P2Ps with a purely local data index are becoming rare.  In such
   designs, peers flood queries widely and only index their own content.
   They enable rich queries - the search is not limited to a simple key
   lookup.  However, they also generate a large volume of query traffic
   with no guarantee that a match will be found, even if it does exist
   on the network.  For example, to find potential peers on the early
   instantiations of Gnutella, 'ping' messages were broadcast over the
   P2P network and the 'pong' responses were used to build the node
   index.  Then, small 'query' messages, each with a list of keywords,
   are broadcast to peers that respond with matching filenames [4].

   There have been numerous attempts to improve the scalability of
   local-index P2P networks.  Gnutella uses fixed time-to-live (TTL)
   rings, where the query's TTL is set less than 7-10 hops [4].  Small
   TTLs reduce the network traffic and the load on peers, but also
   reduce the chances of a successful query hit.  One paper reported,
   perhaps a little too bluntly, that the fixed "TTL-based mechanism
   does not work" [67].  To address this TTL selection problem, they
   proposed an expanding ring, known elsewhere as iterative deepening
   [29].  It uses successively larger TTL counters until there is a

RFC4981 - Page 11

   match.  The flooding, ring, and expanding ring methods all increase
   network load with duplicated query messages.  A random walk, whereby
   an unduplicated query wanders about the network, does indeed reduce
   the network load but massively increases the search latency.  One
   solution is to replicate the query k times at each peer.  Called
   random k-walkers, this technique can be coupled with TTL limits, or
   periodic checks with the query originator, to cap the query load
   [67].  Adamic, Lukose, et al. suggested that the random walk searches
   be directed to nodes with a higher degree, that is, with larger
   numbers of inter-peer connections [269].  They assumed that higher-
   degree peers are also capable of higher query throughputs.  However,
   without some balancing design rule, such peers would be swamped with
   the entire P2P signaling traffic.  In addition to the above
   approaches, there is the 'directed breadth-first' algorithm [29].  It
   forwards queries within a subset of peers selected according to
   heuristics on previous performance, like the number of successful
   query results.  Another algorithm, called probabilistic flooding, has
   been modeled using percolation theory [270].

   Several measurement studies have investigated locally indexed P2Ps.
   Jovanovic noted Gnutella's power law behaviour [70].  Sen and Wang
   compared the performance of Gnutella, Fasttrack [271], and Direct
   Connect [263, 272, 273].  At the time, only Gnutella used local data
   indexes.  All three schemes now use distributed data indexes, with
   hierarchy in the form of Ultrapeers (Gnutella), Super-Nodes
   FastTrack), and Hubs (Direct Connect).  It was found that a very
   small percentage of peers have a very high degree and that the total
   system dependability is at the mercy of such peers.  While peer up-
   time and bandwidth were heavy-tailed, they did not fit well with the
   Zipf distribution.  Fortunately for Internet Service Providers,
   measures aggregated by IP prefix and Autonomous System (AS) were more
   stable than for individual IP addresses.  A study of University of
   Washington traffic found that Gnutella and Kazaa together contributed
   43% of the university's total TCP traffic [274].  They also reported
   a heavy-tailed distribution, with 600 external peers (out of 281,026)
   delivering 26% of Kazaa bytes to internal peers.  Furthermore,
   objects retrieved from the P2P network were typically three orders of
   magnitude larger than Web objects -- 300 objects contributed to
   almost half the total outbound Kazaa bandwidth.  Others reported
   Gnutella's topology mismatch, whereby only 2-5% of P2P connections
   link peers in the same Autonomous System (AS), despite over 40% of
   peers being in the top 10 ASs [65].  Together these studies
   underscore the significance of multimedia sharing applications.  They
   motivate interesting caching and locality solutions to the topology
   mismatch problem.

   These same studies bear out one main dependability lesson: total
   system dependability may be sensitive to the dependability of high-

RFC4981 - Page 12

   degree peers.  The designers of Scamp translated this observation to
   the design heuristic, "have the degree of each node be of nearly
   equal size" [153].  They analyzed a system of N peers, with mean
   degree c.log(n), where link failures occur independently with
   probability e.  If d>0 is fixed and c>(1+d)/(-log(e)), then the
   probability of graph disconnection goes to zero as N->infinity.
   Otherwise, if c<(1-d)/(-log(e)), then the probability of
   disconnection goes to one as N->infinity.  They presented a
   localizer, which finds approximate minima to a global function of
   peer degree and arbitrary link costs using only local information.
   The Scamp overlay construction algorithms could support any of the
   flooding and walking routing schemes above, or other epidemic and
   multicasting schemes for that matter.  Resilience to high churn rates
   was identified for future study.

2.2.  Central Index (Napster)

   Centralized schemes like Napster [256] are significant because they
   were the first to demonstrate the P2P scalability that comes from
   separating the data index from the data itself.  Ultimately, 36
   million Napster users lost their service not because of technical
   failure, but because the single administration was vulnerable to the
   legal challenges of record companies [275].

   There has since been little research on P2P systems with central data
   indexes.  Such systems have also been called 'hybrid' since the index
   is centralized but the data is distributed.  Yang and Garcia-Molina
   devised a four-way classification of hybrid systems [276]: unchained
   servers, where users whose index is on one server do not see other
   servers' indexes; chained servers, where the server that receives a
   query forwards it to a list of servers if it does not own the index
   itself; full replication, where all centralized servers keep a
   complete index of all available metadata; and hashing, where keywords
   are hashed to the server where the associated inverted list is kept.
   The unchained architecture was used by Napster, but it has the
   disadvantage that users do not see all indexed data in the system.
   Strictly speaking, the other three options illustrate the distributed
   data index, not the central index.  The chained architecture was
   recommended as the optimum for the music-swapping application at the
   time.  The methods by which clients update the central index were
   classified as batch or incremental, with the optimum determined by
   the query-to-login ratio.  Measurements were derived from a clone of
   Napster called OpenNap[277].  Another study of live Napster data
   reported wide variation in the availability of peers, a general
   unwillingness to share files (20-40% of peers share few or no files),
   and a common understatement of available bandwidth so as to
   discourage other peers from sharing one's link [202].

RFC4981 - Page 13

   Influenced by Napster's early demise, the P2P research community may
   have prematurely turned its back on centralized architectures.
   Chawathe, Ratnasamy, et al. opined that Google and Yahoo demonstrate
   the viability of a centralized index.  They argued that "the real
   barriers to Napster-like designs are not technical but legal and
   financial" [61].  Even this view may be a little too harsh on the
   centralized architectures -- it implies that they always have an up-
   front capital hurdle that is steeper than for distributed
   architectures.  The closer one looks at scalable 'centralized'
   architectures, the less the distinction with 'distributed'
   architectures seems to matter.  For example, it is clear that
   Google's designers consider Google a distributed, not centralized,
   file system [278].  Google demonstrates the scale and performance
   possible on commodity hardware, but still has a centralized master
   that is critical to the operation of each Google cluster.  Time may
   prove that the value of emerging P2P networks, regardless of the
   centralized-versus-distributed classification, is that they smooth
   the capital outlays and remove the single points of failure across
   the spectra of scale and geographic distribution.

2.3.  Distributed Index (Freenet)

   An important early P2P proposal for a distributed index was Freenet
   [5, 71, 279].  While its primary emphasis was the anonymity of peers,
   it did introduce a novel indexing scheme.  Files are identified by
   low-level "content-hash" keys and by "secure signed-subspace" keys,
   which ensure that only a file owner can write to a file while anyone
   can read from it.  To find a file, the requesting peer first checks
   its local table for the node with keys closest to the target.  When
   that node receives the query, it too checks for either a match or
   another node with keys close to the target.  Eventually, the query
   either finds the target or exceeds time-to-live (TTL) limits.  The
   query response traverses the successful query path in reverse,
   depositing a new routing table entry (the requested key and the data
   holder) at each peer.  The insert message similarly steps towards the
   target node, updating routing table entries as it goes, and finally
   stores the file there.  Whereas early versions of Gnutella used
   breadth-first flooding, Freenet uses a more economic depth-first
   search [280].

   An initial assessment has been done of Freenet's robustness.  It was
   shown that in a network of 1000 nodes, the median query path length
   stayed under 20 hops for a failure of 30% of nodes.  While the
   Freenet designers considered this as evidence that the system is
   "surprisingly robust against quite large failures" [71], the same
   datapoint may well be outside meaningful operating bounds.  How many
   applications are useful when the first quartile of queries have path
   lengths of several hundred hops in a network of only 1000 nodes, per

RFC4981 - Page 14

   Figure 4 of [71]?  To date, there has been no analysis of Freenet's
   dynamic robustness.  For example, how does it perform when nodes are
   continually arriving and departing?

   There have been both criticisms and extensions of the early Freenet
   work.  Gnutella proponents acknowledged the merit in Freenet's
   avoidance of query broadcasting [281].  However, they are critical on
   two counts: the exact file name is needed to construct a query; and
   exactly one match is returned for each query.  P2P designs using
   DHTs, per Section 3, share similar characteristics -- a precise query
   yields a precise response.  The similarity is not surprising since
   Freenet also uses a hash function to generate keys.  However, the
   query routing used in the DHTs has firmer theoretical foundations.
   Another difference with DHTs is that Freenet will take time, when a
   new node joins the network, to build an index that facilitates
   efficient query routing.  By the inventor's own admission, this is
   damaging for a user's first impressions [282].  It was proposed to
   download a copy of routing tables from seed nodes at startup, even
   though the new node might be far from the seed node.  Freenet's slow
   startup motivated Mache, Gilbert, et al. to amend the overlay after
   failed requests and to place additional index entries on successful
   requests -- they claim almost an order of magnitude reduction in
   average query path length [280].  Clarke also highlighted the lack of
   locality or bandwidth information available for efficient query
   routing decisions [282].  He proposed that each node gather response
   times, connection times, and proportion of successful requests for
   each entry in the query routing table.  When searching for a key that
   is not in its own routing table, it was proposed to estimate response
   times from the routing metrics for the nearest known keys and
   consequently choose the node that can retrieve the data fastest.  The
   response time heuristic assumed that nodes close in the key space
   have similar response times.  This assumption stemmed from early
   deployment observations that Freenet peers seemed to specialize in
   parts of the keyspace -- it has not been justified analytically.
   Kronfol drew attention to Freenet's inability to do keyword searches
   [283].  He suggested that peers cache lists of weighted keywords in
   order to route queries to documents, using Term Frequency Inverse
   Document Frequency (TFIDF) measures and inverted indexes (Section
   4.2.1).  With these methods, a peer can route queries for simple
   keyword lists or more complicated conjunctions and disjunctions of
   keywords.  Robustness analysis and simulation of Kronfol's proposal
   remain open.

   The vast majority of P2P proposals in following sections rely on a
   distributed index.

(next page on part 2)