RFC 6940

REsource LOcation And Discovery (RELOAD) Base Protocol

Pages: 176
Proposed Standard
→ Errata

Part 5 of 7 – Pages 93 to 124

RFC6940 - Page 93 prevText

7.3.  Access Control Policies

   Every Kind which is storable in an overlay MUST be associated with an
   access control policy.  This policy defines whether a request from a
   given node to operate on a given value should succeed or fail.  It is
   anticipated that only a small number of generic access control
   policies are required.  To that end, this section describes a small
   set of such policies, and Section 14.4 establishes a registry for new
   policies, if required.  Each policy has a short string identifier
   which is used to reference it in the Configuration Document.

   In the following policies, the term "signer" refers to the signer of
   the StoredValue object and, in the case of non-replica stores, to the
   signer of the StoreReq message.  That is, in a non-replica store,
   both the signer of the StoredValue and the signer of the StoreReq
   MUST conform to the policy.  In the case of a replica store, the
   signer of the StoredValue MUST conform to the policy, and the
   StoreReq itself MUST be checked as described in Section 7.4.1.1.

7.3.1.  USER-MATCH

   In the USER-MATCH policy, a given value MUST be written (or
   overwritten) if and only if the signer's certificate has a user name
   which hashes (using the hash function for the overlay) to the
   Resource-ID for the resource.  Recall that the certificate may,
   depending on the overlay configuration, be self-signed.

7.3.2.  NODE-MATCH

   In the NODE-MATCH policy, a given value MUST be written (or
   overwritten) if and only if the signer's certificate has a specified
   Node-ID which hashes (using the hash function for the overlay) to the
   Resource-ID for the resource and that Node-ID is the one indicated in
   the SignerIdentity value cert_hash.

7.3.3.  USER-NODE-MATCH

   The USER-NODE-MATCH policy may be used only with dictionary types.
   In the USER-NODE-MATCH policy, a given value MUST be written (or
   overwritten) if and only if the signer's certificate has a user name
   which hashes (using the hash function for the overlay) to the
   Resource-ID for the resource.  In addition, the dictionary key MUST
   be equal to the Node-ID in the certificate, and that Node-ID MUST be
   the one indicated in the SignerIdentity value cert_hash.

RFC6940 - Page 94

7.3.4.  NODE-MULTIPLE

   In the NODE-MULTIPLE policy, a given value MUST be written (or
   overwritten) if and only if the signer's certificate contains a
   Node-ID such that H(Node-ID || i) is equal to the Resource-ID for
   some small integer value of i and that Node-ID is the one indicated
   in the SignerIdentity value cert_hash.  When this policy is in use,
   the maximum value of i MUST be specified in the Kind definition.

   Note that because i is not carried on the wire, the verifier MUST
   iterate through potential i values, up to the maximum value, to
   determine whether a store is acceptable.

7.4.  Data Storage Methods

   RELOAD provides several methods for storing and retrieving data:

   o  Store values in the overlay.

   o  Fetch values from the overlay.

   o  Stat: Get metadata about values in the overlay.

   o  Find the values stored at an individual peer.

   These methods are described in the following sections.

7.4.1.  Store

   The Store method is used to store data in the overlay.  The format of
   the Store request depends on the data model, which is determined by
   the Kind.

7.4.1.1.  Request Definition

   A StoreReq message is a sequence of StoreKindData values, each of
   which represents a sequence of stored values for a given Kind.  The
   same Kind-ID MUST NOT be used twice in a given store request.  Each
   value is then processed in turn.  These operations MUST be atomic.
   If any operation fails, the state MUST be rolled back to what it was
   before the request was received.

RFC6940 - Page 95

   The store request is defined by the StoreReq structure:

       struct {
           KindId                 kind;
           uint64                 generation_counter;
           StoredData             values<0..2^32-1>;
       } StoreKindData;

       struct {
           ResourceId             resource;
           uint8                  replica_number;
           StoreKindData          kind_data<0..2^32-1>;
       } StoreReq;

   A single Store request stores data of a number of Kinds to a single
   resource location.  The contents of the structure are:

   resource
      The resource at which to store.

   replica_number
      The number of this replica.  When a storing peer saves replicas to
      other peers, each peer is assigned a replica number, starting from
      1, that is sent in the Store message.  This field is set to 0 when
      a node is storing its own data.  This allows peers to distinguish
      replica writes from original writes.  Different topologies may
      choose to allocate or interpret the replica number differently
      (see Section 10.4).

   kind_data
      A series of elements, one for each Kind of data to be stored.

   The peer MUST check that it is responsible for the resource if the
   replica number is zero; if it is not, the peer must reject the
   request.  The peer MUST check that it expects to be a replica for the
   resource and that the request sender is consistent with being the
   responsible node (i.e., that the receiving peer does not know of a
   better node) if the replica number is nonzero; if the request sender
   is not consistent, it should reject the request.

   Each StoreKindData element represents the data to be stored for a
   single Kind-ID.  The contents of the element are:

   kind
      The Kind-ID.  Implementations MUST reject requests corresponding
      to unknown Kinds.

RFC6940 - Page 96

   generation_counter
      The expected current state of the generation counter
      (approximately the number of times that this object has been
      written; see below for details).

   values
      The value or values to be stored.  This may contain one or more
      stored_data values, depending on the data model associated with
      each Kind.

   The peer MUST perform the following checks:

   o  The Kind-ID is known and supported.

   o  The signatures over each individual data element, if any, are
      valid.  If this check fails, the request MUST be rejected with an
      Error_Forbidden error.

   o  Each element is signed by a credential which is authorized to
      write this Kind at this Resource-ID.  If this check fails, the
      request MUST be rejected with an Error_Forbidden error.

   o  For original (non-replica) stores, the StoreReq is signed by a
      credential which is authorized to write this Kind at this
      Resource-ID.  If this check fails, the request MUST be rejected
      with an Error_Forbidden error.

   o  For replica stores, the StoreReq is signed by a Node-ID which is a
      plausible node to either have originally stored the value or have
      been in the replica set.  What this means is overlay specific, but
      in the case of the Chord-based DHT defined in this specification,
      replica StoreReqs MUST come from nodes which are either in the
      known replica set for a given resource or which are closer than
      some node in the replica set.  If this check fails, the request
      MUST be rejected with an Error_Forbidden error.

   o  For original (non-replica) stores, the peer MUST check that if the
      generation counter is nonzero, it equals the current value of the
      generation counter for this Kind.  This feature allows the
      generation counter to be used in a way similar to the HTTP ETag
      feature.

   o  For replica Stores, the peer MUST set the generation counter to
      match the generation counter in the message and MUST NOT check the
      generation counter against the current value.  Replica Stores MUST
      NOT use a generation counter of 0.

RFC6940 - Page 97

   o  The storage time values are greater than that of any values which
      would be replaced by this Store.

   o  The size and number of the stored values are consistent with the
      limits specified in the overlay configuration.

   o  If the data is signed with identity_type set to "none" and/or
      SignatureAndHashAlgorithm values set to {0, 0} ("anonymous" and
      "none"), the StoreReq MUST be rejected with an Error_forbidden
      error.  Only synthesized data returned by the storage can use
      these values (see Section 7.4.2.2)

   If all these checks succeed, the peer MUST attempt to store the data
   values.  For non-replica stores, if the store succeeds and the data
   is changed, then the peer MUST increase the generation counter by at
   least 1.  If there are multiple stored values in a single
   StoreKindData, it is permissible for the peer to increase the
   generation counter by only 1 for the entire Kind-ID or by 1 or more
   than 1 for each value.  Accordingly, all stored data values MUST have
   a generation counter of 1 or greater. 0 is used in the Store request
   to indicate that the generation counter should be ignored for
   processing this request.  However, the responsible peer should
   increase the stored generation counter and should return the correct
   generation counter in the response.

   When a peer stores data previously stored by another node (e.g., for
   replicas or topology shifts), it MUST adjust the lifetime value
   downward to reflect the amount of time the value was stored at the
   peer.  The adjustment SHOULD be implemented by an algorithm
   equivalent to the following: at the time the peer initially receives
   the StoreReq, it notes the local time T.  When it then attempts to do
   a StoreReq to another node, it should decrement the lifetime value by
   the difference between the current local time and T.

   Unless otherwise specified by the usage, if a peer attempts to store
   data previously stored by another node (e.g., for replicas or
   topology shifts) and that store fails with either an
   Error_Generation_Counter_Too_Low or an Error_Data_Too_Old error, the
   peer MUST fetch the newer data from the peer generating the error and
   use that to replace its own copy.  This rule allows resynchronization
   after partitions heal.

   When a network partition is being healed and unless otherwise
   specified, the default merging rule is to act as if all the values
   that need to be merged were stored and as if the order they were
   stored in corresponds to the stored time values associated with (and
   carried in) their values.  Because the stored time values are those
   associated with the peer which did the writing, clock skew is

RFC6940 - Page 98

   generally not an issue.  If two nodes are on different partitions,
   write to the same location, and have clock skew, this can create
   merge conflicts.  However, because RELOAD deliberately segregates
   storage so that data from different users and peers is stored in
   different locations, and a single peer will typically only be in a
   single network partition, this case will generally not arise.

   The properties of stores for each data model are as follows:

   single-value:  A store of a new single-value element creates the
      element if it does not exist and overwrites any existing value
      with the new value.

   array:  A store of an array entry replaces (or inserts) the given
      value at the location specified by the index.  Because arrays are
      sparse, a store past the end of the array extends it with
      nonexistent values (exists = False) as required.  A store at index
      0xffffffff places the new value at the end of the array,
      regardless of the length of the array.  The resulting StoredData
      has the correct index value when it is subsequently fetched.

   dictionary:  A store of a dictionary entry replaces (or inserts) the
      given value at the location specified by the dictionary key.

RFC6940 - Page 99

   The following figure shows the relationship between these structures
   for an example store which stores the following values at resource
   "1234":

   o  The value "abc" is in the single-value location for Kind X.

   o  The value "foo" at index 0 is in the array for Kind Y.

   o  The value "bar" at index 1 is in the array for Kind Y.

                                     Store
                                resource=1234
                              replica_number = 0
                                   /      \
                                  /        \
                      StoreKindData        StoreKindData
                  kind=X (Single-Value)    kind=Y (Array)
                generation_counter = 99    generation_counter = 107
                           |                    /\
                           |                   /  \
                       StoredData             /    \
             storage_time = xxxxxxx          /      \
                   lifetime = 86400         /        \
                   signature = XXXX        /          \
                           |               |           |
                           |        StoredData       StoredData
                           |    storage_time =       storage_time =
                           |          yyyyyyyy       zzzzzzz
                           |  lifetime = 86400       lifetime = 33200
                           |  signature = YYYY       signature = ZZZZ
                           |               |           |
                    StoredDataValue        |           |
                     value="abc"           |           |
                                           |           |
                                  StoredDataValue  StoredDataValue
                                        index=0      index=1
                                     value="foo"    value="bar"

RFC6940 - Page 100

7.4.1.2.  Response Definition

   In response to a successful Store request, the peer MUST return a
   StoreAns message containing a series of StoreKindResponse elements,
   which contains the current value of the generation counter for each
   Kind-ID, as well as a list of the peers where the data will be
   replicated by the node processing the request.

        struct {
          KindId                  kind;
          uint64                  generation_counter;
          NodeId                  replicas<0..2^16-1>;
        } StoreKindResponse;

        struct {
          StoreKindResponse       kind_responses<0..2^16-1>;
        } StoreAns;

   The contents of each StoreKindResponse are:

   kind
      The Kind-ID being represented.

   generation_counter
      The current value of the generation counter for that Kind-ID.

   replicas
      The list of other peers at which the data was/will be replicated.
      In overlays and applications where the responsible peer is
      intended to store redundant copies, this allows the storing node
      to independently verify that the replicas have in fact been
      stored.  It does this verification by using the Stat method (see
      Section 7.4.3).  Note that the storing node is not required to
      perform this verification.

   The response itself is just StoreKindResponse values packed end to
   end.

   If any of the generation counters in the request precede the
   corresponding stored generation counter, then the peer MUST fail the
   entire request and respond with an Error_Generation_Counter_Too_Low
   error.  The error_info in the ErrorResponse MUST be a StoreAns
   response containing the correct generation counter for each Kind and
   the replica list, which will be empty.  For original (non-replica)
   stores, a node which receives such an error SHOULD attempt to fetch
   the data and, if the storage_time value is newer, replace its own
   data with that newer data.  This rule improves data consistency in
   the case of partitions and merges.

RFC6940 - Page 101

   If the data being stored is too large for the allowed limit by the
   given usage, then the peer MUST fail the request and generate an
   Error_Data_Too_Large error.

   If any type of request tries to access a data Kind that the peer does
   not know about, the peer MUST fail the request and generate an
   Error_Unknown_Kind error.  The error_info in the Error_Response is:

              KindId        unknown_kinds<0..2^8-1>;

   which lists all the Kinds that were unrecognized.  A node which
   receives this error MUST generate a ConfigUpdate message which
   contains the appropriate Kind definition (assuming which, in fact, a
   Kind which was defined in the configuration document was used).

7.4.1.3.  Removing Values

   RELOAD does not have an explicit Remove operation.  Rather, values
   are Removed by storing "nonexistent" values in their place.  Each
   DataValue contains a boolean value called "exists" which indicates
   whether a value is present at that location.  In order to effectively
   remove a value, the owner stores a new DataValue with "exists" set to
   False:

      exists = False

      value = {} (0 length)

   The owner SHOULD use a lifetime for the nonexistent value that is at
   least as long as the remainder of the lifetime of the value it is
   replacing.  Otherwise, it is possible for the original value to be
   accidentally or maliciously re-stored after the storing node has
   expired it.  Note that a window of vulnerability for replay attack
   still exists after the original lifetime has expired (as with any
   store).  This attack can be mitigated by doing a nonexistent store
   with a very long lifetime.

   Storing nodes MUST treat these nonexistent values the same way they
   treat any other stored value, including overwriting the existing
   value, replicating them, and aging them out as necessary when the
   lifetime expires.  When a stored nonexistent value's lifetime
   expires, it is simply removed from the storing node, as happens when
   any other stored value expires.

   Note that in the case of arrays and dictionaries, expiration may
   create an implicit, unsigned "nonexistent" value to represent a gap
   in the data structure, as might happen when any value is aged out.

RFC6940 - Page 102

   However, this value isn't persistent, nor is it replicated.  It is
   simply synthesized by the storing node.

7.4.2.  Fetch

   The Fetch request retrieves one or more data elements stored at a
   given Resource-ID.  A single Fetch request can retrieve multiple
   different Kinds.

7.4.2.1.  Request Definition

   Fetch requests are defined by the FetchReq structure:

        struct {
          int32            first;
          int32            last;
        } ArrayRange;

        struct {
          KindId                  kind;
          uint64                  generation;
          uint16                  length;

          select (DataModel) {
            case single_value: ;    /* Empty */

            case array:
                 ArrayRange       indices<0..2^16-1>;

            case dictionary:
                 DictionaryKey    keys<0..2^16-1>;

            /* This structure may be extended */

          } model_specifier;
        } StoredDataSpecifier;

        struct {
          ResourceId              resource;
          StoredDataSpecifier     specifiers<0..2^16-1>;
        } FetchReq;

   The contents of the Fetch requests are as follows:

   resource
      The Resource-ID to fetch from.

RFC6940 - Page 103

   specifiers
      A sequence of StoredDataSpecifier values, each specifying some of
      the data values to retrieve.

   Each StoredDataSpecifier specifies a single Kind of data to retrieve
   and, if appropriate, the subset of values that are to be retrieved.
   The contents of the StoredDataSpecifier structure are as follows:

   kind
      The Kind-ID of the data being fetched.  Implementations SHOULD
      reject requests corresponding to unknown Kinds unless specifically
      configured otherwise.

   DataModel
      The data model of the data.  This is not transmitted on the wire,
      but comes from the definition of the Kind.

   generation
      The last generation counter that the requesting node saw.  This
      may be used to avoid unnecessary fetches, or it may be set to
      zero.

   length
      The length of the rest of the structure, thus allowing
      extensibility.

   model_specifier
      A reference to the data value being requested within the data
      model specified for the Kind.  For instance, if the data model is
      "array", it might specify some subset of the values.

   The model_specifier is as follows:

   o  If the data model is single value, the specifier is empty.

   o  If the data model is array, the specifier contains a list of
      ArrayRange elements, each of which contains two integers.  The
      first integer is the beginning of the range, and the second is the
      end of the range.  0 is used to indicate the first element, and
      0xffffffff is used to indicate the final element.  The first
      integer MUST be less than or equal to the second.  While multiple
      ranges MAY be specified, they MUST NOT overlap.

   o  If the data model is dictionary, then the specifier contains a
      list of the dictionary keys being requested.  If no keys are
      specified, then this is a wildcard fetch and all key-value pairs
      are returned.

RFC6940 - Page 104

   The generation counter is used to indicate the requester's expected
   state of the storing peer.  If the generation counter in the request
   matches the stored counter, then the storing peer returns a response
   with no StoredData values.

7.4.2.2.  Response Definition

   The response to a successful Fetch request is a FetchAns message
   containing the data requested by the requester.

         struct {
           KindId                 kind;
           uint64                 generation;
           StoredData             values<0..2^32-1>;
         } FetchKindResponse;

         struct {
           FetchKindResponse      kind_responses<0..2^32-1>;
         } FetchAns;

   The FetchAns structure contains a series of FetchKindResponse
   structures.  There MUST be one FetchKindResponse element for each
   Kind-ID in the request.

   The contents of the FetchKindResponse structure are as follows:

   kind
      The Kind that this structure is for.

   generation
      The generation counter for this Kind.

   values
      The relevant values.  If the generation counter in the request
      matches the generation counter in the stored data, then no
      StoredData values are returned.  Otherwise, all relevant data
      values MUST be returned.  A nonexistent value (i.e., one which the
      node has no knowledge of) is represented by a synthetic value with
      "exists" set to False and has an empty signature.  Specifically,
      the identity_type is set to "none", the SignatureAndHashAlgorithm
      values are set to {0, 0} ("anonymous" and "none", respectively),
      and the signature value is of zero length.  This removes the need
      for the responding node to do signatures for values which do not
      exist.  These signatures are unnecessary, as the entire response
      is signed by that node.  Note that entries which have been removed
      by the procedure given in Section 7.4.1.3 and which have not yet
      expired also have exists = False, but have valid signatures from
      the node which did the store.

RFC6940 - Page 105

   Upon receipt of a FetchAns message, nodes MUST verify the signatures
   on all the received values.  Any values with invalid signatures
   (including expired certificates) MUST be discarded.  Note that this
   implies that implementations which wish to store data for long
   periods of time must have certificates with appropriate expiration
   dates or must re-store periodically.  Implementations MAY return the
   subset of values with valid signatures, but in that case, they SHOULD
   somehow signal to the application that a partial response was
   received.

   There is one subtle point about signature computation on arrays.  If
   the storing node uses the append feature (where the
   index=0xffffffff), then the index in the StoredData that is returned
   will not match that used by the storing node, which would break the
   signature.  In order to avoid this issue, the index value in the
   array is set to zero before the signature is computed.  This implies
   that malicious storing nodes can reorder array entries without being
   detected.

7.4.3.  Stat

   The Stat request is used to get metadata (length, generation counter,
   digest, etc.) for a stored element without retrieving the element
   itself.  The name is from the UNIX stat(2) system call, which
   performs a similar function for files in a file system.  It also
   allows the requesting node to get a list of matching elements without
   requesting the entire element.

7.4.3.1.  Request Definition

   The Stat request is identical to the Fetch request.  It simply
   specifies the elements to get metadata about.

        struct {
          ResourceId              resource;
          StoredDataSpecifier     specifiers<0..2^16-1>;
        } StatReq;

RFC6940 - Page 106

7.4.3.2.  Response Definition

   The Stat response contains the same sort of entries that a Fetch
   response would contain.  However, instead of containing the element
   data, it contains metadata.

        struct {
          Boolean                exists;
          uint32                 value_length;
          HashAlgorithm          hash_algorithm;
          opaque                 hash_value<0..255>;
        } MetaData;

        struct {
          uint32                 index;
          MetaData               value;
        } ArrayEntryMeta;

        struct {
          DictionaryKey          key;
          MetaData               value;
        } DictionaryEntryMeta;

        struct {
          select (DataModel) {
            case single_value:
              MetaData              single_value_entry;

            case array:
              ArrayEntryMeta        array_entry;

            case dictionary:
              DictionaryEntryMeta   dictionary_entry;

            /* This structure may be extended */
          };
        } MetaDataValue;

        struct {
          uint32                  value_length;
          uint64                  storage_time;
          uint32                  lifetime;
          MetaDataValue           metadata;
        } StoredMetaData;

RFC6940 - Page 107

        struct {
          KindId                 kind;
          uint64                 generation;
          StoredMetaData         values<0..2^32-1>;
        } StatKindResponse;

        struct {
          StatKindResponse      kind_responses<0..2^32-1>;
        } StatAns;

   The structures used in StatAns parallel those used in FetchAns: a
   response consists of multiple StatKindResponse values, one for each
   Kind that was in the request.  The contents of the StatKindResponse
   are the same as those in the FetchKindResponse, except that the
   values list contains StoredMetaData entries instead of StoredData
   entries.

   The contents of the StoredMetaData structure are the same as the
   corresponding fields in StoredData, except that there is no signature
   field and the value is a MetaDataValue rather than a StoredDataValue.

   A MetaDataValue is a variant structure, like a StoredDataValue,
   except for the types of each arm, which replace DataValue with
   MetaData.

   The only new structure is MetaData, which has the following contents:

   exists
      Same as in DataValue.

   value_length
      The length of the stored value.

   hash_algorithm
      The hash algorithm used to perform the digest of the value.

   hash_value
      A digest using hash_algorithm on the value field of the DataValue,
      including its 4 leading length bytes.

7.4.4.  Find

   The Find request can be used to explore the Overlay Instance.  A Find
   request for a Resource-ID R and a Kind-ID T retrieves the
   Resource-ID, if any, of the resource of Kind T known to the target
   peer which is closest to R.  This method can be used to walk the
   Overlay Instance by iteratively fetching R_n+1=nearest(1 + R_n).

RFC6940 - Page 108

7.4.4.1.  Request Definition

   The FindReq message contains a Resource-ID and a series of Kind-IDs
   identifying the resource the peer is interested in.

     struct {
       ResourceId                 resource;
       KindId                     kinds<0..2^8-1>;
     } FindReq;

   The request contains a list of Kind-IDs which the Find is for, as
   indicated below:

   resource
      The desired Resource-ID.

   kinds
      The desired Kind-IDs.  Each value MUST appear only once.
      Otherwise, the request MUST be rejected with an error.

7.4.4.2.  Response Definition

   A response to a successful Find request is a FindAns message
   containing the closest Resource-ID on the peer for each Kind
   specified in the request.

    struct {
      KindId                      kind;
      ResourceId                  closest;
    } FindKindData;

    struct {
      FindKindData                results<0..2^16-1>;
    } FindAns;

   If the processing peer is not responsible for the specified
   Resource-ID, it SHOULD return an Error_Not_Found error code.

   For each Kind-ID in the request, the response MUST contain a
   FindKindData indicating the closest Resource-ID for that Kind-ID,
   unless the Kind is not allowed to be used with Find, in which case a
   FindKindData for that Kind-ID MUST NOT be included in the response.
   If a Kind-ID is not known, then the corresponding Resource-ID MUST be
   0.  Note that different Kind-IDs may have different closest
   Resource-IDs.

RFC6940 - Page 109

   The response is simply a series of FindKindData elements, one per
   Kind, concatenated end to end.  The contents of each element are:

   kind
      The Kind-ID.

   closest
      The closest Resource-ID to the specified Resource-ID.  It is 0 if
      no Resource-ID is known.

   Note that the response does not contain the contents of the data
   stored at these Resource-IDs.  If the requester wants this, it must
   retrieve it using Fetch.

7.4.5.  Defining New Kinds

   There are two ways to define a new Kind.  The first is by writing a
   document and registering the Kind-ID with IANA.  This is the
   preferred method for Kinds which may be widely used and reused.  The
   second method is to simply define the Kind and its parameters in the
   Configuration Document using the section of Kind-ID space set aside
   for private use.  This method MAY be used to define ad hoc Kinds in
   new overlays.

   However a Kind is defined, the definition MUST include:

   o  The meaning of the data to be stored (in some textual form).

   o  The Kind-ID.

   o  The data model (single value, array, dictionary, etc.).

   o  The access control model.

   In addition, when Kinds are registered with IANA, each Kind is
   assigned a short string name which is used to refer to it in
   Configuration Documents.

   While each Kind needs to define what data model is used for its data,
   this does not mean that it must define new data models.  Where
   practical, Kinds should use the existing data models.  The intention
   is that the basic data model set be sufficient for most applications/
   usages.

RFC6940 - Page 110

8.  Certificate Store Usage

   The Certificate Store Usage allows a node to store its certificate in
   the overlay.

   A user/node MUST store its certificate at Resource-IDs derived from
   two Resource Names:

   o  The user name in the certificate.

   o  The Node-ID in the certificate.

   Note that in the second case, the certificate for a peer is not
   stored at its Node-ID but rather at a hash of its Node-ID.  The
   intention here (as is common throughout RELOAD) is to avoid making a
   peer responsible for its own data.

   New certificates are stored at the end of the list.  This structure
   allows users to store an old and a new certificate that both have the
   same Node-ID, which allows for migration of certificates when they
   are renewed.

   This usage defines the following Kinds:

   Name:  CERTIFICATE_BY_NODE

   Data Model:  The data model for CERTIFICATE_BY_NODE data is array.

   Access Control:  NODE-MATCH

   Name:  CERTIFICATE_BY_USER

   Data Model:  The data model for CERTIFICATE_BY_USER data is array.

   Access Control:  USER-MATCH

9.  TURN Server Usage

   The TURN Server Usage allows a RELOAD peer to advertise that it is
   prepared to be a TURN server, as defined in [RFC5766].  When a node
   starts up, it joins the overlay network and forms several connections
   in the process.  If the ICE stage in any of these connections returns
   a reflexive address that is not the same as the peer's perceived
   address, then the peer is behind a NAT and SHOULD NOT be a candidate
   for a TURN server.  Additionally, if the peer's IP address is in the
   private address space range as defined by [RFC1918], then it is also

RFC6940 - Page 111

   SHOULD NOT be a candidate for a TURN server.  Otherwise, the peer
   SHOULD assume that it is a potential TURN server and follow the
   procedures below.

   If the node is a candidate for a TURN server, it will insert some
   pointers in the overlay so that other peers can find it.  The overlay
   configuration file specifies a turn-density parameter that indicates
   how many times each TURN server SHOULD record itself in the overlay.
   Typically, this should be set to the reciprocal of the estimate of
   what percentage of peers will act as TURN servers.  If the turn-
   density is not set to zero, for each value, called d, between 1 and
   turn-density, the peer forms a Resource Name by concatenating its
   Node-ID and the value d.  This Resource Name is hashed to form a
   Resource-ID.  The address of the peer is stored at that Resource-ID
   using type TURN-SERVICE and the TurnServer object:

        struct {
          uint8                   iteration;
          IpAddressPort           server_address;
        } TurnServer;

   The contents of this structure are as follows:

   iteration
      The d value.

   server_address
      The address at which the TURN server can be contacted.

   Note:  Correct functioning of this algorithm depends on having turn-
      density be a reasonable estimate of the reciprocal of the
      proportion of nodes in the overlay that can act as TURN servers.
      If the turn-density value in the configuration file is too low,
      the process of finding TURN servers becomes more expensive, as
      multiple candidate Resource-IDs must be probed to find a TURN
      server.

   Peers that provide this service need to support the TURN extensions
   to STUN for media relay, as defined in [RFC5766].

   This usage defines the following Kind to indicate that a peer is
   willing to act as a TURN server:

   Name:  TURN-SERVICE

   Data Model:  The TURN-SERVICE Kind stores a single value for each
      Resource-ID.

RFC6940 - Page 112

   Access Control:  NODE-MULTIPLE, with a maximum iteration of counter
      20.

   Peers MAY find other servers by selecting a random Resource-ID and
   then doing a Find request for the appropriate Kind-ID with that
   Resource-ID.  The Find request gets routed to a random peer based on
   the Resource-ID.  If that peer knows of any servers, they will be
   returned.  The returned response may be empty if the peer does not
   know of any servers, in which case the process gets repeated with
   some other random Resource-ID.  As long as the ratio of servers
   relative to peers is not too low, this approach will result in
   finding a server relatively quickly.

   Note to implementers: The certificates used by TurnServer entries
   need to be retained, as described in Section 6.3.4.

10.  Chord Algorithm

   This algorithm is assigned the name CHORD-RELOAD to indicate that it
   is an adaptation of the basic Chord-based DHT algorithm.

   This algorithm differs from the Chord algorithm that was originally
   presented in [Chord].  It has been updated based on more recent
   research results and implementation experiences, and to adapt it to
   the RELOAD protocol.  Here is a short list of differences:

   o  The original Chord algorithm specified that a single predecessor
      and a successor list be stored.  The CHORD-RELOAD algorithm
      attempts to have more than one predecessor and successor.  The
      predecessor sets help other neighbors learn their successor list.

   o  The original Chord specification and analysis called for iterative
      routing.  RELOAD specifies recursive routing.  In addition to the
      performance implications, the cost of NAT traversal dictates
      recursive routing.

   o  Finger Table entries are indexed in the opposite order.  Original
      Chord specifies finger[0] as the immediate successor of the peer.
      CHORD-RELOAD specifies finger[0] as the peer 180 degrees around
      the ring from the peer.  This change was made to simplify
      discussion and implementation of variable-sized Finger Tables.
      However, with either approach, no more than O(log N) entries
      should typically be stored in a Finger Table.

   o  The stabilize() and fix_fingers() algorithms in the original Chord
      algorithm are merged into a single periodic process.
      Stabilization is implemented slightly differently because of the
      larger neighborhood, and fix_fingers is not as aggressive to

RFC6940 - Page 113

      reduce load, nor does it search for optimal matches of the Finger
      Table entries.

   o  RELOAD allows for a 128-bit hash instead of a 160-bit hash, as
      RELOAD is not designed to be used in networks with close to or
      more than 2^128 nodes or objects (and it is hard to see how one
      would assemble such a network).

   o  RELOAD uses randomized finger entries, as described in
      Section 10.7.4.2.

   o  The CHORD-RELOAD algorithm allows the use of either reactive or
      periodic recovery.  The original Chord paper used periodic
      recovery.  Reactive recovery provides better performance in small
      overlays, but is believed to be unstable in large overlays
      (greater than 1000) with high levels of churn
      [handling-churn-usenix04].  The overlay configuration file
      specifies a "chord-reactive" element that indicates whether
      reactive recovery should be used.

10.1.  Overview

   The algorithm described here, CHORD-RELOAD, is a modified version of
   the Chord algorithm.  In Chord (and in the algorithm described here),
   nodes are arranged in a ring, with node n being adjacent to nodes n-1
   and n+1 and with all arithmetic being done modulo 2^{k}, where k is
   the length of the Node-ID in bits, so that node 2^{k} - 1 is directly
   before node 0.

   Each peer keeps track of a Finger Table and a Neighbor Table.  The
   Neighbor Table contains at least the three peers before and after
   this peer in the DHT ring.  There may not be three entries in all
   cases, such as small rings or while the ring topology is changing.
   The first entry in the Finger Table contains the peer halfway around
   the ring from this peer, the second entry contains the peer that is
   1/4th of the way around, the third entry contains the peer that is
   1/8th of the way around, and so on.  Fundamentally, the Chord DHT can
   be thought of as a doubly linked list formed by knowing the
   successors and predecessor peers in the Neighbor Table, sorted by the
   Node-ID.  As long as the successor peers are correct, the DHT will
   return the correct result.  The pointers to the prior peers are kept
   to enable the insertion of new peers into the list structure.
   Keeping multiple predecessor and successor pointers makes it possible
   to maintain the integrity of the data structure even when consecutive
   peers simultaneously fail.  The Finger Table forms a skip list
   [wikiSkiplist] so that entries in the linked list can be found in
   O(log(N)) time instead of the typical O(N) time that a linked list
   would provide, where N represents the number of nodes in the DHT.

RFC6940 - Page 114

   The Neighbor Table and Finger Table entries contain logical Node-IDs
   as values, but the actual mapping of an IP level addressing
   information to reach that Node-ID is kept in the Connection Table.

   A peer, x, is responsible for a particular Resource-ID, k, if k is
   less than or equal to x and k is greater than p, where p is the
   Node-ID of the previous peer in the Neighbor Table.  Care must be
   taken when computing to note that all math is modulo 2^128.

10.2.  Hash Function

   For this Chord-based Topology Plug-in, the size of the Resource-ID is
   128 bits.  The hash of a Resource-ID MUST be computed using SHA-1
   [RFC3174], and then the SHA-1 result MUST be truncated to the most
   significant 128 bits.

10.3.  Routing

   The Routing Table is conceptually the union of the Neighbor Table and
   the Finger Table.

   If a peer is not responsible for a Resource-ID k, but is directly
   connected to a node with Node-ID k, then it MUST route the message to
   that node.  Otherwise, it MUST route the request to the peer in the
   Routing Table that has the largest Node-ID that is in the interval
   between the peer and k. If no such node is found, the peer finds the
   smallest Node-ID that is greater than k and MUST route the message to
   that node.

10.4.  Redundancy

   When a peer receives a Store request for Resource-ID k and it is
   responsible for Resource-ID k, it MUST store the data and return a
   success response.  It MUST then send a Store request to its successor
   in the Neighbor Table and to that peer's successor, incrementing the
   replica number for each successor.  Note that these Store requests
   are addressed to those specific peers, even though the Resource-ID
   they are being asked to store is outside the range that they are
   responsible for.  The peers receiving these SHOULD check that they
   came from an appropriate predecessor in their Neighbor Table and that
   they are in a range that this predecessor is responsible for.  Then,
   they MUST store the data.  They do not themselves perform further
   Stores, because they can determine that they are not responsible for
   the Resource-ID.

   Note that this Topology Plug-in does not use the replica number for
   purposes other than knowing the difference between a replica and a
   non-replica.

RFC6940 - Page 115

   Managing replicas as the overlay changes is described in
   Section 10.7.3.

   The sequential replicas used in this overlay algorithm protect
   against peer failure but not against malicious peers.  Additional
   replication from the Usage is required to protect resources from such
   attacks, as discussed in Section 13.5.4.

10.5.  Joining

   The join process for a Joining Node (JN) with Node-ID n is as
   follows:

   1.  JN MUST connect to its chosen bootstrap node, as specified in
       Section 11.4.

   2.  JN SHOULD send an Attach request to the Admitting Peer (AP) for
       Resource-ID n+1.  The "send_update" flag can be used to acquire
       the Routing Table of AP.

   3.  JN SHOULD send Attach requests to initiate connections to each of
       the peers in the Neighbor Table as well as to the desired peers
       in the Finger Table.  Note that this does not populate their
       Routing Tables, but only their Connection Tables, so JN will not
       get messages that it is expected to route to other nodes.

   4.  JN MUST enter into its Routing Table all the peers that it has
       successfully contacted.

   5.  JN MUST send a Join to AP.  The AP MUST send the response to the
       Join.

   6.  AP MUST do a series of Store requests to JN to store the data
       that JN will be responsible for.

   7.  AP MUST send JN an Update explicitly labeling JN as its
       predecessor.  At this point, JN is part of the ring and is
       responsible for a section of the overlay.  AP MAY now forget any
       data which is assigned to JN and not AP.  AP SHOULD NOT forget
       any data where AP is the replica set for the data.

   8.  The AP MUST send an Update to all of its neighbors (including JN)
       with the new values of its neighbor set (including JN).

   9.  JN MUST send Updates to all of the peers in its Neighbor Table.

RFC6940 - Page 116

   If JN sends an Attach to AP with send_update, it immediately knows
   most of its expected neighbors from AP's Routing Table update and MAY
   directly connect to them.  This is the RECOMMENDED procedure.

   If for some reason JN does not get AP's Routing Table, it MAY still
   populate its Neighbor Table incrementally.  It SHOULD send a Ping
   directed at Resource-ID n+1 (directly after its own Resource-ID).
   This allows JN to discover its own successor.  Call that node p0.  JN
   then SHOULD send a Ping to p0+1 to discover its successor (p1).  This
   process MAY be repeated to discover as many successors as desired.
   The values for the two peers before p will be found at a later stage,
   when n receives an Update.  An alternate procedure is to send
   Attaches to those nodes rather than Pings, which form the connections
   immediately, but may be slower if the nodes need to collect ICE
   candidates.

   In order to set up its i'th Finger Table entry, JN MUST send an
   Attach to peer n+2^(128-i).  This will be routed to a peer in
   approximately the right location around the ring.  (Note that the
   first entry in the Finger Table has i=1 and not i=0 in this
   formulation.)

   The Joining Node MUST NOT send any Update message placing itself in
   the overlay until it has successfully completed an Attach with each
   peer that should be in its Neighbor Table.

10.6.  Routing Attaches

   When a peer needs to Attach to a new peer in its Neighbor Table, it
   MUST source-route the Attach request through the peer from which it
   learned the new peer's Node-ID.  Source-routing these requests allows
   the overlay to recover from instability.

   All other Attach requests, such as those for new Finger
   Table entries, are routed conventionally through the overlay.

RFC6940 - Page 117

10.7.  Updates

   An Update for this DHT is defined as:

        enum { invalidChordUpdateType(0),
               peer_ready(1), neighbors(2), full(3), (255) }
             ChordUpdateType;

        struct {
           uint32                 uptime;
           ChordUpdateType        type;
           select (type){
            case peer_ready:                   /* Empty */
              ;

            case neighbors:
              NodeId              predecessors<0..2^16-1>;
              NodeId              successors<0..2^16-1>;

            case full:
              NodeId              predecessors<0..2^16-1>;
              NodeId              successors<0..2^16-1>;
              NodeId              fingers<0..2^16-1>;
          };
        } ChordUpdate;

   The "uptime" field contains the time this peer has been up in
   seconds.

   The "type" field contains the type of the update, which depends on
   the reason the update was sent.

   peer_ready
      This peer is ready to receive messages.  This message is used to
      indicate that a node which has Attached is a peer and can be
      routed through.  It is also used as a connectivity check to non-
      neighbor peers.

   neighbors
      This version is sent to members of the Chord Neighbor Table.

   full
      This version is sent to peers which request an Update with a
      RouteQueryReq.

RFC6940 - Page 118

   If the message is of type "neighbors", then the contents of the
   message will be:

   predecessors
      The predecessor set of the Updating peer.

   successors
      The successor set of the Updating peer.

   If the message is of type "full", then the contents of the message
   will be:

   predecessors
      The predecessor set of the Updating peer.

   successors
      The successor set of the Updating peer.

   fingers
      The Finger Table of the Updating peer, in numerically ascending
      order.

   A peer MUST maintain an association (via Attach) to every member of
   its neighbor set.  A peer MUST attempt to maintain at least three
   predecessors and three successors, even though this will not be
   possible if the ring is very small.  It is RECOMMENDED that O(log(N))
   predecessors and successors be maintained in the neighbor set.  There
   are many ways to estimate N, some of which are discussed in
   [DHT-RELOAD].

10.7.1.  Handling Neighbor Failures

   Every time a connection to a peer in the Neighbor Table is lost (as
   determined by connectivity pings or the failure of some request), the
   peer MUST remove the entry from its Neighbor Table and replace it
   with the best match it has from the other peers in its Routing Table.
   If using reactive recovery, the peer MUST send an immediate Update to
   all nodes in its Neighbor Table.  The update will contain all the
   Node-IDs of the current entries of the table (after the failed one
   has been removed).  Note that when replacing a successor, the peer
   SHOULD delay the creation of new replicas for the successor
   replacement hold-down time (30 seconds) after removing the failed
   entry from its Neighbor Table in order to allow a triggered update to
   inform it of a better match for its Neighbor Table.

   If the neighbor failure affects the peer's range of responsible IDs,
   then the Update MUST be sent to all nodes in its Connection Table.

RFC6940 - Page 119

   A peer MAY attempt to reestablish connectivity with a lost neighbor
   either by waiting additional time to see if connectivity returns or
   by actively routing a new Attach to the lost peer.  Details for these
   procedures are beyond the scope of this document.  In the case of an
   attempt to reestablish connectivity with a lost neighbor, the peer
   MUST be removed from the Neighbor Table.  Such a peer is returned to
   the Neighbor Table once connectivity is reestablished.

   If connectivity is lost to all successor peers in the Neighbor Table,
   then this peer SHOULD behave as if it is joining the network and MUST
   use Pings to find a peer and send it a Join.  If connectivity is lost
   to all the peers in the Finger Table, this peer SHOULD assume that it
   has been disconnected from the rest of the network, and it SHOULD
   periodically try to join the DHT.

10.7.2.  Handling Finger Table Entry Failure

   If a Finger Table entry is found to have failed (as determined by
   connectivity pings or the failure of some request), all references to
   the failed peer MUST be removed from the Finger Table and replaced
   with the closest preceding peer from the Finger Table or Neighbor
   Table.

   If using reactive recovery, the peer MUST initiate a search for a new
   Finger Table entry, as described below.

10.7.3.  Receiving Updates

   When a peer x receives an Update request, it examines the Node-IDs in
   the UpdateReq and at its Neighbor Table and decides if this UpdateReq
   would change its Neighbor Table.  This is done by taking the set of
   peers currently in the Neighbor Table and comparing them to the peers
   in the Update request.  There are two major cases:

   o  The UpdateReq contains peers that match x's Neighbor Table, so no
      change is needed to the neighbor set.

   o  The UpdateReq contains peers that x does not know about that
      should be in x's Neighbor Table; i.e., they are closer than
      entries in the Neighbor Table.

   In the first case, no change is needed.

   In the second case, x MUST attempt to Attach to the new peers, and if
   it is successful, it MUST adjust its neighbor set accordingly.  Note
   that x can maintain the now inferior peers as neighbors, but it MUST
   remember the closer ones.

RFC6940 - Page 120

   After any Pings and Attaches are done, if the Neighbor Table changes
   and the peer is using reactive recovery, the peer MUST send an Update
   request to each member of its Connection Table.  These Update
   requests are what end up filling in the predecessor/successor tables
   of peers that this peer is a neighbor to.  A peer MUST NOT enter
   itself in its successor or predecessor table and instead should leave
   the entries empty.

   If peer x is responsible for a Resource-ID R and x discovers that the
   replica set for R (the next two nodes in its successor set) has
   changed, it MUST send a Store for any data associated with R to any
   new node in the replica set.  It SHOULD NOT delete data from peers
   which have left the replica set.

   When peer x detects that it is no longer in the replica set for a
   resource R (i.e., there are three predecessors between x and R), it
   SHOULD delete all data associated with R from its local store.

   When a peer discovers that its range of responsible IDs has changed,
   it MUST send an Update to all entries in its Connection Table.

10.7.4.  Stabilization

   There are four components to stabilization:

   1.  Exchange Updates with all peers in its Neighbor Table to exchange
       state.

   2.  Search for better peers to place in its Finger Table.

   3.  Search to determine if the current Finger Table size is
       sufficiently large.

   4.  Search to determine if the overlay has partitioned and needs to
       recover.

10.7.4.1.  Updating the Neighbor Table

   A peer MUST periodically send an Update request to every peer in its
   Neighbor Table.  The purpose of this is to keep the predecessor and
   successor lists up to date and to detect failed peers.  The default
   time is about every ten minutes, but the configuration server SHOULD
   set this in the Configuration Document using the "chord-update-
   interval" element (denominated in seconds).  A peer SHOULD randomly
   offset these Update requests so they do not occur all at once.

RFC6940 - Page 121

10.7.4.2.  Refreshing the Finger Table

   A peer MUST periodically search for new peers to replace invalid
   entries in the Finger Table.  For peer x, the i'th Finger Table entry
   is valid if it is in the range [ x+2^( 128-i ),
   x+2^( 128-(i-1) )-1 ].  Invalid entries occur in the Finger
   Table when a previous Finger Table entry has failed or when no peer
   has been found in that range.

   Two possible methods for searching for new peers for the Finger
   Table entries are presented:

   Alternative 1: A peer selects one entry in the Finger Table from
   among the invalid entries.  It pings for a new peer for that Finger
   Table entry.  The selection SHOULD be exponentially weighted to
   attempt to replace earlier (lower i) entries in the Finger Table.  A
   simple way to implement this selection is to search through the
   Finger Table entries from i=1, and each time an invalid entry is
   encountered, send a Ping to replace that entry with probability 0.5.

   Alternative 2: A peer monitors the Update messages received from its
   connections to observe when an Update indicates a peer that would be
   used to replace an invalid Finger Table entry, i, and flags that
   entry in the Finger Table.  Every "chord-ping-interval" seconds, the
   peer selects from among those flagged candidates using an
   exponentially weighted probability, as above.

   When searching for a better entry, the peer SHOULD send the Ping to a
   Node-ID selected randomly from that range.  Random selection is
   preferred over a search for strictly spaced entries to minimize the
   effect of churn on overlay routing [minimizing-churn-sigcomm06].  An
   implementation or subsequent specification MAY choose a method for
   selecting Finger Table entries other than choosing randomly within
   the range.  Any such alternate methods SHOULD be employed only on
   Finger Table stabilization and not for the selection of initial
   Finger Table entries unless the alternative method is faster and
   imposes less overhead on the overlay.

   A peer SHOULD NOT send Ping requests looking for new finger table
   entries more often than the configuration element "chord-ping-
   interval", which defaults to 3600 seconds (one per hour).

   A peer MAY choose to keep connections to multiple peers that can act
   for a given Finger Table entry.

RFC6940 - Page 122

10.7.4.3.  Adjusting Finger Table Size

   If the Finger Table has fewer than 16 entries, the node SHOULD
   attempt to discover more fingers to grow the size of the table to 16.
   The value 16 was chosen to ensure high odds of a node maintaining
   connectivity to the overlay even with strange network partitions.

   For many overlays, 16 Finger Table entries will be enough, but as an
   overlay grows very large, more than 16 entries may be required in the
   Finger Table for efficient routing.  An implementation SHOULD be
   capable of increasing the number of entries in the Finger Table to
   128 entries.

   Although log(N) entries are all that are required for optimal
   performance, careful implementation of stabilization will result in
   no additional traffic being generated when maintaining a Finger
   Table larger than log(N) entries.  Implementers are encouraged to
   make use of RouteQuery and algorithms for determining where new
   Finger Table entries may be found.  Complete details of possible
   implementations are outside the scope of this specification.

   A simple approach to sizing the Finger Table is to ensure that the
   Finger Table is large enough to contain at least the final successor
   in the peer's Neighbor Table.

10.7.4.4.  Detecting Partitioning

   To detect that a partitioning has occurred and to heal the overlay, a
   peer P MUST periodically repeat the discovery process used in the
   initial join for the overlay to locate an appropriate bootstrap node,
   B.  P SHOULD then send a Ping for its own Node-ID routed through B.
   If a response is received from peer S', which is not P's successor,
   then the overlay is partitioned and P SHOULD send an Attach to S'
   routed through B, followed by an Update sent to S'.  (Note that S'
   may not be in P's Neighbor Table once the overlay is healed, but the
   connection will allow S' to discover appropriate neighbor entries for
   itself via its own stabilization.)

   Future specifications may describe alternative mechanisms for
   determining when to repeat the discovery process.

RFC6940 - Page 123

10.8.  Route Query f.in 3

       For CHORD-RELOAD, the RouteQueryReq contains no additional
       information.  The RouteQueryAns contains the single Node-ID of
       the next peer to which the responding peer would have routed the
       request message in recursive routing:

      struct {
         NodeId                  next_peer;
      } ChordRouteQueryAns;

   The contents of this structure are as follows:

   next_peer
      The peer to which the responding peer would route the message in
      order to deliver it to the destination listed in the request.

   If the requester has set the send_update flag, the responder SHOULD
   initiate an Update immediately after sending the RouteQueryAns.

10.9.  Leaving

   To support extensions, such as [DHT-RELOAD], peers SHOULD send a
   Leave request to all members of their Neighbor Table before exiting
   the Overlay Instance.  The overlay_specific_data field MUST contain
   the ChordLeaveData structure, defined below:

              enum { invalidChordLeaveType(0),
                      from_succ(1), from_pred(2), (255) }
                    ChordLeaveType;

               struct {
                 ChordLeaveType         type;

                  select (type) {
                    case from_succ:
                      NodeId            successors<0..2^16-1>;

                    case from_pred:
                      NodeId           predecessors<0..2^16-1>;
                  };
               } ChordLeaveData;

RFC6940 - Page 124

   The "type" field indicates whether the Leave request was sent by a
   predecessor or a successor of the recipient:

   from_succ
      The Leave request was sent by a successor.

   from_pred
      The Leave request was sent by a predecessor.

   If the type of the request is "from_succ", the contents will be:

   successors
      The sender's successor list.

   If the type of the request is "from_pred", the contents will be:

   predecessors
      The sender's predecessor list.

   Any peer which receives a Leave for a peer n in its neighbor set MUST
   follow procedures as if it had detected a peer failure as described
   in Section 10.7.1.

(page 124 continued on part 6)