Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-75802

Data provenance collection

    • Type: Icon: New Feature New Feature
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Replication

      This is a request to create a collection to help trace where data owned in a MDB "dbpath" originated from. This can be useful when doing forensics on validation results within the context of a replica set. Where invalid data on one node can impact the validation results of other nodes in the replica set.

      The collection must remain small such that there is never a need to truncate it. Each document is estimated to be 300 bytes in size [1] with 10s of documents created every year.

      A document will be written at startup iff the most recent "version" of the provenance document does not match the version that would otherwise be created.

      Additional documents can also be written after startup as runtime events happen which may be accompanied with different data formats/schema's being written to disk.

      The schema of a document:

      {
          _id: ObjectId(),
          date: Date(), // Technically redundant with `_id`.
          datafiles: UUID(), // Generated fresh for the first document. Subsequent updates will copy this UUID over.
          syncedFromId: ObjectId(), // A copy of the `_id` field from a logical initial sync source.
          syncedFromDatafiles: UUID(), // A copy of the `datafiles` field from a logical initial sync source. This is redundnant with `syncedFromId` when the originating sync source still exists.
          binaryVersion: "7.1.12-rc1",
          fcv: "7.0", // The logical FCV value as read from the fcv collection.
          hostIdentifier: "<hostname>:<port>", // The hostname in a replica set config that the node identifies itself as.
          architecture: "arm", // x86, arm, ...?
      }
      

      An example lifetime:

      • Mongod starts up for the first time. The `admin.system.dataProvenance` collection is empty.
        • Create the collection with:
          {
              _id: ObjectId(),
              date: Date(),
              datafiles: UUID(),
              binaryVersion: "7.1.12-rc1",
              architecture: "x86",
              fcv: ??? // Unsure if we postpone creating this for a fresh instance.
          }
          
      • A replica set primary reaches out and gives this new node a replica set config.
        • The node recognizes itself as part of a replica set.
        • Writes a new document: <previous document> + `hostIdentifier: <hostname:port>`
      • The node does logical initial sync from <syncSource>:
        • Writes out a new document <previous document> + `syncedFromId` + `syncedFromDatafiles`
      • During initial sync, the node learns of FCV
        • Writes out a new document <previous document> + `fcv`
      • Initial sync completes
      • The node restarts
        • Nothing has changed, no new data provenance documents
      • The node restarts in standalone mode
        • Nothing has changed, no new data provenance documents
      • The node restarts in replica set mode but with a port that does not match
        • We didn't identify ourselves in the replica set config, so we won't accept writes. No new data provenance documents.
      • The node restarts with binaryVersion: "8.0"
        • Write a new document with the updated `binaryVersion`
      • FCV is upgraded to "8.0"
        • Write a new document with the new FCV.
        • Maybe we're interested in the intermediate "upgrading to 8.0" states.
      • Node is shut down. Atlas copies data files to arm.
        • Write a new document with the updated `architecture`.

      Note that it would be great if logical initial sync was made to be aware of this document. But there's less need for physical initial sync to have special knowledge. When a physical initial sync swaps out its underlying data files, it will include the data provenance collection from the source node. And it will write out a new document because the `hostIdentifier` will have changed.

      Ideally the new data provenance document that includes the `hostIdentifier` change is written out right after a physical initial sync completes. But it would be acceptable to ignore that change until a restart takes place if it were convenient to do so.

      [1]

      > Object.bsonsize({
      ...     _id: ObjectId(),
      ...     date: Date(), // Technically redundant with `_id`.
      ...     datafiles: UUID(), // Generated fresh for the first document. Subsequent updates will copy this UUID over.
      ...     syncedFromId: ObjectId(), // A copy of the `_id` field from a logical initial sync source.
      ...     syncedFromDatafiles: UUID(), // A copy of the `datafiles` field from a logical initial sync source. This is redundnant with `syncedFromId` when the originating sync source still exists.
      ...     binaryVersion: "7.1.12-rc1",
      ...     fcv: "7.0", // The logical FCV value as read from the fcv collection.
      ...     hostIdentifier: "fruits-apple-07.ec2-102-40-28-101.mongodb.com:37001", // The hostname in a replica set config that the node identifies itself as.
      ...     architecture: "arm", // x86, arm, ...?
      ... })
      309
      

            Assignee:
            backlog-server-repl [DO NOT USE] Backlog - Replication Team
            Reporter:
            daniel.gottlieb@mongodb.com Daniel Gottlieb (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated: