Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-67492

Failed chunk migrations can lead recipient shard to have divergent config.transactions records between primary and secondaries

    • Fully Compatible
    • ALL
    • v6.0, v5.0, v4.4, v4.2
    • Sharding 2022-07-25, Sharding 2022-08-08
    • 13
    • 3

      As part of chunk migration, the recipient shard writes op=n no-op oplog entries and updates its config.transactions records to account for retryable writes and transactions touching the range being migrated which were previously run on the donor shard. This procedure ensures a retryable write still cannot be performed a second time even after the chunk has been migrated and the retries are now targeted to the recipient shard.

      A retryable findAndModify command stores, depending on the MongoDB version, a copy of the preImage or postImage document either (a) in an op=n no-op oplog entry or (b) in the config.image_collection collection on the donor shard. The recipient shard always writes an op=n no-op oplog entry containing the preImage or postImage document. One property of the op=n no-op oplog entry written by the recipient shard is that the oplog entry always have its 'o2' field fill in. In particular, for a preImage or postImage document the 'o2' field will be an empty BSONObj.

      Notably, the primary of the recipient shard skips updating its config.transactions record when writing the op=n no-op oplog entry containing the preImage or postImage document. Instead the op=n no-op oplog entry encapsulating the originating update or delete for the preImage or postImage document will cause the config.transactions record on the primary of the recipient shard to be updated. However, the SessionUpdateTracker class used by secondaries to update the config.transactions record as part of secondary oplog application doesn't have symmetric behavior. A secondary of the recipient shard will update its config.transactions record when processing the op=n no-op oplog entry containing the preImage or postImage document because those oplog entries when written by session migration do have an 'o2' field.

      // Ignore pre/post image no-op oplog entries. These entries will not have an o2 field.
      >if (entry.getOpType() == OpTypeEnum::kNoop) {
          if (!entry.getFromMigrate() || !*entry.getFromMigrate()) {
              return {};
          }
      
          if (!entry.getObject2()) {
              return {};
          }
      }
      

      This bug does not enable retryable writes to be executed more than once. This is because the only way for the primary of the recipient shard to skip updating its config.transactions record is if the chunk migration ends up failing for some reason and the recipient shard never processes the oplog entry of the originating update or delete for the preImage or postImage document. However, the chunk migration failing means the range still belongs to the donor shard and so any retries will continue to be targeted to the donor shard, which will correctly not execute the retryable write more than once.


      It can be helpful to see an example of what the oplog entries look like before and after session migration. The following oplog entries from the donor shard

      { lsid: { id: UUID(\"809ef603-fdf2-4911-9a74-89d9def7c5c4\"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 2, op: \"n\", ns: \"test.mycoll\", ui: UUID(\"93b18fb7-935d-40ac-bf81-aab5bcffb3aa\"), o: { _id: 0.0, x: 10.0, y: 2.0 }, stmtId: 0, ts: Timestamp(0, 0), t: -1, v: 2, wall: new Date(1656027741947), prevOpTime: { ts: Timestamp(0, 0), t: -1 } }
      { lsid: { id: UUID(\"809ef603-fdf2-4911-9a74-89d9def7c5c4\"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 2, op: \"u\", ns: \"test.mycoll\", ui: UUID(\"93b18fb7-935d-40ac-bf81-aab5bcffb3aa\"), o: { $v: 2, diff: { i: { y: 2.0 } } }, o2: { x: 10.0, _id: 0.0 }, needsRetryImage: \"postImage\", stmtId: 0, ts: Timestamp(1656027741, 109), t: 1, v: 2, wall: new Date(1656027741872), prevOpTime: { ts: Timestamp(0, 0), t: -1 } }
      

      are transformed into the following oplog entries on the recipient shard.

      { lsid: { id: UUID(\"809ef603-fdf2-4911-9a74-89d9def7c5c4\"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 2, op: \"n\", ns: \"test.mycoll\", ui: UUID(\"93b18fb7-935d-40ac-bf81-aab5bcffb3aa\"), o: { _id: 0.0, x: 10.0, y: 2.0 }, o2: {}, stmtId: 0, fromMigrate: true, ts: Timestamp(0, 0), t: 1, v: 2, wall: new Date(1656027741947), prevOpTime: { ts: Timestamp(0, 0), t: -1 } }
      { lsid: { id: UUID(\"809ef603-fdf2-4911-9a74-89d9def7c5c4\"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 2, op: \"n\", ns: \"test.mycoll\", ui: UUID(\"93b18fb7-935d-40ac-bf81-aab5bcffb3aa\"), o: { $sessionMigrateInfo: 1 }, o2: { lsid: { id: UUID(\"809ef603-fdf2-4911-9a74-89d9def7c5c4\"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 2, op: \"u\", ns: \"test.mycoll\", ui: UUID(\"93b18fb7-935d-40ac-bf81-aab5bcffb3aa\"), o: { $v: 2, diff: { i: { y: 2.0 } } }, o2: { x: 10.0, _id: 0.0 }, needsRetryImage: \"postImage\", stmtId: 0, ts: Timestamp(1656027741, 109), t: 1, v: 2, wall: new Date(1656027741872), prevOpTime: { ts: Timestamp(0, 0), t: -1 } }, postImageOpTime: { ts: Timestamp(1656027741, 121), t: 1 }, stmtId: 0, fromMigrate: true, ts: Timestamp(0, 0), t: 1, v: 2, wall: new Date(1656027741872), prevOpTime: { ts: Timestamp(0, 0), t: -1 } }
      

            Assignee:
            abdul.qadeer@mongodb.com Abdul Qadeer
            Reporter:
            max.hirschhorn@mongodb.com Max Hirschhorn
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: