[SERVER-67492] Failed chunk migrations can lead recipient shard to have divergent config.transactions records between primary and secondaries Created: 23/Jun/22  Updated: 29/Oct/23  Resolved: 02/Aug/22

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 4.2.0, 4.4.0, 5.0.0, 6.0.0-rc11
Fix Version/s: 4.4.17, 6.0.1, 5.0.11, 6.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Abdul Qadeer
Resolution: Fixed Votes: 0
Labels: neweng, sharding-nyc-subteam1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
is related to SERVER-36004 SessionUpdateTracker should ignore no... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v6.0, v5.0, v4.4, v4.2
Sprint: Sharding 2022-07-25, Sharding 2022-08-08
Participants:
Linked BF Score: 13
Story Points: 3

 Description   

As part of chunk migration, the recipient shard writes op=n no-op oplog entries and updates its config.transactions records to account for retryable writes and transactions touching the range being migrated which were previously run on the donor shard. This procedure ensures a retryable write still cannot be performed a second time even after the chunk has been migrated and the retries are now targeted to the recipient shard.

A retryable findAndModify command stores, depending on the MongoDB version, a copy of the preImage or postImage document either (a) in an op=n no-op oplog entry or (b) in the config.image_collection collection on the donor shard. The recipient shard always writes an op=n no-op oplog entry containing the preImage or postImage document. One property of the op=n no-op oplog entry written by the recipient shard is that the oplog entry always have its 'o2' field fill in. In particular, for a preImage or postImage document the 'o2' field will be an empty BSONObj.

Notably, the primary of the recipient shard skips updating its config.transactions record when writing the op=n no-op oplog entry containing the preImage or postImage document. Instead the op=n no-op oplog entry encapsulating the originating update or delete for the preImage or postImage document will cause the config.transactions record on the primary of the recipient shard to be updated. However, the SessionUpdateTracker class used by secondaries to update the config.transactions record as part of secondary oplog application doesn't have symmetric behavior. A secondary of the recipient shard will update its config.transactions record when processing the op=n no-op oplog entry containing the preImage or postImage document because those oplog entries when written by session migration do have an 'o2' field.

// Ignore pre/post image no-op oplog entries. These entries will not have an o2 field.
if (entry.getOpType() == OpTypeEnum::kNoop) {
    if (!entry.getFromMigrate() || !*entry.getFromMigrate()) {
        return {};
    }
 
    if (!entry.getObject2()) {
        return {};
    }
}

This bug does not enable retryable writes to be executed more than once. This is because the only way for the primary of the recipient shard to skip updating its config.transactions record is if the chunk migration ends up failing for some reason and the recipient shard never processes the oplog entry of the originating update or delete for the preImage or postImage document. However, the chunk migration failing means the range still belongs to the donor shard and so any retries will continue to be targeted to the donor shard, which will correctly not execute the retryable write more than once.


It can be helpful to see an example of what the oplog entries look like before and after session migration. The following oplog entries from the donor shard

{ lsid: { id: UUID(\"809ef603-fdf2-4911-9a74-89d9def7c5c4\"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 2, op: \"n\", ns: \"test.mycoll\", ui: UUID(\"93b18fb7-935d-40ac-bf81-aab5bcffb3aa\"), o: { _id: 0.0, x: 10.0, y: 2.0 }, stmtId: 0, ts: Timestamp(0, 0), t: -1, v: 2, wall: new Date(1656027741947), prevOpTime: { ts: Timestamp(0, 0), t: -1 } }
{ lsid: { id: UUID(\"809ef603-fdf2-4911-9a74-89d9def7c5c4\"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 2, op: \"u\", ns: \"test.mycoll\", ui: UUID(\"93b18fb7-935d-40ac-bf81-aab5bcffb3aa\"), o: { $v: 2, diff: { i: { y: 2.0 } } }, o2: { x: 10.0, _id: 0.0 }, needsRetryImage: \"postImage\", stmtId: 0, ts: Timestamp(1656027741, 109), t: 1, v: 2, wall: new Date(1656027741872), prevOpTime: { ts: Timestamp(0, 0), t: -1 } }

are transformed into the following oplog entries on the recipient shard.

{ lsid: { id: UUID(\"809ef603-fdf2-4911-9a74-89d9def7c5c4\"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 2, op: \"n\", ns: \"test.mycoll\", ui: UUID(\"93b18fb7-935d-40ac-bf81-aab5bcffb3aa\"), o: { _id: 0.0, x: 10.0, y: 2.0 }, o2: {}, stmtId: 0, fromMigrate: true, ts: Timestamp(0, 0), t: 1, v: 2, wall: new Date(1656027741947), prevOpTime: { ts: Timestamp(0, 0), t: -1 } }
{ lsid: { id: UUID(\"809ef603-fdf2-4911-9a74-89d9def7c5c4\"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 2, op: \"n\", ns: \"test.mycoll\", ui: UUID(\"93b18fb7-935d-40ac-bf81-aab5bcffb3aa\"), o: { $sessionMigrateInfo: 1 }, o2: { lsid: { id: UUID(\"809ef603-fdf2-4911-9a74-89d9def7c5c4\"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 2, op: \"u\", ns: \"test.mycoll\", ui: UUID(\"93b18fb7-935d-40ac-bf81-aab5bcffb3aa\"), o: { $v: 2, diff: { i: { y: 2.0 } } }, o2: { x: 10.0, _id: 0.0 }, needsRetryImage: \"postImage\", stmtId: 0, ts: Timestamp(1656027741, 109), t: 1, v: 2, wall: new Date(1656027741872), prevOpTime: { ts: Timestamp(0, 0), t: -1 } }, postImageOpTime: { ts: Timestamp(1656027741, 121), t: 1 }, stmtId: 0, fromMigrate: true, ts: Timestamp(0, 0), t: 1, v: 2, wall: new Date(1656027741872), prevOpTime: { ts: Timestamp(0, 0), t: -1 } }



 Comments   
Comment by Githook User [ 11/Aug/22 ]

Author:

{'name': 'Abdul Qadeer', 'email': 'abdul.qadeer@mongodb.com', 'username': 'zorro786'}

Message: SERVER-67492 Skip updating txn table for op=n when o2 is empty
Branch: v4.4
https://github.com/mongodb/mongo/commit/92d24c1d5d7d092638445651db44d1be9bf0e32c

Comment by Max Hirschhorn [ 11/Aug/22 ]

Author:

{'name': 'Abdul Qadeer', 'email': 'abdul.qadeer@mongodb.com', 'username': 'zorro786'}

Message: SERVER-67492 Skip updating txn table for op=n when o2 is empty

(cherry picked from commit 3f6c0845204161336f052d3665817ad5adc1a1b5)
Branch: v5.0
https://github.com/mongodb/mongo/commit/0d3499eef76a29bc28015113dd4cc7cb412abf27

Comment by Githook User [ 04/Aug/22 ]

Author:

{'name': 'Abdul Qadeer', 'email': 'abdul.qadeer@mongodb.com', 'username': 'zorro786'}

Message: SERVER-67492 Skip updating txn table for op=n when o2 is empty

(cherry picked from commit 3f6c0845204161336f052d3665817ad5adc1a1b5)
Branch: v6.0
https://github.com/mongodb/mongo/commit/89bc2f7533ecf43dd57c413b40066928e4c24e24

Comment by Githook User [ 02/Aug/22 ]

Author:

{'name': 'Abdul Qadeer', 'email': 'abdul.qadeer@mongodb.com', 'username': 'zorro786'}

Message: SERVER-67492 Skip updating txn table for op=n when o2 is empty
Branch: master
https://github.com/mongodb/mongo/commit/3f6c0845204161336f052d3665817ad5adc1a1b5

Generated at Thu Feb 08 06:08:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.