[SERVER-49354] Correct how _isTopologyChanged in sharded_backup_restore.js looks for removeShard oplog entries Created: 08/Jul/20  Updated: 29/Oct/23  Resolved: 10/Jul/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.7.0

Type: Bug Priority: Major - P3
Reporter: Kevin Pulo Assignee: Pierlauro Sciarelli
Resolution: Fixed Votes: 0
Labels: PM-1645-Milestone-3, sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-49359 Enable sharded_backup_restore_add_rem... Closed
Problem/Incident
is caused by SERVER-47406 Implement the persistence and trackin... Closed
Related
is related to SERVER-49358 Blacklist sharded_backup_restore_add_... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:
Linked BF Score: 38

 Description   

The _isTopologyChanged() function checks for any oplog entries that are document deletions on config.shards. However, these oplog entries were changed by SERVER-47406 to become an applyOps (so that the topologyTime of some other shard can be updated, much like how chunk versions are updated). This means that this code no longer notices these removeShard events (even when they are supposed to happen), causing test failures. It needs to be updated to look for the "d" op nested inside an appropriate applyOps ops on config.shards (much like in the ConfigServerOpObserver). If these tests run in any multiversion suites then it should probably also retain the existing check for plain "d" ops on config.shards.



 Comments   
Comment by Daniel Gottlieb (Inactive) [ 21/Jul/20 ]

john.morales is that information sufficient for you?

Comment by Kevin Pulo [ 21/Jul/20 ]

Apologies for this oversight.

The only change to addShard oplog entries is the addition of a new topologyTime field inside the o field (ie. inside the document being inserted to the config.shards collection), which I assume is a negligible change.

 {
     "op" : "i",
     "ns" : "config.shards",
     "ui" : UUID("87aff8a9-1b10-460e-85a1-e3b35a6f6329"),
     "o" : {
         "_id" : "shard02",
         "host" : "shard02/localhost:27019",
         "state" : 1,
+        "topologyTime" : Timestamp(1595312397, 7)
     },
     "ts" : Timestamp(1595312397, 8),
     "t" : NumberLong(1),
     "wall" : ISODate("2020-07-21T06:19:57.459Z"),
     "v" : NumberLong(2)
 }

For removeShard, previously this was just a document deletion in config.shards, but is now an applyOps command containing the document deletion and an update on another document (both in config.shards. For multiversion purposes, both formats should probably be detected. As you can see, they're reasonably similar, in the sense that the document deletion is still present, but "nested" inside the applyOps command.

 {
+    "op" : "c",
+    "ns" : "config.$cmd",
+    "o" : {
+        "applyOps" : [
+            {
                 "op" : "d",
                 "b" : false,
                 "ns" : "config.shards",
                 "o" : {
                     "_id" : "shard02"
                 },
                 "ui" : UUID("87aff8a9-1b10-460e-85a1-e3b35a6f6329")
+            },
+            {
+                "op" : "u",
+                "b" : false,
+                "ns" : "config.shards",
+                "o" : {
+                    "$set" : {
+                        "topologyTime" : Timestamp(1595312619, 1)
+                    }
+                },
+                "o2" : {
+                    "_id" : "shard01"
+                },
+                "ui" : UUID("87aff8a9-1b10-460e-85a1-e3b35a6f6329")
+            }
+        ],
+        "alwaysUpsert" : false,
+        "writeConcern" : {
+            "w" : 1,
+            "wtimeout" : 0
+        },
+        "$db" : "config"
+    },
     "ts" : Timestamp(1595312619, 2),
     "t" : NumberLong(1),
     "wall" : ISODate("2020-07-21T06:23:39.233Z"),
     "v" : NumberLong(2)
 }

EDIT: And, no change to the config.shards update that commences draining (technically this doesn't actually alter the topology).

 {
     "op" : "u",
     "ns" : "config.shards",
     "ui" : UUID("87aff8a9-1b10-460e-85a1-e3b35a6f6329"),
     "o" : {
         "$v" : 1,
         "$set" : {
             "draining" : true
         }
     },
     "o2" : {
         "_id" : "shard02"
     },
     "ts" : Timestamp(1595312615, 2),
     "t" : NumberLong(1),
     "wall" : ISODate("2020-07-21T06:23:35.008Z"),
     "v" : NumberLong(2)
 }

Comment by Githook User [ 10/Jul/20 ]

Author:

{'name': 'Pierlauro Sciarelli', 'email': 'pierlauro.sciarelli@mongodb.com', 'username': 'pierlauro'}

Message: SERVER-49354 Correct how _isTopologyChanged in sharded_backup_restore.js looks for removeShard oplog entries
Branch: master
https://github.com/10gen/mongo-enterprise-modules/commit/6c6030eff65b919566c52e4a300ca3dd541f8e8e

Comment by Daniel Gottlieb (Inactive) [ 10/Jul/20 ]

Pinging john.morales as this corrects the downstream impact from SERVER-47406. kevin.pulo or pierlauro.sciarelli can one of you leave a formal description of how a client reading the oplog can detect a topology change for John? Including the specifics of fields on the oplog entry to look for and what they mean.

Generated at Thu Feb 08 05:19:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.