[SERVER-45709] Sharding support for queryable backup Created: 22/Jan/20  Updated: 27/Oct/23  Resolved: 29/Jan/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Gregory Wlodarek Assignee: Kaloian Manassiev
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
Operating System: ALL
Sprint: Sharding 2020-02-10
Participants:

 Description   

With the new implementation of backup for 4.2 servers, the queryable backup interface broke. As part of the project to make queryable backup work again on 4.2 and up, we've been working with the Cloud Backup team who have been running integration tests since the fixes have gone in.

We're experiencing problems with sharded clusters with the sharding subsystem failing during startup in "read-only mode", which is what queryable backup uses. The subsystem is trying to perform a write to try to update the config connection string while the readOnly mode flag is set (a trait of queryable backup mode).

CONTROL  [main] Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'
CONTROL  [initandlisten] MongoDB starting : pid=625822 port=27703 dbpath=/data/backups/daemon/queryable/5e286df22e9d245688eceff7/dbpath/ 64-bit host=triceratops-linux
CONTROL  [initandlisten] db version v0.0.0
CONTROL  [initandlisten] git version: unknown
CONTROL  [initandlisten] OpenSSL version: OpenSSL 1.1.1d  10 Sep 2019
CONTROL  [initandlisten] allocator: tcmalloc
CONTROL  [initandlisten] modules: enterprise ninja 
CONTROL  [initandlisten] build environment:
CONTROL  [initandlisten]     distarch: x86_64
CONTROL  [initandlisten]     target_arch: x86_64
CONTROL  [initandlisten] options: { config: "/data/backups/daemon/queryable/5e286df22e9d245688eceff7/dbpath/conf.yaml", net: { bindIp: "0.0.0.0", port: 27703, tls: { CAFile: "/data/backups/daemon/queryable/5e286df22e9d245688eceff7/dbpath/ca.pem", certificateKeyFile: "/data/backups/daemon/queryable/5e286df22e9d245688eceff7/dbpath/serverIdentity.pem", mode: "requireTLS" } }, queryableBackup: { apiUri: "127.0.0.1:8097", snapshotId: "5e286df22e9d245688eceff7" }, security: { authorization: "enabled", clusterAuthMode: "x509" }, setParameter: { authenticationMechanisms: "MONGODB-X509", recoverToOplogTimestamp: "{timestamp:Timestamp(1579645657,1)}" }, sharding:{ _overrideShardIdentity: "{"_id": "shardIdentity", "clusterId": ObjectId("578fd7a27c55c01d2b1dc1fd"), "shardName": "queryable_shard_1", "configsvrConnectionString": "queryable_...", clusterRole: "shardsvr" }, storage: { dbPath: "/data/backups/daemon/queryable/5e286df22e9d245688eceff7/dbpath/", engine: "queryable_wt", queryableBackupMode: true, wiredTiger: { engineConfig: { cacheSizeGB: 1.0 } } }, systemLog: { destination: "file", path: "/data/backups/daemon/queryable/5e286df22e9d245688eceff7/mongod.log" } }
STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=1024M,cache_overflow=(file_max=0M),session_max=33000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),extensions=[local={entry=queryableWtFsCreate,early_load=true,config={apiUri="127.0.0.1:8097",snapshotId="5e286df22e9d245688eceff7",dbpath="/data/backups/daemon/queryable/5e286df22e9d245688eceff7/dbpath"}},],
RECOVERY [initandlisten] WiredTiger recoveryTimestamp. Ts: Timestamp(1579645657, 1)
STORAGE  [initandlisten] Timestamp monitor starting
STORAGE  [initandlisten] Detected configuration for non-active storage engine wiredTiger when current storage engine is queryable_wt
STORAGE  [initandlisten] Flow Control is enabled on this deployment.
STORAGE  [initandlisten] Running in queryable backup mode. Unable to create authorization indexes
SHARDING [initandlisten] initializing sharding state with: { shardName: "queryable_shard_1", clusterId: ObjectId('578fd7a27c55c01d2b1dc1fd'), configsvrConnectionString: "queryable_config/triceratops-linux:27702" }
NETWORK  [initandlisten] Starting new replica set monitor for queryable_config/triceratops-linux:27702
SHARDING [thread1] creating distributed lock ping thread for process triceratops-linux:27703:1579707942:2362364890367342457 (sleeping for 30000ms)
CONNPOOL [ReplicaSetMonitor-TaskExecutor] Connecting to triceratops-linux:27702
SHARDING [initandlisten] Finished initializing sharding components for primary node.
NETWORK  [ReplicaSetMonitor-TaskExecutor] Confirmed replica set for queryable_config is queryable_config/localhost:27702
SHARDING [Sharding-Fixed-0] Updating config server with confirmed set queryable_config/localhost:27702
CONNPOOL [ReplicaSetMonitor-TaskExecutor] Connecting to localhost:27702
SHARDING [Sharding-Fixed-0] Updating ShardRegistry connection string for shard config from: queryable_config/triceratops-linux:27702 to: queryable_config/localhost:27702
SHARDING [updateShardIdentityConfigString] Error encountered while trying to update config connection string to queryable_config/localhost:27702 :: caused by :: IllegalOperation: Cannot execute a write operation in read-only mode

As this is the furthest they got, I'm not aware of any further problems that may pop up. Part of the work here will require collaboration with the Backup Cloud team to determine if there are any other further problems related to the sharding subsystem.



 Comments   
Comment by Kaloian Manassiev [ 27/Jan/20 ]

The shards do store the metadata for the FQDN of the config server under the shardIdentity document in config.system.version and yes, this is from where that difference is coming between localhost and triceratops-linux above.

The _overrideShardIdentity parameter, which you pointed me to, is a way to override that, so yes this should work.

Comment by Kaloian Manassiev [ 27/Jan/20 ]

gregory.wlodarek, it has always been the case that in a read-only cluster none of the components are writable. Did this ever work in versions before 4.2?

I think what happens here is that a cluster was backed-up with the config server persisted as triceratops-linux:27702, but when you start it in read-only mode, it somehow reports itself as localhost:27702, which for the purposes of sharding is a different node. So this is a matter of ensuring that the FQDN that nodes use to report themselves as doesn't change.

Generated at Thu Feb 08 05:09:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.