[SERVER-74663] Transitioning a server from configsvr to shardsvr doesn't succed, but also doesn't error Created: 07/Mar/23  Updated: 14/Apr/23  Resolved: 14/Apr/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Joanna Cheng Assignee: [DO NOT USE] Backlog - Sharding NYC
Resolution: Duplicate Votes: 0
Labels: skunkelodeon-odcs
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File shard-2-mongodb.log     Text File test-other-shard-mongodb.log    
Issue Links:
Duplicate
duplicates SERVER-74311 Add sanity check assertions that only... Closed
Assigned Teams:
Sharding NYC
Participants:

 Description   

Steps to reproduce:

  1. Set up a sharded cluster with the new "config-shard" feature
  2. Spin up a new mongod as a second shard. (Forget to change the arguments)

    /Users/joanna/skunkworks/odcs-arm64/mongod --setParameter featureFlagCatalogShard=true --setParameter enableTestCommands=true --configsvr --replSet shard-2 --dbpath sh2-dbpath --fork --logpath sh2-dbpath/mongodb.log --logappend --port 27020
    

  3. rs.initiate() this shard. This succeeds
  4. Try to sh.addShard() from your mongos. This fails

    	"errmsg" : "Cannot add shard-2/localhost:27020 as a shard since it is a config server",
    

  5. Shutdown the node and restart it with --shardsvr

    % /Users/joanna/skunkworks/odcs-arm64/mongod --setParameter enableTestCommands=true --shardsvr --replSet shard-2 --dbpath sh2-dbpath --fork --logpath sh2-dbpath/mongodb.log --logappend --port 27020
    {"t":{"$date":"2023-03-07T04:57:42.270Z"},"s":"I",  "c":"CONTROL",  "id":5760901, "ctx":"thread1","msg":"Applied --setParameter options","attr":{"serverParameters":{"enableTestCommands":{"default":false,"value":true}}}}
    about to fork child process, waiting until server is ready for connections.
    forked process: 89394
    ^C
    

    This hangs (although it seems to complete - I have a forked PID). The ctrl-C is from me after I got impatient

The shard now has an identity crisis - it's started, and listening on the right port, but still can't connect to itself. (I assume this is because it's expecting a config server on port 27020, but is getting a not-set-up-properly shard server)

{"t":{"$date":"2023-03-07T15:44:06.567+11:00"},"s":"W",  "c":"SHARDING", "id":22074,   "ctx":"initandlisten","msg":"Started with --shardsvr, but no shardIdentity document was found on disk. This most likely means this server has not yet been added to a sharded cluster","attr":{"namespace":"admin.system.version"}}
....
{"t":{"$date":"2023-03-07T15:44:06.691+11:00"},"s":"I",  "c":"NETWORK",  "id":23015,   "ctx":"listener","msg":"Listening on","attr":{"address":"/tmp/mongodb-27020.sock"}}
{"t":{"$date":"2023-03-07T15:44:06.692+11:00"},"s":"I",  "c":"NETWORK",  "id":23015,   "ctx":"listener","msg":"Listening on","attr":{"address":"127.0.0.1"}}
{"t":{"$date":"2023-03-07T15:44:06.693+11:00"},"s":"I",  "c":"NETWORK",  "id":23016,   "ctx":"listener","msg":"Waiting for connections","attr":{"port":27020,"ssl":"off"}}
...
{"t":{"$date":"2023-03-07T16:19:25.068+11:00"},"s":"I",  "c":"-",        "id":4333222, "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"RSM received error response","attr":{"host":"localhost:27020","error":"NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit","replicaSet":"shard-2","response":{}}}
{"t":{"$date":"2023-03-07T16:19:25.069+11:00"},"s":"I",  "c":"NETWORK",  "id":4712102, "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"Host failed in replica set","attr":{"replicaSet":"shard-2","host":"localhost:27020","error":{"code":202,"codeName":"NetworkInterfaceExceededTimeLimit","errmsg":"Couldn't get a connection within the time limit"},"action":{"dropConnections":false,"requestImmediateCheck":false,"outcome":{"host":"localhost:27020","success":false,"errorMessage":"NetworkInterfaceExceededTimeLimit: Couldn't get a connection within the time limit"}}}}

The node itself is also not connectable via the shell

% mongo --port 27020
MongoDB shell version v5.0.9
connecting to: mongodb://127.0.0.1:27020/?compressors=disabled&gssapiServiceName=mongodb
Error: couldn't connect to server 127.0.0.1:27020, connection attempt failed: SocketException: Error connecting to 127.0.0.1:27020 :: caused by :: Operation timed out :
connect@src/mongo/shell/mongo.js:372:17
@(connect):2:6
exception: connect failed
exiting with code 1

Repeating the same steps on the same version, without the special featureFlagCatalogShard:

  1. Start mongod with --configsvr

    % /Users/joanna/skunkworks/odcs-arm64/mongod --configsvr --replSet shard-test --dbpath test-other-shard --fork --logpath test-other-shard/mongodb.log --logappend --port 27030
    

  2. Run rs.initiate()
  3. Shut down mongod and restart with --shardsvr instead
    The mongod refuses to start

    % /Users/joanna/skunkworks/odcs-arm64/mongod --shardsvr --replSet shard-test --dbpath test-other-shard --fork --logpath test-other-shard/mongodb.log --logappend --port 27030
    about to fork child process, waiting until server is ready for connections.
    forked process: 89713
    ERROR: child process failed, exited with 1
    To see additional information in this output, start without the "--fork" option.
    

    Error in logs

    {"t":{"$date":"2023-03-07T16:02:38.455+11:00"},"s":"E",  "c":"REPL",     "id":21415,   "ctx":"ReplCoord-0","msg":"Locally stored replica set configuration is invalid; See http://www.mongodb.org/dochub/core/recover-replica-set-from-invalid-config for information on how to recover from this","attr":{"error":{"code":2,"codeName":"BadValue","errmsg":"Nodes being used for config servers must be started with the --configsvr flag"},"localConfig":{"_id":"shard-test","version":1,"term":1,"members":[{"_id":0,"host":"localhost:27030","arbiterOnly":false,"buildIndexes":true,"hidden":false,"priority":1,"tags":{},"secondaryDelaySecs":0,"votes":1}],"configsvr":true,"protocolVersion":1,"writeConcernMajorityJournalDefault":true,"settings":{"chainingAllowed":true,"heartbeatIntervalMillis":2000,"heartbeatTimeoutSecs":10,"electionTimeoutMillis":10000,"catchUpTimeoutMillis":-1,"catchUpTakeoverDelayMillis":30000,"getLastErrorModes":{},"getLastErrorDefaults":{"w":1,"wtimeout":0},"replicaSetId":{"$oid":"6406c55bc060c6ffdffd9ca7"}}}}}
    ...
    {"t":{"$date":"2023-03-07T16:02:38.455+11:00"},"s":"F",  "c":"ASSERT",   "id":23091,   "ctx":"ReplCoord-0","msg":"Fatal assertion","attr":{"msgid":28544,"file":"src/mongo/db/repl/replication_coordinator_impl.cpp","line":619}}
    {"t":{"$date":"2023-03-07T16:02:38.455+11:00"},"s":"I",  "c":"CONTROL",  "id":20710,   "ctx":"LogicalSessionCacheRefresh","msg":"Failed to refresh session cache, will try again at the next refresh interval","attr":{"error":"ShardingStateNotInitialized: sharding state is not yet initialized"}}
    {"t":{"$date":"2023-03-07T16:02:38.455+11:00"},"s":"I",  "c":"NETWORK",  "id":23015,   "ctx":"listener","msg":"Listening on","attr":{"address":"/tmp/mongodb-27030.sock"}}
    {"t":{"$date":"2023-03-07T16:02:38.456+11:00"},"s":"F",  "c":"ASSERT",   "id":23092,   "ctx":"ReplCoord-0","msg":"\n\n***aborting after fassert() failure\n\n"}
    



 Comments   
Comment by Jack Mulrow [ 14/Apr/23 ]

I ran the repro and now restarting the --configsvr node with --shardsvr (and the same dbpath) fails to start up because of the shard identity document validation added in SERVER-74311, logging: 

{"t":{"$date":"2023-04-14T15:57:32.775+00:00"},"s":"E",  "c":"CONTROL",  "id":20557,   "ctx":"initandlisten","msg":"DBException in initAndListen, terminating","attr":{"error":"UnsupportedFormat: Invalid shard identity document found when initializing sharding state :: caused by :: Invalid shard identity document: the shard name for a shard server cannot be \"config\""}} 

The URL being broken is still an issue, but a separate one, so I'll file a new ticket for replication to look into it and close this as a duplicate of SERVER-74311.

Generated at Thu Feb 08 06:28:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.