[SERVER-23398] Cannot start mongos with secondary config servers available, if they have never seen a primary Created: 29/Mar/16  Updated: 07/Mar/22  Resolved: 07/Mar/22

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Viacheslav Kulyk Assignee: Andrew Witten (Inactive)
Resolution: Done Votes: 0
Labels: neweng, sharding-nyc-subteam2
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File fix.txt     PNG File screenshot-1.png    
Issue Links:
Related
is related to SERVER-23041 Shards starting or entering primary m... Closed
Participants:
Story Points: 2

 Description   

Hi,

mongos does not start if primary config server is not available, even though there is secondary config server available.
Version is 3.2.3, the same happens with version 3.2.4. Config servers are configured as a replica set. WiredTiger is used. Sharded.

Expected: mongos reads configuration from secondary config server and starts OK

This behavior is critically bad for our customers, as they are not able to restart the node with mongos and config server, when it is network disconnected.



 Comments   
Comment by Lamont Nelson [ 07/Mar/22 ]

See andrew.witten's explanation in the previous comment.

Comment by Andrew Witten (Inactive) [ 07/Feb/22 ]

I think this isn't happening anymore. It seems to me like config_rs_no_primary.js is testing this case. Some relevant logs from a run of this test case are here

First, the expected config server node is elected primary:

[js_test:config_rs_no_primary] c21521| 2022-02-05T03:15:42.702+00:00 D4 ELECTION 4615651 [OplogApplier-0] "Scheduling election timeout callback","attr":{"when":{"$date":"2022-02-05T03:15:53.859Z"}}
[js_test:config_rs_no_primary] c21521| 2022-02-05T03:15:42.702+00:00 I  ELECTION 4615652 [OplogApplier-0] "Starting an election, since we've seen no PRIMARY in election timeout period","attr":{"electionTimeoutPeriodMillis":10000}
[js_test:config_rs_no_primary] c21521| 2022-02-05T03:15:42.702+00:00 I  ELECTION 21438   [OplogApplier-0] "Conducting a dry run election to see if we could be elected","attr":{"currentTerm":0}
...
[js_test:config_rs_no_primary] c21521| 2022-02-05T03:15:42.709+00:00 I  ELECTION 21450   [ReplCoord-0] "Election succeeded, assuming primary role","attr":{"term":1}
[js_test:config_rs_no_primary] c21521| 2022-02-05T03:15:42.709+00:00 I  REPL     21358   [ReplCoord-0] "Replica set state transition","attr":{"newState":"PRIMARY","oldState":"SECONDARY"}

Then the other two are stopped:

[js_test:config_rs_no_primary] c21522| 2022-02-05T03:15:50.108+00:00 I  -        4784931 [SignalHandler] "Dropping the scope cache for shutdown"
[js_test:config_rs_no_primary] c21522| 2022-02-05T03:15:50.108+00:00 I  FTDC     20626   [SignalHandler] "Shutting down full-time diagnostic data capture"
[js_test:config_rs_no_primary] c21522| 2022-02-05T03:15:50.110+00:00 I  CONTROL  20565   [SignalHandler] "Now exiting"
[js_test:config_rs_no_primary] c21522| 2022-02-05T03:15:50.111+00:00 I  CONTROL  23138   [SignalHandler] "Shutting down","attr":{"exitCode":0}
[js_test:config_rs_no_primary] | 2022-02-05T03:15:50.131Z I  -        22821   [js] "shell: Stopped mongo program on port","attr":{"port":21522}
[js_test:config_rs_no_primary] ReplSetTest stop *** Mongod in port 21522 shutdown with code (0) ***
[js_test:config_rs_no_primary] ReplSetTest stop *** Shutting down mongod in port 21523, wait for process termination: true ***

and

[js_test:config_rs_no_primary] c21523| 2022-02-05T03:15:51.087+00:00 I  STORAGE  22279   [SignalHandler] "shutdown: removing fs lock..."
[js_test:config_rs_no_primary] c21523| 2022-02-05T03:15:51.087+00:00 I  -        4784931 [SignalHandler] "Dropping the scope cache for shutdown"
[js_test:config_rs_no_primary] c21523| 2022-02-05T03:15:51.087+00:00 I  FTDC     20626   [SignalHandler] "Shutting down full-time diagnostic data capture"
[js_test:config_rs_no_primary] c21523| 2022-02-05T03:15:51.091+00:00 I  CONTROL  20565   [SignalHandler] "Now exiting"
[js_test:config_rs_no_primary] c21523| 2022-02-05T03:15:51.091+00:00 I  CONTROL  23138   [SignalHandler] "Shutting down","attr":{"exitCode":0}
[js_test:config_rs_no_primary] | 2022-02-05T03:15:51.102Z I  -        22821   [js] "shell: Stopped mongo program on port","attr":{"port":21523}

as a result, the primary no longer has a quorum and therefore steps down:

[js_test:config_rs_no_primary] c21521| 2022-02-05T03:15:58.966+00:00 I  REPL     21809   [ReplCoord-2] "Can't see a majority of the set, relinquishing primary"
[js_test:config_rs_no_primary] c21521| 2022-02-05T03:15:58.966+00:00 I  REPL     21475   [ReplCoord-2] "Stepping down from primary in response to heartbeat"

Then we successfully bring up a new mongos.

It seems like this is testing this case.

Comment by Lamont Nelson [ 30/Nov/21 ]

This ticket has been put into our backlog. We will check if this is still an issue.

Comment by Kaloian Manassiev [ 15/Nov/21 ]

Given that MongoS is allowed to be stale (since the shards have the authoritative versions), it should be possible to start a router without having seen a primary/done a majority write.

Shards cannot on the other hand, if they had an in-progress migration when they started, because they need to recover the latest shard version.

Comment by Viacheslav Kulyk [ 23/Nov/16 ]

Any updates about the possible fix date ?

Comment by Viacheslav Kulyk [ 30/May/16 ]

Hi Andy,

I am trying to get the use case for the setup we described (please see attached) and I can`t.
When one node becomes network disconnected, the majority of config servers remains in the rest of nodes. Disconnected node cannot do writes at the moment of disconnection, as no write majority available. Disconnected node becomes secondary (if it was primary) or remains secondary (if it was secondary), so will not do any new writes. On re-connection, the node gets latest state from the majority.
Could you please explain in more details, how the rollback can happen for the case we described.

Also, the change Denis proposes is only applied if we set Mongo to apply it, so will not break any existing clients.

Regards
Viacheslav

Comment by Andy Schwerin [ 27/May/16 ]

That's not quite the point. You cannot know that when you bring the other nodes back up who they will elect as primary. If they do not elect the former primary, it is possible that the new primary will be missing some writes from the old primary, and the old primary will be forced to roll them back. If somebody manages to read from the old primary before that occurs, they will have inconsistent routing tables.

Comment by Denis Bystrov [ 17/May/16 ]

I've got your point, but in our model we don't have any write to secondary while network or primary (majority) are down, so when Primary is available the secondary node will be in sync with this primary and mongos will get consistent data.

Comment by Andy Schwerin [ 17/May/16 ]

Thanks for the patch, denis.bystrov@avid.com. The problem with the proposal that you and vkulyk submitted is what happens when the secondary config servers finally see a primary. It is possible when that happens that they will roll back some or all of their data, but neither mongos nor the shard servers have a mechanism to deal with config data rollback. As a result, the routing tables maintained by mongos and the shards may become inconsistent, leading to writes being accepted at the wrong nodes and loss of data.

This patch is only sufficient if you somehow know that the config server you contact will never roll back data that is visible in the local read concern, and this cannot be established automatically.

Comment by Denis Bystrov [ 17/May/16 ]

Hey,

Here is out pull request
https://github.com/mongodb/mongo/pull/1082

Comment by Ramon Fernandez Marina [ 17/May/16 ]

Thanks for your submission vkulyk. While we're not able to provide support in the SERVER project, I'd encourage you to read the Contributor Guide and submit a pull request via github so we can consider your patch. Don't forget to sign the Contributor Agreement if you haven't done so already.

Thanks,
Ramón.

Comment by Viacheslav Kulyk [ 17/May/16 ]

Hi,

Attached is a fix candidate for the issue , in form of diff with the source code (fix.txt).

It contains changes which will allow us to read from secondary node on config replica. It is made configurable, so it won’t break existing behavior.
mongos --configdb <LIST> --allowInconsistentConfigReads
Initially mongos requested config replica primary to be available and reads only from it. Plus, Read Concern for config replica was strictly set as “majority”.
With the fix, now it is not mandatory to have primary node available. If it is – fine, if not – I allow reads from secondary. Plus, config reads use another Read Concern, which allow to get data from node with latest available data.

Could you please review the code and let us know:
1)Is this fix applicable or have some drawbacks we do not know?
2)If the fix is fine, could you please add it to your sources for some next release?

Thank you
Viacheslav

Comment by Ramon Fernandez Marina [ 12/Apr/16 ]

vkulyk, as per Andy's comment it is not clear at the moment how to implement this functionality using the existing technology, and a lot of discussion and design is needed. I would not expect this ticket to be part of a stable release in less than 12 months, possibly more.

Comment by Viacheslav Kulyk [ 12/Apr/16 ]

I see that the issue was added to Planning Bucket A. Please advise, at least very roughly, how soon can we expect the issue to be solved and released? In 1/3/6/12 months ?
Thank you,
Viacheslav

Comment by Viacheslav Kulyk [ 07/Apr/16 ]

Thank you! Will be waiting for updates.

Comment by Andy Schwerin [ 06/Apr/16 ]

I was talking with spencer about this, and it might be possible to allow mongos nodes to read config data with local read concern, since the shard servers enforce the versioning protocol. It will take a great deal of further study to confirm, but it's an interesting possibility.

Comment by Viacheslav Kulyk [ 01/Apr/16 ]

I`d like to add.
With 3 mirrored config servers, this issue is not observed in those 3 data centers/nodes, having the config server; can reboot the node, read and write. But Mongo can have maximum 3 mirrored config servers, so it does not meet all our needs.
We hoped, that with the new feature to set config servers as a replica set, we can now have up to 50 nodes we can reboot disconnected. But in fact, we can only reboot nodes in 1 data center - the one having the majority. From this point of view, replica set config servers have degraded functionality in comparison with mirrored and are useless for us.

I hope that you will find the way to start mongos, reading from the secondary config server. I believe it is possible. It is critical for us. And I think it will make Mongo more powerful, being ready for the cross data center configuration I described above.

Regards
Viacheslav

Comment by Viacheslav Kulyk [ 31/Mar/16 ]

Hi Ramon,

Thank you for going on investigating.

I`ve attached a screenshot with the configuration example and case description.
The configuration allows to resist inter data center disconnections. Even on disconnection, both data centers can still read data from all the shards (as services use read preference primaryPreffered/nearest) and even write the documents, tagged to this datacenter (we use tag aware sharding with data center tag). It works fine, but only till the reboot. On reboot, mongos in data center 2 hangs "starting" forever.

As for "Even if mongos was able to start no metadata operations would be allowed" - I believe this is fine, as we do not care about re-chunking, we care about possibility to read and write documents. Let re-chunking happen after connection to majority of config servers is restored. Actually, this is what happens by design if connection is lost, but no reboot done. I would expect that mongos does the same after reboot, starting, but not allowing metadata operations until connection to majority of config servers is restored.

Thank you
Viacheslav

Comment by Ramon Fernandez Marina [ 30/Mar/16 ]

vkulyk, this ticket remains open to investigate what the options going forward are.

However it would be helpful if you could elaborate on your use case. What's the end goal of removing a config server from the replica set? Even if mongos was able to start no metadata operations would be allowed on the cluster. Also, what's the state of the data bearing nodes in the cluster? Can they talk to the config server primary? If yes, then I'm concerned that the mongos that's talking to the secondary node may get an inconsistent view of the cluster. Can you please provide more information of what you're trying to do?

Thanks,
Ramón.

Comment by Viacheslav Kulyk [ 30/Mar/16 ]

Is there a chance to have some fix in next Mongo release? E.g. to save some latest snapshot in secondary config and to use it, if the majority is not available. Or to create a snapshot, based on secondary config data (as all needed configuration is actually there), if the majority is not available?
It really can make our customers to refuse switching to Mongo. Unfortunately, it is pretty realistic case.
In fact it means that though we can have up to 50 config servers, config secondaries are useless for us.

Comment by Spencer Brody (Inactive) [ 30/Mar/16 ]

Unfortunately a replica set config server is only available for reads if it has at some point since it was started been in contact with a majority of the config servers.

Comment by Viacheslav Kulyk [ 30/Mar/16 ]

Thank you for clarification.
Is there a way to fix this via config server / mongos parameters?

Comment by Spencer Brody (Inactive) [ 30/Mar/16 ]

This is due to the lack of a snapshot for serving read concern majority reads. If the only reachable config servers have never been in contact with a majority for the lifetime of the process, they will have no snapshot that they can confirm has been committed to a majority of the config servers, thus all reads from the mongos will time out.

Comment by Viacheslav Kulyk [ 30/Mar/16 ]

Hi Thomas,

Thank you for a quick response.

Just noticed a typo in case 2 description. Cannot edit, so here is the correct:
Case 2:

  • create sharded setup, where one of nodes has mongos and config server on the same node.
  • disconnect this node from the network
  • restart config server on this node
  • restart the mongos
  • mongos does NOT start <--

Also I`d like to add, that the issue is even wider than "when one node is network disconnected". In case if we have data center A and B, and primary config server is in data center A, and network connection is lost between data centers, any node with mongos in data center B cannot be restarted, while each B node has a secondary config server.

Regards
Viacheslav

Comment by Kelsey Schubert [ 29/Mar/16 ]

Hi vkulyk,

Thank you for the report. We are able to reproduce this behavior and are investigating. Please continue to watch this ticket for updates.

Kind regards,
Thomas

Generated at Thu Feb 08 04:03:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.