[SERVER-41646] Hitting CannotVerifyAndSignLogicalTime on 3.6 trying to convert to sharded cluster Created: 11/Jun/19  Updated: 16/Sep/19  Resolved: 04/Sep/19

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Francis Cheng Assignee: Danny Hatcher (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   
Original Summary

Backport SERVER-32672 fix to 3.6

Original Description

A bug related to LogicalTimeValidator fails converting a 3.6 replica set to sharded cluster. It was fixed in 4.0. We need a backport of this fix to 3.6.

Details:

I was following steps in https://docs.mongodb.com/manual/tutorial/convert-replica-set-to-replicated-shard-cluster/

to convert a replica set to a sharded cluster.

I failed on the step to restart a 3.6 replica set to shard. Once I restart it, I can no longer connect to the replica set/shard. I keep getting errors like

 

Command failed with error 210 (CannotVerifyAndSignLogicalTime): 'Cannot accept logicalTime: { ts: Timestamp(1560272596, 1) }. May not be a part of a sharded cluster' on server qm3:27101. The full response is {"ok": 0.0, "errmsg": "Cannot accept logicalTime: { ts: Timestamp(1560272596, 1) }. May not be a part of a sharded cluster", "code": 210, "codeName": "CannotVerifyAndSignLogicalTime"}

The same procedure has no problem with 3.4 or 4.0.

I believe the bug is related to https://jira.mongodb.org/browse/SERVER-32672

We are using mongo 3.6 currently and we need to convert our replica sets to sharded clusters. It is not feasible for us to upgrade to 4.0 very soon. Could you backport this fix to 3.6, so that we'll be able to continue using our database?

 



 Comments   
Comment by Danny Hatcher (Inactive) [ 16/Sep/19 ]

sombra2eternity@gmail.com, because we were never able to diagnose the problem in this case, could you please open another SERVER ticket with the description of your problem + logs from the Primary processes + "diagnostic.data" folders from those processes?

Comment by Marcos Fernándex [ 16/Sep/19 ]

I have hit this bug or very similar a few minutes ago. I tried to set a shard cluster but failed so I roll back to single server mode. I have restarted without repliace or shard sections in config, and even:

use local;
db.dropDatabase();

But from time to time (about 1/6 queries) results in:

Cannot accept logicalTime: { ts: Timestamp(1568629799, 62) }. May not be a part of a sharded cluster

In a few minutes I will be dumping and restoring databases with a new installation because it seems there is some kind of residual misconfiguration inside mongo after a sharded/replica configuration.

Mongo: 

MongoDB shell version v4.0.9
git version: fc525e2d9b0e4bceff5c2201457e564362909765

Comment by Danny Hatcher (Inactive) [ 29/Jul/19 ]

fcheng, have you had the chance to review my previous comment?

Comment by Danny Hatcher (Inactive) [ 08/Jul/19 ]

Hello fcheng, as we've been unable to reproduce internally based on the docs, would you be able to provide a full list of reproduction steps that cause the problem to occur?

Comment by Misha Tyulenev [ 28/Jun/19 ]

daniel.hatcher There are should not be a problem when converting a 3.6 replica set to sharded cluster: If the server was not started with --shardsvr, the default featureCompatibilityVersion on clean startup is the upgrade version. If it was started with --shardsvr, the default featureCompatibilityVersion is the downgrade version, so that it can be safely added to a downgrade version cluster. The config server will run setFeatureCompatibilityVersion as part of addShard.
The symptom indicates that the failing node is already on 3.6 but is not added to the sharded cluster ( and hence does not have a validator). If the scenario goes as normal it should be on FCV 3.4 at this state. Could you please confirm the server was started with --shardsvr option?

The SERVER-32672 is only specific to our testing infrastructure behavior in 4.0.

Could you please clarify the exact release where the issue occurred and the repro scenario if it exists.

Comment by Francis Cheng [ 25/Jun/19 ]

I tried setting the featureCompabilityVersion to 3.4 but failed. Finding out which member is broken in a dead replica set and adding it back isn't a good workaround to us. Too risky for production rollout. I guess it is not the best way mongodb is supposed to be used?

Comment by Danny Hatcher (Inactive) [ 17/Jun/19 ]

Have you tried adding the problematic member back into the replica set? If you do so and it fails, could you please set the featureCompabilityVersion of the replica set to "3.4" and try to add the member back again? Please note that doing so may break the operation of some backwards incompatible features.

Comment by Francis Cheng [ 17/Jun/19 ]

Another thing I want to add is that, once we removed the problematic member, which is qm3 in the test associated with the uploaded log, we were able to see the replica set back to normal.

Comment by Francis Cheng [ 17/Jun/19 ]

Hi Daniel,

I've uploaded logs from all replica members. No sharded cluster was involved in this stage so I have no mongos logs to provide.

My "restart as shard" operation happened at around 2019-06-17T16:30 ~ 2019-06-17T16:40.

I set the log level to 5 though the replica set, but it seems this only affected one of the nodes.

Please ask if you need more input from us.

Comment by Danny Hatcher (Inactive) [ 13/Jun/19 ]

Hello,

In order for us to verify the situation you are experiencing, could you please upload the mongod and mongos logs from all the nodes covering your conversion attempt to our Secure Upload Portal? Please note that only MongoDB engineers will be able to see the files that you upload.

Generated at Thu Feb 08 04:58:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.