[SERVER-36159] Log whenever the gossiped config server opTime term changes Created: 17/Jul/18 Updated: 29/Oct/23 Resolved: 11/Jul/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 3.6.15, 4.2.0-rc0, 4.0.13 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Kaloian Manassiev | Assignee: | Kevin Pulo |
| Resolution: | Fixed | Votes: | 3 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Backport Requested: |
v4.0, v3.6, v3.4
|
||||||||||||
| Sprint: | Sharding 2018-08-13, Sharding 2019-02-11, Sharding 2019-02-25, Sharding 2019-03-11, Sharding 2019-03-25, Sharding 2019-05-20, Sharding 2019-06-03, Sharding 2019-06-17, Sharding 2019-07-01, Sharding 2019-08-12 | ||||||||||||
| Participants: | |||||||||||||
| Case: | (copied to CRM) | ||||||||||||
| Description |
|
In mongodb 3.4 and earlier, the sharded cluster nodes gossip the config server's opTime in order to ensure they always read the latest routing metadata. This opTime contains both timestamp and a term and since it is only used internally between the cluster nodes is not signed or verified in any way. As part of a customer support case we observed the gossiped config server opTime term jump forward without it actually having changed on the config server itself. Such as jump could potentially happen due to DNS misconfiguration causing members of a sharded cluster to inadvertently talk to the wrong host and since there is no validation in 3.4 the term jumping forward could have disastrous consequences for the entire cluster. In order to help diagnose such issues we should have shard nodes log whenever the config server's opTime term changes. Such logging should also ideally include the node from which the new term came so that it can be traced back to the first node which caused it. |
| Comments |
| Comment by Githook User [ 13/Sep/19 ] |
|
Author: {'name': 'Kevin Pulo', 'username': 'devkev', 'email': 'kevin.pulo@mongodb.com'}Message: (cherry picked from commit f6bee9fab63e45bd7ef30e73aff6a21edca16aa2) |
| Comment by Githook User [ 26/Aug/19 ] |
|
Author: {'username': 'devkev', 'email': 'kevin.pulo@mongodb.com', 'name': 'Kevin Pulo'}Message: (cherry picked from commit c2c6ed338f617e89600f4a221abc19045431c46e) |
| Comment by Kevin Pulo [ 11/Jul/19 ] |
|
Remaining work will be continued in |
| Comment by Githook User [ 30/May/19 ] |
|
Author: {'name': 'Kevin Pulo', 'email': 'kevin.pulo@mongodb.com', 'username': 'devkev'}Message: |
| Comment by Eric Sommer [ 18/Feb/19 ] |
|
Additionally, we should extend the error message returned as part of the WriteConcernError (or the ExceededTimeLimit error) to include text like: "Current term is 1 but request is asking for 6" or something similar. This would help alert users that a mismatches/unsatisfiable term is causing the error. |