[SERVER-35377] Operations on new clients get latest in-memory clusterTime as operationTime Created: 04/Jun/18 Updated: 29/Oct/23 Resolved: 14/Jun/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 3.6.7, 4.0.0-rc6, 4.1.1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Jack Mulrow | Assignee: | Misha Tyulenev |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | Sharding36Regression | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||
| Backport Requested: |
v4.0, v3.6
|
||||||||||||||||||||||||||||||||
| Sprint: | Sharding 2018-06-18 | ||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||
| Linked BF Score: | 60 | ||||||||||||||||||||||||||||||||
| Description |
|
When computing operationTime for a response, the "client operation time" at the start of a command is compared to the time at the end. There are some error contexts where there is no start operation time, so if the start time is a null logical time, the latest in-memory cluster time is returned. The start time is computed by getting the opTime of the last op on the current client. If a client hasn't been used before though, this can be null, which results in a null start time and the returned operationTime will be the latest in-memory time, even for successful operations. This does not violate causal consistency, but it can create problems in our test infrastructure when the no-op writer is off, because it's possible the latest in-memory clusterTime is a higher than the latest opTime in a replica set primary's oplog. In particular, when secondaries force metadata refreshes on the primary, they use the operationTime of _forceRouterConfigUpdate to do a waitUntilOpTime (which bypasses forced no-op writes) and can hang forever. Possible solutions would be:
This is new behavior introduced by the refactoring in |
| Comments |
| Comment by Misha Tyulenev [ 02/Aug/18 ] |
|
glajchs i dont think that this issue is directly related, but do not exclude it completely. Check if this backport reduces amount of logging you see currently. |
| Comment by Scott Glajch [ 30/Jul/18 ] |
|
Could this have anything to do with the issue I was seeing in this comment here: Basically after upgrading to 3.6, when I do the stepdown (so there's now a primary in the shard that is 3.6), I see 80-150K log lines in the first minute or two that look something like this: The amount of these message do diminish over time, but don't seem to ever fully go away. |
| Comment by Githook User [ 27/Jul/18 ] |
|
Author: {'username': 'mikety', 'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com'}Message: (cherry picked from commit bf17a390cac465e19a8bd243620b1010d0bad2bb) |
| Comment by Githook User [ 14/Jun/18 ] |
|
Author: {'username': 'mikety', 'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com'}Message: (cherry picked from commit bf17a390cac465e19a8bd243620b1010d0bad2bb) |
| Comment by Githook User [ 14/Jun/18 ] |
|
Author: {'username': 'mikety', 'name': 'Misha Tyulenev', 'email': 'misha@mongodb.com'}Message: |
| Comment by Jack Mulrow [ 08/Jun/18 ] |
|
kaloian.manassiev |
| Comment by Jack Mulrow [ 04/Jun/18 ] |
|
max.hirschhorn I think so. Before the refactoring in |
| Comment by Max Hirschhorn [ 04/Jun/18 ] |
|
jack.mulrow, misha.tyulenev, is this the cause of |