[SERVER-29396] maxAcceptableLogicalClockDrift of 0 causes spurious op failures Created: 30/May/17  Updated: 30/Oct/23  Resolved: 02/Jun/17

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.5.7
Fix Version/s: 3.5.9

Type: Bug Priority: Major - P3
Reporter: Kevin Pulo Assignee: Kevin Pulo
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File output-maxAcceptableLogicalClockDrift-0.gz     File output-maxAcceptableLogicalClockDrift-1.gz     File simpleload-noisy.js    
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:
  • Start a sharded cluster on separate machines with --setParameter maxAcceptableLogicalClockDrift=1
  • Basic sharded collection, eg. test.test sharded by _id: hashed
  • Basic write workload, eg simpleload-noisy.js
  • Observe operations failing (and sometimes errors in the logs) when the wallclock seconds transition
Participants:

 Description   

The maxAcceptableLogicalClockDrift startup-only parameter is currently restricted to non-negative values. However, setting it to 0 means that operations will spuriously fail to set the logical clock (due to the rate limiting mechanism) when the wallclock time naturally rolls over to the next second. This is because the wallclocks are on different hosts, and so not in 100% perfect lockstep (i.e. there is no global clock, which is the whole reason for having a logical clock).

  • Example failed user operation:

    {  "nMatched" : 0,  "nUpserted" : 0,  "nModified" : 0,  "writeError" : {  "code" : 83,  "errmsg" : "write results unavailable from testcc-7.kevincm1.0633.mongodbdns.com:27000 :: caused by :: ClusterTimeFailsRateLimiter: New cluster time, 1496037921, is too far from this node's wall clock time, 1496037920." } }
    

  • Example failed system operations (note the timestamps are all very close to the "next second"):

    2017-05-29T05:50:09.986+0000 I REPL     [replication-1] Restarting oplog query due to error: ClusterTimeFailsRateLimiter: error in fetcher batch callback: New cluster time, 1496037010, is too far from this node's wall clock time, 1496037009.. Last fetched optime (with hash): { ts: Timestamp 1496036996000|4, t: 1 }[4820298219506527748]. Restarts remaining: 3
    2017-05-29T05:50:09.987+0000 I REPL     [replication-0] Restarting oplog query due to error: ClusterTimeFailsRateLimiter: error in fetcher batch callback: New cluster time, 1496037010, is too far from this node's wall clock time, 1496037009.. Last fetched optime (with hash): { ts: Timestamp 1496036996000|4, t: 1 }[4820298219506527748]. Restarts remaining: 2
    2017-05-29T05:50:09.989+0000 I REPL     [replication-1] Restarting oplog query due to error: ClusterTimeFailsRateLimiter: error in fetcher batch callback: New cluster time, 1496037010, is too far from this node's wall clock time, 1496037009.. Last fetched optime (with hash): { ts: Timestamp 1496036996000|4, t: 1 }[4820298219506527748]. Restarts remaining: 1
    2017-05-29T05:50:09.991+0000 I REPL     [replication-0] Error returned from oplog query (no more query restarts left): ClusterTimeFailsRateLimiter: error in fetcher batch callback: New cluster time, 1496037010, is too far from this node's wall clock time, 1496037009.
    2017-05-29T05:50:09.991+0000 W REPL     [rsBackgroundSync] Fetcher stopped querying remote oplog with error: ClusterTimeFailsRateLimiter: error in fetcher batch callback: New cluster time, 1496037010, is too far from this node's wall clock time, 1496037009.
    2017-05-29T05:50:09.992+0000 I REPL_HB  [ReplicationExecutor] Error in heartbeat (requestId: 148) to testcc-4.kevincm1.0633.mongodbdns.com:27000, response status: ClusterTimeFailsRateLimiter: New cluster time, 1496037010, is too far from this node's wall clock time, 1496037009.
    2017-05-29T05:50:09.994+0000 I REPL_HB  [ReplicationExecutor] Error in heartbeat (requestId: 151) to testcc-4.kevincm1.0633.mongodbdns.com:27000, response status: ClusterTimeFailsRateLimiter: New cluster time, 1496037010, is too far from this node's wall clock time, 1496037009.
    2017-05-29T05:50:09.996+0000 I REPL_HB  [ReplicationExecutor] Error in heartbeat (requestId: 155) to testcc-4.kevincm1.0633.mongodbdns.com:27000, response status: ClusterTimeFailsRateLimiter: New cluster time, 1496037010, is too far from this node's wall clock time, 1496037009.
    

When this happens it "appears" (to the rate-limiter) that the time difference between the nodes is 1 sec (apparently unacceptable, since the parameter is 0), but this is actually just an artifact of the per-second resolution — in fact the nodes could be arbitrarily close together in (wallclock) time.

This does not occur when maxAcceptableLogicalClockDrift is 1:

$ zgrep -c ClusterTimeFailsRateLimiter output-maxAcceptableLogicalClockDrift-*
output-maxAcceptableLogicalClockDrift-0.gz:48
output-maxAcceptableLogicalClockDrift-1.gz:0
$ gzip -dc output-maxAcceptableLogicalClockDrift-0.gz | wc -l
57499
$ gzip -dc output-maxAcceptableLogicalClockDrift-1.gz | wc -l
60490

The maxAcceptableLogicalClockDrift parameter should instead be limited to only positive values, ie. >= 1



 Comments   
Comment by Githook User [ 02/Jun/17 ]

Author:

{u'username': u'devkev', u'name': u'Kevin Pulo', u'email': u'kevin.pulo@mongodb.com'}

Message: SERVER-29396 Fix logical clock unit test for kMaxAcceptableLogicalClockDriftSecs rename
Branch: master
https://github.com/mongodb/mongo/commit/b1232f05862ab624912319acc673b37f591a1619

Comment by Githook User [ 02/Jun/17 ]

Author:

{u'username': u'devkev', u'name': u'Kevin Pulo', u'email': u'kevin.pulo@mongodb.com'}

Message: SERVER-29396 maxAcceptableLogicalClockDriftSecs must be positive

Rename parameter to have the "Secs" unit at the end, and prevent it from being 0.
Branch: master
https://github.com/mongodb/mongo/commit/0827a5b8aeef04b66914a5c6841f54165035c2a4

Comment by Andy Schwerin [ 30/May/17 ]

We should have put a unit in that argument, too. It should end Secs.

Generated at Thu Feb 08 04:20:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.