[SERVER-44675] server_status_metrics.js fails due to racy repl.buffer.count metric in serverStatus Created: 15/Nov/19  Updated: 29/Oct/23  Resolved: 19/Nov/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.6.15
Fix Version/s: 3.6.16, 4.2.2, 4.0.14, 4.3.2

Type: Bug Priority: Major - P3
Reporter: A. Jesse Jiryu Davis Assignee: A. Jesse Jiryu Davis
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.2, v4.0, v3.6
Sprint: Repl 2019-11-18, Repl 2019-12-02
Participants:
Linked BF Score: 8

 Description   

The test fails (very rarely) due to a race in how the repl.buffer.count metric is calculated. There's a period when the rsBackgroundSync thread has added oplog entries to the buffer but hasn't yet incremented repl.buffer.count. During this period, the ReplBatcher thread can clear the buffer and decrement repl.buffer.count. Since the count can be decremented before it's incremented, it can be briefly negative. The server_status_metrics.js test doesn't expect this race.

First, the test inserts 1000 docs with w: 2. The secondary's oplog buffer fills and empties, the metric is incremented by 1000 and decremented by 1000. The test calls serverStatus on the secondary and checks that repl.buffer.count >= 0, in fact it's 0, and the assertion passes. 

Next, the test updates all 1000 docs with w: 2. Events proceed perhaps in this order:

  1. the rsBackgroundSync thread in BackgroundSync::_enqueueDocuments buffers 1000 oplog entries, bufferCountGauge is still 0
  2. the ReplBatcher thread in SyncTail::tryPopAndWaitForMore calls bufferCountGauge.decrement(1) a thousand times, now it's -1000
  3. the test calls serverStatus, repl.buffer.count is -1000 so the test will fail
  4. the rsBackgroundSync thread in BackgroundSync::_enqueueDocuments calls bufferCountGauge.increment(1000)


 Comments   
Comment by Githook User [ 22/Nov/19 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'username': 'ajdavis', 'email': 'jesse@mongodb.com'}

Message: SERVER-44675 Handle race in repl metrics test
Branch: v4.0
https://github.com/mongodb/mongo/commit/e461d954679f23ae6b465d5fccd8e7afd492f41c

Comment by Githook User [ 22/Nov/19 ]

Author:

{'email': 'jesse@mongodb.com', 'name': 'A. Jesse Jiryu Davis', 'username': 'ajdavis'}

Message: SERVER-44675 Handle race in repl metrics test
Branch: v4.2
https://github.com/mongodb/mongo/commit/257f01131fd83915c7ac6c5964075ce7a805569c

Comment by Githook User [ 22/Nov/19 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'username': 'ajdavis', 'email': 'jesse@mongodb.com'}

Message: SERVER-44675 Handle race in repl metrics test
Branch: v3.6
https://github.com/mongodb/mongo/commit/204ab367a130a4fd2db1c54b02cd6a86e4e07f56

Comment by A. Jesse Jiryu Davis [ 19/Nov/19 ]

The BF that motivated this change is on the 3.6 branch. Let's backport all the way to 3.6.

Comment by Githook User [ 19/Nov/19 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'username': 'ajdavis', 'email': 'jesse@mongodb.com'}

Message: SERVER-44675 Handle race in repl metrics test
Branch: master
https://github.com/mongodb/mongo/commit/04e8cf864b283af0d5355bc242e6bbab3b925246

Comment by A. Jesse Jiryu Davis [ 19/Nov/19 ]

Now that I've looked at that code, I don't think it's the right answer for this bug. The code in check_transaction_server_status_invariants.js asserts that only a small percentage of serverStatus calls returned inconsistent metrics. For server_status_metrics.js however, we only have one sample, so it's either right or wrong and we should handle wrong metrics by retrying for a short period.

Comment by A. Jesse Jiryu Davis [ 18/Nov/19 ]

Consider factoring the proposed retry loop with that in the transactions concurrency suite's serverStatus metrics tests, which also compensate for races like this one.

Generated at Thu Feb 08 05:06:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.