[SERVER-44675] server_status_metrics.js fails due to racy repl.buffer.count metric in serverStatus Created: 15/Nov/19 Updated: 29/Oct/23 Resolved: 19/Nov/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.6.15 |
| Fix Version/s: | 3.6.16, 4.2.2, 4.0.14, 4.3.2 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | A. Jesse Jiryu Davis | Assignee: | A. Jesse Jiryu Davis |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Backport Requested: |
v4.2, v4.0, v3.6
|
||||||||
| Sprint: | Repl 2019-11-18, Repl 2019-12-02 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 8 | ||||||||
| Description |
|
The test fails (very rarely) due to a race in how the repl.buffer.count metric is calculated. There's a period when the rsBackgroundSync thread has added oplog entries to the buffer but hasn't yet incremented repl.buffer.count. During this period, the ReplBatcher thread can clear the buffer and decrement repl.buffer.count. Since the count can be decremented before it's incremented, it can be briefly negative. The server_status_metrics.js test doesn't expect this race. First, the test inserts 1000 docs with w: 2. The secondary's oplog buffer fills and empties, the metric is incremented by 1000 and decremented by 1000. The test calls serverStatus on the secondary and checks that repl.buffer.count >= 0, in fact it's 0, and the assertion passes. Next, the test updates all 1000 docs with w: 2. Events proceed perhaps in this order:
|
| Comments |
| Comment by Githook User [ 22/Nov/19 ] |
|
Author: {'name': 'A. Jesse Jiryu Davis', 'username': 'ajdavis', 'email': 'jesse@mongodb.com'}Message: |
| Comment by Githook User [ 22/Nov/19 ] |
|
Author: {'email': 'jesse@mongodb.com', 'name': 'A. Jesse Jiryu Davis', 'username': 'ajdavis'}Message: |
| Comment by Githook User [ 22/Nov/19 ] |
|
Author: {'name': 'A. Jesse Jiryu Davis', 'username': 'ajdavis', 'email': 'jesse@mongodb.com'}Message: |
| Comment by A. Jesse Jiryu Davis [ 19/Nov/19 ] |
|
The BF that motivated this change is on the 3.6 branch. Let's backport all the way to 3.6. |
| Comment by Githook User [ 19/Nov/19 ] |
|
Author: {'name': 'A. Jesse Jiryu Davis', 'username': 'ajdavis', 'email': 'jesse@mongodb.com'}Message: |
| Comment by A. Jesse Jiryu Davis [ 19/Nov/19 ] |
|
Now that I've looked at that code, I don't think it's the right answer for this bug. The code in check_transaction_server_status_invariants.js asserts that only a small percentage of serverStatus calls returned inconsistent metrics. For server_status_metrics.js however, we only have one sample, so it's either right or wrong and we should handle wrong metrics by retrying for a short period. |
| Comment by A. Jesse Jiryu Davis [ 18/Nov/19 ] |
|
Consider factoring the proposed retry loop with that in the transactions concurrency suite's serverStatus metrics tests, which also compensate for races like this one. |