[SERVER-48909] Fix a bug where a sharded backup/restore test loses a record Created: 11/Jun/20 Updated: 28/Nov/23 Resolved: 28/Nov/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Alexander Gorrod | Assignee: | Lamont Nelson |
| Resolution: | Gone away | Votes: | 0 |
| Labels: | sharding-nyc-subteam2 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Assigned Teams: |
Sharding NYC
|
||||||||||||
| Sprint: | Sharding 2020-07-13, Sharding 2020-06-29, Sharding 2020-07-27, Sharding 2020-08-10, Sharding 2020-08-24, Sharding 2022-02-21, Cluster Scalability 2023-11-27, Cluster Scalability 2023-12-11 | ||||||||||||
| Participants: | |||||||||||||
| Linked BF Score: | 18 | ||||||||||||
| Story Points: | 8 | ||||||||||||
| Description |
|
The MongoDB test suite has seen a failure where a sharded backup restore test doesn't see the expected set of records at the end of the test. The test is being run with majority read concern off. The test case is:
Stacktrace:
Logs:
|
| Comments |
| Comment by Lamont Nelson [ 28/Nov/23 ] | ||||||||||
|
The last occurrence of this was in 2021. Closing this issue. | ||||||||||
| Comment by Rachita Dhawan [ 11/Feb/22 ] | ||||||||||
|
After analyzing the logs from https://jira.mongodb.org/browse/BFG-939373 with the last occurrence on Dec 9,2021: and also looking at the archived data files added in --setParameter recoverFromOplogAsStandalone=true
4. The document 100 insert timestamp from the BF log : - {"$timestamp":{"t":1638974152,"i":125}}
| ||||||||||
| Comment by Kaloian Manassiev [ 15/Nov/21 ] | ||||||||||
|
Putting this ticket in Needs Triage and on the NYC's backlog since it is not related to the catalog. | ||||||||||
| Comment by Lamont Nelson [ 08/Jul/20 ] | ||||||||||
|
Since | ||||||||||
| Comment by Lamont Nelson [ 08/Jul/20 ] | ||||||||||
|
Yes, in the originating BF for this ticket the restore point was also after the backup point. I didn't see evidence in the log for the timestamp of document 100, but the restore point was after the timestamp for document 101. I'm unable repro this locally, and the logs do not indicate whether this is a problem in writing the data or reading from the backup. | ||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 18/Jun/20 ] | ||||||||||
|
Perhaps the same root cause as | ||||||||||
| Comment by Judah Schvimer [ 17/Jun/20 ] | ||||||||||
I'm currently just looking at the session ids and assuming they're causal or not based on what the test does. 3c3a7373-4def-4c02-8099-143f5f3d161 seems to be the first session, so I'm assuming that's the non-causal one. We also see this line:
6036b3f8-307e-412f-8106-5aea0828bd45 comes next, so I think that's the causal one, but we never see it get started, likely due to some details in how the test infrastructure works involving the second shell we start. If we saw afterClusterTime arguments we'd know causal consistency was enabled. I don't think we log that though for inserts here (since we don't log the entire command in slow operation logs), so I think we don't really know. I would guess causal consistency is enabled though, and that it's not sufficient for the test due to some other test bug, or that there is a server bug, which would be very concerning. If I were to choose a non-storage team, I'd start with sharding. | ||||||||||
| Comment by Susan LoVerso [ 16/Jun/20 ] | ||||||||||
|
If this is a backup test issue the first thing (or maybe it is already done) would be for the test to keep a copy of the backup that we restart from. I thought the tests already did that into some kind of shadow directory but I might be remembering wrong. Determining whether backup and logs are the issue should be able to be determined with that directory. | ||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 16/Jun/20 ] | ||||||||||
|
The test uses causal because it's for sharding. E.g: the opTime for insert #90 must be less than #91 even though they'll end up on different shards. Write #100 is special because that's the first write after forcing a checkpoint with the fsync command. Opening up a backup cursor after that point will likely contain writes 0-99 in the checkpoints (spread across the shards) and writes 100+ will be part of the journal that's copied. Given there's been challenges getting WT to replay the log from backup data, I'd first rule that out. I tried figuring out which of these tickets has a patch should have stuck, but I'm struggling to untangle things. They're all closed as duplicates of each other and the two with commits, have a corresponding revert: That said, this could very well be a MongoDB bug. What I just described is how I triage whether this is something for WT or SERVER to look at. | ||||||||||
| Comment by Alexander Gorrod [ 16/Jun/20 ] | ||||||||||
|
Thanks judah.schvimer - could you help find the right backlog to move this to for initial analysis? I've moved it to a SERVER ticket to make that more obvious. Even if this ends up being a WiredTiger bug, I think we'll need help in characterizing what the test is expecting and where it's going wrong. | ||||||||||
| Comment by Judah Schvimer [ 16/Jun/20 ] | ||||||||||
|
This is definitely worth digging deeper into. It certainly could make sense though for someone on repl/sharding/execution to look first for a test bug. | ||||||||||
| Comment by Alexander Gorrod [ 16/Jun/20 ] | ||||||||||
|
judah.schvimer you recently made a change to the test referenced here in The failure being reported assert failed : Doc 100 is missing seems interesting to me. The test log shows that there are a lot more documents than that There are 22170 documents in total. which makes me wonder why that specific document. The question that I have explicitly is that the failure is coming from a function called _verifyDataIsCausallyConsistent, when the test output is sprinkled very liberally with {{options: { "causalConsistency" : false }}}. For example:
You are more familiar with the test than me. Do you think it's likely that this is a test problem, or is it something worth digging deeper into? We've seen the failure exactly once, so I don't have much signal about when the problem was introduced. | ||||||||||
| Comment by Alexander Gorrod [ 14/Jun/20 ] | ||||||||||
|
I kicked off relevant testing on a number of intermediate jobs, to see if we can get more information about when this started and how often it occurs: | ||||||||||
| Comment by Alexander Gorrod [ 11/Jun/20 ] | ||||||||||
|
The first step here is understanding the test case, and the failure. We should check to see if any changes we've made since the failure happened could explain the symptom (especially those around recovery and rollback to stable). It's possible that the failure isn't due to a bug in WiredTiger - but unlikely, since WiredTiger has been most actively changing how backup and recovery work. |