[SERVER-79912] CheckReplDBHash reports failure with system.buckets collections due to invalid BSON Created: 08/Aug/23 Updated: 29/Oct/23 Resolved: 30/Aug/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 7.2.0-rc0, 7.0.2, 7.1.0-rc1, 6.0.11 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Felipe Gasper | Assignee: | Yuhong Zhang |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Assigned Teams: |
Storage Execution NAMER
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Backport Requested: |
v7.1, v7.0, v6.0
|
||||||||||||||||
| Sprint: | Execution NAMR Team 2023-08-21, Execution NAMR Team 2023-09-04 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
The key phrase is: "checkReplicatedDataHashes, the two nodes have a different hash for the collection test.system.buckets.timeseries_internal_bounded_sort" This is the destination cluster in a mongosync run. Notably, mongosync creates time-series collections by creating a system.buckets.mongosync.tmp.$srcUUID collection, then renaming it to the correct system.buckets name, then creating the view separately. That may be part of the cause. |
| Comments |
| Comment by Githook User [ 01/Sep/23 ] | ||
|
Author: {'name': 'Yuhong Zhang', 'email': 'yuhong.zhang@mongodb.com', 'username': 'YuhongZhang98'}Message: (cherry picked from commit 806f6d42ee6c2d0a8c1cc7079fb589301961124e) | ||
| Comment by Githook User [ 01/Sep/23 ] | ||
|
Author: {'name': 'Yuhong Zhang', 'email': 'yuhong.zhang@mongodb.com', 'username': 'YuhongZhang98'}Message: (cherry picked from commit 923f5a7bf7a2929fad74bc7c798f3c223f106d10) | ||
| Comment by Githook User [ 01/Sep/23 ] | ||
|
Author: {'name': 'Yuhong Zhang', 'email': 'yuhong.zhang@mongodb.com', 'username': 'YuhongZhang98'}Message: (cherry picked from commit 806f6d42ee6c2d0a8c1cc7079fb589301961124e) | ||
| Comment by Githook User [ 01/Sep/23 ] | ||
|
Author: {'name': 'Yuhong Zhang', 'email': 'yuhong.zhang@mongodb.com', 'username': 'YuhongZhang98'}Message: (cherry picked from commit 923f5a7bf7a2929fad74bc7c798f3c223f106d10) | ||
| Comment by Githook User [ 31/Aug/23 ] | ||
|
Author: {'name': 'Yuhong Zhang', 'email': 'yuhong.zhang@mongodb.com', 'username': 'YuhongZhang98'}Message: (cherry picked from commit 806f6d42ee6c2d0a8c1cc7079fb589301961124e) | ||
| Comment by Githook User [ 31/Aug/23 ] | ||
|
Author: {'name': 'Yuhong Zhang', 'email': 'yuhong.zhang@mongodb.com', 'username': 'YuhongZhang98'}Message: (cherry picked from commit 923f5a7bf7a2929fad74bc7c798f3c223f106d10) | ||
| Comment by Githook User [ 29/Aug/23 ] | ||
|
Author: {'name': 'Yuhong Zhang', 'email': 'yuhong.zhang@mongodb.com', 'username': 'YuhongZhang98'}Message: | ||
| Comment by Githook User [ 29/Aug/23 ] | ||
|
Author: {'name': 'Yuhong Zhang', 'email': 'yuhong.zhang@mongodb.com', 'username': 'YuhongZhang98'}Message: | ||
| Comment by Felipe Gasper [ 23/Aug/23 ] | ||
|
yuhong.zhang@mongodb.com It’s hard to test right now due to some temporary breakage. I’ll hopefully confirm later today. | ||
| Comment by Yuhong Zhang [ 22/Aug/23 ] | ||
|
We used to have the assumption that the secondary oplog application of time-series diff update oplog entries could only be generated from user inserts on the view namespace, which is guaranteed to be monotonically increasing, so we performed this optimization to skip checking if a delta insert field already exists. But this was no longer true after we introduced $_internalApplyOplogUpdate, which can try to diff insert an already existing field. This is only used by mongosync so we just caught this bug now. I will lower the priority to P3 as it's not affecting other types of collections and not user-facing. | ||
| Comment by Yuhong Zhang [ 18/Aug/23 ] | ||
|
The problematic document has 10 change events:
The first oplog entry Max mentioned contained the first 6 change events. The second oplog entry should correspond to a refetch, which happened for about 30 seconds (kept getting write conflicts) from the logs. The last occurrence, similar to the previous ones, looks like
Note it shows tsOfLastApply as {T:1691503442 I:308}, which is where the first oplog entry stopped at. After the refetch, we saw several logs like this
They showed that we skipped the first 6 change events since they happened before tsOfLastFetch {T:1691503442 I:308}. This could also explain why we saw the third oplog entry containing the rest of the change events. So this makes it look like the refetch should fetch the document at TS {T:1691503442 I:308}, which only contains the first 6 change events instead of all 10. Could this be a problem? Another question for my understanding is, after mongosync receives the change events, what are the exact commands it will run on the destination cluster? I would want to understand better about how the oplog entry gets generated. | ||
| Comment by Felipe Gasper [ 15/Aug/23 ] | ||
|
I’m marking this as a blocker for REP-2771. It may not be directly related to time-series, but mongosync’s (still-in-development) time-series support seems to be what creates it. | ||
| Comment by Alexander Komyagin [ 08/Aug/23 ] | ||
|
Timeseries support is one of the key things on the short-term roadmap (originally planned for the Aug/Sep release). We would definitely appreciate your help here. | ||
| Comment by Gregory Noma [ 08/Aug/23 ] | ||
|
alan.zheng@mongodb.com alex.komyagin@mongodb.com any input on the urgency of this? We could use some clarity on the timeline for mongosync supporting time-series collections |