[SERVER-78321] MongoDB 6.0: Adding a new shard renders all preceding resume tokens invalid Created: 22/Jun/23 Updated: 06/Nov/23 Resolved: 06/Nov/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 6.0.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Oded Raiches | Assignee: | Romans Kasperovics |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | Bug | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Assigned Teams: |
Query Execution
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Sprint: | QE 2023-09-04, QE 2023-09-18, QE 2023-10-02, QE 2023-10-16, QE 2023-10-30, QE 2023-11-13 | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
Very similar to what was observed here: https://jira.mongodb.org/browse/SERVER-42232 Seeing the same issue in MongoDB 6.0. More info: Using MongoDB 6.0.6. For the shards, we are self-maintaining the storage, and decide on adding a shard from time to time.
In this case, the client, which is connected to the mongos router, is unable to proceed unless we move the start_at_operation_time pointer to 14:30 PM so it can continue reading (and this also makes us lose all the update from ~12:30 PM to ~14:30 PM which is not acceptable).
Following this community thread, were the question was raised few days ago but with no answer: |
| Comments |
| Comment by Romans Kasperovics [ 21/Sep/23 ] | |||||||||||||||||
|
Hi oraiches@zadarastorage.com, Regarding the second incident (with the initial sync), from log lines below, it looks like this was a node that was either newly added into this replica set, OR that had already been part of it but had been restarted with its data cleared. So the first thing it needed to do was initial sync from another member of the set - but that means that its oplog will be truncated to the same clusterTime as the initial sync starts. If you initiate a new set with all its members present, then each member will have the “initializing replica set” message in its logs, and so when that replset gets added as a shard, it will be able to accept resume tokens that pre-date the shard. But if you add a new member into the set after that point, or if you reset a member by clearing its data, it will have to perform an initial sync, and therefore will not have the “initializing replica set” message. It will thus not be able to service change stream resume requests, and will throw ChangeStreamHistoryLost.
Regarding the first incident, it appears that you are recording the clusterTime from each event you see, but that means that if events are rare, your resume point can become very stale or indeed roll off the oplog, as it happened in the incident. Instead, you should adopt the recommended paradigm for recording resume tokens:
This ensures that even if no events are returned from the cursor, the resume token you get back will keep advancing. This way, even if the events are infrequent, you will still have an up-to-date resume point. You should then resume your stream using resumeAfter rather than startAtOperationTime. | |||||||||||||||||
| Comment by Oded Raiches [ 07/Sep/23 ] | |||||||||||||||||
|
Thanks romans.kasperovics@mongodb.com , why would initial sync happen here? how can we prevent this? | |||||||||||||||||
| Comment by Romans Kasperovics [ 07/Sep/23 ] | |||||||||||||||||
|
In the case of the second incident on 2023-08-02, the cause was that the node underwent an initial sync shortly before the ChangeStreamHistoryLost exceptions started showing up. mongodb_logs2/node1/mongodb/mongos_7503.log contains: {"t":\{"$date":"2023-08-02T06:32:10.337+00:00"},"s":"I", "c":"INITSYNC", "id":21173, "ctx":"ReplCoordExtern-0","msg":"Initial syncer oplog truncation finished","attr":{"durationMillis":0}} | |||||||||||||||||
| Comment by Romans Kasperovics [ 06/Sep/23 ] | |||||||||||||||||
|
Hi oraiches@zadarastorage.com, OplogStartMissing happens when the syncing node detects a 'gap' between the batches of oplog entries it fetches. In this case, the synching node might truncate the recent oplog entries until it reaches the entries present on both nodes. In this context, having default write concern = 1 carries the risk of data loss. Nevertheless, we think the OplogStartMissing error is not relevant here. Neither is adding a new shard. The change stream is trying to resume from 2023-06-17 15:54:18 GMT, but the oplog's first time, as seen in the OplogStartMissing message, is 2023-06-18 12:56:45. In other words, the events between 2023-06-17 15:54:18 and the first resume attempt at 2023-06-20 13:15:39 probably do not fit into 50GB. I cannot tell why startAtOperationTime = 2023-06-17 15:54:18 of the resume attempt is so far in the past, but perhaps the were no majority committed events in the cluster since then. In any case, we recommend using resumeAfter parameter instead of startAtOperationTime for the reasons described in | |||||||||||||||||
| Comment by Oded Raiches [ 04/Sep/23 ] | |||||||||||||||||
|
Hi romans.kasperovics@mongodb.com ,
We have seen this happen in P-S-S as well, but did not acquire the logs. Mainly this happens during large migration with new shards. Does it mean we need to resize the oplog? or can this "truncation" happen in any size regardless? We currently use defaultWriteConcern=1 on these P-S-A setups, and would like to move to defaultWriteConcern=2 and dynamically change it when monitoring a failure, while also protecting against ROLL-BACK with on-the-fly priority changes - do you think this might help as well?
| |||||||||||||||||
| Comment by Romans Kasperovics [ 04/Sep/23 ] | |||||||||||||||||
|
Hi oraiches@zadarastorage.com, I am looking at the mongodb-logs.zip / mongodb-node-1 / mongos_7502.log.1
I am not sure what causes all the HostUnreachable errors, but in general, in a P-S-A cluster if P or S goes down, the changes during this period cannot be majority-committed, and so no change stream events are emitted. If the oplog is truncated during this period, the change stream history can get lost. Do you know the oplog size on your system? Can that be that there were many changes to the underlying collections between 13:14:19 and 13:15:17? | |||||||||||||||||
| Comment by Oded Raiches [ 21/Aug/23 ] | |||||||||||||||||
|
Hi sebastien.mendez@mongodb.com, Is there any update about this issue?
| |||||||||||||||||
| Comment by Ana Meza [ 14/Aug/23 ] | |||||||||||||||||
|
sebastien.mendez@mongodb.com could you find an assignne for this work please | |||||||||||||||||
| Comment by Yuan Fang [ 10/Aug/23 ] | |||||||||||||||||
|
Hi oraiches@zadarastorage.com, I have reviewed the log and FTDC data. However, I am now moving this to the team for a more thorough investigation. Please stay tuned for updates. Thank you for your patience! Regards,
| |||||||||||||||||
| Comment by Oded Raiches [ 09/Aug/23 ] | |||||||||||||||||
|
Where you able to look at the logs and find anything? Thanks. | |||||||||||||||||
| Comment by Oded Raiches [ 06/Aug/23 ] | |||||||||||||||||
|
Yes I uploaded with full path but to me it seems that it worked. Uploaded now again as follows:
| |||||||||||||||||
| Comment by Yuan Fang [ 04/Aug/23 ] | |||||||||||||||||
|
Hi oraiches@zadarastorage.com, Thank you for uploading the data. Unfortunately, it seems that the data wasn't uploaded properly, and I couldn't see it in the portal. I've tested the link on my end, and it worked. Would you please try to upload it again? Please make sure that the files are fully uploaded and you should see the progress after initiating the command. One possible reason I've noticed that could cause the upload to fail is if you provided the full path of the file. In other words, you'll need to navigate to the current location of the file and replace <filename> with just the file name (e.g., "mongodb-logs.tgz") without the preceding path. Please let us know if you made another attempt to upload, and thank you so much for your time! | |||||||||||||||||
| Comment by Oded Raiches [ 02/Aug/23 ] | |||||||||||||||||
|
Thanks for the fast reply! Uploaded the logs to the link. The timeline was around {"$date":"2023-08-02T06:30:41.227+00:00"} in the logs.
| |||||||||||||||||
| Comment by Yuan Fang [ 02/Aug/23 ] | |||||||||||||||||
|
Hi oraiches@zadarastorage.com, Thank you for reaching out and bringing to our attention that the issue has occurred again. Here is the new uploader link, for the issue that happened this time:
Regards, | |||||||||||||||||
| Comment by Oded Raiches [ 02/Aug/23 ] | |||||||||||||||||
|
We got the issue happening again on another system today. This time I collected the logs and $dbpath/diagnostic.data an about ~1 hour after the occurrence, so hopefully we have all the info you requested. Can you share a new upload link to send the logs once more? | |||||||||||||||||
| Comment by Oded Raiches [ 19/Jul/23 ] | |||||||||||||||||
|
Was not able to reproduce the same on a different system. I have uploaded the logs of the 2 nodes in a zipped file to your upload portal, reminding that the $dbpath/diagnostic.data were lost. | |||||||||||||||||
| Comment by Yuan Fang [ 19/Jul/23 ] | |||||||||||||||||
|
Hi oraiches@zadarastorage.com, Have you had a chance to reproduce the issue? If not, we can start by looking at the log, could you upload them to the upload portal? Thank you. Regards, | |||||||||||||||||
| Comment by Oded Raiches [ 27/Jun/23 ] | |||||||||||||||||
|
Hi yuan.fang@mongodb.com, thanks for the reply. Unfortunately, we only have the mongod/mongos logs since the $dbpath/diagnostic.data got rotated. Will try to reproduce, but is the mongod/mongos sufficient in the meantime? | |||||||||||||||||
| Comment by Yuan Fang [ 26/Jun/23 ] | |||||||||||||||||
|
Hi oraiches@zadarastorage.com, Thank you for your report, it seems the new shard experienced an issue with synchronization of the change stream. In order to understand why the change stream prior to 14:30 PM is unavailable on the new shard, we need more diagnostic data for further investigation: I've created a secure upload portal for you. Files uploaded to this portal are hosted on Box, are visible only to MongoDB employees, and are routinely deleted after some time. For each node in the replica set spanning a time period that includes the incident, would you please archive (tar or zip) and upload to that link:
Regards, |