[SERVER-78321] MongoDB 6.0: Adding a new shard renders all preceding resume tokens invalid Created: 22/Jun/23  Updated: 06/Nov/23  Resolved: 06/Nov/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 6.0.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Oded Raiches Assignee: Romans Kasperovics
Resolution: Works as Designed Votes: 0
Labels: Bug
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-42232 Adding a new shard renders all preced... Closed
related to SERVER-42290 Target change streams to the primary ... Backlog
Assigned Teams:
Query Execution
Operating System: ALL
Sprint: QE 2023-09-04, QE 2023-09-18, QE 2023-10-02, QE 2023-10-16, QE 2023-10-30, QE 2023-11-13
Participants:

 Description   

Very similar to what was observed here:

https://jira.mongodb.org/browse/SERVER-42232

Seeing the same issue in MongoDB 6.0.

More info:

Using MongoDB 6.0.6.
Having a DB with sharded collections.
The DB has a few shards, every shard is in a P-S-A (2 node replicas and an arbiter).
My application is using change-streams to gather statistics, and is constantly reading for operations, while also maintaining the clusterTime timestamp in case the the app crashes, so we can continue from the last handled point.

For the shards, we are self-maintaining the storage, and decide on adding a shard from time to time.
Been noticing that when a shard is added the following happens:

  1. added shard at ~9:15 AM, shard comes up, rebalancing its data with other shards
  2. few hours later at ~12:30 PM, Resume of change stream was not possible errors start showing up, coming from the port of the new added shard.

    {"t":{"$date":"2023-08-02T06:32:40.283+00:00"},"s":"W",  "c":"QUERY",    "id":20478,   "ctx":"conn188","msg":"getMore command executor error","attr":{"error":{"code":286,"codeName":"ChangeStreamHistoryLost","errmsg":"Resume of change stream was not possible, as the resume point may no longer be in the oplog."},"stats":{},"cmd":{"getMore":2383202557420931747,"collection":"$cmd.aggregate","maxTimeMS":1000}}}

  3. also, the first available event in the new shard is delayed very much, here we added the shard around ~9:30 AM, but the first available event was only at ~14:30 PM, many hours later, but for other shards the first even is before that, so not sure how this makes sense because this is a totally new shard.

In this case, the client, which is connected to the mongos router, is unable to proceed unless we move the start_at_operation_time pointer to 14:30 PM so it can continue reading (and this also makes us lose all the update from ~12:30 PM to ~14:30 PM which is not acceptable).

  • Why is this happening? isn’t the change-stream suppose to continue regularly and add the new shard updates once its ready? failing like this does not look like normal behavior.
  • Is there a safe way to add a shard and keep reading incoming updates for the other shards through the mongos router without getting stopped by this un-synced shard?
  • Is this happening because of the P-S-A configuration, and will not occur in P-S-S? if yes, why is that?

Following this community thread, were the question was raised few days ago but with no answer:

https://www.mongodb.com/community/forums/t/change-stream-resume-point-lost-after-adding-a-new-shard/232005



 Comments   
Comment by Romans Kasperovics [ 21/Sep/23 ]

Hi oraiches@zadarastorage.com

Regarding the second incident (with the initial sync), from log lines below, it looks like this was a node that was either newly added into this replica set, OR that had already been part of it but had been restarted with its data cleared. So the first thing it needed to do was initial sync from another member of the set - but that means that its oplog will be truncated to the same clusterTime as the initial sync starts. If you initiate a new set with all its members present, then each member will have the “initializing replica set” message in its logs, and so when that replset gets added as a shard, it will be able to accept resume tokens that pre-date the shard. But if you add a new member into the set after that point, or if you reset a member by clearing its data, it will have to perform an initial sync, and therefore will not have the “initializing replica set” message. It will thus not be able to service change stream resume requests, and will throw ChangeStreamHistoryLost.

{"t":{"$date":"2023-08-02T06:31:54.989+00:00"},"s":"I",  "c":"REPL",     "id":21313,   "ctx":"initandlisten","msg":"Did not find local replica set configuration document at startup","attr":{"error":{"code":47,"codeName":"NoMatchingDocument","errmsg":"Did not find replica set configuration document in local.system.replset"}}}
 
{"t":{"$date":"2023-08-02T06:32:09.302+00:00"},"s":"I",  "c":"REPL",     "id":21400,   "ctx":"conn30","msg":"Scheduling heartbeat to fetch a new config since we are not a member of our current config","attr":{"senderHost":"10.3.23.62:7801"}}
 
{"t":{"$date":"2023-08-02T06:32:09.303+00:00"},"s":"I",  "c":"REPL",     "id":6015317, "ctx":"ReplCoord-0","msg":"Setting new configuration state","attr":{"newState":"ConfigSteady","oldState":"ConfigHBReconfiguring"}}
 
{"t":{"$date":"2023-08-02T06:32:09.303+00:00"},"s":"I",  "c":"REPL",     "id":21392,   "ctx":"ReplCoord-0","msg":"New replica set config in use","attr":{"config":{"_id":"shard1","version":103851,"members":[{"_id":0,"host":"10.3.23.52:7501","arbiterOnly":false,"buildIndexes":true,"hidden":false,"priority":1,"tags":{},"secondaryDelaySecs":0,"votes":1},{"_id":1,"host":"10.3.23.53:7501","arbiterOnly":false,"buildIndexes":true,"hidden":false,"priority":1,"tags":{},"secondaryDelaySecs":0,"votes":1},{"_id":2,"host":"10.3.23.62:7801","arbiterOnly":true,"buildIndexes":true,"hidden":false,"priority":0,"tags":{},"secondaryDelaySecs":0,"votes":1},{"_id":3,"host":"10.3.23.52:7503","arbiterOnly":false,"buildIndexes":true,"hidden":false,"priority":1,"tags":{},"secondaryDelaySecs":0,"votes":1}],"protocolVersion":1,"writeConcernMajorityJournalDefault":true,"settings":{"chainingAllowed":true,"heartbeatIntervalMillis":2000,"heartbeatTimeoutSecs":10,"electionTimeoutMillis":10000,"catchUpTimeoutMillis":-1,"catchUpTakeoverDelayMillis":30000,"getLastErrorModes":{},"getLastErrorDefaults":{"w":1,"wtimeout":0},"replicaSetId":{"$oid":"64c8a08028e92a92739f52d5"}}}}}
 
{"t":{"$date":"2023-08-02T06:32:09.303+00:00"},"s":"I",  "c":"REPL",     "id":21393,   "ctx":"ReplCoord-0","msg":"Found self in config","attr":{"hostAndPort":"10.3.23.52:7503"}}
 
{"t":{"$date":"2023-08-02T06:32:09.303+00:00"},"s":"I",  "c":"REPL",     "id":21358,   "ctx":"ReplCoord-0","msg":"Replica set state transition","attr":{"newState":"STARTUP2","oldState":"STARTUP"}}
 
{"t":{"$date":"2023-08-02T06:32:09.304+00:00"},"s":"I",  "c":"REPL",     "id":21306,   "ctx":"ReplCoord-0","msg":"Starting replication storage threads"}
 
{"t":{"$date":"2023-08-02T06:32:09.305+00:00"},"s":"I",  "c":"REPL",     "id":4280513, "ctx":"ReplCoord-0","msg":"Initial sync required. Attempting to start initial sync..."} 

Regarding the first incident, it appears that you are recording the clusterTime from each event you see, but that means that if events are rare, your resume point can become very stale or indeed roll off the oplog, as it happened in the incident. Instead, you should adopt the recommended paradigm for recording resume tokens:

event = cursor.tryNext()
resumeToken = cursor.getResumeToken()

This ensures that even if no events are returned from the cursor, the resume token you get back will keep advancing. This way, even if the events are infrequent, you will still have an up-to-date resume point. You should then resume your stream using resumeAfter rather than startAtOperationTime.

Comment by Oded Raiches [ 07/Sep/23 ]

Thanks romans.kasperovics@mongodb.com ,

why would initial sync happen here? how can we prevent this?

Comment by Romans Kasperovics [ 07/Sep/23 ]

In the case of the second incident on 2023-08-02, the cause was that the node underwent an initial sync shortly before the ChangeStreamHistoryLost exceptions started showing up. mongodb_logs2/node1/mongodb/mongos_7503.log contains:

{"t":\{"$date":"2023-08-02T06:32:10.337+00:00"}

,"s":"I", "c":"INITSYNC", "id":21173, "ctx":"ReplCoordExtern-0","msg":"Initial syncer oplog truncation finished","attr":{"durationMillis":0}}
The initial sync process truncates the node’s oplog to just the most recent entry, which in this case meant that it no longer had the ability to resume from an older point in time.

Comment by Romans Kasperovics [ 06/Sep/23 ]

Hi oraiches@zadarastorage.com,

OplogStartMissing happens when the syncing node detects a 'gap' between the batches of oplog entries it fetches. In this case, the synching node might truncate the recent oplog entries until it reaches the entries present on both nodes. In this context, having default write concern = 1 carries the risk of data loss.

Nevertheless, we think the OplogStartMissing error is not relevant here. Neither is adding a new shard. The change stream is trying to resume from 2023-06-17 15:54:18 GMT, but the oplog's first time, as seen in the OplogStartMissing message, is 2023-06-18 12:56:45. In other words, the events between 2023-06-17 15:54:18 and the first resume attempt at 2023-06-20 13:15:39 probably do not fit into 50GB.

I cannot tell why startAtOperationTime = 2023-06-17 15:54:18 of the resume attempt is so far in the past, but perhaps the were no majority committed events in the cluster since then. In any case, we recommend using resumeAfter parameter instead of startAtOperationTime for the reasons described in SERVER-35740.

Comment by Oded Raiches [ 04/Sep/23 ]

Hi romans.kasperovics@mongodb.com ,

  • what does OplogStartMissing  mean? how can it be missing?
  • Also, what does "truncated oplog" mean? is it lost like a ROLL-BACK of a double failure? and even if so, how is it being lost so fast.
  • As you observed, it was unreachable for a short time, but still not sure why the oplog would be so overwhelmed in such a short time.
  • The oplog size is 50GB.
  • The shortest I saw for it for one of my shards during a new shard (initial migration) was ~30 minutes, with a transfer rate of about 20K docs/sec.

We have seen this happen in P-S-S as well, but did not acquire the logs.

Mainly this happens during large migration with new shards.

Does it mean we need to resize the oplog? or can this "truncation" happen in any size regardless?

We currently use defaultWriteConcern=1 on these P-S-A setups, and would like to move to defaultWriteConcern=2 and dynamically change it when monitoring a failure, while also protecting against ROLL-BACK with on-the-fly priority changes - do you think this might help as well?

 

Comment by Romans Kasperovics [ 04/Sep/23 ]

Hi oraiches@zadarastorage.com,

I am looking at the mongodb-logs.zip / mongodb-node-1 / mongos_7502.log.1

  • There is a server restart at 13:15:15.
  • Just before that (from 13:14:19 to 13:15:30) the log contains HostUnreachable errors for 10.3.19.39:7502, 10.3.19.38:7503,  10.3.19.38:7504, 10.3.19.39:7504.
  • All 4 shards re-added to the topology around 13:15:17. There are 3 hosts (10.3.19.38 - 10.3.19.40), with config servers listening to ports 7601 and 7901, and shards listening to ports 7501, 7502, 7503, 7504, 7802, 7803, 7804.
  • The HostUnreachable error continues after the restart from 13:15:17 to 13:15:30.
  • At 13:15:18 there is an OplogStartMissing error.
  • ChangeStreamHistoryLost errors start to show up from 13:15:39.

I am not sure what causes all the HostUnreachable errors, but in general, in a P-S-A cluster if P or S goes down, the changes during this period cannot be majority-committed, and so no change stream events are emitted. If the oplog is truncated during this period, the change stream history can get lost.

Do you know the oplog size on your system? Can that be that there were many changes to the underlying collections between 13:14:19 and 13:15:17?

Comment by Oded Raiches [ 21/Aug/23 ]

Hi sebastien.mendez@mongodb.com,

Is there any update about this issue?

 

Comment by Ana Meza [ 14/Aug/23 ]

sebastien.mendez@mongodb.com could you find an assignne for this work please

Comment by Yuan Fang [ 10/Aug/23 ]

Hi oraiches@zadarastorage.com,

I have reviewed the log and FTDC data. However, I am now moving this to the team for a more thorough investigation. Please stay tuned for updates. Thank you for your patience!

Regards,
Yuan

 

Comment by Oded Raiches [ 09/Aug/23 ]

Hi yuan.fang@mongodb.com ,

Where you able to look at the logs and find anything?

Thanks.

Comment by Oded Raiches [ 06/Aug/23 ]

Hi yuan.fang@mongodb.com

Yes I uploaded with full path but to me it seems that it worked.

Uploaded now again as follows:

curl -X POST https://upload.box.com/api/2.0/files/content \
> -H 'Authorization: Bearer 1!xWlQ4ggMMjG26rADGlkBg_aj6qpIzwRyLnJaRJ_M50u0TF_M6HRUlgDBsNyTALLOHmYIw33Cj5af0pyJk3QP4qrVpwwyL6b_I-_P-ullBWX1hS6iHarTnlq1q_U9vntPVxUQtZN05xypkNqhXavI0slAkYZ8XA5uSbPIqX4E6HmWlbJtSGDq98UTt6JGYPpx_Q0tdy2BQL0rinCJvBkhnOi54QqV_sKtB8r4mebBJezsTuRtMZysISKRe7W2hB1QipoLySFNNRAVOgi2qWEZBHIO4_lVwiTpKVRcdONIl8Lgn4yRYcAn7L-k-ZjtHj0g2lCw21Gj6lRzyFQend4pc7SORrdOHdIBfHPyA1fkDRDimcvv4wrsaqZvFv5XVPO9qKswrQWxkfRzn7Kt_-cWYv818xY.' \
> -H 'Content-Type: multipart/form-data' \
> -F attributes='{"name": "mongodb_logs2.zip", "parent": {"id": "214562146277"}}' \
> -F file=@mongodb_logs2.zip > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 17.8M  100  1170  100 17.8M    213  3330k  0:00:05  0:00:05 --:--:-- 3471k

 

Comment by Yuan Fang [ 04/Aug/23 ]

Hi oraiches@zadarastorage.com,

Thank you for uploading the data. Unfortunately, it seems that the data wasn't uploaded properly, and I couldn't see it in the portal. I've tested the link on my end, and it worked. Would you please try to upload it again?

Please make sure that the files are fully uploaded and you should see the progress after initiating the command. One possible reason I've noticed that could cause the upload to fail is if you provided the full path of the file. In other words, you'll need to navigate to the current location of the file and replace <filename> with just the file name (e.g., "mongodb-logs.tgz") without the preceding path.

Please let us know if you made another attempt to upload, and thank you so much for your time!

Comment by Oded Raiches [ 02/Aug/23 ]

Hi yuan.fang@mongodb.com

Thanks for the fast reply!

Uploaded the logs to the link.

The timeline was around {"$date":"2023-08-02T06:30:41.227+00:00"} in the logs.

 

Comment by Yuan Fang [ 02/Aug/23 ]

Hi oraiches@zadarastorage.com,

Thank you for reaching out and bringing to our attention that the issue has occurred again. Here is the new uploader link, for the issue that happened this time:

  • Please ensure that you upload the logs/FTDC covering the time that the incident occurred.
  • Could you also provide a clear issue timeline?

Regards,
Yuan

Comment by Oded Raiches [ 02/Aug/23 ]

Hi yuan.fang@mongodb.com ,

We got the issue happening again on another system today.

This time I collected the logs and $dbpath/diagnostic.data an about ~1 hour after the occurrence, so hopefully we have all the info you requested.

Can you share a new upload link to send the logs once more?

Comment by Oded Raiches [ 19/Jul/23 ]

Hi yuan.fang@mongodb.com

Was not able to reproduce the same on a different system.

I have uploaded the logs of the 2 nodes in a zipped file to your upload portal, reminding that the  $dbpath/diagnostic.data were lost.

Comment by Yuan Fang [ 19/Jul/23 ]

Hi oraiches@zadarastorage.com,

Have you had a chance to reproduce the issue? If not, we can start by looking at the log, could you upload them to the upload portal? Thank you.

Regards,
Yuan

Comment by Oded Raiches [ 27/Jun/23 ]

Hi yuan.fang@mongodb.com, thanks for the reply.

Unfortunately, we only have the mongod/mongos logs since the $dbpath/diagnostic.data got rotated.

Will try to reproduce, but is the mongod/mongos sufficient in the meantime? 

Comment by Yuan Fang [ 26/Jun/23 ]

Hi oraiches@zadarastorage.com,

Thank you for your report, it seems the new shard experienced an issue with synchronization of the change stream. In order to understand why the change stream prior to 14:30 PM is unavailable on the new shard, we need more diagnostic data for further investigation:

I've created a secure upload portal for you. Files uploaded to this portal are hosted on Box, are visible only to MongoDB employees, and are routinely deleted after some time.

For each node in the replica set spanning a time period that includes the incident, would you please archive (tar or zip) and upload to that link:

  • the mongod/mongos logs
  • the $dbpath/diagnostic.data directory (the contents are described here)

Regards,
Yuan
 

Generated at Thu Feb 08 06:38:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.