[SERVER-35214] Invariant failure starting up mongos after automated restore from backup Created: 24/May/18  Updated: 27/Oct/23  Resolved: 07/Jun/18

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 4.0.0-rc0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Louisa Berger Assignee: Kaloian Manassiev
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File config_40WT (1).tar.gz     Text File mongos_after_restore_fassert.log    
Issue Links:
Depends
Operating System: ALL
Steps To Reproduce:

See the spec for sharded cluster automated restore for reference on exact procedure for sharded cluster restore from backup.

Sprint: Sharding 2018-06-18
Participants:

 Description   

When doing a sharded cluster automated backup restore on 4.0.0-rc0, when we try to restart the mongos after the rest of the cluster has been restored, we get the following invariant failure in the mongos logs:

2018-05-24T19:36:22.400+0000 I SHARDING [LogicalSessionCacheRefresh] Refreshing chunks for collection config.system.sessions based on version 0|0||000000000000000000000000 
2018-05-24T19:36:22.400+0000 F - [ConfigServerCatalogCacheLoader-0] Invariant failure _shardId == _history.front().getShard() src/mongo/s/chunk.cpp 67 
2018-05-24T19:36:22.400+0000 F - [ConfigServerCatalogCacheLoader-0] ***aborting after invariant() failure

Full logs attached.

This happens for about 20 minutes (we keep retrying the start every 3 minutes or so), and then it successfully starts up.

Can try to narrow down a better repro if needed – this is from our E2E test runs, so there's not as much information as usual. If this is not sufficient information, I'll generate a repro tomorrow that we can look at more closely. kaloian.manassiev suggested I file a ticket for the failure.



 Comments   
Comment by Kaloian Manassiev [ 07/Jun/18 ]

Thanks for attaching the snapshot - luckily, the snapshot itself is correct.

The problem is the restore procedure and specifically, this step:

Change the value shard from sourceShardName to destShardName for every document in config.chunks where {shard: sourceShardName}

For 4.0+, this step needs to be augmented with also clearing the history array (that's the easiest to do instead of also performing renames there), because after the step above, this history's entries will not match the renamed shards.

So basically: chunk.history should be set to [].

Comment by Louisa Berger [ 07/Jun/18 ]

Yes, it was taken from 4.0-rc0 with fcv 4.0.

Attaching config snapshot

Comment by Kaloian Manassiev [ 07/Jun/18 ]

louisa.berger, how did you take the dump which you are using for these restore tests? I presume it was it taken from 4.0-rc0 with FCV 4.0, but can you confirm?

Also, can you please attach the dump from the config database? I would like to examine what it looks like.

Comment by Kelsey Schubert [ 07/Jun/18 ]

Thanks for the note about impact, we're looking into it!

Comment by Louisa Berger [ 07/Jun/18 ]

Note: this is blocking us from completing support for 4.0 automated restores. Thank you!

Comment by Louisa Berger [ 07/Jun/18 ]

Procedure for mongos:

  1. Shut down
  2. Wait for all config servers to have finished their restores, restarted, and elected a primary.
  3. Restart

Details of what the config servers are doing during the restore can be found in the restores spec here : https://docs.google.com/document/d/16oENund7VwCjKe_QEJDXoPzOljLxruVhWKpAYFhYw5s/edit#heading=h.s8fu4tms63b3

Comment by Louisa Berger [ 07/Jun/18 ]

Hi kaloian.manassiev

Re-opening because I'm seeing these again. Attached the mongos logs. Can easily repro if you need other logs or information.

Comment by Louisa Berger [ 04/Jun/18 ]

Sorry Kal, was working on trying to repro and I think I figured out how to make this issue go away. Was waiting to close until I could confirm, but it looks like we're all set. Thank you!

Comment by Kaloian Manassiev [ 01/Jun/18 ]

louisa.berger, there are no logs attached to this ticket. Will you be able to include a repro or attach the logs? Otherwise it is unclear what caused this invariant from just looking at the message.

Comment by Kaloian Manassiev [ 24/May/18 ]

This invariant points at incomplete FCV upgrade or incomplete restore. Would it be possible to provide the repro steps that you used and also attach the complete logs from that node?

Generated at Thu Feb 08 04:39:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.