[SERVER-35214] Invariant failure starting up mongos after automated restore from backup Created: 24/May/18 Updated: 27/Oct/23 Resolved: 07/Jun/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 4.0.0-rc0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Louisa Berger | Assignee: | Kaloian Manassiev |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Operating System: | ALL | ||||
| Steps To Reproduce: | See the spec for sharded cluster automated restore for reference on exact procedure for sharded cluster restore from backup. |
||||
| Sprint: | Sharding 2018-06-18 | ||||
| Participants: | |||||
| Description |
|
When doing a sharded cluster automated backup restore on 4.0.0-rc0, when we try to restart the mongos after the rest of the cluster has been restored, we get the following invariant failure in the mongos logs:
Full logs attached. This happens for about 20 minutes (we keep retrying the start every 3 minutes or so), and then it successfully starts up. Can try to narrow down a better repro if needed – this is from our E2E test runs, so there's not as much information as usual. If this is not sufficient information, I'll generate a repro tomorrow that we can look at more closely. kaloian.manassiev suggested I file a ticket for the failure. |
| Comments |
| Comment by Kaloian Manassiev [ 07/Jun/18 ] | |
|
Thanks for attaching the snapshot - luckily, the snapshot itself is correct. The problem is the restore procedure and specifically, this step:
For 4.0+, this step needs to be augmented with also clearing the history array (that's the easiest to do instead of also performing renames there), because after the step above, this history's entries will not match the renamed shards. So basically: chunk.history should be set to []. | |
| Comment by Louisa Berger [ 07/Jun/18 ] | |
|
Yes, it was taken from 4.0-rc0 with fcv 4.0. Attaching config snapshot | |
| Comment by Kaloian Manassiev [ 07/Jun/18 ] | |
|
louisa.berger, how did you take the dump which you are using for these restore tests? I presume it was it taken from 4.0-rc0 with FCV 4.0, but can you confirm? Also, can you please attach the dump from the config database? I would like to examine what it looks like. | |
| Comment by Kelsey Schubert [ 07/Jun/18 ] | |
|
Thanks for the note about impact, we're looking into it! | |
| Comment by Louisa Berger [ 07/Jun/18 ] | |
|
Note: this is blocking us from completing support for 4.0 automated restores. Thank you! | |
| Comment by Louisa Berger [ 07/Jun/18 ] | |
|
Procedure for mongos:
Details of what the config servers are doing during the restore can be found in the restores spec here : https://docs.google.com/document/d/16oENund7VwCjKe_QEJDXoPzOljLxruVhWKpAYFhYw5s/edit#heading=h.s8fu4tms63b3 | |
| Comment by Louisa Berger [ 07/Jun/18 ] | |
|
Hi kaloian.manassiev – Re-opening because I'm seeing these again. Attached the mongos logs. Can easily repro if you need other logs or information. | |
| Comment by Louisa Berger [ 04/Jun/18 ] | |
|
Sorry Kal, was working on trying to repro and I think I figured out how to make this issue go away. Was waiting to close until I could confirm, but it looks like we're all set. Thank you! | |
| Comment by Kaloian Manassiev [ 01/Jun/18 ] | |
|
louisa.berger, there are no logs attached to this ticket. Will you be able to include a repro or attach the logs? Otherwise it is unclear what caused this invariant from just looking at the message. | |
| Comment by Kaloian Manassiev [ 24/May/18 ] | |
|
This invariant points at incomplete FCV upgrade or incomplete restore. Would it be possible to provide the repro steps that you used and also attach the complete logs from that node? |