[SERVER-34671] Server shuts down on startup with 4.0 for mmap csrs Created: 25/Apr/18 Updated: 27/Oct/23 Resolved: 25/Apr/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Storage |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Louisa Berger | Assignee: | Backlog - Storage Execution Team |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
|||||||||||||||||||||||
| Assigned Teams: |
Storage Execution
|
|||||||||||||||||||||||
| Operating System: | ALL | |||||||||||||||||||||||
| Steps To Reproduce: |
|
|||||||||||||||||||||||
| Participants: |
| Description |
|
Note: I know that only WT csrs is supported since 3.6, but kaloian.manassiev recommended filing a bug since it causes the server to crash. If you start and initiate a mmapv1 CSRS on 3.7.5, the secondaries crash with the following fassert:
Then, if you try to restart that process after the fassert, you get a different FCV fassert:
Attached the full log files for all 3 members of the csrs. |
| Comments |
| Comment by Eric Milkie [ 26/Apr/18 ] |
|
A brand new server that is started with --replSet or --shardsvr cannot simply write an FCV document without first becoming part of a replica set or sharded system, because FCV determination is synchronized in those environments. |
| Comment by Kaloian Manassiev [ 25/Apr/18 ] |
|
Ah ok, this makes sense - thank you for the explanation! Regardless of this, shouldn't a brand new server starting write its FCV as the first thing it does after recovery? This would avoid situations like this. |
| Comment by Eric Milkie [ 25/Apr/18 ] |
|
I think the reason this is happening is because initial sync, upon exhausting retry attempts, does not do a final "clean everything out" before it shuts down the server. I don't think it's worth it to fix that. |
| Comment by Andy Schwerin [ 25/Apr/18 ] |
|
There's nothing wrong here other than maybe the text of the message. Config servers have to use WT as their storage engine. |
| Comment by Kaloian Manassiev [ 25/Apr/18 ] |
|
Oh sorry, I didn't see there is a question. milkie, what seemed odd to me in this failure is this message Unable to start up mongod due to missing featureCompatibilityVersion document, which indicates that a node managed to create some collections before the FCV document has been written. Apart from that, I don't care if the secondaries fassert due to being unable to perform initial sync. |
| Comment by Louisa Berger [ 25/Apr/18 ] |
|
Assuming you want kaloian.manassiev to weigh in here, but from Cloud perspective I really don't have an opinion – it's not blocking us, I just came across this while testing. |
| Comment by Eric Milkie [ 25/Apr/18 ] |
|
It appears that initial sync doesn't work because, due to the existing conversion logic that helps users convert from mirrored config servers to CSRS, any MMAP nodes are automatically moved to "REMOVED" state when they join a cluster. This state is prohibiting them from completing initial sync. |
| Comment by Eric Milkie [ 25/Apr/18 ] |
|
Note that the server isn't "rendered unusable"; the log message immediately following the error tells you how to fix it – by restarting the server with some different parameters to override the error detection and fix the problem. |
| Comment by Kaloian Manassiev [ 25/Apr/18 ] |
|
Assigning to the storage team, because it looks like the "Unable to start up mongod due to missing featureCompatibilityVersion document" message appears as a result of the server being left in inconsistent state, where collections were created before the FCV document was written. Not sure if this is a scenario we care about, but rendering the server unusable in this state seems wrong. |