[SERVER-34671] Server shuts down on startup with 4.0 for mmap csrs Created: 25/Apr/18  Updated: 27/Oct/23  Resolved: 25/Apr/18

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Louisa Berger Assignee: Backlog - Storage Execution Team
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File csrs_primary.log     Text File csrs_secondary1_first_startup.log     Text File csrs_secondary1_second_startup.log     Text File csrs_secondary2_first_startup.log    
Assigned Teams:
Storage Execution
Operating System: ALL
Steps To Reproduce:
  1. Start up 3 config servers on mmapv1 with the following config files:

    net:
      bindIp: 0.0.0.0
      port: 5006
    processManagement:
      fork: "true"
    replication:
      replSetName: csrs_set
    sharding:
      clusterRole: configsvr
    storage:
      dbPath: /tmp/mms-automation/data/config1
      engine: mmapv1
    systemLog:
      destination: file
      path: /tmp/mms-automation/logs/config1_run.log
    

  2. Initiate the replica set (on 5006) :

    > rs.initiate({_id : "csrs_set", "members" : [ { _id : 0, host : "louisamac:5006"}, {_id : 1, host : "louisamac:5007"}, {_id : 2, host : "louisamac:5008" } ] })
    2018-04-25T11:33:20.237-0400 E QUERY    [js] Error: error doing query: failed: network error while attempting to run command 'replSetInitiate' on host '127.0.0.1:5006'  :
    DB.prototype.runCommand@src/mongo/shell/db.js:168:1
    DB.prototype.adminCommand@src/mongo/shell/db.js:186:16
    rs.initiate@src/mongo/shell/utils.js:1270:12
    @(shell):1:1
    2018-04-25T11:33:20.238-0400 I NETWORK  [js] trying reconnect to 127.0.0.1:5006 failed
    2018-04-25T11:33:20.242-0400 I NETWORK  [js] reconnect 127.0.0.1:5006 ok
    

  3. Wait a while.. and see the first initial sync fassert on the secondaries.
  4. Restart the secondary with the same options. Immediately see the second fcv fassert on the secondaries.
Participants:

 Description   

Note: I know that only WT csrs is supported since 3.6, but kaloian.manassiev recommended filing a bug since it causes the server to crash.

If you start and initiate a mmapv1 CSRS on 3.7.5, the secondaries crash with the following fassert:

2018-04-25T11:35:01.441-0400 I REPL     [replication-0] Initial Sync Attempt Statistics: { failedInitialSyncAttempts: 9, maxFailedInitialSyncAttempts: 10, initialSyncStart: new Date(1524670402251), initialSyncAttempts: [ { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" } ] }
2018-04-25T11:35:01.441-0400 E REPL     [replication-0] Initial sync attempt failed -- attempts left: 0 cause: InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.
2018-04-25T11:35:01.442-0400 F REPL     [replication-0] The maximum number of retries have been exhausted for initial sync.
2018-04-25T11:35:01.443-0400 I STORAGE  [replication-0] Finishing collection drop for local.temp_oplog_buffer (9c32c293-5d0b-491b-921a-6c2b83700ff9).
2018-04-25T11:35:01.444-0400 E REPL     [replication-0] Initial sync failed, shutting down now. Restart the server to attempt a new initial sync.
2018-04-25T11:35:01.445-0400 F -        [replication-0] Fatal assertion 40088 InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync. at src/mongo/db/repl/replication_coordinator_impl.cpp 685
2018-04-25T11:35:01.445-0400 F -        [replication-0]
 
***aborting after fassert() failure

Then, if you try to restart that process after the fassert, you get a different FCV fassert:

2018-04-25T11:47:13.205-0400 F STORAGE  [initandlisten] Unable to start up mongod due to missing featureCompatibilityVersion document.
2018-04-25T11:47:13.205-0400 F STORAGE  [initandlisten] Please run with --journalOptions 4 to recover the journal. Then run with --repair to restore the document.
2018-04-25T11:47:13.206-0400 F -        [initandlisten] Fatal Assertion 40652 at src/mongo/db/repair_database_and_check_version.cpp 527
2018-04-25T11:47:13.206-0400 F -        [initandlisten]
 
***aborting after fassert() failure

Attached the full log files for all 3 members of the csrs.



 Comments   
Comment by Eric Milkie [ 26/Apr/18 ]

A brand new server that is started with --replSet or --shardsvr cannot simply write an FCV document without first becoming part of a replica set or sharded system, because FCV determination is synchronized in those environments.

Comment by Kaloian Manassiev [ 25/Apr/18 ]

Ah ok, this makes sense - thank you for the explanation!

Regardless of this, shouldn't a brand new server starting write its FCV as the first thing it does after recovery? This would avoid situations like this.

Comment by Eric Milkie [ 25/Apr/18 ]

I think the reason this is happening is because initial sync, upon exhausting retry attempts, does not do a final "clean everything out" before it shuts down the server. I don't think it's worth it to fix that.

Comment by Andy Schwerin [ 25/Apr/18 ]

There's nothing wrong here other than maybe the text of the message. Config servers have to use WT as their storage engine.

Comment by Kaloian Manassiev [ 25/Apr/18 ]

Oh sorry, I didn't see there is a question.

milkie, what seemed odd to me in this failure is this message Unable to start up mongod due to missing featureCompatibilityVersion document, which indicates that a node managed to create some collections before the FCV document has been written. Apart from that, I don't care if the secondaries fassert due to being unable to perform initial sync.

Comment by Louisa Berger [ 25/Apr/18 ]

Assuming you want kaloian.manassiev to weigh in here, but from Cloud perspective I really don't have an opinion – it's not blocking us, I just came across this while testing.

Comment by Eric Milkie [ 25/Apr/18 ]

It appears that initial sync doesn't work because, due to the existing conversion logic that helps users convert from mirrored config servers to CSRS, any MMAP nodes are automatically moved to "REMOVED" state when they join a cluster. This state is prohibiting them from completing initial sync.
I'm not sure what the expected behavior here is. Would you want the servers to not shut down, but simply remain running as REMOVED members?

Comment by Eric Milkie [ 25/Apr/18 ]

Note that the server isn't "rendered unusable"; the log message immediately following the error tells you how to fix it – by restarting the server with some different parameters to override the error detection and fix the problem.
However, it is still a problem that initial sync is failing for this case. We'll still need to investigate that.

Comment by Kaloian Manassiev [ 25/Apr/18 ]

Assigning to the storage team, because it looks like the "Unable to start up mongod due to missing featureCompatibilityVersion document" message appears as a result of the server being left in inconsistent state, where collections were created before the FCV document was written. Not sure if this is a scenario we care about, but rendering the server unusable in this state seems wrong.

Generated at Thu Feb 08 04:37:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.