Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-34671

Server shuts down on startup with 4.0 for mmap csrs

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Works as Designed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Storage
    • Labels:
      None
    • Operating System:
      ALL
    • Steps To Reproduce:
      Hide
      1. Start up 3 config servers on mmapv1 with the following config files:

        net:
          bindIp: 0.0.0.0
          port: 5006
        processManagement:
          fork: "true"
        replication:
          replSetName: csrs_set
        sharding:
          clusterRole: configsvr
        storage:
          dbPath: /tmp/mms-automation/data/config1
          engine: mmapv1
        systemLog:
          destination: file
          path: /tmp/mms-automation/logs/config1_run.log
        

      2. Initiate the replica set (on 5006) :

        > rs.initiate({_id : "csrs_set", "members" : [ { _id : 0, host : "louisamac:5006"}, {_id : 1, host : "louisamac:5007"}, {_id : 2, host : "louisamac:5008" } ] })
        2018-04-25T11:33:20.237-0400 E QUERY    [js] Error: error doing query: failed: network error while attempting to run command 'replSetInitiate' on host '127.0.0.1:5006'  :
        DB.prototype.runCommand@src/mongo/shell/db.js:168:1
        DB.prototype.adminCommand@src/mongo/shell/db.js:186:16
        rs.initiate@src/mongo/shell/utils.js:1270:12
        @(shell):1:1
        2018-04-25T11:33:20.238-0400 I NETWORK  [js] trying reconnect to 127.0.0.1:5006 failed
        2018-04-25T11:33:20.242-0400 I NETWORK  [js] reconnect 127.0.0.1:5006 ok
        

      3. Wait a while.. and see the first initial sync fassert on the secondaries.
      4. Restart the secondary with the same options. Immediately see the second fcv fassert on the secondaries.
      Show
      Start up 3 config servers on mmapv1 with the following config files: net: bindIp: 0.0.0.0 port: 5006 processManagement: fork: "true" replication: replSetName: csrs_set sharding: clusterRole: configsvr storage: dbPath: /tmp/mms-automation/data/config1 engine: mmapv1 systemLog: destination: file path: /tmp/mms-automation/logs/config1_run.log Initiate the replica set (on 5006) : > rs.initiate({_id : "csrs_set", "members" : [ { _id : 0, host : "louisamac:5006"}, {_id : 1, host : "louisamac:5007"}, {_id : 2, host : "louisamac:5008" } ] }) 2018-04-25T11:33:20.237-0400 E QUERY [js] Error: error doing query: failed: network error while attempting to run command 'replSetInitiate' on host '127.0.0.1:5006' : DB.prototype.runCommand@src/mongo/shell/db.js:168:1 DB.prototype.adminCommand@src/mongo/shell/db.js:186:16 rs.initiate@src/mongo/shell/utils.js:1270:12 @(shell):1:1 2018-04-25T11:33:20.238-0400 I NETWORK [js] trying reconnect to 127.0.0.1:5006 failed 2018-04-25T11:33:20.242-0400 I NETWORK [js] reconnect 127.0.0.1:5006 ok Wait a while.. and see the first initial sync fassert on the secondaries. Restart the secondary with the same options. Immediately see the second fcv fassert on the secondaries.

      Description

      Note: I know that only WT csrs is supported since 3.6, but Kaloian Manassiev recommended filing a bug since it causes the server to crash.

      If you start and initiate a mmapv1 CSRS on 3.7.5, the secondaries crash with the following fassert:

      2018-04-25T11:35:01.441-0400 I REPL     [replication-0] Initial Sync Attempt Statistics: { failedInitialSyncAttempts: 9, maxFailedInitialSyncAttempts: 10, initialSyncStart: new Date(1524670402251), initialSyncAttempts: [ { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" } ] }
      2018-04-25T11:35:01.441-0400 E REPL     [replication-0] Initial sync attempt failed -- attempts left: 0 cause: InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.
      2018-04-25T11:35:01.442-0400 F REPL     [replication-0] The maximum number of retries have been exhausted for initial sync.
      2018-04-25T11:35:01.443-0400 I STORAGE  [replication-0] Finishing collection drop for local.temp_oplog_buffer (9c32c293-5d0b-491b-921a-6c2b83700ff9).
      2018-04-25T11:35:01.444-0400 E REPL     [replication-0] Initial sync failed, shutting down now. Restart the server to attempt a new initial sync.
      2018-04-25T11:35:01.445-0400 F -        [replication-0] Fatal assertion 40088 InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync. at src/mongo/db/repl/replication_coordinator_impl.cpp 685
      2018-04-25T11:35:01.445-0400 F -        [replication-0]
       
      ***aborting after fassert() failure
      

      Then, if you try to restart that process after the fassert, you get a different FCV fassert:

      2018-04-25T11:47:13.205-0400 F STORAGE  [initandlisten] Unable to start up mongod due to missing featureCompatibilityVersion document.
      2018-04-25T11:47:13.205-0400 F STORAGE  [initandlisten] Please run with --journalOptions 4 to recover the journal. Then run with --repair to restore the document.
      2018-04-25T11:47:13.206-0400 F -        [initandlisten] Fatal Assertion 40652 at src/mongo/db/repair_database_and_check_version.cpp 527
      2018-04-25T11:47:13.206-0400 F -        [initandlisten]
       
      ***aborting after fassert() failure
      

      Attached the full log files for all 3 members of the csrs.

        Attachments

        1. csrs_primary.log
          6.82 MB
        2. csrs_secondary1_first_startup.log
          8 kB
        3. csrs_secondary1_second_startup.log
          29 kB
        4. csrs_secondary2_first_startup.log
          29 kB

          Activity

            People

            Assignee:
            backlog-server-execution Backlog - Storage Execution Team
            Reporter:
            louisa.berger Louisa Berger
            Participants:
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: