Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-36874

Fatal Assertion 40526 while migrating chunks

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Critical - P2 Critical - P2
    • None
    • Affects Version/s: 3.6.5, 3.6.7
    • Component/s: Sharding
    • Labels:
      None
    • Environment:
      Ubuntu 16.04.4 LTS (GNU/Linux 4.4.0-1065-aws x86_64)
    • ALL

      We have a four shard cluster undergoing balancing. Each shard is a three member replica set.

      Right after a migration occurs, the to-shard's primary occassionally crashes. On 3.6.7, we estimate that this happens in 1 out of 5,000 migrations. Once a single crash happens, the next migration to the same shard (with a different replica node as primary) is likely to lead to a crash as well.

      We experienced this problem on 3.6.5 as well, but because SERVER-35658 prevented us from balancing at all, we didn't think much of the issue.

      Log from the to-shard primary:

      2018-08-26T19:34:32.238+0000 I SHARDING [migrateThread] Starting receiving end of migration of chunk <REDACTED> for collection <REDACTED> from rs0/<REDACTED> at epoch 5b244c7c70330ec95919bfff with session id rs0_rs2_5b8300c7f321a73b73bed889
      2018-08-26T19:34:32.238+0000 I NETWORK  [migrateThread] Successfully connected to rs0/<REDACTED> (1 connections now open to rs0/<REDACTED> with a 0 second timeout)
      2018-08-26T19:34:32.241+0000 I SHARDING [migrateThread] Scheduling deletion of any documents in <REDACTED> range <REDACTED> before migrating in a chunk covering the range
      2018-08-26T19:34:32.242+0000 I SHARDING [Collection Range Deleter] No documents remain to delete in <REDACTED> range <REDACTED>
      2018-08-26T19:34:32.242+0000 I SHARDING [Collection Range Deleter] Waiting for majority replication of local deletions in <REDACTED> range <REDACTED>
      2018-08-26T19:34:32.242+0000 I SHARDING [Collection Range Deleter] Finished deleting documents in <REDACTED> range <REDACTED>
      2018-08-26T19:34:32.242+0000 I SHARDING [migrateThread] Finished deleting <REDACTED> range <REDACTED>
      2018-08-26T19:34:32.553+0000 I COMMAND  [sessionCatalogMigration-rs0_rs2_5b8300c7f321a73b73bed889] command local.op
      log.rs command: find { find: "oplog.rs", filter: { ts: Timestamp(1535228547, 1972), t: 23 }, ntoreturn: 1, singleBatch
      : true, oplogReplay: true, $db: "local" } planSummary: COLLSCAN keysExamined:0 docsExamined:1 cursorExhausted:1 numYie
      lds:1 nreturned:1 reslen:1860 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 3 } }, Collec
      tion: { acquireCount: { r: 1 } }, oplog: { acquireCount: { r: 2 } } } protocol:op_msg 112ms
      2018-08-26T19:34:32.619+0000 F STORAGE  [sessionCatalogMigration-rs0_rs2_5b8300c7f321a73b73bed889] Statement id 0 f
      rom transaction [ { id: UUID("dbcd2345-9052-4e7d-b12d-518b577191d6"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB924
      27AE41E4649B934CA495991B7852B855) }:350208 ] was committed once with opTime { ts: Timestamp(1535310545, 1153), t: 23 }
       and a second time with opTime { ts: Timestamp(1534883468, 882), t: 20 }. This indicates possible data corruption or s
      erver bug and the process will be terminated.
      2018-08-26T19:34:32.619+0000 F -        [sessionCatalogMigration-rs0_rs2_5b8300c7f321a73b73bed889] Fatal Assertion 
      40526 at src/mongo/db/session.cpp 67
      2018-08-26T19:34:32.619+0000 F -        [sessionCatalogMigration-rs0_rs2_5b8300c7f321a73b73bed889] 
      
      ***aborting after fassert() failure
      
      2018-08-26T19:34:32.636+0000 F -        [sessionCatalogMigration-rs0_rs2_5b8300c7f321a73b73bed889] Got signal: 6 (Aborted).
      
       0x5634687c23b1 0x5634687c15c9 0x5634687c1aad 0x7f639ebba390 0x7f639e814428 0x7f639e81602a 0x563466f1d1ae 0x5634674d661a 0x5634674dad9e 0x5634674dcbda 0x5634675cf9bd 0x5634688d1610 0x7f639ebb06ba 0x7f639e8e641d
      ----- BEGIN BACKTRACE -----
      {"backtrace":[{"b":"563466589000","o":"22393B1","s":"_ZN5mongo15printStackTraceERSo"},{"b":"563466589000","o":"22385C9"},{"b":"563466589000","o":"2238AAD"},{"b":"7F639EBA9000","o":"11390"},{"b":"7F639E7DF000","o":"35428","s":"gsignal"},{"b":"7F639E7DF000","o":"3702A","s":"abort"},{"b":"563466589000","o":"9941AE","s":"_ZN5mongo32fassertFailedNoTraceWithLocationEiPKcj"},{"b":"563466589000","o":"F4D61A"},{"b":"563466589000","o":"F51D9E","s":"_ZN5mongo7Session26refreshFromStorageIfNeededEPNS_16OperationContextE"},{"b":"563466589000","o":"F53BDA","s":"_ZN5mongo14SessionCatalog18getOrCreateSessionEPNS_16OperationContextERKNS_16LogicalSessionIdE"},{"b":"563466589000","o":"10469BD","s":"_ZN5mongo34SessionCatalogMigrationDestination31_retrieveSessionStateFromSourceEPNS_14ServiceContextE"},{"b":"563466589000","o":"2348610"},{"b":"7F639EBA9000","o":"76BA"},{"b":"7F639E7DF000","o":"10741D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.6.7", "gitVersion" : "2628472127e9f1826e02c665c1d93880a204075e", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "4.4.0-1065-aws", "version" : "#75-Ubuntu SMP Fri Aug 10 11:14:32 UTC 2018", "machine" : "x86_64" }, "somap" : [ { "b" : "563466589000", "elfType" : 3, "buildId" : "5D78F445F57AA961C35C97316819BF42C1939FFF" }, { "b" : "7FFE739E2000", "elfType" : 3, "buildId" : "98B173804DDFA7204D8EC8829DB1D865B54DCD24" }, { "b" : "7F639FD9E000", "path" : "/lib/x86_64-linux-gnu/libresolv.so.2", "elfType" : 3, "buildId" : "6EF73266978476EF9F2FD2CF31E57F4597CB74F8" }, { "b" : "7F639F95A000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "250E875F74377DFC74DE48BF80CCB237BB4EFF1D" }, { "b" : "7F639F6F1000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "513282AC7EB386E2C0133FD9E1B6B8A0F38B047D" }, { "b" : "7F639F4ED000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "8CC8D0D119B142D839800BFF71FB71E73AEA7BD4" }, { "b" : "7F639F2E5000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "89C34D7A182387D76D5CDA1F7718F5D58824DFB3" }, { "b" : "7F639EFDC000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "DFB85DE42DAFFD09640C8FE377D572DE3E168920" }, { "b" : "7F639EDC6000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "68220AE2C65D65C1B6AAA12FA6765A6EC2F5F434" }, { "b" : "7F639EBA9000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "CE17E023542265FC11D9BC8F534BB4F070493D30" }, { "b" : "7F639E7DF000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "B5381A457906D279073822A5CEB24C4BFEF94DDB" }, { "b" : "7F639FFB9000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "5D7B6259552275A3C17BD4C3FD05F5A6BF40CAA5" } ] }}
       mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x5634687c23b1]
       mongod(+0x22385C9) [0x5634687c15c9]
      mongod(+0x22385C9) [0x5634687c15c9]
       mongod(+0x2238AAD) [0x5634687c1aad]
       libpthread.so.0(+0x11390) [0x7f639ebba390]
       libc.so.6(gsignal+0x38) [0x7f639e814428]
       libc.so.6(abort+0x16A) [0x7f639e81602a]
       mongod(_ZN5mongo32fassertFailedNoTraceWithLocationEiPKcj+0x0) [0x563466f1d1ae]
       mongod(+0xF4D61A) [0x5634674d661a]
       mongod(_ZN5mongo7Session26refreshFromStorageIfNeededEPNS_16OperationContextE+0x12FE) [0x5634674dad9e]
       mongod(_ZN5mongo14SessionCatalog18getOrCreateSessionEPNS_16OperationContextERKNS_16LogicalSessionIdE+0xDA) [0x5634674dcbda]
       mongod(_ZN5mongo34SessionCatalogMigrationDestination31_retrieveSessionStateFromSourceEPNS_14ServiceContextE+0x100D) [0x5634675cf9bd]
       mongod(+0x2348610) [0x5634688d1610]
       libpthread.so.0(+0x76BA) [0x7f639ebb06ba]
       libc.so.6(clone+0x6D) [0x7f639e8e641d]
      -----  END BACKTRACE  -----
      

            Assignee:
            kelsey.schubert@mongodb.com Kelsey Schubert
            Reporter:
            epkugelmass Elan Kugelmass
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated:
              Resolved: