Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-36139

Cluster-wide crash due to segfaults in LockPinger thread when an SCCC member is started with replication.replSetName option

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.2.17
    • Component/s: Sharding, Stability
    • Labels:
      None
    • Sharding
    • ALL
    • Hide
      1. Deploy a 3.2.17 sharded cluster with 3 SCCC config servers
      2. Add replication.replSetName option into the configuration file of one of the config servers and re-start it to the change would take effect
      3. Observe how the mongos and the shard processes crash with segfault
      Show
      Deploy a 3.2.17 sharded cluster with 3 SCCC config servers Add replication.replSetName option into the configuration file of one of the config servers and re-start it to the change would take effect Observe how the mongos and the shard processes crash with segfault

      With legacy SCCC configuration of the config servers, if one of the config servers is accidentally restarted with the replication.replSetName option set the LogPinger thread on the shards and on the mongoS will segfault apparently due to unexpected notMaster outcome of trying to work with the config server that has the replSetName option set.

      This will likely cause cluster-wide crashes == outage!

      The back trace looks like this:

      "The backtrace"
      2018-07-16T00:53:01.415+0000 F - [LockPinger] Invalid access at address: 0x20b7000
      2018-07-16T00:53:01.436+0000 F - [LockPinger] Got signal: 11 (Segmentation fault).
      
      0x13c6902 0x13c5a59 0x13c5dd8 0x7fa978a4e330 0x7fa97870fb10 0x142a92e 0x1bfe00d 0xa2ec08 0xa2ecae 0x1288c5b 0x12c3c0c 0xa9fc8d 0xa40825 0xaa5a2e 0xaa6132 0xa2fdfb 0xa33005 0xa334c1 0x1202eaf 0x1206d1a 0x120789a 0x1bf42e0 0x7fa978a46184 0x7fa97877303d
      ----- BEGIN BACKTRACE -----
      {"backtrace":[\{"b":"400000","o":"FC6902","s":"_ZN5mongo15printStackTraceERSo"},\{"b":"400000","o":"FC5A59"},\{"b":"400000","o":"FC5DD8"},\{"b":"7FA978A3E000","o":"10330"},\{"b":"7FA978675000","o":"9AB10"},\{"b":"400000","o":"102A92E","s":"_ZNSs12_S_constructIPKcEEPcT_S3_RKSaIcESt20forward_iterator_tag"},\{"b":"400000","o":"17FE00D","s":"_ZNSsC1EPKcmRKSaIcE"},\{"b":"400000","o":"62EC08","s":"_ZN5mongo16ConnectionStringC1ENS_10StringDataESt6vectorINS_11HostAndPortESaIS3_EE"},\{"b":"400000","o":"62ECAE","s":"_ZN5mongo16ConnectionString13forReplicaSetENS_10StringDataESt6vectorINS_11HostAndPortESaIS3_EE"},\{"b":"400000","o":"E88C5B","s":"_ZN5mongo29ShardingNetworkConnectionHook16validateHostImplERKNS_11HostAndPortERKNS_8executor21RemoteCommandResponseEb"},\{"b":"400000","o":"EC3C0C"},\{"b":"400000","o":"69FC8D"},\{"b":"400000","o":"640825","s":"_ZN5mongo18DBClientConnection7connectERKNS_11HostAndPortE"},\{"b":"400000","o":"6A5A2E","s":"_ZN5mongo21SyncClusterConnection8_connectERKSs"},\{"b":"400000","o":"6A6132","s":"_ZN5mongo21SyncClusterConnectionC1ERKSt4listINS_11HostAndPortESaIS2_EEd"},\{"b":"400000","o":"62FDFB","s":"_ZNK5mongo16ConnectionString7connectERSsd"},\{"b":"400000","o":"633005","s":"_ZN5mongo16DBConnectionPool3getERKSsd"},\{"b":"400000","o":"6334C1","s":"_ZN5mongo18ScopedDbConnectionC2ERKSsd"},\{"b":"400000","o":"E02EAF","s":"_ZN5mongo20LegacyDistLockPinger19_distLockPingThreadENS_16ConnectionStringERKSsNSt6chrono8durationIlSt5ratioILl1ELl1000EEEE"},\{"b":"400000","o":"E06D1A","s":"_ZN5mongo20LegacyDistLockPinger18distLockPingThreadENS_16ConnectionStringExRKSsNSt6chrono8durationIlSt5ratioILl1ELl1000EEEE"},\{"b":"400000","o":"E0789A","s":"_ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFSt7_Mem_fnIMN5mongo20LegacyDistLockPingerEFvNS4_16ConnectionStringExRKSsNSt6chrono8durationIlSt5ratioILl1ELl1000EEEEEEPS5_S6_xSsSD_EEvEEE6_M_runEv"},\{"b":"400000","o":"17F42E0","s":"execute_native_thread_routine"},\{"b":"7FA978A3E000","o":"8184"},\{"b":"7FA978675000","o":"FE03D","s":"clone"}],"processInfo":\{ "mongodbVersion" : "3.2.17", "gitVersion" : "186656d79574f7dfe0831a7e7821292ab380f667", "compiledModules" : [ "enterprise" ], "uname" : { "sysname" : "Linux", "release" : "3.13.0-105-generic", "version" : "#152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016", "machine" : "x86_64" }, "somap" : [ \{ "elfType" : 2, "b" : "400000", "buildId" : "C64FB7E39A41884970087FED27370726E8FF84C6" }, \{ "b" : "7FFC238BE000", "elfType" : 3, "buildId" : "9C7CBCF6C957D8FC8E55B45A3C7A1556B38A3097" }, \{ "b" : "7FA97ABCF000", "path" : "/usr/lib/x86_64-linux-gnu/libsasl2.so.2", "elfType" : 3, "buildId" : "666B276BD134B0E9579B67D4EE333F2D0FB813CD" }, \{ "b" : "7FA97A762000", "path" : "/usr/lib/x86_64-linux-gnu/libnetsnmpmibs.so.30", "elfType" : 3, "buildId" : "8047EB46F312235A7AD5E88665194B9B79823731" }, \{ "b" : "7FA97A553000", "path" : "/usr/lib/x86_64-linux-gnu/libsensors.so.4", "elfType" : 3, "buildId" : "859FDBFDD82F0EFDEB44A433D9D8020A232A35E2" }, \{ "b" : "7FA97A34F000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "034D6A4EE9DCAB4A34ABD644345CBBB42DC63088" }, \{ "b" : "7FA97A0E6000", "path" : "/usr/lib/x86_64-linux-gnu/libnetsnmpagent.so.30", "elfType" : 3, "buildId" : "440F4DBA9B84E851695DA5087266A215A17F05AF" }, \{ "b" : "7FA979EDC000", "path" : "/lib/x86_64-linux-gnu/libwrap.so.0", "elfType" : 3, "buildId" : "54FCBC5B0F994A13A9B0EAD46F23E7DA7F7FE75B" }, \{ "b" : "7FA979C02000", "path" : "/usr/lib/x86_64-linux-gnu/libnetsnmp.so.30", "elfType" : 3, "buildId" : "3FA90E3998BC0E2B00C1E751A3690FE919E12042" }, \{ "b" : "7FA979826000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "CE5EE930D4F0B1F47EDFDACC388EAC6C4DE5CDD2" }, \{ "b" : "7FA9795DF000", "path" : "/usr/lib/x86_64-linux-gnu/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "55F72A23CB9C0F7529F0E0BEE43981864B74C4FE" }, \{ "b" : "7FA9792D9000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "300C7884CDEB5667BEA2357D2B8E7A76397562D6" }, \{ "b" : "7FA97907A000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "920BD37B19B7BD04CA38CE35155D6CDCD744EAB5" }, \{ "b" : "7FA978E72000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "4F930712D3609C93E380E5BE5DF73E7AD273531C" }, \{ "b" : "7FA978C5C000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "36311B4457710AE5578C4BF00791DED7359DBB92" }, \{ "b" : "7FA978A3E000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "F64B8AD471FBA1B7A3A64EFB01551E694975E1F7" }, \{ "b" : "7FA978675000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "D9A10B8EF90300628DD0A3A535106967714D7328" }, \{ "b" : "7FA97ADEA000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "2CA513EDC89C7BC06EC183D1A3A03CC0F606319C" }, \{ "b" : "7FA9782EC000", "path" : "/usr/lib/libperl.so.5.18", "elfType" : 3, "buildId" : "C0DB67A9F9ACDD77265A72E03557AC3AF3DCB362" }, \{ "b" : "7FA9780D2000", "path" : "/lib/x86_64-linux-gnu/libnsl.so.1", "elfType" : 3, "buildId" : "77E8046EDCD924AF0081170F3E3BDC4317CCE6A0" }, \{ "b" : "7FA977E07000", "path" : "/usr/lib/x86_64-linux-gnu/libkrb5.so.3", "elfType" : 3, "buildId" : "77287B3AF8DD293D7367EEF27C652C04353752EC" }, \{ "b" : "7FA977BD8000", "path" : "/usr/lib/x86_64-linux-gnu/libk5crypto.so.3", "elfType" : 3, "buildId" : "49E3D743C2B3741229AD3892B22C4794C646E1F2" }, \{ "b" : "7FA9779D4000", "path" : "/lib/x86_64-linux-gnu/libcom_err.so.2", "elfType" : 3, "buildId" : "8D56938ABD6462C4C29822D8E48A131BE1C61F6A" }, \{ "b" : "7FA9777C9000", "path" : "/usr/lib/x86_64-linux-gnu/libkrb5support.so.0", "elfType" : 3, "buildId" : "0B3ABC152466DE0C69954405A0E980B6E0D4B78F" }, \{ "b" : "7FA977590000", "path" : "/lib/x86_64-linux-gnu/libcrypt.so.1", "elfType" : 3, "buildId" : "A2CA559CCEB691EF8623361D52671E146DC0B06C" }, \{ "b" : "7FA97738C000", "path" : "/lib/x86_64-linux-gnu/libkeyutils.so.1", "elfType" : 3, "buildId" : "0F03635F97B93D3DACD84F0ED363C56BD266044F" }, \{ "b" : "7FA977171000", "path" : "/lib/x86_64-linux-gnu/libresolv.so.2", "elfType" : 3, "buildId" : "AD304AFCE6847F7A4D66D22853E87CCBF5A66966" } ] }}
       mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x13c6902]
       mongod(+0xFC5A59) [0x13c5a59]
       mongod(+0xFC5DD8) [0x13c5dd8]
       libpthread.so.0(+0x10330) [0x7fa978a4e330]
       libc.so.6(+0x9AB10) [0x7fa97870fb10]
       mongod(_ZNSs12_S_constructIPKcEEPcT_S3_RKSaIcESt20forward_iterator_tag+0x7E) [0x142a92e]
       mongod(_ZNSsC1EPKcmRKSaIcE+0x1D) [0x1bfe00d]
       mongod(_ZN5mongo16ConnectionStringC1ENS_10StringDataESt6vectorINS_11HostAndPortESaIS3_EE+0x78) [0xa2ec08]
       mongod(_ZN5mongo16ConnectionString13forReplicaSetENS_10StringDataESt6vectorINS_11HostAndPortESaIS3_EE+0x4E) [0xa2ecae]
       mongod(_ZN5mongo29ShardingNetworkConnectionHook16validateHostImplERKNS_11HostAndPortERKNS_8executor21RemoteCommandResponseEb+0x83B) [0x1288c5b]
       mongod(+0xEC3C0C) [0x12c3c0c]
       mongod(+0x69FC8D) [0xa9fc8d]
       mongod(_ZN5mongo18DBClientConnection7connectERKNS_11HostAndPortE+0x765) [0xa40825]
       mongod(_ZN5mongo21SyncClusterConnection8_connectERKSs+0x2EE) [0xaa5a2e]
       mongod(_ZN5mongo21SyncClusterConnectionC1ERKSt4listINS_11HostAndPortESaIS2_EEd+0x2A2) [0xaa6132]
       mongod(_ZNK5mongo16ConnectionString7connectERSsd+0xEB) [0xa2fdfb]
       mongod(_ZN5mongo16DBConnectionPool3getERKSsd+0x145) [0xa33005]
       mongod(_ZN5mongo18ScopedDbConnectionC2ERKSsd+0x61) [0xa334c1]
       mongod(_ZN5mongo20LegacyDistLockPinger19_distLockPingThreadENS_16ConnectionStringERKSsNSt6chrono8durationIlSt5ratioILl1ELl1000EEEE+0x2BF) [0x1202eaf]
       mongod(_ZN5mongo20LegacyDistLockPinger18distLockPingThreadENS_16ConnectionStringExRKSsNSt6chrono8durationIlSt5ratioILl1ELl1000EEEE+0x10A) [0x1206d1a]
       mongod(_ZNSt6thread5_ImplISt12_Bind_simpleIFSt5_BindIFSt7_Mem_fnIMN5mongo20LegacyDistLockPingerEFvNS4_16ConnectionStringExRKSsNSt6chrono8durationIlSt5ratioILl1ELl1000EEEEEEPS5_S6_xSsSD_EEvEEE6_M_runEv+0x11A) [0x120789a]
       mongod(execute_native_thread_routine+0x20) [0x1bf42e0]
       libpthread.so.0(+0x8184) [0x7fa978a46184]
       libc.so.6(clone+0x6D) [0x7fa97877303d]
      ----- END BACKTRACE -----
      2018-07-16T00:53:01.436+0000 F - [LockPinger] /proc/self/maps:
      00400000-020b7000 r-xp 00000000 ca:01 394237 /var/lib/mongodb-mms-automation/mongodb-linux-x86_64-3.2.17-ent/bin/mongod
      2018-07-16T00:53:01.436+0000 F - [LockPinger] 020b8000-02186000 r--p 01cb7000 ca:01 394237 /var/lib/mongodb-mms-automation/mongodb-linux-x86_64-3.2.17-ent/bin/mongod
      2018-07-16T00:53:01.436+0000 F - [LockPinger] 02186000-0218e000 rw-p 01d85000 ca:01 394237 /var/lib/mongodb-mms-automation/mongodb-linux-x86_64-3.2.17-ent/bin/mongod
      2018-07-16T00:53:01.436+0000 F - [LockPinger] 0218e000-021ff000 rw-p 00000000 00:00 0 
      2018-07-16T00:53:01.436+0000 F - [LockPinger] 037da000-043da000 rw-p 00000000 00:00 0 [heap]
      2018-07-16T00:53:01.436+0000 F - [LockPinger] 043da000-54b9c000 rw-p 00000000 00:00 0 [heap]
      2018-07-16T00:53:01.436+0000 F - [LockPinger] 7fa645c1a000-7fa6c5b1a000 rw-p 00000000 ca:50 100663430 /data/test.5
      <...>
      

      It appears that this problem has been already fixed in v3.2.20 where LockPinger will complain that CSRS has not been initialized:

      "Works fine on 3.2.20"
      2018-07-16T01:44:02.021+0000 W SHARDING [LockPinger] distributed lock pinger 'myCluster-config-0.dryabtsev-test.4125.mongodbdns.com:27017,myCluster-config-1.dryabtsev-test.4125.mongodbdns.com:27017,myCluster-config-2.dryabtsev-test.4125.mongodbdns.com:27017/dmn-apple-test-4:27017:1531705231:-1629418921' detected an exception while pinging. :: caused by :: CSRS replica set is not initialized
      

            Assignee:
            backlog-server-sharding [DO NOT USE] Backlog - Sharding Team
            Reporter:
            dmitry.ryabtsev@mongodb.com Dmitry Ryabtsev
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: