Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-14261

stepdown during migration range delete can abort mongod

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 2.6.4, 2.7.3
    • Affects Version/s: 2.6.1
    • Component/s: Sharding
    • Labels:
      None
    • Fully Compatible
    • ALL

      Issue Status as of Aug 6, 2014

      ISSUE SUMMARY
      When cleaning up after a chunk migration, if the shard is a replica set the primary node may for the cleanup to be replicated to secondaries. If during this time the primary becomes non-primary, the node throws an exception and aborts.

      The only configurations affected by this issue are sharded clusters where shards are replica sets, the balancer is enabled, and chunk migrations have occurred.

      USER IMPACT
      Under the conditions described above, the cleanup operation fails with an exception and the node aborts.

      WORKAROUNDS
      A possible workaround is to enable synchronous migration cleanups:

      use config
      db.settings.update({ _id: 'balancer' }, { $set: { _waitForDelete: true }});
      

      Note however that this makes migration cleanups slower, so this workaround has a performance impact.

      AFFECTED VERSIONS
      MongoDB 2.6 production releases up to 2.6.3 are affected by this issue.

      FIX VERSION
      The fix is included in the 2.6.4 production release.

      RESOLUTION DETAILS
      Catch the exception and log a message to the log file. The affected node is no longer primary, so the operation should not be restarted.

      Original description

      One of our our mongod's crashed. Thge last log lines are:

      2014-06-16T01:06:07.662+0200 [conn170044] command admin.$cmd command: serverStatus { serverStatus: 1 } keyUpdates:0 numYields:0 locks(micros) r:39 reslen:3982 3926153ms
      2014-06-16T01:06:07.662+0200 [conn170732] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after cursors: 0, after extra_info: 0, after globalLock: 0, after indexCounters: 0, after locks: 0, after network: 0, after opcounters: 0, after opcountersRepl: 0, after recordStats: 838061, after repl: 838351, at end: 838351 }
      2014-06-16T01:06:07.662+0200 [conn170732] command admin.$cmd command: serverStatus { serverStatus: 1 } keyUpdates:0 numYields:0 locks(micros) r:45 reslen:3982 857713ms
      2014-06-16T01:06:07.664+0200 [conn170015] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after cursors: 0, after extra_info: 0, after globalLock: 0, after indexCounters: 0, after locks: 0, after network: 0, after opcounters: 0, after opcountersRepl: 0, after recordStats: 4003515, after repl: 4003805, at end: 4003805 }
      2014-06-16T01:06:07.664+0200 [conn170015] command admin.$cmd command: serverStatus { serverStatus: 1 } keyUpdates:0 numYields:0 locks(micros) r:62 reslen:3982 4063527ms
      2014-06-16T01:06:07.679+0200 [conn170894] end connection 172.16.64.36:57234 (50 connections now open)
      2014-06-16T01:06:07.680+0200 [conn170839] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.110:53226]
      2014-06-16T01:06:07.680+0200 [conn170008] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:34421]
      2014-06-16T01:06:07.680+0200 [conn170072] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:40371]
      2014-06-16T01:06:07.680+0200 [conn170120] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:44809]
      2014-06-16T01:06:07.680+0200 [conn169952] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.110:45250]
      2014-06-16T01:06:07.680+0200 [conn169973] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.110:45466]
      2014-06-16T01:06:07.680+0200 [conn170677] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:37881]
      2014-06-16T01:06:07.680+0200 [conn170193] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:51427]
      2014-06-16T01:06:07.680+0200 [conn169990] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:33256]
      2014-06-16T01:06:07.680+0200 [conn170545] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.110:50537]
      2014-06-16T01:06:07.680+0200 [conn170836] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:52316]
      2014-06-16T01:06:07.680+0200 [conn170578] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.110:50862]
      2014-06-16T01:06:07.680+0200 [conn169975] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:59916]
      2014-06-16T01:06:07.680+0200 [conn170811] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:49906]
      2014-06-16T01:06:07.680+0200 [conn170094] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:42575]
      2014-06-16T01:06:07.708+0200 [RangeDeleter] 0x11c0e91 0x1163109 0x11477e6 0xebb7ab 0xeb5cfd 0xde2830 0xddd2dd 0x1205829 0x7f9f89db38ca 0x7f9f8916692d
       /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11c0e91]
       /usr/bin/mongod(_ZN5mongo10logContextEPKc+0x159) [0x1163109]
       /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0xe6) [0x11477e6]
       /usr/bin/mongod(_ZN5mongo13SlaveTracking18opReplicatedEnoughENS_6OpTimeERKSs+0x8b) [0xebb7ab]
       /usr/bin/mongod(_ZN5mongo18opReplicatedEnoughENS_6OpTimeENS_11BSONElementE+0xdd) [0xeb5cfd]
       /usr/bin/mongod(_ZN5mongo17RangeDeleterDBEnv11deleteRangeERKNS_10StringDataERKNS_7BSONObjES6_S6_bPSs+0xbb0) [0xde2830]
       /usr/bin/mongod(_ZN5mongo12RangeDeleter6doWorkEv+0x86d) [0xddd2dd]
       /usr/bin/mongod() [0x1205829]
       /lib/libpthread.so.0(+0x68ca) [0x7f9f89db38ca]
       /lib/libc.so.6(clone+0x6d) [0x7f9f8916692d]
      2014-06-16T01:06:07.731+0200 [RangeDeleter] SEVERE: terminate() called, printing stack (if implemented for platform):
      0x11c0e91 0x11c0118 0x7f9f8995c166 0x7f9f8995c193 0x7f9f8995c28e 0x114788e 0xebb7ab 0xeb5cfd 0xde2830 0xddd2dd 0x1205829 0x7f9f89db38ca 0x7f9f8916692d
       /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11c0e91]
       /usr/bin/mongod() [0x11c0118]
       /usr/lib/libstdc++.so.6(+0xcb166) [0x7f9f8995c166]
       /usr/lib/libstdc++.so.6(+0xcb193) [0x7f9f8995c193]
       /usr/lib/libstdc++.so.6(+0xcb28e) [0x7f9f8995c28e]
       /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x18e) [0x114788e]
       /usr/bin/mongod(_ZN5mongo13SlaveTracking18opReplicatedEnoughENS_6OpTimeERKSs+0x8b) [0xebb7ab]
       /usr/bin/mongod(_ZN5mongo18opReplicatedEnoughENS_6OpTimeENS_11BSONElementE+0xdd) [0xeb5cfd]
       /usr/bin/mongod(_ZN5mongo17RangeDeleterDBEnv11deleteRangeERKNS_10StringDataERKNS_7BSONObjES6_S6_bPSs+0xbb0) [0xde2830]
       /usr/bin/mongod(_ZN5mongo12RangeDeleter6doWorkEv+0x86d) [0xddd2dd]
       /usr/bin/mongod() [0x1205829]
       /lib/libpthread.so.0(+0x68ca) [0x7f9f89db38ca]
       /lib/libc.so.6(clone+0x6d) [0x7f9f8916692d]
      

      The crashed mongod is member of a replSet consisting of 3 members. The replSet is one of 5 shards. The cluster is running with authentication.

            Assignee:
            randolph@mongodb.com Randolph Tan
            Reporter:
            kay.agahd@idealo.de Kay Agahd
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: