[SERVER-14261] stepdown during migration range delete can abort mongod Created: 16/Jun/14  Updated: 11/Mar/15  Resolved: 29/Jul/14

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.6.1
Fix Version/s: 2.6.4, 2.7.3

Type: Bug Priority: Major - P3
Reporter: Kay Agahd Assignee: Randolph Tan
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
duplicates SERVER-14274 awaitReplication in LegacyReplication... Closed
Related
related to SERVER-14389 segmentation fault in RangeDeleter::c... Closed
is related to SERVER-14551 Runner yield during migration cleanup... Closed
Tested
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Participants:

 Description   
Issue Status as of Aug 6, 2014

ISSUE SUMMARY
When cleaning up after a chunk migration, if the shard is a replica set the primary node may for the cleanup to be replicated to secondaries. If during this time the primary becomes non-primary, the node throws an exception and aborts.

The only configurations affected by this issue are sharded clusters where shards are replica sets, the balancer is enabled, and chunk migrations have occurred.

USER IMPACT
Under the conditions described above, the cleanup operation fails with an exception and the node aborts.

WORKAROUNDS
A possible workaround is to enable synchronous migration cleanups:

use config
db.settings.update({ _id: 'balancer' }, { $set: { _waitForDelete: true }});

Note however that this makes migration cleanups slower, so this workaround has a performance impact.

AFFECTED VERSIONS
MongoDB 2.6 production releases up to 2.6.3 are affected by this issue.

FIX VERSION
The fix is included in the 2.6.4 production release.

RESOLUTION DETAILS
Catch the exception and log a message to the log file. The affected node is no longer primary, so the operation should not be restarted.

Original description

One of our our mongod's crashed. Thge last log lines are:

2014-06-16T01:06:07.662+0200 [conn170044] command admin.$cmd command: serverStatus { serverStatus: 1 } keyUpdates:0 numYields:0 locks(micros) r:39 reslen:3982 3926153ms
2014-06-16T01:06:07.662+0200 [conn170732] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after cursors: 0, after extra_info: 0, after globalLock: 0, after indexCounters: 0, after locks: 0, after network: 0, after opcounters: 0, after opcountersRepl: 0, after recordStats: 838061, after repl: 838351, at end: 838351 }
2014-06-16T01:06:07.662+0200 [conn170732] command admin.$cmd command: serverStatus { serverStatus: 1 } keyUpdates:0 numYields:0 locks(micros) r:45 reslen:3982 857713ms
2014-06-16T01:06:07.664+0200 [conn170015] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after cursors: 0, after extra_info: 0, after globalLock: 0, after indexCounters: 0, after locks: 0, after network: 0, after opcounters: 0, after opcountersRepl: 0, after recordStats: 4003515, after repl: 4003805, at end: 4003805 }
2014-06-16T01:06:07.664+0200 [conn170015] command admin.$cmd command: serverStatus { serverStatus: 1 } keyUpdates:0 numYields:0 locks(micros) r:62 reslen:3982 4063527ms
2014-06-16T01:06:07.679+0200 [conn170894] end connection 172.16.64.36:57234 (50 connections now open)
2014-06-16T01:06:07.680+0200 [conn170839] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.110:53226]
2014-06-16T01:06:07.680+0200 [conn170008] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:34421]
2014-06-16T01:06:07.680+0200 [conn170072] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:40371]
2014-06-16T01:06:07.680+0200 [conn170120] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:44809]
2014-06-16T01:06:07.680+0200 [conn169952] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.110:45250]
2014-06-16T01:06:07.680+0200 [conn169973] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.110:45466]
2014-06-16T01:06:07.680+0200 [conn170677] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:37881]
2014-06-16T01:06:07.680+0200 [conn170193] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:51427]
2014-06-16T01:06:07.680+0200 [conn169990] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:33256]
2014-06-16T01:06:07.680+0200 [conn170545] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.110:50537]
2014-06-16T01:06:07.680+0200 [conn170836] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:52316]
2014-06-16T01:06:07.680+0200 [conn170578] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.110:50862]
2014-06-16T01:06:07.680+0200 [conn169975] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:59916]
2014-06-16T01:06:07.680+0200 [conn170811] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:49906]
2014-06-16T01:06:07.680+0200 [conn170094] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [172.16.64.36:42575]
2014-06-16T01:06:07.708+0200 [RangeDeleter] 0x11c0e91 0x1163109 0x11477e6 0xebb7ab 0xeb5cfd 0xde2830 0xddd2dd 0x1205829 0x7f9f89db38ca 0x7f9f8916692d
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11c0e91]
 /usr/bin/mongod(_ZN5mongo10logContextEPKc+0x159) [0x1163109]
 /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0xe6) [0x11477e6]
 /usr/bin/mongod(_ZN5mongo13SlaveTracking18opReplicatedEnoughENS_6OpTimeERKSs+0x8b) [0xebb7ab]
 /usr/bin/mongod(_ZN5mongo18opReplicatedEnoughENS_6OpTimeENS_11BSONElementE+0xdd) [0xeb5cfd]
 /usr/bin/mongod(_ZN5mongo17RangeDeleterDBEnv11deleteRangeERKNS_10StringDataERKNS_7BSONObjES6_S6_bPSs+0xbb0) [0xde2830]
 /usr/bin/mongod(_ZN5mongo12RangeDeleter6doWorkEv+0x86d) [0xddd2dd]
 /usr/bin/mongod() [0x1205829]
 /lib/libpthread.so.0(+0x68ca) [0x7f9f89db38ca]
 /lib/libc.so.6(clone+0x6d) [0x7f9f8916692d]
2014-06-16T01:06:07.731+0200 [RangeDeleter] SEVERE: terminate() called, printing stack (if implemented for platform):
0x11c0e91 0x11c0118 0x7f9f8995c166 0x7f9f8995c193 0x7f9f8995c28e 0x114788e 0xebb7ab 0xeb5cfd 0xde2830 0xddd2dd 0x1205829 0x7f9f89db38ca 0x7f9f8916692d
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11c0e91]
 /usr/bin/mongod() [0x11c0118]
 /usr/lib/libstdc++.so.6(+0xcb166) [0x7f9f8995c166]
 /usr/lib/libstdc++.so.6(+0xcb193) [0x7f9f8995c193]
 /usr/lib/libstdc++.so.6(+0xcb28e) [0x7f9f8995c28e]
 /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x18e) [0x114788e]
 /usr/bin/mongod(_ZN5mongo13SlaveTracking18opReplicatedEnoughENS_6OpTimeERKSs+0x8b) [0xebb7ab]
 /usr/bin/mongod(_ZN5mongo18opReplicatedEnoughENS_6OpTimeENS_11BSONElementE+0xdd) [0xeb5cfd]
 /usr/bin/mongod(_ZN5mongo17RangeDeleterDBEnv11deleteRangeERKNS_10StringDataERKNS_7BSONObjES6_S6_bPSs+0xbb0) [0xde2830]
 /usr/bin/mongod(_ZN5mongo12RangeDeleter6doWorkEv+0x86d) [0xddd2dd]
 /usr/bin/mongod() [0x1205829]
 /lib/libpthread.so.0(+0x68ca) [0x7f9f89db38ca]
 /lib/libc.so.6(clone+0x6d) [0x7f9f8916692d]

The crashed mongod is member of a replSet consisting of 3 members. The replSet is one of 5 shards. The cluster is running with authentication.



 Comments   
Comment by Ramon Fernandez Marina [ 28/Aug/14 ]

kay.agahd@idealo.de, the workaround does prevent the crash reported in this ticket. The crash you reported in SERVER-14389 is a separate issue altogether.

Regards,
Ramón.

Comment by Kay Agahd [ 08/Aug/14 ]

WORKAROUNDS A possible workaround is to enable synchronous migration cleanups:

This workaround didn't work wor us. waitForDelete=true didn't help and even with balancer disabled, mongo crashed.

Comment by Githook User [ 21/Jul/14 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-14261 stepdown during migration range delete can abort mongod
Branch: v2.6
https://github.com/mongodb/mongo/commit/fa0ca23e0d48269f08d677bb4b1102b9bada6adc

Comment by Randolph Tan [ 23/Jun/14 ]

Fixed by commit:
https://github.com/mongodb/mongo/commit/017fc448ba39e0ee731b84b21991c7a3646a8724

Comment by Kay Agahd [ 17/Jun/14 ]

Thanks Eric, much appreciated!

Comment by Eric Milkie [ 17/Jun/14 ]

The error can happen whenever a node steps down for any reason. By not executing the replSetFreezeCommand, you are likely, but not guaranteed, to avoid the problem.
Randolph's suggestion of using synchronous balancer deletes, on the other hand, will completely avoid the problem.

Comment by Kay Agahd [ 16/Jun/14 ]

If the node stepped down for other reasons, it wouldn't have received a replSetFreezeCommand since we send this command only manually. So the error shouldn't occur then. Does this makes sense?

Comment by Randolph Tan [ 16/Jun/14 ]

kay.agahd@idealo.de You can also do that as well. But the issue can potentially occur again if the node stepped down for other reasons.

Comment by Kay Agahd [ 16/Jun/14 ]

Thank you for the workaround Randolph!
Could we avoid the error also by NOT executing the replSetFreezeCommand on the PRIMARY? We could adapt our compress script to do so because the primary does not need to be freezed since it will be stepped down in any case. What do you think about? Thanks!

Comment by Randolph Tan [ 16/Jun/14 ]

kay.agahd@idealo.de One workaround you can do to workaround this issue while waiting for the fix is to change the balancer setting so it will always perform synchronous migration cleanup. This can be done by setting the _waitForDelete field in config.settings document to true like this:

use config
db.settings.update({ _id: 'balancer' }, { $set: { _waitForDelete: true }});

The side effect is that migrations will now take longer to complete because it will always wait for the deletion to complete.

Comment by Randolph Tan [ 16/Jun/14 ]

This is a bug and it is scheduled to be fixed on 2.6.3.

Comment by Kay Agahd [ 16/Jun/14 ]

Am I right that's a bug? Will you fix it or should we avoid the replSetFreeze command?
We are compacting our database on a regulary basis, When the primary needs to be compacted, we use the replSetFreeze command on the secondaries having the less free RAM. Thus, once stepped down the primary, the new primary will be the server having the most free RAM.

Comment by Randolph Tan [ 16/Jun/14 ]

Based from the logs the node step down after receiving a replSetFreeze command. The rangeDeleter thread asserted because the node is not primary anymore, but the exception was never captured so it terminated the server.

Comment by Kay Agahd [ 16/Jun/14 ]

Greg, I've uploaded the log file mongod.log.1.tgz to you:

root@s124:/home/admin# scp -P 722 /data/mongod/log/mongod.log.1.tgz SERVER-14261@www.mongodb.com:
The authenticity of host '[www.mongodb.com]:722 ([75.101.156.249]:722)' can't be established.
DSA key fingerprint is 8d:cf:ca:17:1e:00:66:e0:a0:9f:08:0b:e6:71:e1:ff.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '[www.mongodb.com]:722,[75.101.156.249]:722' (DSA) to the list of known hosts.
SERVER-14261@www.mongodb.com's password: 
mongod.log.1.tgz   
root@s124:/home/admin# 

If you need more information, just tell me please.
Thanks!

Comment by Greg Studer [ 16/Jun/14 ]

Is there any more of the log file you can provide us from this mongod - ideally something close to an hour before/after? This looks like it happened due to a replica set state change while waiting for replication, but more context would help us further confirm this.

Generated at Thu Feb 08 03:34:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.