[SERVER-29293] Recipient shard fails to abort migration on stepdown Created: 19/May/17  Updated: 30/Oct/23  Resolved: 14/Jul/17

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.4.4, 3.5.10
Fix Version/s: 3.4.11, 3.5.11

Type: Bug Priority: Major - P3
Reporter: Wayne Wang Assignee: Nathan Myers
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File shard3.log.rar    
Issue Links:
Backports
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v3.4
Steps To Reproduce:

with no obvious steps to recur.
I think the problem appeared during migration.

Participants:
Case:

 Description   

condition:
mongodb cluster based on version 3.4.4
3 shards with 2 replication sets
appearance:
shard3 primary和secondary mongod process aborted exceptionally.

2017-05-18T23:28:26.240+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to 10.11.123.241:22003
2017-05-18T23:28:26.243+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to 10.11.123.241:22003 - HostUnreachable: Connection refused
2017-05-18T23:28:26.243+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to 10.11.123.241:22003 due to failed operation on a connection
2017-05-18T23:28:26.244+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 10.11.123.241:22003; HostUnreachable: Connection refused
2017-05-18T23:28:26.244+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to 10.11.123.241:22003
2017-05-18T23:28:26.245+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to 10.11.123.241:22003 - HostUnreachable: Connection refused
2017-05-18T23:28:26.245+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to 10.11.123.241:22003 due to failed operation on a connection
2017-05-18T23:28:26.246+0800 I REPL     [ReplicationExecutor] Error in heartbeat request to 10.11.123.241:22003; HostUnreachable: Connection refused
2017-05-18T23:28:26.252+0800 I SHARDING [migrateThread] about to log metadata event into changelog: { _id: "Juyuan202-01-2017-05-18T23:28:26.251+0800-591dbd9a22d91a38ec6bb40f", server: "Juyuan202-01", clientAddr: "", time: new Date(1495121306251), what: "moveChunk.to", ns: "SafetyAnalysis.dns.record", details: { min: { _id: "4bfb5160c15087" }, max: { _id: "4bfb52c2b2ea8a" }, step 1 of 6: 32, step 2 of 6: 2, step 3 of 6: 10026, step 4 of 6: 0, note: "aborted", errmsg: "Cannot go to critical section because secondaries cannot keep up" } }
2017-05-18T23:28:26.266+0800 I -        [migrateThread] Invariant failure it != _receivingChunks.end() src/mongo/db/s/metadata_manager.cpp 234
2017-05-18T23:28:26.273+0800 I -        [migrateThread]
 
***aborting after invariant() failure
 
 
2017-05-18T23:28:26.325+0800 F -        [migrateThread] Got signal: 6 (Aborted).
 
 0x7f824a9ade61 0x7f824a9acf59 0x7f824a9ad43d 0x7f824810a7e0 0x7f8247d995e5 0x7f8247d9adc5 0x7f8249c41a12 0x7f824a51c9d8 0x7f824a52dabc 0x7f824a533e02 0x7f824a534550 0x7f824b41ee10 0x7f8248102aa1 0x7f8247e4faad
----- BEGIN BACKTRACE -----
 
***aborting after invariant() failure
 
2017-05-18T23:28:26.266+0800 I -        [migrateThread] Invariant failure it != _receivingChunks.end() src/mongo/db/s/metadata_manager.cpp 234
2017-05-18T23:28:26.273+0800 I -        [migrateThread]
 
***aborting after invariant() failure
 
 
2017-05-18T23:28:26.325+0800 F -        [migrateThread] Got signal: 6 (Aborted).
 
mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x7f824a9ade61]
 mongod(+0x157CF59) [0x7f824a9acf59]
 mongod(+0x157D43D) [0x7f824a9ad43d]
 libpthread.so.0(+0xF7E0) [0x7f824810a7e0]
 libc.so.6(gsignal+0x35) [0x7f8247d995e5]
 libc.so.6(abort+0x175) [0x7f8247d9adc5]
 mongod(_ZN5mongo17invariantOKFailedEPKcRKNS_6StatusES1_j+0x0) [0x7f8249c41a12]
 mongod(_ZN5mongo15MetadataManager13forgetReceiveERKNS_10ChunkRangeE+0x2E8) [0x7f824a51c9d8]
 mongod(_ZN5mongo27MigrationDestinationManager14_forgetPendingEPNS_16OperationContextERKNS_15NamespaceStringERKNS_7BSONObjES8_RKNS_3OIDE+0x36C) [0x7f824a52dabc]
 mongod(_ZN5mongo27MigrationDestinationManager14_migrateThreadENS_7BSONObjES1_S1_NS_16ConnectionStringENS_3OIDENS_19WriteConcernOptionsE+0x172) [0x7f824a533e02]
 mongod(+0x1104550) [0x7f824a534550]
 mongod(+0x1FEEE10) [0x7f824b41ee10]
libpthread.so.0(+0x7AA1) [0x7f8248102aa1]
 libc.so.6(clone+0x6D) [0x7f8247e4faad]
-----  END BACKTRACE  -----



 Comments   
Comment by Githook User [ 09/Nov/17 ]

Author:

{'name': 'Nathan Myers', 'username': 'nathan-myers-mongo', 'email': 'nathan.myers@10gen.com'}

Message: SERVER-29293 Recipient shard to abort migration on stepdown

(cherry picked from commit 6d921f47c0fcb29266c57286f1dd0d411cc459f0)
Branch: v3.4
https://github.com/mongodb/mongo/commit/2a576865f2ac132eeb76a6fbdc630787b1430e8c

Comment by Githook User [ 14/Jul/17 ]

Author:

{u'username': u'nathan-myers-mongo', u'name': u'Nathan Myers', u'email': u'nathan.myers@10gen.com'}

Message: SERVER-29293 Recipient shard to abort migration on stepdown
Branch: master
https://github.com/mongodb/mongo/commit/6d921f47c0fcb29266c57286f1dd0d411cc459f0

Comment by Nathan Myers [ 29/Jun/17 ]

BTW, I would bet that the 600 minute timeout was supposed to be 600 seconds.

Comment by Nathan Myers [ 29/Jun/17 ]

This log is from a shard primary trying get ready to begin requesting documents from a chunk on a donor shard.

The relevant part of the (filtered) log seems to be:

2017-05-18T13:28:16.071+0800 I SHARDING [conn61] Marking collection SafetyAnalysis.dns.record as sharded with collection version: 1169|1||59141106552811d553bf9a42, shard version: 1168|0||59141106552811d553bf9a42
2017-05-18T13:28:16.072+0800 I SHARDING [migrateThread] Starting receiving end of migration of chunk { _id: "4bfb5160c15087" } -> { _id: "4bfb52c2b2ea8a" } for collection SafetyAnalysis.dns.record from shard2ReplSet/10.11.123.231:22002,10.11.123.241:22002 at epoch 59141106552811d553bf9a42 with session id shard2ReplSet_shard3ReplSet_591d30ef3f51cc9d3a
34e8be
2017-05-18T13:28:16.104+0800 I SHARDING [migrateThread] Deleter starting delete for: SafetyAnalysis.dns.record from { _id: "4bfb5160c15087" } -> { _id: "4bfb52c2b2ea8a" }, with opId: 2874873
2017-05-18T13:28:16.104+0800 I SHARDING [migrateThread] rangeDeleter deleted 0 documents for SafetyAnalysis.dns.record from { _id: "4bfb5160c15087" } -> { _id: "4bfb52c2b2ea8a" }
2017-05-18T13:28:23.927+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Ending idle connection to host 10.11.123.231:22003 because the pool meets constraints; 1 connections to that host remain open
"4bfb5160c15087" } -> { _id: "4bfb52c2b2ea8a" } waiting for: { ts: Timestamp 1495085306000|591, t: 270 }
2017-05-18T13:29:39.649+0800 I ASIO     [NetworkInterfaceASIO-Replication-0] Ending connection to host 10.11.123.231:22003 due to bad connection status; 0 connections to that host remain open
2017-05-18T13:29:39.650+0800 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Ending idle connection to host 10.11.123.221:21000 because the pool meets constraints; 1 connections to that host remain open
2017-05-18T13:29:18.895+0800 I -        [conn44] end connection 10.11.123.241:58160 (21 connections now open)
2017-05-18T13:29:39.651+0800 I NETWORK  [replSetDistLockPinger] Marking host 10.11.123.231:21000 as failed :: caused by :: ExceededTimeLimit: Operation timed out, request was RemoteCommand 2721966 -- target:10.11.123.231:21000 db:config expDate:2017-05-18T13:29:14.621+0800 cmd:{ findAndModify: "lockpings", query: { _id: "Juyuan202-01:22003:1494982120:7736367257400017735" }, update: { $set: { ping: new Date(1495085324621) } }, upsert: true, writeConcern: { w: "majority", wtimeout: 15000 }, maxTimeMS: 30000 }
2017-05-18T13:29:39.651+0800 I SHARDING [replSetDistLockPinger] Operation timed out with status ExceededTimeLimit: Operation timed out, request was RemoteCommand 2721966 -- target:10.11.123.231:21000 db:config expDate:2017-05-18T13:29:14.621+0800 cmd:{ findAndModify: "lockpings", query: { _id: "Juyuan202-01:22003:1494982120:7736367257400017735" }, update: { $set: { ping: new Date(1495085324621) } }, upsert: true, writeConcern: { w: "majority", wtimeout: 15000 }, maxTimeMS: 30000 }
2017-05-18T13:29:39.651+0800 W SHARDING [replSetDistLockPinger] pinging failed for distributed lock pinger :: caused by :: ExceededTimeLimit: Operation timed out, request was RemoteCommand 2721966 -- target:10.11.123.231:21000 db:config expDate:2017-05-18T13:29:14.621+0800 cmd:{ findAndModify: "lockpings", query: { _id: "Juyuan202-01:22003:1494982120:7736367257400017735" }, update: { $set: { ping: new Date(1495085324621) } }, upsert: true, writeConcern: { w: "majority", wtimeout: 15000 }, maxTimeMS: 30000 }
2017-05-18T13:29:19.456+0800 I -        [conn45] end connection 10.11.123.241:58162 (21 connections now open)
2017-05-18T13:29:20.020+0800 I -        [conn42] end connection 10.11.123.241:58155 (21 connections now open)
2017-05-18T13:29:20.577+0800 I -        [conn43] end connection 10.11.123.241:58157 (21 connections now open)
2017-05-18T13:29:39.650+0800 I ASIO     [NetworkInterfaceASIO-ShardRegistry-0] Ending connection to host 10.11.123.231:21000 due to bad connection status; 1 connections to that host remain open
2017-05-18T13:29:39.653+0800 I SHARDING [shard registry reload] Operation timed out  :: caused by :: ExceededTimeLimit: Operation timed out, request was RemoteCommand 2721995 -- target:10.11.123.231:21000 db:admin cmd:{ isMaster: 1 }
2017-05-18T13:29:39.653+0800 I SHARDING [shard registry reload] Periodic reload of shard registry failed  :: caused by :: 50 could not get updated shard list from config server due to Operation timed out, request was RemoteCommand 2721995 -- target:10.11.123.231:21000 db:admin cmd:{ isMaster: 1 }; will retry after 30s
2017-05-18T13:29:51.696+0800 I REPL     [ReplicationExecutor] can't see a majority of the set, relinquishing primary
2017-05-18T13:31:02.179+0800 I COMMAND  [conn60] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "shard3ReplSet", configVersion: 3, from: "10.11.123.231:22003", fromId: 1, term: 270 } numYields:0 reslen:527 locks:{} protocol:op_command 69028ms

This is followed, ten hours later and after many primary step-ups and step-downs, by aborting the migration. By then, the structure asserted about doesn't exist anymore, so the assertion is correct.

The failure appears to be that the migration is not aborted when the "Periodic reload of shard registry" fails; or, at the latest, when the receiving host steps down as primary.

Comment by Wayne Wang [ 27/Jun/17 ]

The same problem occurs almost every day. So I must restart the mongod process ot the corresponding shard manually.

[root@203-01 ~]# ps -e l|grep mongo
1 0 1037 1 20 0 1829744 733960 futex_ Sl ? 653:37 /home/mongodb3.4.4/bin/mongod --configsvr --replSet cfgReplSet --dbpath /home/mongodb3.4.4/config/data --port 21000 --logpath /home/mongodb3.4.4/config/log/config.log --fork
1 0 1257 1 20 0 16069820 12866120 futex_ Sl ? 2176:52 /home/mongodb3.4.4/bin/mongod --shardsvr --replSet shard2ReplSet --port 22002 --dbpath /home/mongodb3.4.4/shard2/data --logpath /home/mongodb3.4.4/shard2/log/shard2.log --fork --nojournal
1 0 1343 1 20 0 474052 18268 futex_ Sl ? 393:19 /home/mongodb3.4.4/bin/mongod --shardsvr --replSet shard3ReplSet --port 22003 --dbpath /home/mongodb3.4.4/shard3/data --logpath /home/mongodb3.4.4/shard3/log/shard3.log --fork --nojournal
1 0 1408 1 20 0 434976 22440 futex_ Sl ? 198:43 /home/mongodb3.4.4/bin/mongos --configdb cfgReplSet/10.11.123.221:21000,10.11.123.231:21000,10.11.123.241:21000 --port 20000 --logpath /home/mongodb3.4.4/mongos/log/mongos.log --fork

[root@203-01 ~]# /home/mongodb3.4.4/bin/mongod --shardsvr --replSet shard1ReplSet --port 22001 --dbpath /home/mongodb3.4.4/shard1/data --logpath /home/mongodb3.4.4/shard1/log/shard1.log --fork --nojournal
about to fork child process, waiting until server is ready for connections.
forked process: 24586

Comment by Ramon Fernandez Marina [ 19/May/17 ]

Thanks for your report wayne80, and for including uploading the full logs.

This is MongoDB 3.4.4 and here are the relevant debug symbols. The demangled stack trace is below:

 /data/mci/b9c1c92514a49d7446ed28ea59697136/src/src/mongo/util/stacktrace_posix.cpp:172:0: mongo::printStackTrace(std::ostream&)
 /data/mci/b9c1c92514a49d7446ed28ea59697136/src/src/mongo/util/signal_handlers_synchronous.cpp:178:0: mongo::(anonymous namespace)::printSignalAndBacktrace(int)
 /data/mci/b9c1c92514a49d7446ed28ea59697136/src/src/mongo/util/signal_handlers_synchronous.cpp:235:0: mongo::(anonymous namespace)::abruptQuit(int)
 ??:0:0: ??
 ??:0:0: ??
 ??:0:0: ??
 /data/mci/b9c1c92514a49d7446ed28ea59697136/src/src/mongo/util/assert_util.cpp:154:0: mongo::invariantFailed(char const*, char const*, unsigned int)
 /data/mci/b9c1c92514a49d7446ed28ea59697136/src/src/mongo/db/s/metadata_manager.cpp:234:0: mongo::MetadataManager::forgetReceive(mongo::ChunkRange const&)
 /data/mci/b9c1c92514a49d7446ed28ea59697136/src/src/mongo/db/s/migration_destination_manager.cpp:1074:0: mongo::MigrationDestinationManager::_forgetPending(mongo::OperationContext*, mongo::NamespaceString const&, mongo::BSONObj const&, mongo::BSONObj const&, mongo::OID const&)
 /data/mci/b9c1c92514a49d7446ed28ea59697136/src/src/mongo/db/s/migration_destination_manager.cpp:428:0: mongo::MigrationDestinationManager::_migrateThread(mongo::BSONObj, mongo::BSONObj, mongo::BSONObj, mongo::ConnectionString, mongo::OID, mongo::WriteConcernOptions)
 /data/mci/b9c1c92514a49d7446ed28ea59697136/src/src/mongo/db/s/migration_destination_manager.cpp:319:0: operator()
 /opt/mongodbtoolchain/v2/include/c++/5.4.0/functional:1531:0: _M_invoke<>
 /opt/mongodbtoolchain/v2/include/c++/5.4.0/functional:1520:0: operator()
 /opt/mongodbtoolchain/v2/include/c++/5.4.0/thread:115:0: std::thread::_Impl<std::_Bind_simple<mongo::MigrationDestinationManager::start(mongo::NamespaceString const&, mongo::ScopedRegisterReceiveChunk, mongo::MigrationSessionId const&, mongo::ConnectionString const&, mongo::ShardId const&, mongo::ShardId const&, mongo::BSONObj const&, mongo::BSONObj const&, mongo::BSONObj const&, mongo::OID const&, mongo::WriteConcernOptions const&)::{lambda()#1} ()> >::_M_run()
 /data/mci/6bbc2c0020aea5cd322127b03219e9a7/toolchain-builder/build-gcc-v2.sh-EyP/x86_64-mongodb-linux/libstdc++-v3/src/c++11/../../../../../gcc-5.4.0/libstdc++-v3/src/c++11/thread.cc:84:0: execute_native_thread_routine
 ??:0:0: ??

Stay tuned for updates.

Regards,
Ramón.

Generated at Thu Feb 08 04:20:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.