|
schwerin It looks like the donor shard did terminate. It was printing out this message on the migration thread:
2014-12-22T17:38:58.031+0000 I SHARDING [conn69] MigrateFromStatus::done About to acquire global lock to exit critical section
|
2014-12-22T17:39:58.113+0000 I - [conn69] LockerId 94 has been waiting to acquire lock for more than 30 seconds. MongoDB will print the lock manager state and the stack of the thread that has been waiting, for diagnostic purposes. This message does not necessary imply that the server is experiencing an outage, but might be an indication of an overload.
|
2014-12-22T17:39:58.124+0000 I - [conn69] Dumping LockManager @ 0x1ae7480
|
2014-12-22T17:39:58.198+0000 I - [conn69] Lock @ 0x4020700: {2305843009213693953: Global, 1}
|
2014-12-22T17:39:58.198+0000 I - [conn69]
|
2014-12-22T17:39:59.981+0000 I - [conn69]
|
|
...
|
|
// repeat multiple times
|
|
2014-12-23T03:53:51.120+0000 I - [conn69] LockerId 94 has been waiting to acquire lock for more than 30 seconds. MongoDB will print the lock manager state and the stack of the thread that has been waiting, for diagnostic purposes. This message does not necessary imply that the server is experiencing an outage, but might be an indication of an overload.
|
2014-12-23T03:53:51.120+0000 I - [conn69] Dumping LockManager @ 0x1ae7480
|
2014-12-23T03:53:51.120+0000 I - [conn69] Lock @ 0x4020700: {2305843009213693953: Global, 1}
|
2014-12-23T03:53:51.120+0000 I - [conn69]
|
2014-12-23T03:53:51.126+0000 I - [conn69]
|
2014-12-23T03:53:51.530+0000 F - [conn69] terminate() called. An exception is active; attempting to gather more information
|
2014-12-23T03:53:51.557+0000 F - [conn69] std::exception::what(): std::exception
|
|
These were the last few lines of the logs:
2014-12-23T03:53:51.120+0000 I - [conn69] LockerId 94 has been waiting to acquire lock for more than 30 seconds. MongoDB will print the lock manager state and the stack of the thread that has been waiting, for diagnostic purposes. This message does not necessary imply that the server is experiencing an outage, but might be an indication of an overload.
|
2014-12-23T03:53:51.120+0000 I - [conn69] Dumping LockManager @ 0x1ae7480
|
2014-12-23T03:53:51.120+0000 I - [conn69] Lock @ 0x4020700: {2305843009213693953: Global, 1}
|
GRANTED:
|
LockRequest 2790 @ 0x43BA5F100: Mode = IX; ConvertMode = NONE; EnqueueAtFront = 0; CompatibleFirst = 0;
|
|
PENDING:
|
LockRequest 94 @ 0x10FABC00: Mode = S; ConvertMode = NONE; EnqueueAtFront = 1; CompatibleFirst = 1;
|
2014-12-23T03:53:51.120+0000 I - [conn69]
|
2014-12-23T03:53:51.126+0000 I - [conn69]
|
0xf0bd99 0x99966a 0x9a13cc 0x99a84b 0x9917fd 0xda565c 0xda751a 0x9a9784 0x9aa5d3 0x9ab08b 0xb77c2a 0xa8af55 0x7e1770 0xec9d61 0x7f3424bd1f18 0x7f3423ce3b9d
|
----- BEGIN BACKTRACE -----
|
{"backtrace":[{"b":"400000","o":"B0BD99"},{"b":"400000","o":"59966A"},{"b":"400000","o":"5A13CC"},{"b":"400000","o":"59A84B"},{"b":"400000","o":"5917FD"},{"b":"400000","o":"9A565C"},{"b":"400000","o":"9A751A"},{"b":"400000","o":"5A9784"},{"b":"400000","o":"5AA5D3"},{"b":"400000","o":"5AB08B"},{"b":"400000","o":"777C2A"},{"b":"400000","o":"68AF55"},{"b":"400000","o":"3E1770"},{"b":"400000","o":"AC9D61"},{"b":"7F3424BCA000","o":"7F18"},{"b":"7F3423C01000","o":"E2B9D"}],"processInfo":{ "mongodbVersion" : "2.8.0-rc3", "gitVersion" : "2d679247f17dab05a492c8b6d2c250dab18e54f2", "uname" : { "sysname" : "Linux", "release" : "3.10.42-52.145.amzn1.x86_64", "version" : "#1 SMP Tue Jun 10 23:46:43 UTC 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000" }, { "b" : "7FFFE7AFE000", "elfType" : 3 }, { "b" : "7F3424BCA000", "path" : "/lib64/libpthread.so.0", "elfType" : 3 }, { "b" : "7F34249C2000", "path" : "/lib64/librt.so.1", "elfType" : 3 }, { "b" : "7F34247BE000", "path" : "/lib64/libdl.so.2", "elfType" : 3 }, { "b" : "7F34244BA000", "path" : "/usr/lib64/libstdc++.so.6", "elfType" : 3 }, { "b" : "7F34241BC000", "path" : "/lib64/libm.so.6", "elfType" : 3 }, { "b" : "7F3423FA6000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3 }, { "b" : "7F3423C01000", "path" : "/lib64/libc.so.6", "elfType" : 3 }, { "b" : "7F3424DE6000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3 } ] }}
|
mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf0bd99]
|
mongod(+0x59966A) [0x99966a]
|
mongod(_ZN5mongo10LockerImplILb0EE12lockCompleteENS_10ResourceIdEjb+0x18C) [0x9a13cc]
|
mongod(_ZN5mongo10LockerImplILb0EE10lockGlobalENS_8LockModeEj+0x2B) [0x99a84b]
|
mongod(_ZN5mongo4Lock10GlobalReadC2EPNS_6LockerEj+0x2D) [0x9917fd]
|
mongod(_ZN5mongo17MigrateFromStatus4doneEPNS_16OperationContextE+0x8C) [0xda565c]
|
mongod(_ZN5mongo16MoveChunkCommand3runEPNS_16OperationContextERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x1CDA) [0xda751a]
|
mongod(_ZN5mongo12_execCommandEPNS_16OperationContextEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x34) [0x9a9784]
|
mongod(_ZN5mongo7Command11execCommandEPNS_16OperationContextEPS0_iPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0xC13) [0x9aa5d3]
|
mongod(_ZN5mongo12_runCommandsEPNS_16OperationContextEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x28B) [0x9ab08b]
|
mongod(_ZN5mongo8runQueryEPNS_16OperationContextERNS_7MessageERNS_12QueryMessageERNS_5CurOpES3_b+0x76A) [0xb77c2a]
|
mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortEb+0xB25) [0xa8af55]
|
mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0xE0) [0x7e1770]
|
mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x411) [0xec9d61]
|
libpthread.so.0(+0x7F18) [0x7f3424bd1f18]
|
libc.so.6(clone+0x6D) [0x7f3423ce3b9d]
|
----- END BACKTRACE -----
|
2014-12-23T03:53:51.530+0000 F - [conn69] terminate() called. An exception is active; attempting to gather more information
|
2014-12-23T03:53:51.557+0000 F - [conn69] std::exception::what(): std::exception
|
Actual exception type: mongo::DBTryLockTimeoutException
|
|
0xf0bd99 0xf0b890 0x7f34245186c6 0x7f3424517789 0x7f342451833a 0x7f3423fb5913 0x7f3423fb5e47 0x991854 0xda565c 0xda751a 0x9a9784 0x9aa5d3 0x9ab08b 0xb77c2a 0xa8af55 0x7e1770 0xec9d61 0x7f3424bd1f18 0x7f3423ce3b9d
|
----- BEGIN BACKTRACE -----
|
{"backtrace":[{"b":"400000","o":"B0BD99"},{"b":"400000","o":"B0B890"},{"b":"7F34244BA000","o":"5E6C6"},{"b":"7F34244BA000","o":"5D789"},{"b":"7F34244BA000","o":"5E33A"},{"b":"7F3423FA6000","o":"F913"},{"b":"7F3423FA6000","o":"FE47"},{"b":"400000","o":"591854"},{"b":"400000","o":"9A565C"},{"b":"400000","o":"9A751A"},{"b":"400000","o":"5A9784"},{"b":"400000","o":"5AA5D3"},{"b":"400000","o":"5AB08B"},{"b":"400000","o":"777C2A"},{"b":"400000","o":"68AF55"},{"b":"400000","o":"3E1770"},{"b":"400000","o":"AC9D61"},{"b":"7F3424BCA000","o":"7F18"},{"b":"7F3423C01000","o":"E2B9D"}],"processInfo":{ "mongodbVersion" : "2.8.0-rc3", "gitVersion" : "2d679247f17dab05a492c8b6d2c250dab18e54f2", "uname" : { "sysname" : "Linux", "release" : "3.10.42-52.145.amzn1.x86_64", "version" : "#1 SMP Tue Jun 10 23:46:43 UTC 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000" }, { "b" : "7FFFE7AFE000", "elfType" : 3 }, { "b" : "7F3424BCA000", "path" : "/lib64/libpthread.so.0", "elfType" : 3 }, { "b" : "7F34249C2000", "path" : "/lib64/librt.so.1", "elfType" : 3 }, { "b" : "7F34247BE000", "path" : "/lib64/libdl.so.2", "elfType" : 3 }, { "b" : "7F34244BA000", "path" : "/usr/lib64/libstdc++.so.6", "elfType" : 3 }, { "b" : "7F34241BC000", "path" : "/lib64/libm.so.6", "elfType" : 3 }, { "b" : "7F3423FA6000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3 }, { "b" : "7F3423C01000", "path" : "/lib64/libc.so.6", "elfType" : 3 }, { "b" : "7F3424DE6000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3 } ] }}
|
mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf0bd99]
|
mongod(+0xB0B890) [0xf0b890]
|
libstdc++.so.6(+0x5E6C6) [0x7f34245186c6]
|
libstdc++.so.6(+0x5D789) [0x7f3424517789]
|
libstdc++.so.6(__gxx_personality_v0+0x52A) [0x7f342451833a]
|
libgcc_s.so.1(+0xF913) [0x7f3423fb5913]
|
libgcc_s.so.1(_Unwind_Resume+0x57) [0x7f3423fb5e47]
|
mongod(_ZN5mongo4Lock10GlobalReadC2EPNS_6LockerEj+0x84) [0x991854]
|
mongod(_ZN5mongo17MigrateFromStatus4doneEPNS_16OperationContextE+0x8C) [0xda565c]
|
mongod(_ZN5mongo16MoveChunkCommand3runEPNS_16OperationContextERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x1CDA) [0xda751a]
|
mongod(_ZN5mongo12_execCommandEPNS_16OperationContextEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x34) [0x9a9784]
|
mongod(_ZN5mongo7Command11execCommandEPNS_16OperationContextEPS0_iPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0xC13) [0x9aa5d3]
|
mongod(_ZN5mongo12_runCommandsEPNS_16OperationContextEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x28B) [0x9ab08b]
|
mongod(_ZN5mongo8runQueryEPNS_16OperationContextERNS_7MessageERNS_12QueryMessageERNS_5CurOpES3_b+0x76A) [0xb77c2a]
|
mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortEb+0xB25) [0xa8af55]
|
mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0xE0) [0x7e1770]
|
mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x411) [0xec9d61]
|
libpthread.so.0(+0x7F18) [0x7f3424bd1f18]
|
libc.so.6(clone+0x6D) [0x7f3423ce3b9d]
|
----- END BACKTRACE -----
|
|
|
|
Based from the logs, it looks like the migration was taking too long (> 1day) and the from side decided to stop (https://github.com/mongodb/mongo/blob/r2.8.0-rc5/src/mongo/s/d_migrate.cpp#L1287).
Log excerpt:
2014-12-21T17:03:08.044+0000 I SHARDING [conn69] received moveChunk request: { moveChunk: "sbtest.sbtest1", from: "rs1/172.31.39.197:27017,ip-172-31-42-25:27017", to: "rs3/172.31.40.76:27017,ip-172-31-36-70:27017", fromShard: "rs1", toShard: "rs3", min: { _id: MinKey }, max: { _id: -9071177708216106847 }, maxChunkSizeBytes: 67108864, shardId: "sbtest.sbtest1-_id_MinKey", configdb: "172.31.38.156:27019,172.31.39.168:27019,172.31.35.248:27019", secondaryThrottle: true, waitForDelete: false, maxTimeMS: 0, epoch: ObjectId('5495957883b4d32dac74360d') }
|
2014-12-21T17:03:08.085+0000 I SHARDING [conn69] distributed lock 'sbtest.sbtest1/ip-172-31-42-25:27017:1419089272:1651494422' acquired, ts : 5496fd4c3e0df20b54401602
|
2014-12-21T17:03:08.088+0000 I SHARDING [conn69] about to log metadata event: { _id: "ip-172-31-42-25-2014-12-21T17:03:08-5496fd4c3e0df20b54401603", server: "ip-172-31-42-25", clientAddr: "172.31.42.45:54993", time: new Date(1419181388088), what: "moveChunk.start", ns: "sbtest.sbtest1", details: { min: { _id: MinKey }, max: { _id: -9071177708216106847 }, from: "rs1", to: "rs3" } }
|
2014-12-21T17:03:08.093+0000 I SHARDING [conn69] remotely refreshing metadata for sbtest.sbtest1 based on current shard version 4|208||5495957883b4d32dac74360d, current metadata version is 4|208||5495957883b4d32dac74360d
|
2014-12-21T17:03:08.094+0000 I SHARDING [conn69] updating metadata for sbtest.sbtest1 from shard version 4|208||5495957883b4d32dac74360d to shard version 4|208||5495957883b4d32dac74360d
|
2014-12-21T17:03:08.094+0000 I SHARDING [conn69] collection version was loaded at version 4|220||5495957883b4d32dac74360d, took 0ms
|
2014-12-21T17:03:08.094+0000 I SHARDING [conn69] moveChunk request accepted at version 4|208||5495957883b4d32dac74360d
|
2014-12-21T17:03:17.612+0000 I SHARDING [conn69] moveChunk number of documents: 165171
|
|
...
|
|
2014-12-21T17:03:18.249+0000 I SHARDING [conn69] moveChunk data transfer progress: { active: true, ns: "sbtest.sbtest1", from: "rs1/172.31.39.197:27017,ip-172-31-42-25:27017", min: { _id: MinKey }, max: { _id: -9071177708216106847 }, shardKeyPattern: { _id: "hashed" }, state: "ready", counts: { cloned: 0, clonedBytes: 0, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
|
|
...
|
|
|
2014-12-22T17:38:58.030+0000 I SHARDING [conn69] moveChunk data transfer progress: { active: true, ns: "sbtest.sbtest1", from: "rs1/172.31.39.197:27017,ip-172-31-42-25:27017", min: { _id: MinKey }, max: { _id: -9071177708216106847 }, shardKeyPattern: { _id: "hashed" }, state: "clone", counts: { cloned: 1473, clonedBytes: 319641, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 7
|
2014-12-22T17:38:58.031+0000 I SHARDING [conn69] About to check if it is safe to enter critical section
|
2014-12-22T17:38:58.031+0000 E SHARDING [conn69] moveChunk cannot enter critical section before all data is cloned, 90225 locs were not transferred but to-shard reported { active: true, ns: "sbtest.sbtest1", from: "rs1/172.31.39.197:27017,ip-172-31-42-25:27017", min: { _id: MinKey }, max: { _id: -9071177708216106847 }, shardKeyPattern: { _id: "hashed" }, state: "clone", counts: { cloned: 1473, clonedBytes: 319641, catchup: 0, steady: 0 }, ok: 1.0 }
|
|