|
We have a setup currently with 3x shards that we are trying to consolidate down to 1x shard. After reviewing the documentation we started up the process and it seemed to go fine and spent the next several days draining the chunks.
It has been stuck for a long time at this point:
db.getSiblingDB('admin').runCommand( { removeShard: "shard0001" } );
|
{
|
"msg": "draining ongoing",
|
"state": "ongoing",
|
"remaining": {
|
"chunks": NumberLong("8"),
|
"dbs": NumberLong("0")
|
},
|
"ok": 1
|
}
|
After a lot of reading it would seem that this is because some of the chunks are indivisible or too big. Looking at the output of
db.chunks.find({shard: "shard0001"})
|
shows that they are all marked as jumbo, which makes sense from the docs. However, running this shows that they aren't big, just maybe indivisible:
interface> db.getSiblingDB("config").chunks.find({ns : "interface.data_3600", shard: "shard0001"}).forEach(function(chunk) { var ds = db.getSiblingDB("interface.data_3600", ).runCommand({datasize:chunk.ns,keyPattern:key,min:chunk.min,max:chunk.max}); print("Chunk: "+chunk._id +" has a size of "+ds.size+", and
|
includes "+ds.numObjects+" objects (took "+ds.millis+"ms)") } );
|
Output:
Chunk: interface.data_3600-identifier_"eb680bf0edff481d8029cda660b46438157ec1af94850abf62bda93a6989f0c7"start_1443600000end_1447200000 has a size of 0, and includes 0 objects (took 364ms)
|
Chunk: interface.data_3600-identifier_"f3bd3220e046b1093b1c976edf4bbdd0a80e11d3a1c08af483654e1929a4617b"start_1447200000end_1450800000 has a size of 0, and includes 0 objects (took 67ms)
|
Chunk: interface.data_3600-identifier_"f58989626ca054a08a89291eb888ad55997cc44ede5a55bf2723f2a9a6d7dadf"start_1447200000end_1450800000 has a size of 1319889, and includes 4 objects (took 290ms)
|
Chunk: interface.data_3600-identifier_"f608484cf555af6a1dcd6ef85e24c0b4ca9880b2c8fd459beedd9a6b158bac31"start_1447200000end_1450800000 has a size of 1715344, and includes 5 objects (took 193ms)
|
Chunk: interface.data_3600-identifier_"f85d1cb52fbda215a828f994aa7da465892c9aa9d6a423d3555797b9a394258e"start_1447200000end_1450800000 has a size of 1888620, and includes 4 objects (took 102ms)
|
Chunk: interface.data_3600-identifier_"fc89195d311b61c3d38f38ee3d29dcb111a4a4752ca460eb07c305220f45a720"start_1450800000end_1454400000 has a size of 0, and includes 0 objects (took 0ms)
|
Chunk: interface.data_3600-identifier_"ff26d379eb71a09c56438310289ad4a6556e7e4e4f3b6a14f1a04411c599ae66"start_1447200000end_1450800000 has a size of 1869036, and includes 4 objects (took 70ms)
|
Chunk: interface.data_3600-identifier_"ffff18518b122a12d14aa6c229c96fb824ae0a04328a80efeb87585deda6920b"start_1450800000end_1454400000 has a size of 1525992, and includes 4 objects (took 49ms)
|
These all seem to be empty or nearly empty chunks. There was a period where we did a lot of deletions of data, so this may be left over from that. I cannot seem to get these to move, however, and finish the draining.
Trying to perform a manual move resulted in the following:
{
|
|
"cause": {
|
"chunkTooBig": true,
|
"estimatedChunkSize": NumberLong("169236184162472618"),
|
"ok": 0,
|
"errmsg": "chunk too big to move"
|
},
|
"ok": 0,
|
"errmsg": "move failed"
|
}
|
After some more reading, someone suggested deleted the documents in the chunk, clearing the jumbo flag manually and then moving it. Doing that and running a manual move now shows:
sh.moveChunk("interface.data_3600", {"identifier": "f3bd3220e046b1093b1c976edf4bbdd0a80e11d3a1c08af483654e1929a4617b", "start": NumberLong("1447200000"), "end": NumberLong("1450800000")}, "shard0000");
|
{
|
"cause": {
|
"ok": 0,
|
"errmsg": "could not acquire collection lock for interface.data_3600 to migrate chunk [{ : MinKey },{ : MaxKey }) :: caused by :: Lock for migrating chunk [{ : MinKey }, { : MaxKey }) in interface.data_3600 is taken."
|
},
|
"ok": 0,
|
"errmsg": "move failed"
|
}
|
The lock in question seems to be perpetually stuck, and the regular balancer is always logging the same thing. In the interest of debugging I tried manually clearing the lock and running the command, at which point the migrate FROM shard segfaulted with this:
2016-06-02T13:41:01.534+0000 I SHARDING [conn3] received moveChunk request: { moveChunk: "interface.data_3600", from: "tsds.bldc.grnoc.iu.edu:27026", to: "tsds.bldc.grnoc.iu.edu:27025", fromShard: "shard0001", toShard: "shard0000", min: { identifier: "eb680bf0edff481d8029cda660b46438157ec1af94850abf62bda93a6989f0c7", start: 14
|
43600000, end: 1447200000 }, max: { identifier: "eb680bf0edff481d8029cda660b46438157ec1af94850abf62bda93a6989f0c7", start: 1447200000, end: 1450800000 }, maxChunkSizeBytes: 268435456, shardId: "interface.data_3600-identifier_"eb680bf0edff481d8029cda660b46438157ec1af94850abf62bda93a6989f0c7"start_1443600000end_1447200000", conf
|
igdb: "tsds.bldc.grnoc.iu.edu:27019,tsds.bldc.grnoc.iu.edu:27020,tsds.bldc.grnoc.iu.edu:27021", secondaryThrottle: true, waitForDelete: false, maxTimeMS: 0, epoch: ObjectId('5592cb956e74ba75920e0bda') }
|
2016-06-02T13:41:01.797+0000 I SHARDING [conn3] distributed lock 'interface.data_3600/tsds.bldc.grnoc.iu.edu:27026:1464874713:1439290629' acquired, ts : 5750376d7a625e027e304639
|
2016-06-02T13:41:01.797+0000 I SHARDING [conn3] about to log metadata event: { _id: "tsds.bldc.grnoc.iu.edu-2016-06-02T13:41:01-5750376d7a625e027e30463a", server: "tsds.bldc.grnoc.iu.edu", clientAddr: "140.182.45.176:52095", time: new Date(1464874861797), what: "moveChunk.start", ns: "interface.data_3600", details: { min: { id
|
entifier: "eb680bf0edff481d8029cda660b46438157ec1af94850abf62bda93a6989f0c7", start: 1443600000, end: 1447200000 }, max: { identifier: "eb680bf0edff481d8029cda660b46438157ec1af94850abf62bda93a6989f0c7", start: 1447200000, end: 1450800000 }, from: "shard0001", to: "shard0000" } }
|
2016-06-02T13:41:01.799+0000 I SHARDING [conn3] remotely refreshing metadata for interface.data_3600, current shard version is 0|0||000000000000000000000000, current metadata version is 0|0||000000000000000000000000
|
2016-06-02T13:41:01.822+0000 I SHARDING [conn3] collection interface.data_3600 was previously unsharded, new metadata loaded with shard version 4642|1||5592cb956e74ba75920e0bda
|
2016-06-02T13:41:01.822+0000 I SHARDING [conn3] collection version was loaded at version 4642|1||5592cb956e74ba75920e0bda, took 23ms
|
2016-06-02T13:41:01.822+0000 I SHARDING [conn3] moveChunk request accepted at version 4642|1||5592cb956e74ba75920e0bda
|
2016-06-02T13:41:01.822+0000 F - [conn3] Invalid operation at address: 0xdfa98e
|
2016-06-02T13:41:01.828+0000 F - [conn3] Got signal: 8 (Floating point exception).
|
|
0xf758c9 0xf74f42 0xf7529e 0x3fda60f7e0 0xdfa98e 0xdfcef4 0x9cf824 0x9d07ad 0x9d14bb 0xba2f4a 0xab4830 0x7fbcfd 0xf26b9b 0x3fda607aa1 0x3fda2e893d
|
----- BEGIN BACKTRACE -----
|
{"backtrace":[{"b":"400000","o":"B758C9"},{"b":"400000","o":"B74F42"},{"b":"400000","o":"B7529E"},{"b":"3FDA600000","o":"F7E0"},{"b":"400000","o":"9FA98E"},{"b":"400000","o":"9FCEF4"},{"b":"400000","o":"5CF824"},{"b":"400000","o":"5D07AD"},{"b":"400000","o":"5D14BB"},{"b":"400000","o":"7A2F4A"},{"b":"400000","o":"6B4830"},{"b"
|
:"400000","o":"3FBCFD"},{"b":"400000","o":"B26B9B"},{"b":"3FDA600000","o":"7AA1"},{"b":"3FDA200000","o":"E893D"}],"processInfo":{ "mongodbVersion" : "3.0.7", "gitVersion" : "6ce7cbe8c6b899552dadd907604559806aa2e9bd", "uname" : { "sysname" : "Linux", "release" : "2.6.32-504.23.4.el6.x86_64", "version" : "#1 SMP Tue Jun 9 20:57:
|
37 UTC 2015", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "E359E52F4A5E2A16804BCA34965CC29081397CDF" }, { "b" : "7FFF3C0CD000", "elfType" : 3, "buildId" : "FF4CBAAE51A93124ED31C2B1386CE92FF24AEBC3" }, { "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "C56DD1B811FC0D9263248EBB3
|
08C73FCBCD80FC1" }, { "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "40BEA6554E64FC0C3D5C7D0CD91362730515102F" }, { "path" : "/usr/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "FC4EFD7502ACB3B9D213D28272D15A165857AD5A" }, { "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "95159178F1A4A3DBDC7819F
|
BEA2C80E5FCDD6BAC" }, { "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "29B61382141595ECBA6576232E44F2310C3AAB72" }, { "path" : "/usr/lib64/libstdc++.so.6", "elfType" : 3, "buildId" : "F07F2E7CF4BFB393CC9BBE8CDC6463652E14DB07" }, { "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "989FE3A42CA8CEBDCC185A743896F
|
23A0CF537ED" }, { "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "246C3BAB0AB093AFD59D34C8CBF29E786DE4BE97" }, { "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "8E6FA4C4B0594C355C1B90C1D49990368C81A040" }, { "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "959C5E10A47EE8A633E7681B64B4B9
|
F74E242ED5" }, { "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "54BA6B78A9220344E77463947215E42F0EABCC62" }, { "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "6797403AA5F8FAD8ADFF683478B45F528CE4FB0E" }, { "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "8CE28F280150E62296240E70ECA
|
C64E4A57AB826" }, { "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "05733977F4E41652B86070B27A0CFC2C1EA7719D" }, { "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "5FA8E5038EC04A774AF72A9BB62DC86E1049C4D6" }, { "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "E3FA235F3BA3F776A01A18ECA7
|
37C9890F445923" }, { "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "AF374BAFB7F5B139A0B431D3F06D82014AFF3251" }, { "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "C39D7FFB49DFB1B55AD09D1D711AD802123F6623" }, { "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "E6798A06BEE17CF102BBA44F
|
D512FF8B805CEAF1" } ] }}
|
mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf758c9]
|
mongod(+0xB74F42) [0xf74f42]
|
mongod(+0xB7529E) [0xf7529e]
|
libpthread.so.0(+0xF7E0) [0x3fda60f7e0]
|
mongod(_ZN5mongo17MigrateFromStatus16storeCurrentLocsEPNS_16OperationContextExRSsRNS_14BSONObjBuilderE+0x36E) [0xdfa98e]
|
mongod(_ZN5mongo16MoveChunkCommand3runEPNS_16OperationContextERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x1CD4) [0xdfcef4]
|
mongod(_ZN5mongo12_execCommandEPNS_16OperationContextEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x34) [0x9cf824]
|
mongod(_ZN5mongo7Command11execCommandEPNS_16OperationContextEPS0_iPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0xC1D) [0x9d07ad]
|
mongod(_ZN5mongo12_runCommandsEPNS_16OperationContextEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x28B) [0x9d14bb]
|
mongod(_ZN5mongo8runQueryEPNS_16OperationContextERNS_7MessageERNS_12QueryMessageERKNS_15NamespaceStringERNS_5CurOpES3_+0x77A) [0xba2f4a]
|
mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0xB10) [0xab4830]
|
mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0xDD) [0x7fbcfd]
|
mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x34B) [0xf26b9b]
|
libpthread.so.0(+0x7AA1) [0x3fda607aa1]
|
libc.so.6(clone+0x6D) [0x3fda2e893d]
|
----- END BACKTRACE -----
|
Upon restart of the process, the lock goes back into a perpetually locked state again.
2016-06-02T13:59:58.908+0000 W SHARDING [conn3] could not acquire collection lock for interface.data_3600 to migrate chunk [{ : MinKey },{ : MaxKey }) :: caused by :: Lock for migrating chunk [{ : MinKey }, { : MaxKey }) in interface.data_3600 is taken
|
All processes involved have been restarted in case something was in a weird state that would be resolved by that. Our environment is mongodb 3.0.7 on CentOS6 64bit. At this point upgrading is not a possibility due to driver concerns, and we are trying to avoid big downtimes as it is a production service. The data set is somewhat large (> 1 TB).
Any help would be appreciated. If I can provide anything else I am more than happy to.
|