Details
Description
The Balancer logged some very odd output in a test that failed:
m30999| Fri Nov 23 09:15:43.837 [Balancer] Refreshing MaxChunkSize: 1
|
m30999| Fri Nov 23 09:15:43.838 [Balancer] about to acquire distributed lock 'balancer/bs-rhel-57-32-1:30999:1353680073:1804289383:
|
m30999| { "state" : 1,
|
m30999| "who" : "bs-rhel-57-32-1:30999:1353680073:1804289383:Balancer:846930886",
|
m30999| "process" : "bs-rhel-57-32-1:30999:1353680073:1804289383",
|
m30999| "when" : { "$date" : "Fri Nov 23 09:15:43 2012" },
|
m30999| "why" : "doing balance round",
|
m30999| "ts" : { "$oid" : "50af850f919c10ab73f4f08e" } }
|
m30999| { "_id" : "balancer",
|
m30999| "state" : 0,
|
m30999| "ts" : { "$oid" : "50af850e919c10ab73f4f08b" } }
|
m30999| Fri Nov 23 09:15:43.842 [Balancer] distributed lock 'balancer/bs-rhel-57-32-1:30999:1353680073:1804289383' acquired, ts : 50af850f919c10ab73f4f08e
|
m30999| Fri Nov 23 09:15:43.842 [Balancer] *** start balancing round
|
m30001| Fri Nov 23 09:15:43.973 [cleanupOldData-50af84fee796a8ccca22ede0] moveChunk deleted 663 documents for test.foo from { a: 175.1680213492364 } -> { a: 181.7463950719684 }
|
m30001| Fri Nov 23 09:15:43.974 [cleanupOldData-50af84e2e796a8ccca22edc6] moveChunk starting delete for: test.foo from { a: 67.9901554249227 } -> { a: 75.34292479977012 }
|
m30001| Fri Nov 23 09:15:44.033 [cleanupOldData-50af84e2e796a8ccca22edc6] moveChunk deleted 711 documents for test.foo from { a: 67.9901554249227 } -> { a: 75.34292479977012 }
|
m30001| Fri Nov 23 09:15:44.033 [cleanupOldData-50af84ede796a8ccca22edd4] moveChunk starting delete for: test.foo from { a: 129.4025427196175 } -> { a: 136.3868271000683 }
|
m30999| Fri Nov 23 09:15:44.040 [Balancer] shard0001 has more chunks me:93 best: shard0000:34
|
m30999| Fri Nov 23 09:15:44.040 [Balancer] collection : test.foo
|
m30999| Fri Nov 23 09:15:44.040 [Balancer] donor : shard0001 chunks on 93
|
m30999| Fri Nov 23 09:15:44.040 [Balancer] receiver : shard0000 chunks on 34
|
m30999| Fri Nov 23 09:15:44.040 [Balancer] threshold : 2
|
m30999| Fri Nov 23 09:15:44.040 [Balancer] ns: test.foo going to move { _id: "test.foo-a_262.6239235978574", lastmod: Timestamp 33000|1, lastmodEpoch: ObjectId('50af84c9919c10ab73f4f06a'), ns: "test.foo", min: { a: 262.6239235978574 }, max: { a: 269.6408361662179 }, shard: "shard0001" } from: shard0001 to: shard0000 tag []
|
m30999| Fri Nov 23 09:15:44.042 [Balancer] shard0001 has more chunks me:108 best: shard0000:108
|
m30999| Fri Nov 23 09:15:44.042 [Balancer] collection : test.mrShardedOut
|
m30999| Fri Nov 23 09:15:44.042 [Balancer] donor : shard0000 chunks on 108
|
m30999| Fri Nov 23 09:15:44.042 [Balancer] receiver : shard0000 chunks on 108
|
m30999| Fri Nov 23 09:15:44.042 [Balancer] threshold : 2
|
m30999| Fri Nov 23 09:15:44.042 [Balancer] moving chunk ns: test.foo moving ( ns:test.fooshard: shard0001:localhost:30001lastmod: 33|1||000000000000000000000000min: { a: 262.6239235978574 }max: { a: 269.6408361662179 }) shard0001:localhost:30001 -> shard0000:localhost:30000
|
|
|
The Balancer seems to be balancing two different collections at the same time (test.foo and test.mrShardedOut). For collection test.mrShardedOut, the Balancer reports that 108 is more than 108 and that therefore it will move a chunk from shard0000 to shard0000. I don't know if it is the logic or the logging that is flaky here. There is nothing in the log to suggest that it actually tried to move a chunk from shard0000 to shard0000.
This sequence (report a real migration followed by a bogus one) appears 24 times in the log, always with the same two collections in the same order.
The event that will cause the test to fail happens later and seems to be unrelated:
m30001| Fri Nov 23 09:16:54.233 [conn3] mmap() failed for /data/db/mrShardedOutput1/test.5 len:536608768 errno:12 Cannot allocate memory
|
m30001| Fri Nov 23 09:16:54.233 [conn3] mmap failed with out of memory. You are using a 32-bit build and probably need to upgrade to 64
|
m30001| Fri Nov 23 09:16:54.238 [conn3] ClientCursor::YieldLock not closed properly
|
m30001| Fri Nov 23 09:16:54.238 [conn3] mr failed, removing collection :: caused by :: 10084 can't map file memory - mongo requires 64 bit build for larger datasets
|
m30001| Fri Nov 23 09:16:54.238 [conn3] CMD: drop test.tmp.mr.foo_2
|
m30001| Fri Nov 23 09:16:54.244 [conn3] CMD: drop test.tmp.mr.foo_2_inc
|
m30001| Fri Nov 23 09:16:54.245 [conn3] command test.$cmd command: { mapreduce: "foo", map: function map2() { emit(this._id, {count: 1, y: this.y}); }, reduce: function reduce2(key, values) { return values[0]; }, out: "tmp.mrs.foo_1353680195_1", shardedFirstPass: true, splitInfo: 1048576 } ntoreturn:1 keyUpdates:0 numYields: 1080 locks(micros) r:19580648 w:7480433 reslen:142 18495ms
|