Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Cannot Reproduce
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Logging, Sharding
Labels:
- neweng
Environment:
RHEL 32-bit build slave

Assigned Teams:

Sharding
Operating System:
ALL
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The Balancer logged some very odd output in a test that failed:

http://buildlogs.mongodb.org/Linux%20RHEL%2032-bit/builds/1085/test/recent%20failures/mrShardedOutput.js

 m30999| Fri Nov 23 09:15:43.837 [Balancer] Refreshing MaxChunkSize: 1
 m30999| Fri Nov 23 09:15:43.838 [Balancer] about to acquire distributed lock 'balancer/bs-rhel-57-32-1:30999:1353680073:1804289383:
 m30999| { "state" : 1,
 m30999|   "who" : "bs-rhel-57-32-1:30999:1353680073:1804289383:Balancer:846930886",
 m30999|   "process" : "bs-rhel-57-32-1:30999:1353680073:1804289383",
 m30999|   "when" : { "$date" : "Fri Nov 23 09:15:43 2012" },
 m30999|   "why" : "doing balance round",
 m30999|   "ts" : { "$oid" : "50af850f919c10ab73f4f08e" } }
 m30999| { "_id" : "balancer",
 m30999|   "state" : 0,
 m30999|   "ts" : { "$oid" : "50af850e919c10ab73f4f08b" } }
 m30999| Fri Nov 23 09:15:43.842 [Balancer] distributed lock 'balancer/bs-rhel-57-32-1:30999:1353680073:1804289383' acquired, ts : 50af850f919c10ab73f4f08e
 m30999| Fri Nov 23 09:15:43.842 [Balancer] *** start balancing round
 m30001| Fri Nov 23 09:15:43.973 [cleanupOldData-50af84fee796a8ccca22ede0] moveChunk deleted 663 documents for test.foo from { a: 175.1680213492364 } -> { a: 181.7463950719684 }
 m30001| Fri Nov 23 09:15:43.974 [cleanupOldData-50af84e2e796a8ccca22edc6] moveChunk starting delete for: test.foo from { a: 67.9901554249227 } -> { a: 75.34292479977012 }
 m30001| Fri Nov 23 09:15:44.033 [cleanupOldData-50af84e2e796a8ccca22edc6] moveChunk deleted 711 documents for test.foo from { a: 67.9901554249227 } -> { a: 75.34292479977012 }
 m30001| Fri Nov 23 09:15:44.033 [cleanupOldData-50af84ede796a8ccca22edd4] moveChunk starting delete for: test.foo from { a: 129.4025427196175 } -> { a: 136.3868271000683 }
 m30999| Fri Nov 23 09:15:44.040 [Balancer] shard0001 has more chunks me:93 best: shard0000:34
 m30999| Fri Nov 23 09:15:44.040 [Balancer] collection : test.foo
 m30999| Fri Nov 23 09:15:44.040 [Balancer] donor      : shard0001 chunks on 93
 m30999| Fri Nov 23 09:15:44.040 [Balancer] receiver   : shard0000 chunks on 34
 m30999| Fri Nov 23 09:15:44.040 [Balancer] threshold  : 2
 m30999| Fri Nov 23 09:15:44.040 [Balancer]  ns: test.foo going to move { _id: "test.foo-a_262.6239235978574", lastmod: Timestamp 33000|1, lastmodEpoch: ObjectId('50af84c9919c10ab73f4f06a'), ns: "test.foo", min: { a: 262.6239235978574 }, max: { a: 269.6408361662179 }, shard: "shard0001" } from: shard0001 to: shard0000 tag []
 m30999| Fri Nov 23 09:15:44.042 [Balancer] shard0001 has more chunks me:108 best: shard0000:108
 m30999| Fri Nov 23 09:15:44.042 [Balancer] collection : test.mrShardedOut
 m30999| Fri Nov 23 09:15:44.042 [Balancer] donor      : shard0000 chunks on 108
 m30999| Fri Nov 23 09:15:44.042 [Balancer] receiver   : shard0000 chunks on 108
 m30999| Fri Nov 23 09:15:44.042 [Balancer] threshold  : 2
 m30999| Fri Nov 23 09:15:44.042 [Balancer] moving chunk ns: test.foo moving ( ns:test.fooshard: shard0001:localhost:30001lastmod: 33|1||000000000000000000000000min: { a: 262.6239235978574 }max: { a: 269.6408361662179 }) shard0001:localhost:30001 -> shard0000:localhost:30000

The Balancer seems to be balancing two different collections at the same time (test.foo and test.mrShardedOut). For collection test.mrShardedOut, the Balancer reports that 108 is more than 108 and that therefore it will move a chunk from shard0000 to shard0000. I don't know if it is the logic or the logging that is flaky here. There is nothing in the log to suggest that it actually tried to move a chunk from shard0000 to shard0000.

This sequence (report a real migration followed by a bogus one) appears 24 times in the log, always with the same two collections in the same order.

The event that will cause the test to fail happens later and seems to be unrelated:

 m30001| Fri Nov 23 09:16:54.233 [conn3]   mmap() failed for /data/db/mrShardedOutput1/test.5 len:536608768 errno:12 Cannot allocate memory
 m30001| Fri Nov 23 09:16:54.233 [conn3] mmap failed with out of memory. You are using a 32-bit build and probably need to upgrade to 64
 m30001| Fri Nov 23 09:16:54.238 [conn3] ClientCursor::YieldLock not closed properly
 m30001| Fri Nov 23 09:16:54.238 [conn3] mr failed, removing collection :: caused by :: 10084 can't map file memory - mongo requires 64 bit build for larger datasets
 m30001| Fri Nov 23 09:16:54.238 [conn3] CMD: drop test.tmp.mr.foo_2
 m30001| Fri Nov 23 09:16:54.244 [conn3] CMD: drop test.tmp.mr.foo_2_inc
 m30001| Fri Nov 23 09:16:54.245 [conn3] command test.$cmd command: { mapreduce: "foo", map: function map2() { emit(this._id, {count: 1, y: this.y}); }, reduce: function reduce2(key, values) { return values[0]; }, out: "tmp.mrs.foo_1353680195_1", shardedFirstPass: true, splitInfo: 1048576 } ntoreturn:1 keyUpdates:0 numYields: 1080 locks(micros) r:19580648 w:7480433 reslen:142 18495ms

Assignee:: [DO NOT USE] Backlog - Sharding Team
Reporter:: Tad Marshall
Participants:: [DO NOT USE] Backlog - Sharding Team, Gregory McKeon, Tad Marshall
Votes:: 0 Vote for this issue
Watchers:: 1 Start watching this issue

Created:: Nov 24 2012 10:19:41 PM UTC
Updated:: Dec 06 2022 05:27:17 AM UTC
Resolved:: Apr 19 2018 06:37:14 PM UTC

Details

Description

Attachments

Activity

People

Dates