Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-7753

Nonsense Balancer logging running jstests/sharding/mrShardedOutput.js

    XMLWordPrintableJSON

Details

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Major - P3 Major - P3
    • None
    • None
    • Logging, Sharding
    • RHEL 32-bit build slave
    • Sharding
    • ALL

    Description

      The Balancer logged some very odd output in a test that failed:

      http://buildlogs.mongodb.org/Linux%20RHEL%2032-bit/builds/1085/test/recent%20failures/mrShardedOutput.js

       m30999| Fri Nov 23 09:15:43.837 [Balancer] Refreshing MaxChunkSize: 1
       m30999| Fri Nov 23 09:15:43.838 [Balancer] about to acquire distributed lock 'balancer/bs-rhel-57-32-1:30999:1353680073:1804289383:
       m30999| { "state" : 1,
       m30999|   "who" : "bs-rhel-57-32-1:30999:1353680073:1804289383:Balancer:846930886",
       m30999|   "process" : "bs-rhel-57-32-1:30999:1353680073:1804289383",
       m30999|   "when" : { "$date" : "Fri Nov 23 09:15:43 2012" },
       m30999|   "why" : "doing balance round",
       m30999|   "ts" : { "$oid" : "50af850f919c10ab73f4f08e" } }
       m30999| { "_id" : "balancer",
       m30999|   "state" : 0,
       m30999|   "ts" : { "$oid" : "50af850e919c10ab73f4f08b" } }
       m30999| Fri Nov 23 09:15:43.842 [Balancer] distributed lock 'balancer/bs-rhel-57-32-1:30999:1353680073:1804289383' acquired, ts : 50af850f919c10ab73f4f08e
       m30999| Fri Nov 23 09:15:43.842 [Balancer] *** start balancing round
       m30001| Fri Nov 23 09:15:43.973 [cleanupOldData-50af84fee796a8ccca22ede0] moveChunk deleted 663 documents for test.foo from { a: 175.1680213492364 } -> { a: 181.7463950719684 }
       m30001| Fri Nov 23 09:15:43.974 [cleanupOldData-50af84e2e796a8ccca22edc6] moveChunk starting delete for: test.foo from { a: 67.9901554249227 } -> { a: 75.34292479977012 }
       m30001| Fri Nov 23 09:15:44.033 [cleanupOldData-50af84e2e796a8ccca22edc6] moveChunk deleted 711 documents for test.foo from { a: 67.9901554249227 } -> { a: 75.34292479977012 }
       m30001| Fri Nov 23 09:15:44.033 [cleanupOldData-50af84ede796a8ccca22edd4] moveChunk starting delete for: test.foo from { a: 129.4025427196175 } -> { a: 136.3868271000683 }
       m30999| Fri Nov 23 09:15:44.040 [Balancer] shard0001 has more chunks me:93 best: shard0000:34
       m30999| Fri Nov 23 09:15:44.040 [Balancer] collection : test.foo
       m30999| Fri Nov 23 09:15:44.040 [Balancer] donor      : shard0001 chunks on 93
       m30999| Fri Nov 23 09:15:44.040 [Balancer] receiver   : shard0000 chunks on 34
       m30999| Fri Nov 23 09:15:44.040 [Balancer] threshold  : 2
       m30999| Fri Nov 23 09:15:44.040 [Balancer]  ns: test.foo going to move { _id: "test.foo-a_262.6239235978574", lastmod: Timestamp 33000|1, lastmodEpoch: ObjectId('50af84c9919c10ab73f4f06a'), ns: "test.foo", min: { a: 262.6239235978574 }, max: { a: 269.6408361662179 }, shard: "shard0001" } from: shard0001 to: shard0000 tag []
       m30999| Fri Nov 23 09:15:44.042 [Balancer] shard0001 has more chunks me:108 best: shard0000:108
       m30999| Fri Nov 23 09:15:44.042 [Balancer] collection : test.mrShardedOut
       m30999| Fri Nov 23 09:15:44.042 [Balancer] donor      : shard0000 chunks on 108
       m30999| Fri Nov 23 09:15:44.042 [Balancer] receiver   : shard0000 chunks on 108
       m30999| Fri Nov 23 09:15:44.042 [Balancer] threshold  : 2
       m30999| Fri Nov 23 09:15:44.042 [Balancer] moving chunk ns: test.foo moving ( ns:test.fooshard: shard0001:localhost:30001lastmod: 33|1||000000000000000000000000min: { a: 262.6239235978574 }max: { a: 269.6408361662179 }) shard0001:localhost:30001 -> shard0000:localhost:30000
       

      The Balancer seems to be balancing two different collections at the same time (test.foo and test.mrShardedOut). For collection test.mrShardedOut, the Balancer reports that 108 is more than 108 and that therefore it will move a chunk from shard0000 to shard0000. I don't know if it is the logic or the logging that is flaky here. There is nothing in the log to suggest that it actually tried to move a chunk from shard0000 to shard0000.

      This sequence (report a real migration followed by a bogus one) appears 24 times in the log, always with the same two collections in the same order.

      The event that will cause the test to fail happens later and seems to be unrelated:

       m30001| Fri Nov 23 09:16:54.233 [conn3]   mmap() failed for /data/db/mrShardedOutput1/test.5 len:536608768 errno:12 Cannot allocate memory
       m30001| Fri Nov 23 09:16:54.233 [conn3] mmap failed with out of memory. You are using a 32-bit build and probably need to upgrade to 64
       m30001| Fri Nov 23 09:16:54.238 [conn3] ClientCursor::YieldLock not closed properly
       m30001| Fri Nov 23 09:16:54.238 [conn3] mr failed, removing collection :: caused by :: 10084 can't map file memory - mongo requires 64 bit build for larger datasets
       m30001| Fri Nov 23 09:16:54.238 [conn3] CMD: drop test.tmp.mr.foo_2
       m30001| Fri Nov 23 09:16:54.244 [conn3] CMD: drop test.tmp.mr.foo_2_inc
       m30001| Fri Nov 23 09:16:54.245 [conn3] command test.$cmd command: { mapreduce: "foo", map: function map2() { emit(this._id, {count: 1, y: this.y}); }, reduce: function reduce2(key, values) { return values[0]; }, out: "tmp.mrs.foo_1353680195_1", shardedFirstPass: true, splitInfo: 1048576 } ntoreturn:1 keyUpdates:0 numYields: 1080 locks(micros) r:19580648 w:7480433 reslen:142 18495ms

      Attachments

        Activity

          People

            backlog-server-sharding [DO NOT USE] Backlog - Sharding Team
            tad Tad Marshall
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: