Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-7765

Draining a shard stalled due to writebacksQueued stalled

    • Type: Icon: Bug Bug
    • Resolution: Incomplete
    • Priority: Icon: Critical - P2 Critical - P2
    • None
    • Affects Version/s: 2.0.7
    • Component/s: None
    • Labels:
      None
    • Linux

      I started draining the last of four shards in a live sharded mongo cluster (v2.0.7), with each shard being a 3-node replset, and it went fine until it got to 16 chunks remaining. Now the draining has been stuck there for more than four hours.

      mongos> db.runCommand(

      {removeShard:"mongo-live-d"}

      )
      {
      "msg" : "draining ongoing",
      "state" : "ongoing",
      "remaining" :

      { "chunks" : NumberLong(16), "dbs" : NumberLong(0) }

      ,
      "ok" : 1
      }

      The mongos log shows this:

      Wed Nov 21 22:10:26 [Balancer] distributed lock 'balancer/mongo-live-a-1:27017:1350073653:1804289383' acquired, ts : 50adc1d2538fcedc6aa3cf93
      Wed Nov 21 22:10:26 [Balancer] biggest shard mongo-live-b has unprocessed writebacks, waiting for completion of migrate
      Wed Nov 21 22:10:26 [Balancer] biggest shard mongo-live-b has unprocessed writebacks, waiting for completion of migrate
      Wed Nov 21 22:10:26 [Balancer] biggest shard mongo-live-b has unprocessed writebacks, waiting for completion of migrate
      Wed Nov 21 22:10:26 [Balancer] distributed lock 'balancer/mongo-live-a-1:27017:1350073653:1804289383' unlocked.

      When I check writebacksQueued the total ops never goes down but is increasing over time:

      PRIMARY> db.adminCommand("writeBacksQueued")
      {
      "hasOpsQueued" : true,
      "totalOpsQueued" : 603910,
      "queues" : { "50787cba376f032868ac165e" :

      { "n" : 0, "minutesSinceLastCall" : 2 }

      ,
      "50787cba4a4a812e093429a5" :

      { "n" : 341466, "minutesSinceLastCall" : 0 }

      ,
      "50787cba5df1e05fedab56ff" :

      { "n" : 1, "minutesSinceLastCall" : 40 }

      ,
      "50787cbadc8a4a2ee5bab98f" :

      { "n" : 262443, "minutesSinceLastCall" : 0 }

      ,
      "50787cbafb83be34cb49a885" :

      { "n" : 0, "minutesSinceLastCall" : 1 }

      },
      "ok" : 1
      }

      The "totalOpsQueued" and various "n" values keep going up. I don't see anything interesting in the troublesome shard's mongod log. I'd try restarting everything but I'm worried that this queued data would be lost.

            Assignee:
            barrie Barrie Segal
            Reporter:
            papercrane Justin Patrin
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: