Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-7765

Draining a shard stalled due to writebacksQueued stalled

    XMLWordPrintableJSON

Details

    • Icon: Bug Bug
    • Resolution: Incomplete
    • Icon: Critical - P2 Critical - P2
    • None
    • 2.0.7
    • None
    • None
    • Linux

    Description

      I started draining the last of four shards in a live sharded mongo cluster (v2.0.7), with each shard being a 3-node replset, and it went fine until it got to 16 chunks remaining. Now the draining has been stuck there for more than four hours.

      mongos> db.runCommand(

      {removeShard:"mongo-live-d"}

      )
      {
      "msg" : "draining ongoing",
      "state" : "ongoing",
      "remaining" :

      { "chunks" : NumberLong(16), "dbs" : NumberLong(0) }

      ,
      "ok" : 1
      }

      The mongos log shows this:

      Wed Nov 21 22:10:26 [Balancer] distributed lock 'balancer/mongo-live-a-1:27017:1350073653:1804289383' acquired, ts : 50adc1d2538fcedc6aa3cf93
      Wed Nov 21 22:10:26 [Balancer] biggest shard mongo-live-b has unprocessed writebacks, waiting for completion of migrate
      Wed Nov 21 22:10:26 [Balancer] biggest shard mongo-live-b has unprocessed writebacks, waiting for completion of migrate
      Wed Nov 21 22:10:26 [Balancer] biggest shard mongo-live-b has unprocessed writebacks, waiting for completion of migrate
      Wed Nov 21 22:10:26 [Balancer] distributed lock 'balancer/mongo-live-a-1:27017:1350073653:1804289383' unlocked.

      When I check writebacksQueued the total ops never goes down but is increasing over time:

      PRIMARY> db.adminCommand("writeBacksQueued")
      {
      "hasOpsQueued" : true,
      "totalOpsQueued" : 603910,
      "queues" : { "50787cba376f032868ac165e" :

      { "n" : 0, "minutesSinceLastCall" : 2 }

      ,
      "50787cba4a4a812e093429a5" :

      { "n" : 341466, "minutesSinceLastCall" : 0 }

      ,
      "50787cba5df1e05fedab56ff" :

      { "n" : 1, "minutesSinceLastCall" : 40 }

      ,
      "50787cbadc8a4a2ee5bab98f" :

      { "n" : 262443, "minutesSinceLastCall" : 0 }

      ,
      "50787cbafb83be34cb49a885" :

      { "n" : 0, "minutesSinceLastCall" : 1 }

      },
      "ok" : 1
      }

      The "totalOpsQueued" and various "n" values keep going up. I don't see anything interesting in the troublesome shard's mongod log. I'd try restarting everything but I'm worried that this queued data would be lost.

      Attachments

        Activity

          People

            barrie Barrie Segal
            papercrane Justin Patrin
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: