[SERVER-7765] Draining a shard stalled due to writebacksQueued stalled Created: 26/Nov/12  Updated: 08/Mar/13  Resolved: 19/Feb/13

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 2.0.7
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Justin Patrin Assignee: Barrie Segal
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: Linux
Participants:

 Description   

I started draining the last of four shards in a live sharded mongo cluster (v2.0.7), with each shard being a 3-node replset, and it went fine until it got to 16 chunks remaining. Now the draining has been stuck there for more than four hours.

mongos> db.runCommand(

{removeShard:"mongo-live-d"}

)
{
"msg" : "draining ongoing",
"state" : "ongoing",
"remaining" :

{ "chunks" : NumberLong(16), "dbs" : NumberLong(0) }

,
"ok" : 1
}

The mongos log shows this:

Wed Nov 21 22:10:26 [Balancer] distributed lock 'balancer/mongo-live-a-1:27017:1350073653:1804289383' acquired, ts : 50adc1d2538fcedc6aa3cf93
Wed Nov 21 22:10:26 [Balancer] biggest shard mongo-live-b has unprocessed writebacks, waiting for completion of migrate
Wed Nov 21 22:10:26 [Balancer] biggest shard mongo-live-b has unprocessed writebacks, waiting for completion of migrate
Wed Nov 21 22:10:26 [Balancer] biggest shard mongo-live-b has unprocessed writebacks, waiting for completion of migrate
Wed Nov 21 22:10:26 [Balancer] distributed lock 'balancer/mongo-live-a-1:27017:1350073653:1804289383' unlocked.

When I check writebacksQueued the total ops never goes down but is increasing over time:

PRIMARY> db.adminCommand("writeBacksQueued")
{
"hasOpsQueued" : true,
"totalOpsQueued" : 603910,
"queues" : { "50787cba376f032868ac165e" :

{ "n" : 0, "minutesSinceLastCall" : 2 }

,
"50787cba4a4a812e093429a5" :

{ "n" : 341466, "minutesSinceLastCall" : 0 }

,
"50787cba5df1e05fedab56ff" :

{ "n" : 1, "minutesSinceLastCall" : 40 }

,
"50787cbadc8a4a2ee5bab98f" :

{ "n" : 262443, "minutesSinceLastCall" : 0 }

,
"50787cbafb83be34cb49a885" :

{ "n" : 0, "minutesSinceLastCall" : 1 }

},
"ok" : 1
}

The "totalOpsQueued" and various "n" values keep going up. I don't see anything interesting in the troublesome shard's mongod log. I'd try restarting everything but I'm worried that this queued data would be lost.



 Comments   
Comment by Barrie Segal [ 08/Feb/13 ]

Justin,

Just checking in-- what was the outcome of draining the shard?

Barrie

Comment by Justin Patrin [ 30/Nov/12 ]

Well, we tried restarting the mongod processes on the secondaries of the misbehaving shard/repl. No change. We stepped down the primary. No change. Stopping the former primary once it was a secondary, however, took quite a while (~5 minutes). Once that was down we started it back up and the cluster is draining the shard like I originally had it doing again. I hope we haven't lost any data....

Comment by Justin Patrin [ 27/Nov/12 ]

All of the mongos logs have the same Balancer entries as listed above, nothing different.

Looking at the mongod logs I'm seeing a fair number of entries like this:

Tue Nov 27 10:44:43 [conn3062] command admin.$cmd command: { writebacklisten: ObjectId('50787cbae13db536af5c54a0') } ntoreturn:1 reslen:44 300008ms

Comment by Eliot Horowitz (Inactive) [ 27/Nov/12 ]

Can you check the mongos logs?
One of them should be processing a lot of write backs or have a lot of errors.

Generated at Thu Feb 08 03:15:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.