[SERVER-16312] endless "waiting for write concern" Created: 25/Nov/14  Updated: 08/Apr/15  Resolved: 08/Apr/15

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.6.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kay Agahd Assignee: Unassigned
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

Intro

We are running a sharded cluster of 7 shards. Every shard is a replicaset of 3 replicas.

We do pre-splitting in order to distribute evenly new inserted documents among all shards dependent on their amount of RAM. Our database must fit in RAM else it performance drops down significantly. The balancer is switched off because it thinks all servers have the same amount of RAM which would lead to overload of shards having less RAM.

From time to time, i.e. when adding a new shard, we need to balance manually. Our script executes first sh.moveChunk and when it returns ok, it deletes the moved documents from the source shards because we observed that sh.moveChunk does not always clean-up 100%.

In pseudo code it gives:

for all chunks from donor shard
  print("move " + (chunk++) + " of id: " +  chunk.min._id)
  move = sh.moveChunk(fromCollection, chunk.min._id, toShard)
  if(move.ok){
    del = fromCollection.remove({_id:{$gte:chunk.min._id, $lt:chunk.max._id}}, { writeConcern: { w: "majority"} })
    print("deleted: " + del.nRemoved)
  }

Problem

We observed that mongo may wait forever when we add the writeConcern "majority" to the remove command.

In the currentOp we see something like this:

{
  "opid" : 135910798,
  "active" : true,
  "secs_running" : 2362,
  "microsecs_running" : NumberLong("2362545153"),
  "op" : "query",
  "ns" : "",
  "query" : {
	  "delete" : "offer",
	  "deletes" : [
		  {
			  "q" : {
				  "_id" : {
					  "$gte" : NumberLong("2577460009"),
					  "$lt" : NumberLong("2577494734")
				  }
			  },
			  "limit" : 0
		  }
	  ],
	  "ordered" : true,
	  "writeConcern" : {
		  "w" : "majority"
	  }
  },
  "client" : "172.30.2.130:51426",
  "desc" : "conn4708952",
  "threadId" : "0x7f2035ca5700",
  "connectionId" : 4708952,
  "waitingForLock" : false,
  "msg" : "waiting for write concern",
  "numYields" : 0,
  "lockStats" : {
	  "timeLockedMicros" : {
		  
	  },
	  "timeAcquiringMicros" : {
		  
	  }
  }
}

While mongo was waiting, we verified the number of documents of this chunk on all replicaset members of the donor shard. It was 0! So all documents of this chunk were removed from the donor shard but mongo still waited for it. Why?

Also, still more weird, the output of the above pseudo script was like this:

move 1 of id: 11605025
deleted: 5623
move 2 of id: 183729058
deleted: 7152
move 3 of id: 2577460009
deleted: 3561
move 4 of id: 2977458123

Move 4 hung forever.
As you can see, mongo was endlessly waiting for the query $gte:2577460009. However, the result of the delete returned already since it printed the number of deleted documents (3561). Only the following chunk move #4 got stuck. Why?
Also, as you can see, our script had always to delete documents that should have been deleted by the sh.moveChunk already, shouldn't it?



 Comments   
Comment by Sam Kleinman (Inactive) [ 23/Mar/15 ]

Without a more clear reproduction case that we can use to investigate this issue, it will be hard to find and resolve the root cause for this issue.

  • Do you have a wtimeout value set for the driver that's executing this operation?
  • Does this always happen after migrating three shards? Are there other clients or operations happening on the cluster while you're performing this maintenance?
  • The cleanupOrphaned command (added in 2.6) should remove the documents left over after chunk migrations, just as your script. We might see more clear logging in this case, and you may have a better experience, although again, without more information its unclear if this will actually resolve the underlying issue.

Hope this helps.

Regards,
sam

Comment by Kay Agahd [ 11/Mar/15 ]

I'm not able to reproduce it on a smaller dataset. It happens only in the cluster of 7 shards that I can't share with you because its dataset is simply too huge. The repro script is that what I've posted above. Could you indicate steps to nail down the issue?

Comment by Andy Schwerin [ 10/Mar/15 ]

kay.agahd@idealo.de, do you have a repro script and dataset you could share with us?

Generated at Thu Feb 08 03:40:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.