[SERVER-3266] mongos consistently locks up distributing parallel updates to multiple shards -- cluster unusable Created: 15/Jun/11  Updated: 12/Jul/16  Resolved: 22/Jun/11

Status: Closed
Project: Core Server
Component/s: Concurrency, Sharding, Stability
Affects Version/s: 1.8.1, 1.8.2, 1.8.3
Fix Version/s: 1.8.3

Type: Bug Priority: Blocker - P1
Reporter: Rob Giardina Assignee: Greg Studer
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

centos, two machines, four shards each, 30 simultaneous updateres


Attachments: Text File mongos-thread-dump.txt    
Operating System: Linux
Participants:

 Description   

a data set of 120m docs in one collection is being updated to add data to each document.

every few million records processed (mixed reads and writes, many fewer writes), mongos becomes unresponsive. this number has some down from 20m gradually until it now locks after only a few million.

queries against each of the shard and config mongod instance show them to be responsive to requests

queries against mongos hang indefinitely as does db.stats().

gdb stack trace (attached) shows many threads in mongo::ChunkManager::getShardsForQuery waiting to obtain mongo::rwlock::rwlock



 Comments   
Comment by Greg Studer [ 22/Jun/11 ]

No worries, reopen if you see the problem again.

Comment by Rob Giardina [ 22/Jun/11 ]

Hi Greg,

Unfortunately, I had to tear down the sharded config to make progress and I
am now depending on the system in question. For a real live test, I need to
simulate it on other systems which will take a while to build.

Thanks for the speedy fix, you've restored my faith in sharding, I'll come
back to it soon.

Thanks,
Rob

Comment by Greg Studer [ 21/Jun/11 ]

just pinging for an update on your status...

Comment by Greg Studer [ 17/Jun/11 ]

pretty much - the codepath usually works, but pretty sure interleaved writes can cause issues. Fix didn't make it into 1.8.2 unfortunately, but if you grab that checkout version it should be exactly 1.8.2+the patch.
I'll close if/when we confirm things on your end.

Comment by Rob Giardina [ 17/Jun/11 ]

thx for the workaround, i've had to decommission the cluster and move to a single instance for the moment so i can't test this now. your fix seems pretty definitive – I didn't read the surrounding code but the diff looks like you're not trying to get a (non-reentrant?) lock. I'm very optimistic.

Comment by auto [ 17/Jun/11 ]

Author:

{u'login': u'gregstuder', u'name': u'gregs', u'email': u'greg@10gen.com'}

Message: don't reacquire read lock when getting all shards SERVER-3266
Branch: v1.8
https://github.com/mongodb/mongo/commit/1f9df58e76cd47a19475c7532b114e4ec55af9b5

Comment by Greg Studer [ 17/Jun/11 ]

Is your shard key included in each of the queries? One potential workaround may be to always ensure that the (first part of) your shard key is explicitly bounded in your queries by a min or max value ( an actual value, not $Min/$MaxKey ). key :

{ $gt : -1000000 }

for example.

Comment by Greg Studer [ 16/Jun/11 ]

Ah yes, missed it, thanks.

Comment by Rob Giardina [ 16/Jun/11 ]

I sent a tar file via private email to you and Eliot. Did you get them?

Comment by Greg Studer [ 16/Jun/11 ]

Very understandable Trying a 1.9 mongos with v1.8 mongod may be less risky since it's only routing your requests, not storing data - though again, this is not a configuration we test.

Do you have the logs for the mongos process from before and after the slowdown and eventual hang?

Comment by Rob Giardina [ 16/Jun/11 ]

this is production data – only using the 1.8.x versions; i consider the odd numbered versions a subtle invitation to lose data

Comment by Greg Studer [ 16/Jun/11 ]

are you seeing this in a dev environment or production system? The need for locking in some of these places has been removed in v1.9.

Comment by Eliot Horowitz (Inactive) [ 15/Jun/11 ]

Looks like chunk updates are slow.

Can you attach mongos log?

Generated at Thu Feb 08 03:02:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.