[SERVER-3266] mongos consistently locks up distributing parallel updates to multiple shards -- cluster unusable Created: 15/Jun/11 Updated: 12/Jul/16 Resolved: 22/Jun/11 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Concurrency, Sharding, Stability |
| Affects Version/s: | 1.8.1, 1.8.2, 1.8.3 |
| Fix Version/s: | 1.8.3 |
| Type: | Bug | Priority: | Blocker - P1 |
| Reporter: | Rob Giardina | Assignee: | Greg Studer |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
centos, two machines, four shards each, 30 simultaneous updateres |
||
| Attachments: |
|
| Operating System: | Linux |
| Participants: |
| Description |
|
a data set of 120m docs in one collection is being updated to add data to each document. every few million records processed (mixed reads and writes, many fewer writes), mongos becomes unresponsive. this number has some down from 20m gradually until it now locks after only a few million. queries against each of the shard and config mongod instance show them to be responsive to requests queries against mongos hang indefinitely as does db.stats(). gdb stack trace (attached) shows many threads in mongo::ChunkManager::getShardsForQuery waiting to obtain mongo::rwlock::rwlock |
| Comments |
| Comment by Greg Studer [ 22/Jun/11 ] |
|
No worries, reopen if you see the problem again. |
| Comment by Rob Giardina [ 22/Jun/11 ] |
|
Hi Greg, Unfortunately, I had to tear down the sharded config to make progress and I Thanks for the speedy fix, you've restored my faith in sharding, I'll come Thanks, |
| Comment by Greg Studer [ 21/Jun/11 ] |
|
just pinging for an update on your status... |
| Comment by Greg Studer [ 17/Jun/11 ] |
|
pretty much - the codepath usually works, but pretty sure interleaved writes can cause issues. Fix didn't make it into 1.8.2 unfortunately, but if you grab that checkout version it should be exactly 1.8.2+the patch. |
| Comment by Rob Giardina [ 17/Jun/11 ] |
|
thx for the workaround, i've had to decommission the cluster and move to a single instance for the moment so i can't test this now. your fix seems pretty definitive – I didn't read the surrounding code but the diff looks like you're not trying to get a (non-reentrant?) lock. I'm very optimistic. |
| Comment by auto [ 17/Jun/11 ] |
|
Author: {u'login': u'gregstuder', u'name': u'gregs', u'email': u'greg@10gen.com'}Message: don't reacquire read lock when getting all shards |
| Comment by Greg Studer [ 17/Jun/11 ] |
|
Is your shard key included in each of the queries? One potential workaround may be to always ensure that the (first part of) your shard key is explicitly bounded in your queries by a min or max value ( an actual value, not $Min/$MaxKey ). key : { $gt : -1000000 }for example. |
| Comment by Greg Studer [ 16/Jun/11 ] |
|
Ah yes, missed it, thanks. |
| Comment by Rob Giardina [ 16/Jun/11 ] |
|
I sent a tar file via private email to you and Eliot. Did you get them? |
| Comment by Greg Studer [ 16/Jun/11 ] |
|
Very understandable Do you have the logs for the mongos process from before and after the slowdown and eventual hang? |
| Comment by Rob Giardina [ 16/Jun/11 ] |
|
this is production data – only using the 1.8.x versions; i consider the odd numbered versions a subtle invitation to lose data |
| Comment by Greg Studer [ 16/Jun/11 ] |
|
are you seeing this in a dev environment or production system? The need for locking in some of these places has been removed in v1.9. |
| Comment by Eliot Horowitz (Inactive) [ 15/Jun/11 ] |
|
Looks like chunk updates are slow. Can you attach mongos log? |