[SERVER-16333] hot replication mutex Created: 26/Nov/14 Updated: 10/Dec/14 Resolved: 04/Dec/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.8.0-rc1 |
| Fix Version/s: | 2.8.0-rc2 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Asya Kamsky | Assignee: | Andy Schwerin |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
Running on a very fast machine an insert workload that seems like it could sustain a very high throughput but seeing pauses not seemingly related to logging or checkpointing I got some stack traces and a lot of the threads seem to be waiting for replicationCoordinator. This happens to be a standalone mongod which makes it particularly puzzling. |
| Comments |
| Comment by Andy Schwerin [ 04/Dec/14 ] |
|
Fixed in 2.8.0-rc2 by Eliot's commit, above. |
| Comment by Githook User [ 02/Dec/14 ] |
|
Author: {u'username': u'erh', u'name': u'Eliot Horowitz', u'email': u'eliot@10gen.com'}Message: |
| Comment by Andy Schwerin [ 26/Nov/14 ] |
|
One option might be to require replica set config changes to hold the global lock in MODE_S, so that the replica set configuration could be observed by anyone holding the global lock in MODE_IX. We'd have to change the getlasterror command to acquire the global lock in MODE_IX, but otherwise there would be no additional locking on potentially hot code paths. It would put an additional constraint on reconfig, which might add complexity to heartbeat reconfig. |
| Comment by Andy Schwerin [ 26/Nov/14 ] |
|
Thanks for the report. The WriteBatchExecutor and the implementation of the getlasterror command both unconditionally read the "default" get last error value, which is a property of a replica set configuration. Caching that value someplace where the mutex isn't required, along preferably in an already-parsed form, should resolve this problem. |