[SERVER-16333] hot replication mutex Created: 26/Nov/14  Updated: 10/Dec/14  Resolved: 04/Dec/14

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.8.0-rc1
Fix Version/s: 2.8.0-rc2

Type: Bug Priority: Major - P3
Reporter: Asya Kamsky Assignee: Andy Schwerin
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-16500 Reduce contention on ReplicationCoord... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

Running on a very fast machine an insert workload that seems like it could sustain a very high throughput but seeing pauses not seemingly related to logging or checkpointing I got some stack traces and a lot of the threads seem to be waiting for replicationCoordinator.

This happens to be a standalone mongod which makes it particularly puzzling.



 Comments   
Comment by Andy Schwerin [ 04/Dec/14 ]

Fixed in 2.8.0-rc2 by Eliot's commit, above.

Comment by Githook User [ 02/Dec/14 ]

Author:

{u'username': u'erh', u'name': u'Eliot Horowitz', u'email': u'eliot@10gen.com'}

Message: SERVER-16333: don't serialize and deserialize default write concerns
Branch: master
https://github.com/mongodb/mongo/commit/e7baa714a95e0cb43ad54f4497eec512e774fbec

Comment by Andy Schwerin [ 26/Nov/14 ]

One option might be to require replica set config changes to hold the global lock in MODE_S, so that the replica set configuration could be observed by anyone holding the global lock in MODE_IX. We'd have to change the getlasterror command to acquire the global lock in MODE_IX, but otherwise there would be no additional locking on potentially hot code paths. It would put an additional constraint on reconfig, which might add complexity to heartbeat reconfig.

Comment by Andy Schwerin [ 26/Nov/14 ]

Thanks for the report. The WriteBatchExecutor and the implementation of the getlasterror command both unconditionally read the "default" get last error value, which is a property of a replica set configuration. Caching that value someplace where the mutex isn't required, along preferably in an already-parsed form, should resolve this problem.

Generated at Thu Feb 08 03:40:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.