[SERVER-14982] replSetMaintenance command should not block Created: 21/Aug/14  Updated: 29/Nov/14  Resolved: 17/Oct/14

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.9, 2.6.4
Fix Version/s: 2.7.8

Type: Bug Priority: Major - P3
Reporter: Alexander Komyagin Assignee: Scott Hernandez (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File maintenance_non-blocking.js    
Issue Links:
Related
is related to SERVER-14983 Ability to immediately mark the node ... Open
Tested
Backwards Compatibility: Major Change
Operating System: ALL
Steps To Reproduce:
  1. create a replica set (I used Azure Windows without drive cache to make sure disk IO is super slow)
  2. insert 10000000 records into a test collection with random values for some field)
  3. build a foreground index on that field (I used hashed index in my experiment)
  4. when the index build is started on secondaries (mongostat on that secondary was frozen), try setting the maintenance mode on a secondary
  5. observe the command being blocked and queued heartbeats in db.currentOp()
Participants:

 Description   

This command appears to take some internal locks, so it can be blocked by things like foreground index builds (tested on 2.6.4 and 2.4.9).

Further, this command, when queued, causes replSetHeartbeat commands to queue up, resulting in missing heartbeats.

Ideally, this command should be lock-less, allowing the operator to effectively hide the node from application servers or MongoS routers in critical circumstances, e.g. the node being overloaded.



 Comments   
Comment by Githook User [ 17/Oct/14 ]

Author:

{u'username': u'scotthernandez', u'name': u'Scott Hernandez', u'email': u'scotthernandez@gmail.com'}

Message: SERVER-14982: don't take lock to schedule maint mode
Branch: master
https://github.com/mongodb/mongo/commit/aaf740d1748b5ed1ce890d88c489bd6f9399aeac

Comment by Scott Hernandez (Inactive) [ 25/Sep/14 ]

Put up a simple test using fsync + lock and maint-mode on the secondary. Once we change over to the new code (heart beating, maint, etc) this may just start to pass as we no longer depend on a db-lock. We should know in a few weeks.

Generated at Thu Feb 08 03:36:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.