[SERVER-9944] assumePrimary() waits 10 seconds (for nothing?) Created: 17/Jun/13 Updated: 20/Jul/13 Resolved: 17/Jun/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.4.4, 2.5.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Henrik Ingo (Inactive) | Assignee: | Unassigned |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu 12.04 on a Dell XPS13 laptop |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
When a primary node has died and a secondary has won an election, it proceeds to assume primary role. However, assumePrimary() will wait for producerThread() which is sleeping 10 seconds. The reason it is sleeping is the same reason we are assuming primary, the old primary node has disconnected. I've attached a verbose syslog where all 3 replicaset members run on the same host, but the summary is: port 27001 is the primary:
mongod on ports 27002 and 27003 immediately notice the connection was closed:
This causes the following to happen in producerThread():
Soon after, well at most 2 seconds later, 27002 has won the election and proceeds to assume primary role
In the code, we end up here, and step 2 is blocked for next 10 seconds:
And 10 seconds later, without much fanfare, 27002 actually is PRIMARY:
It seems to me this is an unintended wait that arises from 2 threads depending on each other. The producer thread in isolation thinks "oh no, couldn't read from primary, I'll try again (10 sec) later". At the same time the assumePrimary thread correctly asks the producer to finish what it might be doing, but unfortunately all it is doing is sleeping. Note that while there might be an argument to insert delay before electing a new primary, that is not what is happening here: the replica set members have already called for election and agreed on the winner, it's just the state change itself that is blocked for <10 seconds. |
| Comments |
| Comment by Henrik Ingo (Inactive) [ 17/Jun/13 ] |
|
Verbose syslog |