[SERVER-18190] Secondary reads may block replication Created: 23/Apr/15 Updated: 19/Sep/15 Resolved: 08/May/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Concurrency, Querying |
| Affects Version/s: | 3.0.2 |
| Fix Version/s: | 3.0.4, 3.1.3 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Geert Bosch |
| Resolution: | Done | Votes: | 2 |
| Labels: | ET | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Backport Completed: | |||||||||||||||||||||
| Sprint: | Quint Iteration 3 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
Issue Status as of Jun 09, 2015 ISSUE SUMMARY USER IMPACT In extreme cases the affected node may become "stale". Stale nodes need to be resynchronized. If enough nodes in a replica set become stale availability may be impacted. WORKAROUNDS Alternatively, the oplog size can be increased on secondary nodes. This is only a suitable workaround if the nodes undergo periods of no reads so replication can catch up. AFFECTED VERSIONS FIX VERSION Original description
|
| Comments |
| Comment by Ramon Fernandez Marina [ 11/Jun/15 ] | |
|
m.cuk, apologies for the inaccuracies, I'll update JIRA. 3.0.4 was delayed about a week, but the 3.0.4-rc0 release candidate contains a fix for this issue and is available for download. If you were affected by this bug it would be very helpful if you could try 3.0.4-rc0 out and confirm that your problem is indeed fixed. Thanks, | |
| Comment by Bruce Lucas (Inactive) [ 11/Jun/15 ] | |
|
A release candidate 3.0.4-rc0 is available for testing (only) in the "development releases" section of the download site. It is not ready for production use yet, but if this release candidate passes our tests it will become the production 3.0.4 release. | |
| Comment by Matja ?uk [ 11/Jun/15 ] | |
|
So the JIRA versions page says : Today is 11/Jun/15 and under downloads there is still only 3.0.3. | |
| Comment by Githook User [ 13/May/15 ] | |
|
Author: {u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}Message: (cherry picked from commit 465ba933e8d6f5ad9173c4c806686b915bfffe1c) Conflicts: | |
| Comment by Ramon Fernandez Marina [ 11/May/15 ] | |
|
m.cuk, we're currently working on 3.0.3. Once we have a timeframe for 3.0.4 we'll update the JIRA versions page. | |
| Comment by Matja ?uk [ 11/May/15 ] | |
|
Hi, do you have any time estimates when 3.04 will be released? | |
| Comment by Githook User [ 08/May/15 ] | |
|
Author: {u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}Message: | |
| Comment by Githook User [ 07/May/15 ] | |
|
Author: {u'username': u'GeertBosch', u'name': u'Geert Bosch', u'email': u'geert@mongodb.com'}Message: | |
| Comment by Daniel Pasette (Inactive) [ 29/Apr/15 ] | |
|
The patch is in progress. The commit will show up as a comment on this ticket as usual. | |
| Comment by David Murphy [ 29/Apr/15 ] | |
|
Is there a github commit on this yet? We are trying to do some testing with 3.0 but this is a blocker for a test, even a manual patch so the test can proceed would be appreciated until 3.0.4 is a tag in the github. Thanks | |
| Comment by Kaloian Manassiev [ 23/Apr/15 ] | |
|
From looking at the stacks, I think the problem is that yielding (QueryYield::yieldAllLocks) does not know about the parallel-batch-writer lock, which is acquired by the RAII objects and is not off the lock manager. That way even though all other locks get yielded, the PBWR lock is still held. I think the only way to fix this would be to move the PBWR lock to be on the lock manager, so that Locker::saveLockStateAndUnlock would release it as well. This is definitely a regression from 2.6, because back then the yielding code was going directly through the RAII objects on the context. | |
| Comment by Bruce Lucas (Inactive) [ 23/Apr/15 ] | |
|
Log shows that the table scans are yielding, but that does not seem to be sufficient to avoid blocking replications.
| |
| Comment by Eric Milkie [ 23/Apr/15 ] | |
|
I would have expected the behavior to be that the table scans should have yielded to other operations, including replication applier. The investigation may want to start by examining the behavior there. |