[SERVER-12109] getMore with tailable cursor, projection, and Query_OplogReplay may fail to return new data Created: 16/Dec/13 Updated: 11/Jul/16 Resolved: 29/Jan/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.5.4 |
| Fix Version/s: | 2.5.5 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Matt Dannenberg | Assignee: | Matt Dannenberg |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Operating System: | ALL | ||||
| Steps To Reproduce: |
|
||||
| Participants: | |||||
| Description |
|
The logic behind this is as follows: In 2.4, the oplogreader which notifies the primary of a secondary's sync progress only sends a handshake when it first connects (in 2.6, we notify the primary of this progress via the SyncSourceFeedback class). This handshake is how the secondary gets added to primary's ghost cache which is how the primary tracks the secondary's sync progress for the sake of write concerns. On a reconfig where only tags are affected, 2.6 members clear the ghost cache as well as the member list, but do not close all connections in order to avoid triggering an election. When a 2.6 node is syncing from a 2.6 node, this does not cause a problem because the 2.6 secondary node will send an update, hear back that primary does not know who the secondary is, and the secondary will send a handshake. After the reconfig, the 2.6 primary does not know who the 2.4 secondary is, but the 2.4 secondary does not send a new handshake. So, the secondary will continue to send updates and the primary will ignore them. After 10 minutes the oplogreader timeout is trigger and a reconnect occurs. |
| Comments |
| Comment by Githook User [ 29/Jan/14 ] |
|
Author: {u'username': u'hkhalsa', u'name': u'Hari Khalsa', u'email': u'hkhalsa@10gen.com'}Message: |
| Comment by Matt Dannenberg [ 14/Jan/14 ] |
|
The repro script passes up to and including this commit. The next commit enables the new project code. I imagine that has something to do with the root cause. I know that 2.4 secondaries make a query on the oplog of the primary with a project of only the ts field. But I am struggling to narrow in any further, due in no small part to my lack of knowledge around both the old and new query frameworks. |
| Comment by Matt Dannenberg [ 14/Jan/14 ] |
|
attaching a repro script |
| Comment by Matt Dannenberg [ 02/Jan/14 ] |
|
went from "for 10 mins" to indefinitely with this commit that causes getMore() to reset cursor timeout: |