[SERVER-12109] getMore with tailable cursor, projection, and Query_OplogReplay may fail to return new data Created: 16/Dec/13  Updated: 11/Jul/16  Resolved: 29/Jan/14

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.5.4
Fix Version/s: 2.5.5

Type: Bug Priority: Major - P3
Reporter: Matt Dannenberg Assignee: Matt Dannenberg
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File 12109repro.js    
Issue Links:
Related
Operating System: ALL
Steps To Reproduce:
  1. Create a two node replica set with a 2.4 secondary syncing from a 2.6 primary.
  2. Send a reconfig that only contains changes to member tags.
  3. Insert something and do a getLastError() check with w:2.
  4. See that it takes 10 mins to get a response (unless you have a timeout specified, but in that case, if you do another getLastError(), it will not work until it has been 10 mins).
Participants:

 Description   

The logic behind this is as follows:

In 2.4, the oplogreader which notifies the primary of a secondary's sync progress only sends a handshake when it first connects (in 2.6, we notify the primary of this progress via the SyncSourceFeedback class). This handshake is how the secondary gets added to primary's ghost cache which is how the primary tracks the secondary's sync progress for the sake of write concerns.

On a reconfig where only tags are affected, 2.6 members clear the ghost cache as well as the member list, but do not close all connections in order to avoid triggering an election. When a 2.6 node is syncing from a 2.6 node, this does not cause a problem because the 2.6 secondary node will send an update, hear back that primary does not know who the secondary is, and the secondary will send a handshake.

After the reconfig, the 2.6 primary does not know who the 2.4 secondary is, but the 2.4 secondary does not send a new handshake. So, the secondary will continue to send updates and the primary will ignore them. After 10 minutes the oplogreader timeout is trigger and a reconnect occurs.



 Comments   
Comment by Githook User [ 29/Jan/14 ]

Author:

{u'username': u'hkhalsa', u'name': u'Hari Khalsa', u'email': u'hkhalsa@10gen.com'}

Message: SERVER-12109 projection stage should try to work child even if child is EOF
Branch: master
https://github.com/mongodb/mongo/commit/4b9ab5bd393d0742f1be81a8189fa1260469b0c5

Comment by Matt Dannenberg [ 14/Jan/14 ]

The repro script passes up to and including this commit.

The next commit enables the new project code. I imagine that has something to do with the root cause. I know that 2.4 secondaries make a query on the oplog of the primary with a project of only the ts field. But I am struggling to narrow in any further, due in no small part to my lack of knowledge around both the old and new query frameworks.

Comment by Matt Dannenberg [ 14/Jan/14 ]

attaching a repro script

Comment by Matt Dannenberg [ 02/Jan/14 ]

went from "for 10 mins" to indefinitely with this commit that causes getMore() to reset cursor timeout:
https://github.com/mongodb/mongo/commit/6de0da15e46398dfc1b1747ec2c27c61c2e8bca9

Generated at Thu Feb 08 03:27:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.