[SERVER-35616] Oplog query on initial syncing node can cause segmentation fault Created: 15/Jun/18  Updated: 29/Oct/23  Resolved: 24/Aug/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.0.3, 4.1.3

Type: Bug Priority: Major - P3
Reporter: Judah Schvimer Assignee: Tess Avitabile (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File bf-9529.diff    
Issue Links:
Backports
Depends
Related
is related to SERVER-34249 Oplog query on uninitiated replica se... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.0, v3.6
Sprint: Repl 2018-07-16, Repl 2018-08-27
Participants:
Linked BF Score: 60

 Description   

If an oplog query with a read concern afterClusterTime is run on a node while it is in initial sync but doesn't have an oplog, then it will seg-fault. This is similar to SERVER-34249, though this time the node's config is initialized. I think this is probably possible on 3.6 as well, though the implementer should check.



 Comments   
Comment by Githook User [ 28/Aug/18 ]

Author:

{'name': 'Tess Avitabile', 'email': 'tess.avitabile@mongodb.com', 'username': 'tessavitabile'}

Message: SERVER-35616 Do not allow afterClusterTime reads before oplog exists

(cherry picked from commit f7c2600036ce876bb389f3eb3adc8eada6932d8b)
Branch: v4.0
https://github.com/mongodb/mongo/commit/7d3019f834c41368c6f16e48250b0f022c51a9ae

Comment by Githook User [ 24/Aug/18 ]

Author:

{'name': 'Tess Avitabile', 'email': 'tess.avitabile@mongodb.com', 'username': 'tessavitabile'}

Message: SERVER-35616 Do not allow afterClusterTime reads before oplog exists
Branch: master
https://github.com/mongodb/mongo/commit/f7c2600036ce876bb389f3eb3adc8eada6932d8b

Comment by Tess Avitabile (Inactive) [ 22/Aug/18 ]

The script does not cause the seg-fault on 3.6. On 3.6, we do not start advancing the cluster time until the FCV is 3.6. Until the cluster time becomes non-null, all non-null afterClusterTime queries will uassert (and null afterClusterTime queries always uassert). During initial sync, the FCV is not set until after the oplog is created, so by the time afterClusterTime queries can succeed, the oplog exists, so there is no seg-fault attempting to access the oplog.

I made an attempt to get the cluster time to be non-null when in initial sync by using the resync command. However, when we are initial sync due to a resync command, the oplog is not dropped, so we do not hit the seg-fault.

Since I have been unable to reproduce this on 3.6, I am going to decline to backport the fix to 3.6.

Comment by Tess Avitabile (Inactive) [ 06/Jul/18 ]

No, I don't think it makes sense to put this in the Initial Sync Semantics epic. That epic is to ensure that the presence of an initial syncing node cannot cause majority acknowledged writes to roll back and to ensure that once a node exits initial sync, it can restart or roll back without requiring a full resync.

Comment by Judah Schvimer [ 15/Jun/18 ]

Reproduced here on top of 3bec524a983f2a21bfd40b1d39c937189e12db07. bf-9529.diff

Generated at Thu Feb 08 04:40:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.