[SERVER-30559] Sharding tests which run under continuous stepdown must allow for lastVisibleOp to go backwards Created: 08/Aug/17  Updated: 27/Oct/23  Resolved: 26/Mar/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Matthew Russotto Assignee: Kevin Pulo
Resolution: Gone away Votes: 0
Labels: sharding-4.4-stabilization, sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-30615 _configsvrShardCollection can retry u... Closed
related to SERVER-32375 Config metadata commands should not u... Closed
Sprint: Sharding 2020-03-23, Sharding 2020-04-06
Participants:
Linked BF Score: 0

 Description   

In view_rewrite.js, the test enables sharding on a database and then shards a collection. Sharding a collection requires the database have sharding enabled. However, this can fail under unusual circumstances when a stepdown occurs between enabling sharding on the database and sharding the collection; the write to the config server indicating that the database has sharding enabled may no longer be visible (though it is guaranteed to eventually become visible).



 Comments   
Comment by Kaloian Manassiev [ 26/Mar/20 ]

Thanks for the detailed write-up, kevin.pulo. Yes, the reasoning sounds right, because the local read concern will permit the write to be seen even though it is not yet majority committed and any decision (i.e., subsequent write) made based on that read will be majority committed.

This looks like it's gone away.

Comment by Kevin Pulo [ 26/Mar/20 ]

I'm 99% sure this was fixed by SERVER-30615 (and later further strengthened by SERVER-32375), which adjusted ShardLocal to do these local configsvr reads with readConcern level of local, rather than majority (as it was when this ticket was filed). This means that the enableSharding op isn't hidden by reading from the slightly-old majority snapshot. This was the approach taken at the time, rather than trying to do afterOpTime/afterClusterTime majority reads, to solve similar visibility problems when stepdowns occurred between sequential sharding operations. This would also explain why there haven't been any recent recurrences of this problem, since SERVER-30615 was fixed about a month after this ticket was filed.

Generated at Thu Feb 08 04:24:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.