[SERVER-50697] Do not enforce fast count on 'config.system.indexBuilds' Created: 02/Sep/20  Updated: 29/Oct/23  Resolved: 11/Nov/20

Status: Closed
Project: Core Server
Component/s: Index Maintenance, Storage
Affects Version/s: None
Fix Version/s: 4.9.0

Type: Bug Priority: Major - P3
Reporter: Gregory Wlodarek Assignee: Gregory Wlodarek
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Execution Team 2020-11-16
Participants:
Linked BF Score: 70

 Description   

This is related to a build failure on a two-node replica set where we build an index and then restart each node individually in standalone mode and encounter an incorrect fast count.

 

More precisely, this is the timeline of events leading up to the incorrect fast count:

1. The primary node starts an index build {x: 1} with UUID f50b6510-8a48-4143-b8f4-fc318d8cbd2a.

2. The secondary starts building the index too, and eventually, the commit quorum is satisfied.

3. Both nodes finish the index build and the commit timestamp for both the primary and secondary is Timestamp(1596309184, 9).

4. Both nodes are shutdown, as they will be started up in standalone mode.

5. The primary node is starting up in standalone mode with recoveryTimestamp Timestamp(1596309184, 8) and it sees that it has a single unfinished index in the 'config.system.indexBuilds' collection: "Found index from unfinished build".

However, during shutdown, the collections are validated and the 'config.system.indexBuilds' fast count is incorrect. 
"fast count (0) does not match number of records (1) for collection 'config.system.indexBuilds'"

with the contents of the collection being:

{
	"_id" : UUID("f50b6510-8a48-4143-b8f4-fc318d8cbd2a"),
	"collectionUUID" : UUID("afdc996d-0602-47e5-adbb-ee3e02809050"),
	"commitQuorum" : "votingMembers",
	"indexNames" : [
		"x_1"
	],
	"commitReadyMembers" : [
		"...:20270",
		"...:20271"
	]
}

 

From what I can tell, it looks like the fsync performed when shutting down the nodes in step 4 synced the fast count data "too early" to disk when step 3 removed the index entry from the 'config.system.indexBuilds' collection that was committed after the recoveryTimestamp.

Restarting replica set members in standalone mode makes the database contents prone to appear inconsistent with the writes that were visible when this node was running as part of a replica set, so perhaps we can just turn off the fast count validation for the 'config.system.indexBuilds' in 'absent_ns_field_in_index_specs.js'.



 Comments   
Comment by Githook User [ 11/Nov/20 ]

Author:

{'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}

Message: SERVER-50697 Do not enforce fast count on 'config.system.indexBuilds'
Branch: master
https://github.com/mongodb/mongo/commit/cba2cfc0e778ffd1c21e6d03976c2aee21a059b1

Comment by Gregory Wlodarek [ 11/Nov/20 ]

After discussing with benety.goh we've agreed to exclude the 'config.system.indexBuilds' collection from validations fast count enforcement. This collection is internal to building indexes and should not be queried outside of that.

Generated at Thu Feb 08 05:23:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.