Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 4.9.0
Affects Version/s: None
Component/s: Index Maintenance, Storage
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Sprint:
Execution Team 2020-11-16
Linked BF Score:
70
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

This is related to a build failure on a two-node replica set where we build an index and then restart each node individually in standalone mode and encounter an incorrect fast count.

More precisely, this is the timeline of events leading up to the incorrect fast count:

1. The primary node starts an index build {x: 1} with UUID f50b6510-8a48-4143-b8f4-fc318d8cbd2a.

2. The secondary starts building the index too, and eventually, the commit quorum is satisfied.

3. Both nodes finish the index build and the commit timestamp for both the primary and secondary is Timestamp(1596309184, 9).

4. Both nodes are shutdown, as they will be started up in standalone mode.

5. The primary node is starting up in standalone mode with recoveryTimestamp Timestamp(1596309184, 8) and it sees that it has a single unfinished index in the 'config.system.indexBuilds' collection: "Found index from unfinished build".

However, during shutdown, the collections are validated and the 'config.system.indexBuilds' fast count is incorrect.
"fast count (0) does not match number of records (1) for collection 'config.system.indexBuilds'"

with the contents of the collection being:

{
	"_id" : UUID("f50b6510-8a48-4143-b8f4-fc318d8cbd2a"),
	"collectionUUID" : UUID("afdc996d-0602-47e5-adbb-ee3e02809050"),
	"commitQuorum" : "votingMembers",
	"indexNames" : [
		"x_1"
	],
	"commitReadyMembers" : [
		"...:20270",
		"...:20271"
	]
}

From what I can tell, it looks like the fsync performed when shutting down the nodes in step 4 synced the fast count data "too early" to disk when step 3 removed the index entry from the 'config.system.indexBuilds' collection that was committed after the recoveryTimestamp.

Restarting replica set members in standalone mode makes the database contents prone to appear inconsistent with the writes that were visible when this node was running as part of a replica set, so perhaps we can just turn off the fast count validation for the 'config.system.indexBuilds' in 'absent_ns_field_in_index_specs.js'.

Assignee:: Gregory Wlodarek
Reporter:: Gregory Wlodarek
Participants:: Githook User, Gregory Wlodarek
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: Sep 02 2020 04:38:25 PM UTC
Updated:: Oct 29 2023 10:03:42 PM UTC
Resolved:: Nov 11 2020 10:42:43 PM UTC
Confidence Status Last Update:: 11/Nov/20 6:54 PM

Details

Description

Attachments

Forms

Activity

People

Dates