[SERVER-30983] Investigate background index behavior with causal consistency and secondary read preference Created: 07/Sep/17  Updated: 27/Oct/23  Resolved: 08/Dec/17

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.6.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Jack Mulrow Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Works as Designed Votes: 0
Labels: sharding36-passthrough-testing
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Sharding
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 2017-10-02, Sharding 2017-10-23
Participants:

 Description   

Possible bug with secondary read preference and causal consistency on. The reindex_background.js workload was failing with an "IndexNotFound" error somewhat frequently in the concurrency_sharded_causal_consistency(_and_balancer) suites once I turned secondary read preference on.

Example stack trace:

[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.519+0000 2017-08-31T22:33:38.519+0000 E QUERY    [thread1] Error: 3 threads threw
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.519+0000 
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.519+0000         Foreground jstests/concurrency/fsm_workloads/reindex_background.js
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.519+0000         Error: error: {
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.519+0000         	"ok" : 0,
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.519+0000         	"errmsg" : "text index required for $text query",
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.519+0000         	"code" : 27,
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.519+0000         	"codeName" : "IndexNotFound",
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.519+0000         	"$clusterTime" : {
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.519+0000         		"clusterTime" : Timestamp(1504218621, 2),
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.519+0000         		"signature" : {
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.519+0000         			"hash" : BinData(0,"zON3is6ERc9naFc6aUw6RnuPsrU="),
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.519+0000         			"keyId" : NumberLong("6460566295715774465")
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.519+0000         		}
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.519+0000         	},
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.519+0000         	"operationTime" : Timestamp(1504218621, 2)
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.519+0000         }
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.519+0000 
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.519+0000         _getErrorWithCode@src/mongo/shell/utils.js:25:13
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.519+0000         DBCommandCursor@src/mongo/shell/query.js:702:1
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.519+0000         DBQuery.prototype._exec@src/mongo/shell/query.js:117:28
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.519+0000         DBQuery.prototype.hasNext@src/mongo/shell/query.js:288:5
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.519+0000         DBQuery.prototype.itcount@src/mongo/shell/query.js:415:12
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.520+0000         query/<@jstests/concurrency/fsm_workloads/reindex.js:76:25
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.520+0000         assertWithLevel/assertWithLevel@jstests/concurrency/fsm_libs/assert.js:75:1
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.520+0000         query@jstests/concurrency/fsm_workloads/reindex.js:69:13
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.520+0000         runFSM@jstests/concurrency/fsm_libs/fsm.js:37:13
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.520+0000         @<unknown> line 6 > eval:10:9
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.520+0000         main@jstests/concurrency/fsm_libs/worker_thread.js:135:17
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.520+0000         @<unknown> line 6 > eval:7:1
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.520+0000         @<unknown> line 6 > eval:5:24
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.520+0000         _threadStartWrapper@:24:16
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.520+0000 
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.520+0000  :
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.520+0000 throwError@jstests/concurrency/fsm_libs/runner.js:344:23
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.520+0000 runWorkloads@jstests/concurrency/fsm_libs/runner.js:739:17
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.520+0000 serial@jstests/concurrency/fsm_libs/runner.js:752:1
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.520+0000 @jstests\concurrency\fsm_all_sharded_causal_consistency_and_balancer.js:105:1
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-08-31T22:33:38.520+0000 failed to load: jstests\concurrency\fsm_all_sharded_causal_consistency_and_balancer.

evergreen link: https://evergreen.mongodb.com/task/mongodb_mongo_master_windows_64_2k8_debug_concurrency_sharded_causal_consistency_and_balancer_WT_patch_c351caa6815218c5b4a9801342ccbb1b050f6aea_59a87624e3c3314259001bc1_17_08_31_20_58_16



 Comments   
Comment by Spencer Brody (Inactive) [ 05/Dec/17 ]

Agreed with Will's summary. I don't think there's anything replication can do about this at the moment. Sounds like this likely depends on PM-253 (index builds with two phase commit), which is on the storage backlog.

Sending back to Sharding.

Comment by William Schultz (Inactive) [ 05/Dec/17 ]

misha.tyulenev, to your comment, this sort of seems like a fundamental incompatibility between replicated indexes and causal consistency, at least as the system stands today. As far as I understand, the only operation about an index build that we actually log in the oplog is the createIndex op. If the index creation is synchronous i.e. background=false, then the client can explicitly wait until the build has finished, and then, presumably, use the operationTime of the index build operation to read after. If the build is asynchronous i.e. background=true, then the only information the client can receive about that index's lifecycle is the optime of the createIndex operation. The actual operations carried out internally to build the physical index are not explicitly "replicated", since they are not logged in the oplog. Additionally, there would be no way, from the oplog's perspective, to determine when a background index build "finished". So, it seems like there may be no existing way for a client to check for the completion of a background index build simply by utilizing the causal consistency machinery. The most they could do is a read that guarantees it happened after the start of the index build, since that seems to be the only operation time that actually exists in relation to the lifetime of a replicated index. Let me know if I am misunderstanding any of the fundamental concepts here.

There doesn't really seem to be a fix for this specific issue unless the behavior of replicated index builds are significantly changed.

Comment by Benety Goh [ 31/Oct/17 ]

Assigning to william.schultz to get a minimal reproduction of the issue.

Comment by Misha Tyulenev [ 27/Oct/17 ]

max.hirschhorn until this issue is resolved all tests that involve background index building are not causally consistent.

Comment by Misha Tyulenev [ 27/Oct/17 ]

The issue is caused by how ensureIndex(

{background: true}

) command is replicated.
1. On the primary when the command finishes its logged to oplog
2. Then its replicated to the secondary it starts a background build
Hence the afterClusterTime time in the find command will not be able to wait until index creating has finished
Over to replication team to triage. Please file a DOCs ticket to document it if its not being fixed in 3.6.

Comment by Esha Maharishi (Inactive) [ 20/Oct/17 ]

Could this be related to BF-6752, where there is a race between migrations and createIndexes that causes an index not to be created on a recipient shard?

Comment by Max Hirschhorn [ 09/Oct/17 ]

misha.tyulenev, the failures in the reindex_background.js FSM workload are not caused by the issue described in SERVER-31456, as the reindex_background.js FSM workload has a no-op $config.setup() function and creates/drops indexes as part of its $config.states functions.

Generated at Thu Feb 08 04:25:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.