[SERVER-30984] Investigate agg_base.js workload with causal consistency and secondary read preference Created: 07/Sep/17  Updated: 27/Oct/23  Resolved: 10/Nov/17

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Jack Mulrow Assignee: Misha Tyulenev
Resolution: Works as Designed Votes: 0
Labels: sharding36-passthrough-testing
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-31934 Setting orphanCleanupDelaySecs=0 in o... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding 2017-10-02, Sharding 2017-11-13
Participants:
Linked BF Score: 0

 Description   

Possible bug with secondary read preference and causal consistency. The agg_base.js workload failed when an aggregation with an empty match expression didn't find the correct number of documents.

Example stack trace:

[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.064+0000 2017-09-06T22:52:22.064+0000 E QUERY    [thread1] Error: 5 exceptions were thrown, 3 of which were unique:
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.064+0000 
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.064+0000         3 threads with tids [0,1,4] threw
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.064+0000         Foreground jstests/concurrency/fsm_workloads/agg_base.js
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.064+0000         Error: [922] != [1000] are not equal : undefined
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.065+0000 
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.065+0000         quietlyDoAssert@jstests/concurrency/fsm_libs/assert.js:53:15
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.065+0000         assert.eq@src/mongo/shell/assert.js:54:5
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.065+0000         wrapAssertFn@jstests/concurrency/fsm_libs/assert.js:60:13
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.065+0000         assertWithLevel/</assertWithLevel[fn]@jstests/concurrency/fsm_libs/assert.js:99:13
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.065+0000         query@jstests/concurrency/fsm_workloads/agg_base.js:44:13
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.065+0000         runFSM@jstests/concurrency/fsm_libs/fsm.js:37:13
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.066+0000         @<unknown> line 6 > eval:10:9
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.066+0000         main@jstests/concurrency/fsm_libs/worker_thread.js:139:17
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.066+0000         @<unknown> line 6 > eval:7:1
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.066+0000         @<unknown> line 6 > eval:5:24
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.066+0000         _threadStartWrapper@:24:16
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.066+0000 
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.066+0000 
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.067+0000         1 thread with tids [2] threw
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.067+0000         Foreground jstests/concurrency/fsm_workloads/agg_base.js
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.067+0000         Error: [725] != [1000] are not equal : undefined
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.067+0000 
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.067+0000         quietlyDoAssert@jstests/concurrency/fsm_libs/assert.js:53:15
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.068+0000         assert.eq@src/mongo/shell/assert.js:54:5
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.068+0000         wrapAssertFn@jstests/concurrency/fsm_libs/assert.js:60:13
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.068+0000         assertWithLevel/</assertWithLevel[fn]@jstests/concurrency/fsm_libs/assert.js:99:13
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.068+0000         query@jstests/concurrency/fsm_workloads/agg_base.js:44:13
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.068+0000         runFSM@jstests/concurrency/fsm_libs/fsm.js:37:13
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.069+0000         @<unknown> line 6 > eval:10:9
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.069+0000         main@jstests/concurrency/fsm_libs/worker_thread.js:139:17
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.069+0000         @<unknown> line 6 > eval:7:1
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.069+0000         @<unknown> line 6 > eval:5:24
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.069+0000         _threadStartWrapper@:24:16
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.069+0000 
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.070+0000 
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.070+0000         1 thread with tids [3] threw
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.070+0000         Foreground jstests/concurrency/fsm_workloads/agg_base.js
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.070+0000         Error: [856] != [1000] are not equal : undefined
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.070+0000 
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.070+0000         quietlyDoAssert@jstests/concurrency/fsm_libs/assert.js:53:15
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.071+0000         assert.eq@src/mongo/shell/assert.js:54:5
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.071+0000         wrapAssertFn@jstests/concurrency/fsm_libs/assert.js:60:13
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.071+0000         assertWithLevel/</assertWithLevel[fn]@jstests/concurrency/fsm_libs/assert.js:99:13
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.071+0000         query@jstests/concurrency/fsm_workloads/agg_base.js:44:13
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.071+0000         runFSM@jstests/concurrency/fsm_libs/fsm.js:37:13
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.072+0000         @<unknown> line 6 > eval:10:9
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.072+0000         main@jstests/concurrency/fsm_libs/worker_thread.js:139:17
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.072+0000         @<unknown> line 6 > eval:7:1
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.072+0000         @<unknown> line 6 > eval:5:24
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.072+0000         _threadStartWrapper@:24:16
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.072+0000 
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.073+0000  :
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.073+0000 throwError@jstests/concurrency/fsm_libs/runner.js:344:23
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.073+0000 runWorkloads@jstests/concurrency/fsm_libs/runner.js:739:17
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.073+0000 serial@jstests/concurrency/fsm_libs/runner.js:752:1
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.073+0000 @jstests/concurrency/fsm_all_sharded_causal_consistency_and_balancer.js:112:1
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-09-06T22:52:22.074+0000 failed to load: jstests/concurrency/fsm_all_sharded_causal_consistency_and_balancer.js

evergreen link: https://evergreen.mongodb.com/task/mongodb_mongo_master_ubuntu1604_debug_asan_concurrency_sharded_causal_consistency_and_balancer_patch_a644bf58213f0786b8415277c1d54277b4f14581_59b04453e3c33147710011de_17_09_06_18_57_25/0



 Comments   
Comment by Misha Tyulenev [ 10/Nov/17 ]

Investigated.
From the log https://logkeeper.mongodb.org/build/4010a77909a233be24f1ded0d6556473/test/5a0404b8f84ae8522009f5e5?raw=1
its evident that all threads that run the aggregate query followed by balancer chunk migration followed by getMore asserted on #of documents is < #expected. This is due to balancer orphanCleanupDelay is set to 0 for those tests and it deletes without any consideration of chunk filtering or clusterTime. Usually it is 15 min delay but in the tests its 0.
Hence the scenario is
start running aggregate (from secondaries) establish a cursor
chunkMigration finishes, rangeDeleter starts
getMore reads data that is being deleted on the donor while its being deleted by the rangeDeleter.

This will be fixed by setting orphanCleanupDelaySecs parameter to > 0 seconds. It also may be a source of intermittent BF for balancer sharded tests with the pattern
query starts
chunk moves
getMore

	
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.655+0000 !!! tid=2 request={ "query" : { "aggregate" : "coll4", "pipeline" : [ ], "cursor" : { }, "lsid" : { "id" : UUID("87a5afc4-af26-4ebd-aded-ae813a19603d") }, "$clusterTime" : { "clusterTime" : Timestamp(1510212904, 3), "signature" : { "hash" : BinData(0,"QnmvlncMeptYCTgbxGWTY977NdI="), "keyId" : NumberLong("6486314762094247955") } }, "readConcern" : { "afterClusterTime" : Timestamp(1510212904, 3) } }, "$readPreference" : { "mode" : "secondary" } }
12211
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.664+0000 d20012| 2017-11-09T07:35:04.663+0000 D REPL_HB [replexec-2] Sending heartbeat (requestId: 1544) to ip-10-81-156-171:20011, { replSetHeartbeat: "fsm_all_sharded_causal_consistency_and_balancer-rs0", configVersion: 2, from: "ip-10-81-156-171:20012", fromId: 2, term: 1 }
12212
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.665+0000 d20012| 2017-11-09T07:35:04.663+0000 D REPL_HB [replexec-3] Sending heartbeat (requestId: 1545) to ip-10-81-156-171:20010, { replSetHeartbeat: "fsm_all_sharded_causal_consistency_and_balancer-rs0", configVersion: 2, from: "ip-10-81-156-171:20012", fromId: 2, term: 1 }
12213
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.665+0000 d20011| 2017-11-09T07:35:04.662+0000 D REPL_HB [replexec-2] Sending heartbeat (requestId: 1547) to ip-10-81-156-171:20010, { replSetHeartbeat: "fsm_all_sharded_causal_consistency_and_balancer-rs0", configVersion: 2, from: "ip-10-81-156-171:20011", fromId: 1, term: 1 }
12214
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.666+0000 d20011| 2017-11-09T07:35:04.663+0000 D REPL_HB [conn5] Received heartbeat request from ip-10-81-156-171:20012, { replSetHeartbeat: "fsm_all_sharded_causal_consistency_and_balancer-rs0", configVersion: 2, from: "ip-10-81-156-171:20012", fromId: 2, term: 1, $replData: 1, $clusterTime: { clusterTime: Timestamp 1510212904000|6, signature: { hash: BinData(0, 4279AF96770C7A9B5809381BC4659363DEFB35D2), keyId: 6486314762094247955 } }, $db: "admin" }
12215
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.667+0000 d20011| 2017-11-09T07:35:04.663+0000 D REPL_HB [conn5] Processed heartbeat from ip-10-81-156-171:20012 and generated response, { ok: 1.0, state: 2, v: 2, hbmsg: "", set: "fsm_all_sharded_causal_consistency_and_balancer-rs0", syncingTo: "ip-10-81-156-171:20010", term: 1, primaryId: 0, durableOpTime: { ts: Timestamp 1510212904000|3, t: 1 }, opTime: { ts: Timestamp 1510212904000|6, t: 1 } }
12216
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.668+0000 d20010| 2017-11-09T07:35:04.663+0000 D REPL_HB [conn3] Received heartbeat request from ip-10-81-156-171:20011, { replSetHeartbeat: "fsm_all_sharded_causal_consistency_and_balancer-rs0", configVersion: 2, from: "ip-10-81-156-171:20011", fromId: 1, term: 1, $replData: 1, $clusterTime: { clusterTime: Timestamp 1510212904000|6, signature: { hash: BinData(0, 4279AF96770C7A9B5809381BC4659363DEFB35D2), keyId: 6486314762094247955 } }, $db: "admin" }
12217
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.668+0000 d20010| 2017-11-09T07:35:04.663+0000 D REPL_HB [conn3] Processed heartbeat from ip-10-81-156-171:20011 and generated response, { ok: 1.0, electionTime: new Date(6486314594590523393), state: 1, v: 2, hbmsg: "", set: "fsm_all_sharded_causal_consistency_and_balancer-rs0", term: 1, primaryId: 0, durableOpTime: { ts: Timestamp 1510212904000|3, t: 1 }, opTime: { ts: Timestamp 1510212904000|6, t: 1 } }
12218
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.669+0000 d20010| 2017-11-09T07:35:04.663+0000 D REPL_HB [conn2] Received heartbeat request from ip-10-81-156-171:20012, { replSetHeartbeat: "fsm_all_sharded_causal_consistency_and_balancer-rs0", configVersion: 2, from: "ip-10-81-156-171:20012", fromId: 2, term: 1, $replData: 1, $clusterTime: { clusterTime: Timestamp 1510212904000|6, signature: { hash: BinData(0, 4279AF96770C7A9B5809381BC4659363DEFB35D2), keyId: 6486314762094247955 } }, $db: "admin" }
12219
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.670+0000 d20010| 2017-11-09T07:35:04.663+0000 D REPL_HB [conn2] Processed heartbeat from ip-10-81-156-171:20012 and generated response, { ok: 1.0, electionTime: new Date(6486314594590523393), state: 1, v: 2, hbmsg: "", set: "fsm_all_sharded_causal_consistency_and_balancer-rs0", term: 1, primaryId: 0, durableOpTime: { ts: Timestamp 1510212904000|3, t: 1 }, opTime: { ts: Timestamp 1510212904000|6, t: 1 } }
12220
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.672+0000 d20012| 2017-11-09T07:35:04.664+0000 D REPL_HB [replexec-4] Received response to heartbeat (requestId: 1544) from ip-10-81-156-171:20011, { ok: 1.0, state: 2, v: 2, hbmsg: "", set: "fsm_all_sharded_causal_consistency_and_balancer-rs0", syncingTo: "ip-10-81-156-171:20010", term: 1, primaryId: 0, durableOpTime: { ts: Timestamp 1510212904000|3, t: 1 }, opTime: { ts: Timestamp 1510212904000|6, t: 1 }, operationTime: Timestamp 1510212904000|6, $replData: { term: 1, lastOpCommitted: { ts: Timestamp 1510212904000|3, t: 1 }, lastOpVisible: { ts: Timestamp 1510212904000|3, t: 1 }, configVersion: 2, replicaSetId: ObjectId('5a0404c1ecce3160a4e4202a'), primaryIndex: 0, syncSourceIndex: 0 }, $gleStats: { lastOpTime: Timestamp 0|0, electionId: ObjectId('000000000000000000000000') }, $clusterTime: { clusterTime: Timestamp 1510212904000|6, signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } }, $configServerState: { opTime: { ts: Timestamp 1510212901000|27, t: 1 } } }
12221
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.672+0000 d20012| 2017-11-09T07:35:04.664+0000 D REPL_HB [replexec-4] Scheduling heartbeat to ip-10-81-156-171:20011 at 2017-11-09T07:35:06.664Z
12222
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.674+0000 d20011| 2017-11-09T07:35:04.664+0000 D REPL_HB [replexec-0] Received response to heartbeat (requestId: 1547) from ip-10-81-156-171:20010, { ok: 1.0, electionTime: new Date(6486314594590523393), state: 1, v: 2, hbmsg: "", set: "fsm_all_sharded_causal_consistency_and_balancer-rs0", term: 1, primaryId: 0, durableOpTime: { ts: Timestamp 1510212904000|3, t: 1 }, opTime: { ts: Timestamp 1510212904000|6, t: 1 }, operationTime: Timestamp 1510212904000|6, $replData: { term: 1, lastOpCommitted: { ts: Timestamp 1510212904000|3, t: 1 }, lastOpVisible: { ts: Timestamp 1510212904000|3, t: 1 }, configVersion: 2, replicaSetId: ObjectId('5a0404c1ecce3160a4e4202a'), primaryIndex: 0, syncSourceIndex: -1 }, $gleStats: { lastOpTime: Timestamp 0|0, electionId: ObjectId('7fffffff0000000000000001') }, $clusterTime: { clusterTime: Timestamp 1510212904000|6, signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } }, $configServerState: { opTime: { ts: Timestamp 1510212904000|5, t: 1 } } }
12223
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.676+0000 d20012| 2017-11-09T07:35:04.664+0000 D REPL_HB [replexec-2] Received response to heartbeat (requestId: 1545) from ip-10-81-156-171:20010, { ok: 1.0, electionTime: new Date(6486314594590523393), state: 1, v: 2, hbmsg: "", set: "fsm_all_sharded_causal_consistency_and_balancer-rs0", term: 1, primaryId: 0, durableOpTime: { ts: Timestamp 1510212904000|3, t: 1 }, opTime: { ts: Timestamp 1510212904000|6, t: 1 }, operationTime: Timestamp 1510212904000|6, $replData: { term: 1, lastOpCommitted: { ts: Timestamp 1510212904000|3, t: 1 }, lastOpVisible: { ts: Timestamp 1510212904000|3, t: 1 }, configVersion: 2, replicaSetId: ObjectId('5a0404c1ecce3160a4e4202a'), primaryIndex: 0, syncSourceIndex: -1 }, $gleStats: { lastOpTime: Timestamp 0|0, electionId: ObjectId('7fffffff0000000000000001') }, $clusterTime: { clusterTime: Timestamp 1510212904000|6, signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } }, $configServerState: { opTime: { ts: Timestamp 1510212904000|5, t: 1 } } }
12224
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.676+0000 d20012| 2017-11-09T07:35:04.664+0000 D REPL_HB [replexec-2] Scheduling heartbeat to ip-10-81-156-171:20010 at 2017-11-09T07:35:06.664Z
12225
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.677+0000 d20012| 2017-11-09T07:35:04.664+0000 D REPL_HB [conn6] Received heartbeat request from ip-10-81-156-171:20011, { replSetHeartbeat: "fsm_all_sharded_causal_consistency_and_balancer-rs0", configVersion: 2, from: "ip-10-81-156-171:20011", fromId: 1, term: 1, $replData: 1, $clusterTime: { clusterTime: Timestamp 1510212904000|6, signature: { hash: BinData(0, 4279AF96770C7A9B5809381BC4659363DEFB35D2), keyId: 6486314762094247955 } }, $db: "admin" }
12226
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.678+0000 d20012| 2017-11-09T07:35:04.664+0000 D REPL_HB [conn6] Processed heartbeat from ip-10-81-156-171:20011 and generated response, { ok: 1.0, state: 2, v: 2, hbmsg: "", set: "fsm_all_sharded_causal_consistency_and_balancer-rs0", syncingTo: "ip-10-81-156-171:20010", term: 1, primaryId: 0, durableOpTime: { ts: Timestamp 1510212904000|6, t: 1 }, opTime: { ts: Timestamp 1510212904000|6, t: 1 } }
12227
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.678+0000 d20011| 2017-11-09T07:35:04.664+0000 D REPL_HB [replexec-0] Scheduling heartbeat to ip-10-81-156-171:20010 at 2017-11-09T07:35:06.664Z
12228
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.679+0000 d20011| 2017-11-09T07:35:04.664+0000 D REPL_HB [replexec-1] Sending heartbeat (requestId: 1549) to ip-10-81-156-171:20012, { replSetHeartbeat: "fsm_all_sharded_causal_consistency_and_balancer-rs0", configVersion: 2, from: "ip-10-81-156-171:20011", fromId: 1, term: 1 }
12229
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.680+0000 d20011| 2017-11-09T07:35:04.665+0000 D REPL_HB [replexec-2] Received response to heartbeat (requestId: 1549) from ip-10-81-156-171:20012, { ok: 1.0, state: 2, v: 2, hbmsg: "", set: "fsm_all_sharded_causal_consistency_and_balancer-rs0", syncingTo: "ip-10-81-156-171:20010", term: 1, primaryId: 0, durableOpTime: { ts: Timestamp 1510212904000|6, t: 1 }, opTime: { ts: Timestamp 1510212904000|6, t: 1 }, operationTime: Timestamp 1510212904000|6, $replData: { term: 1, lastOpCommitted: { ts: Timestamp 1510212904000|3, t: 1 }, lastOpVisible: { ts: Timestamp 1510212904000|3, t: 1 }, configVersion: 2, replicaSetId: ObjectId('5a0404c1ecce3160a4e4202a'), primaryIndex: 0, syncSourceIndex: 0 }, $gleStats: { lastOpTime: Timestamp 0|0, electionId: ObjectId('000000000000000000000000') }, $clusterTime: { clusterTime: Timestamp 1510212904000|6, signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } }, $configServerState: { opTime: { ts: Timestamp 1510212901000|27, t: 1 } } }
12230
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.680+0000 d20011| 2017-11-09T07:35:04.665+0000 D REPL_HB [replexec-2] Scheduling heartbeat to ip-10-81-156-171:20012 at 2017-11-09T07:35:06.665Z
12231
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.681+0000 d20010| 2017-11-09T07:35:04.670+0000 I SHARDING [ConfigServerCatalogCacheLoader-0] Cache loader remotely refreshed for collection db4.coll4 from collection version 1|1||5a040525184af75d8f602ef9 and found collection version 2|1||5a040525184af75d8f602ef9
12232
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.681+0000 d20010| 2017-11-09T07:35:04.672+0000 I SHARDING [ConfigServerCatalogCacheLoader-0] Cache loader found enqueued metadata from 2|0||5a040525184af75d8f602ef9 to 2|1||5a040525184af75d8f602ef9 and persisted metadata from 1|1||5a040525184af75d8f602ef9 to 1|1||5a040525184af75d8f602ef9, GTE cache version 1|1||5a040525184af75d8f602ef9
12233
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.682+0000 d20010| 2017-11-09T07:35:04.673+0000 I SHARDING [ConfigServerCatalogCacheLoader-0] Refresh for collection db4.coll4 took 68 ms and found version 2|1||5a040525184af75d8f602ef9
12234
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.682+0000 d20010| 2017-11-09T07:35:04.673+0000 I SHARDING [conn20] Updating collection metadata for db4.coll4 from collection version: 1|1||5a040525184af75d8f602ef9, shard version: 1|1||5a040525184af75d8f602ef9 to collection version: 2|1||5a040525184af75d8f602ef9, shard version: 2|1||5a040525184af75d8f602ef9
12235
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.682+0000 d20010| 2017-11-09T07:35:04.673+0000 I SHARDING [conn20] Migration succeeded and updated collection version to 2|1||5a040525184af75d8f602ef9
12236
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.683+0000 d20010| 2017-11-09T07:35:04.673+0000 I SHARDING [conn20] about to log metadata event into changelog: { _id: "ip-10-81-156-171-2017-11-09T07:35:04.673+0000-5a040528ecce3160a4e42b09", server: "ip-10-81-156-171", clientAddr: "10.81.156.171:57598", time: new Date(1510212904673), what: "moveChunk.commit", ns: "db4.coll4", details: { min: { _id: MinKey }, max: { _id: 0 }, from: "fsm_all_sharded_causal_consistency_and_balancer-rs0", to: "fsm_all_sharded_causal_consistency_and_balancer-rs1" } }
12237
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.684+0000 !!! tid=3 request={ "query" : { "aggregate" : "coll4", "pipeline" : [ ], "cursor" : { }, "lsid" : { "id" : UUID("4c6023b7-c58e-4f18-8fe1-9cc553bec4cd") }, "$clusterTime" : { "clusterTime" : Timestamp(1510212904, 3), "signature" : { "hash" : BinData(0,"QnmvlncMeptYCTgbxGWTY977NdI="), "keyId" : NumberLong("6486314762094247955") } }, "readConcern" : { "afterClusterTime" : Timestamp(1510212904, 3) } }, "$readPreference" : { "mode" : "secondary" } }
12238
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.702+0000 !!! tid=1 request={ "query" : { "aggregate" : "coll4", "pipeline" : [ ], "cursor" : { }, "lsid" : { "id" : UUID("9a5a1b90-0d5c-4316-a8c9-f0ed272a9834") }, "$clusterTime" : { "clusterTime" : Timestamp(1510212904, 3), "signature" : { "hash" : BinData(0,"QnmvlncMeptYCTgbxGWTY977NdI="), "keyId" : NumberLong("6486314762094247955") } }, "readConcern" : { "afterClusterTime" : Timestamp(1510212904, 3) } }, "$readPreference" : { "mode" : "secondary" } }
12239
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.703+0000 d20010| 2017-11-09T07:35:04.702+0000 I SHARDING [conn20] Scheduling deferred deletion of db4.coll4 range [{ _id: MinKey }, { _id: 0 })
12240
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.703+0000 d20010| 2017-11-09T07:35:04.702+0000 I SHARDING [conn20] Leaving cleanup of db4.coll4 range [{ _id: MinKey }, { _id: 0 }) to complete in background
12241
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.704+0000 d20010| 2017-11-09T07:35:04.703+0000 I SHARDING [conn20] about to log metadata event into changelog: { _id: "ip-10-81-156-171-2017-11-09T07:35:04.702+0000-5a040528ecce3160a4e42b11", server: "ip-10-81-156-171", clientAddr: "10.81.156.171:57598", time: new Date(1510212904702), what: "moveChunk.from", ns: "db4.coll4", details: { min: { _id: MinKey }, max: { _id: 0 }, step 1 of 6: 0, step 2 of 6: 5, step 3 of 6: 120, step 4 of 6: 2058, step 5 of 6: 477, step 6 of 6: 136, to: "fsm_all_sharded_causal_consistency_and_balancer-rs1", from: "fsm_all_sharded_causal_consistency_and_balancer-rs0", note: "success" } }
12242
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-11-09T07:35:04.714+0000 !!! tid=2 request={ "getMore" : NumberLong("7610086770093917222"), "collection" : "coll4", "lsid" : { "id" : UUID("87a5afc4-af26-4ebd-aded-ae813a19603d") }, "$clusterTime" : { "clusterTime" : Timestamp(1510212904, 6), "signature" : { "hash" : BinData(0,"QnmvlncMeptYCTgbxGWTY977NdI="), "keyId" : NumberLong("6486314762094247955") } } }
12243

Comment by Charlie Swanson [ 25/Oct/17 ]

Possibly related to BF-6988? Unclear why agg_sort.js would be expected to work if agg_base.js is not.

Comment by Max Hirschhorn [ 15/Oct/17 ]

After running the agg_base.js FSM workload many more times on top of my changes from SERVER-31456—with the following patch to collect additional diagnostics—I've determined there is still an issue and won't be unblacklisting the agg_base.js FSM workload in SERVER-31456 after all. It appears that pre-splitting chunks when using a hashed shard key allows an initial moveChunk command (prior to when the "shardCollection" and "_configsvrShardCollection" commands would return) to fail and be attempted later by the balancer. I'm not sure if that is new behavior in MongoDB 3.6 with changes to how the balancer works, or whether it has always behaved this way and the failures from the agg_base.js FSM workload are a manifestation of the issue described in SERVER-31275.

[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-10-15T00:57:28.104+0000 c20016| 2017-10-15T00:57:28.103+0000 W SHARDING [conn15] couldn't move chunk shard: fsm_all_sharded_causal_consistency_and_balancer-rs0, lastmod: 1|1||59e2b277ea6e9afd2cd1b7d7, [{ _id: 0 }, { _id: MaxKey }) to shard fsm_all_sharded_causal_consistency_and_balancer-rs1:fsm_all_sharded_causal_consistency_and_balancer-rs1/ip-10-166-197-179:20013,ip-10-166-197-179:20014,ip-10-166-197-179:20015 while sharding collection db19.coll19 :: caused by :: ConflictingOperationInProgress: Unable to start new migration because this shard is currently donating chunk [{ _id: MinKey }, { _id: 0 }) for namespace db19.coll19 to fsm_all_sharded_causal_consistency_and_balancer-rs1
...
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-10-15T00:57:30.577+0000 !!! tid=0 request={  "query" : {  "aggregate" : "coll19",  "pipeline" : [ ],  "cursor" : {   },  "lsid" : {  "id" : UUID("f7001bc4-fec4-4c80-a9d1-51162f4e0273") },  "$clusterTime" : {  "clusterTime" : Timestamp(1508029050, 2),  "signature" : {  "hash" : BinData(0,"nzm/aAm3sh2S0J7YjLtPAbcFZXM="),  "keyId" : NumberLong("6476934489095274505") } },  "readConcern" : {  "afterClusterTime" : Timestamp(1508029050, 2) } },  "$readPreference" : {  "mode" : "secondary" } }
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-10-15T00:57:30.618+0000 !!! tid=0 request={  "getMore" : NumberLong("4731112085947059242"),  "collection" : "coll19",  "lsid" : {  "id" : UUID("f7001bc4-fec4-4c80-a9d1-51162f4e0273") },  "$clusterTime" : {  "clusterTime" : Timestamp(1508029050, 2),  "signature" : {  "hash" : BinData(0,"nzm/aAm3sh2S0J7YjLtPAbcFZXM="),  "keyId" : NumberLong("6476934489095274505") } } }
...
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-10-15T00:57:30.778+0000 d20010| 2017-10-15T00:57:30.777+0000 I COMMAND  [conn20] command admin.$cmd appName: "MongoDB Shell" command: moveChunk { moveChunk: "db19.coll19", shardVersion: [ Timestamp 1000|0, ObjectId('59e2b277ea6e9afd2cd1b7d7') ], epoch: ObjectId('59e2b277ea6e9afd2cd1b7d7'), configdb: "fsm_all_sharded_causal_consistency_and_balancer-configRS/ip-10-166-197-179:20016,ip-10-166-197-179:20017,ip-10-166-197-179:20018", fromShard: "fsm_all_sharded_causal_consistency_and_balancer-rs0", toShard: "fsm_all_sharded_causal_consistency_and_balancer-rs1", min: { _id: MinKey }, max: { _id: 0 }, maxChunkSizeBytes: 67108864, waitForDelete: false, takeDistLock: false, $clusterTime: { clusterTime: Timestamp 1508029047000|38, signature: { hash: BinData(0, BFB54198FC70EF323373094EBC5E06A4159264B3), keyId: 6476934489095274505 } }, $configServerState: { opTime: { ts: Timestamp 1508029047000|38, t: 1 } }, $db: "admin" } numYields:0 reslen:288 locks:{ Global: { acquireCount: { r: 24, w: 14 } }, Database: { acquireCount: { r: 5, w: 12, W: 2 } }, Collection: { acquireCount: { r: 5, W: 6 } }, Metadata: { acquireCount: { w: 4 } }, oplog: { acquireCount: { w: 6 } } } protocol:op_msg 2758ms
...
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-10-15T00:57:30.930+0000 !!! tid=0 request={  "query" : {  "aggregate" : "coll19",  "pipeline" : [ ],  "cursor" : {   },  "lsid" : {  "id" : UUID("f7001bc4-fec4-4c80-a9d1-51162f4e0273") },  "$clusterTime" : {  "clusterTime" : Timestamp(1508029050, 154),  "signature" : {  "hash" : BinData(0,"nzm/aAm3sh2S0J7YjLtPAbcFZXM="),  "keyId" : NumberLong("6476934489095274505") } },  "readConcern" : {  "afterClusterTime" : Timestamp(1508029050, 5) } },  "$readPreference" : {  "mode" : "secondary" } }
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-10-15T00:57:30.993+0000 !!! tid=0 request={  "getMore" : NumberLong("5561859436560032256"),  "collection" : "coll19",  "lsid" : {  "id" : UUID("f7001bc4-fec4-4c80-a9d1-51162f4e0273") },  "$clusterTime" : {  "clusterTime" : Timestamp(1508029050, 269),  "signature" : {  "hash" : BinData(0,"nzm/aAm3sh2S0J7YjLtPAbcFZXM="),  "keyId" : NumberLong("6476934489095274505") } } }
...
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-10-15T00:58:36.965+0000 2017-10-15T00:58:36.965+0000 E QUERY    [thread1] Error: 5 exceptions were thrown, 5 of which were unique:
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-10-15T00:58:36.965+0000
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-10-15T00:58:36.965+0000         1 thread with tids [0] threw
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-10-15T00:58:36.965+0000         Foreground jstests/concurrency/fsm_workloads/agg_base.js
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-10-15T00:58:36.965+0000         Error: [746] != [1000] are not equal : undefined
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-10-15T00:58:36.965+0000
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-10-15T00:58:36.966+0000         quietlyDoAssert@jstests/concurrency/fsm_libs/assert.js:53:15
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-10-15T00:58:36.966+0000         assert.eq@src/mongo/shell/assert.js:54:5
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-10-15T00:58:36.966+0000         wrapAssertFn@jstests/concurrency/fsm_libs/assert.js:60:13
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-10-15T00:58:36.966+0000         assertWithLevel/</assertWithLevel[fn]@jstests/concurrency/fsm_libs/assert.js:99:13
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-10-15T00:58:36.966+0000         query@jstests/concurrency/fsm_workloads/agg_base.js:44:13
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-10-15T00:58:36.966+0000         runFSM@jstests/concurrency/fsm_libs/fsm.js:37:13
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-10-15T00:58:36.966+0000         @<unknown> line 6 > eval:10:9
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-10-15T00:58:36.967+0000         main@jstests/concurrency/fsm_libs/worker_thread.js:140:17
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-10-15T00:58:36.967+0000         @<unknown> line 6 > eval:7:1
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-10-15T00:58:36.967+0000         @<unknown> line 6 > eval:5:24
[js_test:fsm_all_sharded_causal_consistency_and_balancer] 2017-10-15T00:58:36.967+0000         _threadStartWrapper@:24:16

diff --git a/jstests/concurrency/fsm_all_sharded_causal_consistency.js b/jstests/concurrency/fsm_all_sharded_causal_consistency.js
index 1d2f861..0b4a7e0 100644
--- a/jstests/concurrency/fsm_all_sharded_causal_consistency.js
+++ b/jstests/concurrency/fsm_all_sharded_causal_consistency.js
@@ -95,8 +95,6 @@ var blacklist = [
 
     'reindex_background.js',  // TODO SERVER-30983
 
-    'agg_base.js',  // TODO SERVER-30984
-
     'indexed_insert_multikey.js',                         // TODO SERVER-31000
     'indexed_insert_multikey_noindex.js',                 // TODO SERVER-31000
     'indexed_insert_text_multikey.js',                    // TODO SERVER-31000
diff --git a/jstests/concurrency/fsm_all_sharded_causal_consistency_and_balancer.js b/jstests/concurrency/fsm_all_sharded_causal_consistency_and_balancer.js
index 7ee65f8..5a4e6ca 100644
--- a/jstests/concurrency/fsm_all_sharded_causal_consistency_and_balancer.js
+++ b/jstests/concurrency/fsm_all_sharded_causal_consistency_and_balancer.js
@@ -101,8 +101,6 @@ var blacklist = [
 
     'reindex_background.js',  // TODO SERVER-30983
 
-    'agg_base.js',  // TODO SERVER-30984
-
     'indexed_insert_multikey.js',                         // TODO SERVER-31000
     'indexed_insert_multikey_noindex.js',                 // TODO SERVER-31000
     'indexed_insert_text_multikey.js',                    // TODO SERVER-31000
diff --git a/jstests/concurrency/fsm_libs/runner.js b/jstests/concurrency/fsm_libs/runner.js
index 4997a7b..3f9156a 100644
--- a/jstests/concurrency/fsm_libs/runner.js
+++ b/jstests/concurrency/fsm_libs/runner.js
@@ -713,6 +713,8 @@ var runner = (function() {
                 bgThreadMgr.checkFailed(0);
 
                 var schedule = scheduleWorkloads(workloads, executionMode, executionOptions);
+                schedule = Array.from({length: 100},
+                                      _ => ['jstests/concurrency/fsm_workloads/agg_base.js']);
                 printWorkloadSchedule(schedule, bgWorkloads);
 
                 schedule.forEach(function(workloads) {
diff --git a/jstests/concurrency/fsm_libs/worker_thread.js b/jstests/concurrency/fsm_libs/worker_thread.js
index d03e739..7a74288 100644
--- a/jstests/concurrency/fsm_libs/worker_thread.js
+++ b/jstests/concurrency/fsm_libs/worker_thread.js
@@ -26,6 +26,9 @@ var workerThread = (function() {
 
         globalAssertLevel = args.globalAssertLevel;
 
+        TestData = TestData || {};
+        TestData.tid = args.tid;
+
         try {
             if (Cluster.isStandalone(args.clusterOptions)) {
                 myDB = db.getSiblingDB(args.dbName);
diff --git a/src/mongo/shell/session.js b/src/mongo/shell/session.js
index ae49db6..664bab3 100644
--- a/src/mongo/shell/session.js
+++ b/src/mongo/shell/session.js
@@ -296,6 +296,10 @@ var {
         this.runCommand = function runCommand(driverSession, dbName, cmdObj, options) {
             cmdObj = prepareCommandRequest(driverSession, cmdObj);
 
+            if (typeof TestData !== "undefined" && TestData.tid !== undefined) {
+                print("!!! tid=" + TestData.tid + " request=" + tojsononeline(cmdObj));
+            }
+
             const res = runClientFunctionWithRetries(
                 driverSession, cmdObj, client.runCommand, [dbName, cmdObj, options]);

Comment by Max Hirschhorn [ 09/Oct/17 ]

I've confirmed the agg_base.js FSM workload no longer fails in the concurrency_sharded_causal_consistency*.yml test suites with my changes for SERVER-31456. The agg_base.js FSM workload inserts the 100 documents in its $config.setup() function and then tries to iterate over all of them in its $config.states.query() function. However, reading from the secondary may not have waited long enough for all of the insert operations to have been applied.

I'm planning to unblacklist the agg_base.js FSM workload as part of my changes for SERVER-31456.

Comment by Jack Mulrow [ 08/Sep/17 ]

Again here (execution 3): https://evergreen.mongodb.com/task/mongodb_mongo_master_osx_1010_ssl_concurrency_sharded_causal_consistency_and_balancer_patch_a644bf58213f0786b8415277c1d54277b4f14581_59b04453e3c33147710011de_17_09_06_18_57_25/2

Generated at Thu Feb 08 04:25:39 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.