[SERVER-22543] multi_write_target.js should not rely on the order of shard ids Created: 15/Jan/16  Updated: 18/Nov/16  Resolved: 10/Feb/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 3.2.4, 3.3.2

Type: Bug Priority: Minor - P4
Reporter: Spencer Jackson Assignee: Kaloian Manassiev
Resolution: Done Votes: 0
Labels: test-only
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Sprint: Sharding 10 (02/19/16)
Participants:

 Description   

sharding_csrs_continuous_config_stepdown_WT failed on enterprise-rhel-62-64-bit

multi_write_target.js - Logs | History

BF Ticket Generated by spencer.jackson



 Comments   
Comment by Githook User [ 17/Feb/16 ]

Author:

{u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}

Message: SERVER-22543 multi_write_target.js should not rely on order of shard ids
Branch: v3.2
https://github.com/mongodb/mongo/commit/bfe0a9ad30e1c5c6d0f62526a164388eb802db8a

Comment by Githook User [ 10/Feb/16 ]

Author:

{u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}

Message: SERVER-22543 multi_write_target.js should not rely on order of shard ids
Branch: master
https://github.com/mongodb/mongo/commit/4b4cbf4dc3349744bec757382dfe01a3e363d0fc

Comment by Kaloian Manassiev [ 09/Feb/16 ]

This test failed because this line in the test returned the shards not sorted based on their shard id, but in this order: shard0000, shard0002, shard0001. This is evident from the actual move chunk commands, which get executed later on:

assert.commandWorked(admin.runCommand({ moveChunk : coll + "", find : { skey : 0 }, to : shards[1]._id }));
[js_test:multi_write_target] 2016-01-15T21:46:15.567+0000 d20510| 2016-01-15T21:46:15.172+0000 I COMMAND  [conn6] command admin.$cmd command: moveChunk { moveChunk: "foo.bar", from: "ip-10-169-241-28:20510", to: "ip-10-169-241-28:20512", fromShard: "shard0000", toShard: "shard0002", min: { skey: 0.0 }, max: { skey: 100.0 }, maxChunkSizeBytes: 52428800, configdb: "test-configRS/ip-10-169-241-28:20513,ip-10-169-241-28:20514,ip-10-169-241-28:20515", secondaryThrottle: true, waitForDelete: false, maxTimeMS: 0, shardVersion: [ Timestamp 1000|4, ObjectId('569968a64848062e2db13294') ], epoch: ObjectId('569968a64848062e2db13294') } keyUpdates:0 writeConflicts:0 numYields:0 reslen:74 locks:{ Global: { acquireCount: { r: 11, w: 5, R: 2 } }, Database: { acquireCount: { r: 2, w: 2, W: 3 } }, Collection: { acquireCount: { r: 2, W: 2 } } } protocol:op_command 66ms
 
assert.commandWorked(admin.runCommand({ moveChunk : coll + "", find : { skey : 100 }, to : shards[2]._id }));
[js_test:multi_write_target] 2016-01-15T21:46:16.122+0000 d20510| 2016-01-15T21:46:15.561+0000 I COMMAND  [conn6] command admin.$cmd command: moveChunk { moveChunk: "foo.bar", from: "ip-10-169-241-28:20510", to: "ip-10-169-241-28:20511", fromShard: "shard0000", toShard: "shard0001", min: { skey: 100.0 }, max: { skey: MaxKey }, maxChunkSizeBytes: 52428800, configdb: "test-configRS/ip-10-169-241-28:20513,ip-10-169-241-28:20514,ip-10-169-241-28:20515", secondaryThrottle: true, waitForDelete: false, maxTimeMS: 0, shardVersion: [ Timestamp 2000|1, ObjectId('569968a64848062e2db13294') ], epoch: ObjectId('569968a64848062e2db13294') } keyUpdates:0 writeConflicts:0 numYields:0 reslen:74 locks:{ Global: { acquireCount: { r: 10, w: 4, R: 2 } }, Database: { acquireCount: { r: 2, w: 2, W: 2 } }, Collection: { acquireCount: { r: 2, W: 2 } } } protocol:op_command 386ms

Because the rest of the test relies on the order of the shards being sorted, the expected checks didn't match later on.

The only explanation, which I have for this outcome is that the third config server in the CSRS set was not part of the majority when the shard entries were inserted and then when replaying the oplog, happened to insert shard0002 first, which got a lower RecordId than shard0001. This is very likely since one of the nodes in the CSRS set was lagging behind for a while:

	
[js_test:multi_write_target] 2016-01-15T21:46:06.765+0000 c20513| 2016-01-15T21:46:06.297+0000 D REPL     [conn14] received notification that node with memberID 1 in config with version 1 has reached optime: (term: -1, timestamp: Jan 15 21:45:52:1)
[js_test:multi_write_target] 2016-01-15T21:46:06.766+0000 c20513| 2016-01-15T21:46:06.297+0000 D REPL     [conn14] received notification that node with memberID 2 in config with version 1 has reached optime: (term: 1, timestamp: Jan 15 21:46:06:11)
...
[js_test:multi_write_target] 2016-01-15T21:46:06.781+0000 c20513| 2016-01-15T21:46:06.304+0000 D REPL     [conn14] received notification that node with memberID 1 in config with version 1 has reached optime: (term: -1, timestamp: Jan 15 21:45:52:1)
[js_test:multi_write_target] 2016-01-15T21:46:06.782+0000 c20513| 2016-01-15T21:46:06.304+0000 D REPL     [conn14] received notification that node with memberID 2 in config with version 1 has reached optime: (term: 1, timestamp: Jan 15 21:46:06:13)

In the above example memberID 1 is server 20514:

"members":[
   {
      "_id":0,
      "host":"ip-10-169-241-28:20513"
   },
   {
      "_id":1,
      "host":"ip-10-169-241-28:20514"
   },
   {
      "_id":2,
      "host":"ip-10-169-241-28:20515"
   }
]

And later on, it is this host, which becomes the new primary after the stepdown thread runs:

[js_test:multi_write_target] 2016-01-15T21:46:11.866+0000 c20514| 2016-01-15T21:46:11.848+0000 I REPL     [ReplicationExecutor] transition to PRIMARY

Based on this evidence, the effect is expected and the test needs to be fixed to not rely on this order.

Comment by Spencer Jackson [ 15/Jan/16 ]

https://logkeeper.mongodb.org/build/569964cb9041300b275639eb/test/5699688e9041300b2756ae2a#L7526

[js_test:multi_write_target] 2016-01-15T21:46:15.733+0000 d20511| 2016-01-15T21:46:15.519+0000 I COMMAND  [conn5] command admin.$cmd command: isMaster { isMaster: 1, hostInfo: "ip-10-169-241-28:20510" } keyUpdates:0 writeConflicts:0 numYields:0 reslen:178 locks:{} protocol:op_query 0ms
[js_test:multi_write_target] 2016-01-15T21:46:15.734+0000 assert: [null] != [null] are equal : undefined
[js_test:multi_write_target] 2016-01-15T21:46:15.734+0000 d20511| 2016-01-15T21:46:15.520+0000 D COMMAND  [conn5] run command admin.$cmd { ismaster: 1 }
[js_test:multi_write_target] 2016-01-15T21:46:15.734+0000 d20511| 2016-01-15T21:46:15.520+0000 D STORAGE  [migrateThread] stored meta data for foo.bar @ RecordId(2)
[js_test:multi_write_target] 2016-01-15T21:46:15.734+0000 d20511| 2016-01-15T21:46:15.520+0000 I COMMAND  [conn5] command admin.$cmd command: isMaster { ismaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:215 locks:{} protocol:op_command 0ms
[js_test:multi_write_target] 2016-01-15T21:46:15.735+0000 doassert@src/mongo/shell/assert.js:15:14
[js_test:multi_write_target] 2016-01-15T21:46:15.735+0000 assert.neq@src/mongo/shell/assert.js:125:5
[js_test:multi_write_target] 2016-01-15T21:46:15.735+0000 @jstests/sharding/multi_write_target.js:50:1
[js_test:multi_write_target] 2016-01-15T21:46:15.735+0000 
[js_test:multi_write_target] 2016-01-15T21:46:15.735+0000 d20511| 2016-01-15T21:46:15.520+0000 D COMMAND  [conn5] run command admin.$cmd { _recvChunkStatus: 1 }
[js_test:multi_write_target] 2016-01-15T21:46:15.735+0000 2016-01-15T21:46:15.613+0000 E d20511| 2016-01-15T21:46:15.520+0000 D COMMAND  [conn5] command: _recvChunkStatus
[js_test:multi_write_target] 2016-01-15T21:46:15.736+0000 QUERY    [thread1] Error: [null] != [null] are equal : undefined :
[js_test:multi_write_target] 2016-01-15T21:46:15.736+0000 doassert@src/mongo/shell/assert.js:15:14
[js_test:multi_write_target] 2016-01-15T21:46:15.736+0000 assert.neq@src/mongo/shell/assert.js:125:5
[js_test:multi_write_target] 2016-01-15T21:46:15.736+0000 @jstests/sharding/multi_write_target.js:50:1
[js_test:multi_write_target] 2016-01-15T21:46:15.736+0000 

Looks like something with sharding?

Generated at Thu Feb 08 04:00:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.