[SERVER-22553] mongos_shard_failure_tolerance.js should not rely on order of shard ids Created: 19/Jan/16  Updated: 26/Apr/18  Resolved: 10/Feb/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 3.2.4, 3.3.2, 3.4.15

Type: Bug Priority: Minor - P4
Reporter: Spencer Jackson Assignee: Kaloian Manassiev
Resolution: Done Votes: 0
Labels: test-only
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Backport Requested:
v3.4
Sprint: Sharding 10 (02/19/16)
Participants:
Linked BF Score: 18

 Description   

sharding_csrs_continuous_config_stepdown_WT failed on enterprise-rhel-62-64-bit

mongos_shard_failure_tolerance.js - Logs | History

BF Ticket Generated by spencer.jackson



 Comments   
Comment by Githook User [ 26/Apr/18 ]

Author:

{'email': 'misha@mongodb.com', 'username': 'mikety', 'name': 'Misha Tyulenev'}

Message: SERVER-22553 mongos_rs_shard_failure_tolerance.js should not rely on order of shard ids
Branch: v3.4
https://github.com/mongodb/mongo/commit/3275fbf2affabe89a7ae9c604d631d0b6a60e8bf

Comment by Githook User [ 17/Feb/16 ]

Author:

{u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}

Message: SERVER-22553 mongos_shard_failure_tolerance.js should not rely on order of shard ids
Branch: v3.2
https://github.com/mongodb/mongo/commit/ffe9971c736ded4f9d797eb2152ef27ec98c8a70

Comment by Githook User [ 10/Feb/16 ]

Author:

{u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}

Message: SERVER-22553 mongos_shard_failure_tolerance.js should not rely on order of shard ids
Branch: master
https://github.com/mongodb/mongo/commit/5abb483dcf51701f48bc371e0944057412ce2515

Comment by Kaloian Manassiev [ 10/Feb/16 ]

I think this is similar problem to SERVER-22543. The test relies on the shards being returned in a fixed sequence, but won't always be true. I'll fix it by removing this assumption.

Comment by Matt Cotter [ 04/Feb/16 ]

In both cases one of the dbs is getting shut down while a different part of the test is still trying to connecting to them.
In the original report from Spencer, c20265 is shut down, then the test fails to connect to it:

[js_test:mongos_shard_failure_tolerance] 2016-01-19T18:50:22.470+0000 c20265| 2016-01-19T18:50:21.659+0000 I CONTROL  [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
....
 
[js_test:mongos_shard_failure_tolerance] 2016-01-19T18:50:26.429+0000 ReplSetTest Could not call ismaster on node connection to ip-10-47-166-201:20264: Error: error doing query: failed: network error while attempting to run command 'ismaster' on host 'ip-10-47-166-201:20264'
[js_test:mongos_shard_failure_tolerance] 2016-01-19T18:50:26.430+0000 2016-01-19T18:50:26.429+0000 I NETWORK  [thread2] trying reconnect to ip-10-47-166-201:20265 (10.47.166.201) failed
[js_test:mongos_shard_failure_tolerance] 2016-01-19T18:50:26.430+0000 2016-01-19T18:50:26.430+0000 W NETWORK  [thread2] Failed to connect to 10.47.166.201:20265, reason: errno:111 Connection refused
[js_test:mongos_shard_failure_tolerance] 2016-01-19T18:50:26.431+0000 2016-01-19T18:50:26.430+0000 I NETWORK  [thread2] reconnect ip-10-47-166-201:20265 (10.47.166.201) failed failed
[js_test:mongos_shard_failure_tolerance] 2016-01-19T18:50:26.431+0000 ReplSetTest Could not call ismaster on node connection to ip-10-47-166-201:20265: Error: socket exception [CONNECT_ERROR] for couldn't connect to server ip-10-47-166-201:20265, connection attempt failed

Something similar happens in Charlie's patch build.
To me this doesn't look like a networking error. Bouncing over to sharding.

Comment by Charlie Swanson [ 03/Feb/16 ]

Looks like it happened again in my patch build

Comment by Spencer Jackson [ 19/Jan/16 ]

Huh... The error is:

[js_test:mongos_shard_failure_tolerance] 2016-01-19T18:50:21.814+0000 ----
[js_test:mongos_shard_failure_tolerance] 2016-01-19T18:50:21.814+0000 Testing active connection...
[js_test:mongos_shard_failure_tolerance] 2016-01-19T18:50:21.814+0000 ----
[js_test:mongos_shard_failure_tolerance] 2016-01-19T18:50:21.814+0000 
[js_test:mongos_shard_failure_tolerance] 2016-01-19T18:50:21.814+0000 
[js_test:mongos_shard_failure_tolerance] 2016-01-19T18:50:21.815+0000 c20264| 2016-01-19T18:50:21.377+0000 D REPL     [SyncSourceFeedback] Sending slave oplog progress to upstream updater: { replSetUpdatePosition: 1, optimes: [ { _id: ObjectId('000000000000000000000000'), optime: { ts: Timestamp 1453229419000|6, t: 2 }, memberId: 0, cfgver: 2 }, { _id: ObjectId('569e8555754ab418b8071cfa'), optime: { ts: Timestamp 1453229421000|5, t: 2 }, memberId: 1, cfgver: 2 }, { _id: ObjectId('000000000000000000000000'), optime: { ts: Timestamp 1453229412000|18, t: 1 }, memberId: 2, cfgver: 2 } ] }
[js_test:mongos_shard_failure_tolerance] 2016-01-19T18:50:21.816+0000 c20264| 2016-01-19T18:50:21.377+0000 D ASIO     [NetworkInterfaceASIO-BGSync-0] Starting asynchronous command 277 on host ip-10-47-166-201:20265
[js_test:mongos_shard_failure_tolerance] 2016-01-19T18:50:21.816+0000 c20264| 2016-01-19T18:50:21.378+0000 D ASIO     [NetworkInterfaceASIO-BGSync-0] Failed to time operation 277 out: Operation aborted.
[js_test:mongos_shard_failure_tolerance] 2016-01-19T18:50:21.816+0000 c20264| 2016-01-19T18:50:21.378+0000 D REPL     [rsBackgroundSync-0] fetcher read 0 operations from remote oplog
[js_test:mongos_shard_failure_tolerance] 2016-01-19T18:50:21.817+0000 c20264| 2016-01-19T18:50:21.379+0000 D COMMAND  [conn10] run command config.$cmd { find: "databases", filter: { _id: "fooUnsharded" }, readConcern: { level: "majority", afterOpTime: { ts: Timestamp 1453229421000|5, t: 2 } }, limit: 1, maxTimeMS: 30000 }
[js_test:mongos_shard_failure_tolerance] 2016-01-19T18:50:21.817+0000 c20264| 2016-01-19T18:50:21.379+0000 D QUERY    [conn10] Using idhack: query: { _id: "fooUnsharded" } sort: {} projection: {} limit: 1
[js_test:mongos_shard_failure_tolerance] 2016-01-19T18:50:21.818+0000 2016-01-19T18:50:21.653+0000 E QUERY    [thread1] Error: error: { "ok" : 0, "errmsg" : "Connection refused", "code" : 6 } :
[js_test:mongos_shard_failure_tolerance] 2016-01-19T18:50:21.818+0000 _getErrorWithCode@src/mongo/shell/utils.js:23:13
[js_test:mongos_shard_failure_tolerance] 2016-01-19T18:50:21.818+0000 DBCommandCursor@src/mongo/shell/query.js:679:1
[js_test:mongos_shard_failure_tolerance] 2016-01-19T18:50:21.818+0000 DBQuery.prototype._exec@src/mongo/shell/query.js:105:28
[js_test:mongos_shard_failure_tolerance] 2016-01-19T18:50:21.818+0000 DBQuery.prototype.hasNext@src/mongo/shell/query.js:267:5
[js_test:mongos_shard_failure_tolerance] 2016-01-19T18:50:21.819+0000 DBCollection.prototype.findOne@src/mongo/shell/collection.js:215:12
[js_test:mongos_shard_failure_tolerance] 2016-01-19T18:50:21.819+0000 @jstests/sharding/mongos_shard_failure_tolerance.js:65:18

From the JSTest in question, starting at line 64:

assert.neq(null, mongosConnActive.getCollection( collSharded.toString() ).findOne({ _id : -1 }));
assert.neq(null, mongosConnActive.getCollection( collSharded.toString() ).findOne({ _id : 1 }));

So it seems we try to make two consecutive findOne calls, and the second fails with some sort of connection refused message. Maybe this is something with networking? samantha.ritter, could you take a look at this, or forward on to someone else who might have a better idea about what's going on here?

Generated at Thu Feb 08 04:00:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.