[SERVER-21531] Shard's connection blocks forever when attempting autosplit and first config server (SCCC) in TCP blackhole from shards Created: 18/Nov/15  Updated: 06/Dec/22  Resolved: 06/Jan/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.2.0-rc3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File blackhole_first_config_server_from_shards_and_autosplit.js     Text File stacks.log     Text File test_output.log    
Issue Links:
Duplicate
is duplicated by SERVER-16690 All inserts are delayed by 5 sec when... Closed
is duplicated by SERVER-22486 query to the router never done when t... Closed
is duplicated by SERVER-22812 Connection lasts long time when one c... Closed
Assigned Teams:
Sharding
Operating System: ALL
Steps To Reproduce:

python buildscripts/resmoke.py --executor sharding_legacy blackhole_first_config_server_from_shards_and_autosplit.js

Sprint: Sharding E (01/08/16)
Participants:

 Description   

Tests the scenario when the first config server discards all messages from the shards, but not the mongos. Note that this prevents the autosplit from succeeding.

  1. Initialize the sharded cluster
  2. Enable sharding on the "test" database
  3. Create a sharded collection called "server16690" with shard key {_id: 1}
  4. Configure the mongobridge corresponding to the first config server to discard messages from each of the shards
  5. Insert a few documents into the test.server16690 collection to trigger an autosplit

[js_test:blackhole_first_config_server_from_shards_and_autosplit] 2015-11-18T13:34:19.841-0500 d20010| 2015-11-18T13:34:19.840-0500 I SHARDING [conn7] received splitChunk request: { splitChunk: "test.server16690", keyPattern: { _id: 1.0 }, min: { _id: MinKey }, max: { _id: MaxKey }, from: "shard0000", splitKeys: [ { _id: ObjectId('564cc4ab70832ec51069b10c') }, { _id: ObjectId('564cc4ab70832ec51069b110') } ], configdb: "hanamizu:20015,hanamizu:20017,hanamizu:20019", shardVersion: [ Timestamp 1000|0, ObjectId('564cc4aba24c0213685f09d7') ], epoch: ObjectId('564cc4aba24c0213685f09d7') }
[js_test:blackhole_first_config_server_from_shards_and_autosplit] 2015-11-18T13:34:19.841-0500 d20010| 2015-11-18T13:34:19.840-0500 D SHARDING [conn7] created new distributed lock for test.server16690 on hanamizu:20015,hanamizu:20017,hanamizu:20019 ( lock timeout : 900000, ping interval : 30000, process : 0 )
[js_test:blackhole_first_config_server_from_shards_and_autosplit] 2015-11-18T13:34:19.841-0500 d20010| 2015-11-18T13:34:19.841-0500 D NETWORK  [conn7] creating new connection to:hanamizu:20015
[js_test:blackhole_first_config_server_from_shards_and_autosplit] 2015-11-18T13:34:19.842-0500 d20010| 2015-11-18T13:34:19.841-0500 D COMMAND  [ConnectBG] BackgroundJob starting: ConnectBG
[js_test:blackhole_first_config_server_from_shards_and_autosplit] 2015-11-18T13:34:19.842-0500 d20010| 2015-11-18T13:34:19.841-0500 D NETWORK  [conn7] connected to server hanamizu:20015 (127.0.1.1)
[js_test:blackhole_first_config_server_from_shards_and_autosplit] 2015-11-18T13:34:19.842-0500 b20015| 2015-11-18T13:34:19.841-0500 I NETWORK  [main] connection accepted from 127.0.0.1:41967 #14 (1 connection now open)
[js_test:blackhole_first_config_server_from_shards_and_autosplit] 2015-11-18T13:34:19.842-0500 b20015| 2015-11-18T13:34:19.841-0500 I BRIDGE   [thread1] Discarding "isMaster" command with arguments { isMaster: 1, hostInfo: "hanamizu:20010" } from hanamizu:20010

This network request does not appear to time out, even after several minutes.



 Comments   
Comment by Andy Schwerin [ 06/Jan/16 ]

This is a somewhat esoteric scenario, it doesn't cause corruption or misrouted queries, and SCCC is slated for elimination. We'll let this bug be.

Comment by Max Hirschhorn [ 18/Nov/15 ]

When a message is "discarded", the mongobridge process reads the message from the socket (sending a TCP acknowledgement), but does not reply with a message on the socket (nor does it forward the message onto the destination). This is intended to trigger a socket timeout on the process that sent the message.

Additionally, the test finishes when running with CSRS. The shard will try and send 3 messages to the first config server, each of which trigger a socket timeout after 5 seconds:

[js_test:blackhole_first_config_server_from_shards_and_autosplit] 2015-11-18T13:59:37.049-0500 Write operation 1 of 10 took 3 milliseconds
[js_test:blackhole_first_config_server_from_shards_and_autosplit] 2015-11-18T13:59:37.050-0500 Write operation 2 of 10 took 2 milliseconds
[js_test:blackhole_first_config_server_from_shards_and_autosplit] 2015-11-18T13:59:37.051-0500 Write operation 3 of 10 took 2 milliseconds
[js_test:blackhole_first_config_server_from_shards_and_autosplit] 2015-11-18T13:59:37.052-0500 Write operation 4 of 10 took 3 milliseconds
[js_test:blackhole_first_config_server_from_shards_and_autosplit] 2015-11-18T14:00:07.053-0500 Write operation 5 of 10 took 30004 milliseconds
[js_test:blackhole_first_config_server_from_shards_and_autosplit] 2015-11-18T14:00:22.957-0500 Write operation 6 of 10 took 15902 milliseconds
[js_test:blackhole_first_config_server_from_shards_and_autosplit] 2015-11-18T14:00:38.956-0500 Write operation 7 of 10 took 16000 milliseconds
[js_test:blackhole_first_config_server_from_shards_and_autosplit] 2015-11-18T14:00:54.960-0500 Write operation 8 of 10 took 16004 milliseconds
[js_test:blackhole_first_config_server_from_shards_and_autosplit] 2015-11-18T14:01:10.964-0500 Write operation 9 of 10 took 16004 milliseconds
[js_test:blackhole_first_config_server_from_shards_and_autosplit] 2015-11-18T14:01:26.971-0500 Write operation 10 of 10 took 16007 milliseconds

Running the test with CSRS:

python buildscripts/resmoke.py --executor sharding blackhole_first_config_server_from_shards_and_autosplit.js

Generated at Thu Feb 08 03:57:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.