[SERVER-38798] shard registry busy-loops in commitTransaction Created: 28/Dec/18  Updated: 16/Jan/19  Resolved: 16/Jan/19

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Eric Milkie Assignee: Kaloian Manassiev
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-37880 Add in backoff for retrying sending c... Closed
Related
Operating System: ALL
Sprint: Sharding 2019-01-28
Participants:
Linked BF Score: 9

 Description   

If a node is down, the shard registry will hammer retries with no backoff. This generates so much system log that it overloads logkeeper and makes it difficult to diagnose problems:

[ShardedClusterFixture:job0:shard1:secondary1] 2018-12-23T23:49:10.133+0000 I REPL     [replexec-4] Member localhost:20004 is now in state RS_DOWN
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 D TXN      [ShardRegistry] Coordinator shard got response {} for { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinD
ata(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard-rs1
[ShardedClusterFixture:job0:shard1:secondary1] 2018-12-23T23:49:10.133+0000 D REPL_HB  [replexec-4] Scheduling heartbeat to localhost:20004 at 2018-12-23T23:49:10.633Z
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 D TXN      [ShardRegistry] Coordinator shard received a retryable error in response to { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9
d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } }from shard shard-rs1 :: caused by :: HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) 
:: caused by :: Connection refused, resending command.
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 D TXN      [TransactionCoordinatorService-6] Coordinator going to send command { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37
e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard shard-rs1
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 I ASIO     [ShardRegistry] Connecting to localhost:20004
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 I ASIO     [ShardRegistry] Failed to connect to localhost:20004 - HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 I ASIO     [ShardRegistry] Dropping all pooled connections to localhost:20004 due to HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 D TXN      [ShardRegistry] Coordinator shard got response {} for { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinD
ata(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard-rs1
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 D TXN      [ShardRegistry] Coordinator shard received a retryable error in response to { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } }from shard shard-rs1 :: caused by :: HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused, resending command.
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 D TXN      [TransactionCoordinatorService-0] Coordinator going to send command { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard shard-rs1
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 I ASIO     [ShardRegistry] Connecting to localhost:20004
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 I ASIO     [ShardRegistry] Failed to connect to localhost:20004 - HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 I ASIO     [ShardRegistry] Dropping all pooled connections to localhost:20004 due to HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 D TXN      [ShardRegistry] Coordinator shard got response {} for { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard-rs1
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.133+0000 D TXN      [ShardRegistry] Coordinator shard received a retryable error in response to { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } }from shard shard-rs1 :: caused by :: HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused, resending command.
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 D TXN      [TransactionCoordinatorService-1] Coordinator going to send command { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard shard-rs1
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 I ASIO     [ShardRegistry] Connecting to localhost:20004
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 I ASIO     [ShardRegistry] Failed to connect to localhost:20004 - HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 I ASIO     [ShardRegistry] Dropping all pooled connections to localhost:20004 due to HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 D TXN      [ShardRegistry] Coordinator shard got response {} for { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard-rs1
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 D TXN      [ShardRegistry] Coordinator shard received a retryable error in response to { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } }from shard shard-rs1 :: caused by :: HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused, resending command.
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 D TXN      [TransactionCoordinatorService-2] Coordinator going to send command { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard shard-rs1
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 I ASIO     [ShardRegistry] Connecting to localhost:20004
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 I ASIO     [ShardRegistry] Failed to connect to localhost:20004 - HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 I ASIO     [ShardRegistry] Dropping all pooled connections to localhost:20004 due to HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 D TXN      [ShardRegistry] Coordinator shard got response {} for { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard-rs1
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 D TXN      [ShardRegistry] Coordinator shard received a retryable error in response to { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } }from shard shard-rs1 :: caused by :: HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused, resending command.
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 D TXN      [TransactionCoordinatorService-5] Coordinator going to send command { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard shard-rs1
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.134+0000 I ASIO     [ShardRegistry] Connecting to localhost:20004
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.135+0000 I ASIO     [ShardRegistry] Failed to connect to localhost:20004 - HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.135+0000 I ASIO     [ShardRegistry] Dropping all pooled connections to localhost:20004 due to HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.135+0000 D TXN      [ShardRegistry] Coordinator shard got response {} for { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard-rs1
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.135+0000 D TXN      [ShardRegistry] Coordinator shard received a retryable error in response to { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } }from shard shard-rs1 :: caused by :: HostUnreachable: Error connecting to localhost:20004 (127.0.0.1:20004) :: caused by :: Connection refused, resending command.
[ShardedClusterFixture:job0:shard0:primary] 2018-12-23T23:49:10.135+0000 D TXN      [TransactionCoordinatorService-3] Coordinator going to send command { commitTransaction: 1, commitTimestamp: Timestamp(1545608949, 136), lsid: { id: UUID("c8a55cb6-25f8-4b51-8491-ae9d4735c37e"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 362, autocommit: false, writeConcern: { w: "majority" } } to shard shard-rs1

This behavior happened in the linked BF, but it's very difficult to actually download the logs from that failure; I have pasted in a portion of the log here for reference.



 Comments   
Comment by Kaloian Manassiev [ 16/Jan/19 ]

The transaction coordinator retry logic was rewritten as part of SERVER-37880 and now there will not be busy loop and also there is now test to ensure that no infinite-length chain is crated by the retry loop.

Comment by Eric Milkie [ 07/Jan/19 ]

I think this also can result in this process crash:

2018-12-07T21:23:53.186+0000 F - [TransactionCoordinatorService-18] Invariant failure depth < kMaxDepth src/mongo/util/future.h 413
 
2018-12-07T21:23:53.186+0000 F - [TransactionCoordinatorService-18] 
 
***aborting after invariant() failure
 
2018-12-07T21:23:53.197+0000 F - [TransactionCoordinatorService-18] Got signal: 6 (Aborted).
 
0x7fb600d49a21 0x7fb600d48c39 0x7fb600d4911d 0x7fb5fd3087e0 0x7fb5fcf97495 0x7fb5fcf98c75 0x7fb5ff2b166d 0x7fb5ff5a2d6f 0x7fb5ff5a3335 0x7fb5ff5c3496 0x7fb5ff5b4a6e 0x7fb600413a05 0x7fb600413fdc 0x7fb6004149c6 0x7fb600e672f0 0x7fb5fd300aa1 0x7fb5fd04dbdd

This invariant only fires in debug mode. It looks like the future system can get overwhelmed when it goes into this tight loop.

Comment by Eric Milkie [ 28/Dec/18 ]

Note that this particular failure generated 14 log messages per millisecond, for two hours until the test timed out.

Generated at Thu Feb 08 04:50:05 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.