[SERVER-47839] CSRS member fails to be killed on 4.4.0-rc3 Created: 29/Apr/20  Updated: 08/Jan/24  Resolved: 08/May/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Louisa Berger Assignee: Janna Golden
Resolution: Gone away Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File member1_9007.log     Text File member2_9008.log     Text File member3_9009.log    
Issue Links:
Depends
Operating System: ALL
Sprint: Sharding 2020-05-18
Participants:

 Description   

Running a sharded cluster on 4.4.0-rc3. I killed all 3 CSRS members by running a kill <pid> command. The first and third members were successfully shut down, but the second member was not able to shut down. My random guess at the most relevant error is here:

{"t":{"$date":"2020-04-28T14:26:47.203+00:00"},"s":"W", "c":"STORAGE", "id":20561,  "ctx":"SignalHandler","msg":"Error stepping down in non-command initiated shutdown path","attr":{"error":{"code":189,"codeName":"PrimarySteppedDown","errmsg":"While waiting for secondaries to catch up before stepping down, this node decided to step down for other reasons"}}}

Attaching the full logs. "member2_9008" is the process that failed to be killed.

Note – this doesn't happen every time, but I can reliably trigger again.

Spoke to judah.schvimer and he recommended filing a bug directly.



 Comments   
Comment by Janna Golden [ 08/May/20 ]

This was fixed by this commit in 4.4.0-rc4.

Comment by Louisa Berger [ 08/May/20 ]

Sure, go ahead!

Comment by Janna Golden [ 07/May/20 ]

Yay! I can close this ticket if you feel comfortable doing so.

Comment by Louisa Berger [ 07/May/20 ]

Ran our tests on rc4 and it looks resolved!

Comment by Janna Golden [ 06/May/20 ]

I think this should be fixed in rc4. I have not been able to reproduce the issue, but looking at the logs I'm pretty convinced that this is the same issue described in SERVER-46841 which was backported to rc4. That ticket in particular discusses fixing an issue shutting down the logical session cache when it's refreshing, but I believe the same issue described can happen with the PeriodicShardedIndexConsistencyChecker. We see the PeriodicShardedIndexConsistencyChecker log

| 2020-04-28T14:26:37.204+00:00 I  SHARDING 22049   [PeriodicShardedIndexConsistencyChecker] "Checking consistency of sharded collection indexes across the cluster" 

just before the SignalHandler logs that it received a shutdown signal. The PeriodicShardedIndexConsistencyChecker runs an agg pipeline that is run on the ARS, so we can get stuck here when running PeriodicShardedIndexConsistencyChecker::onShutDown() in the same way as the LSC.

louisa.berger rc4 is getting cut tonight, so I think it might be easiest to see if rc4 fixes the issues you're seeing. If it doesn't I'm happy to continue to try to repro after.

Comment by Tess Avitabile (Inactive) [ 30/Apr/20 ]

We catch the error from the stepdown attemptĀ here and continue.

We can see that we proceed in shutting down the MirrorMaestro here by this log line:

| 2020-04-28T14:26:47.203+00:00 I  REPL     40447   [TopologyVersionObserver] "Stopped {topologyVersionObserverName}","attr":{"topologyVersionObserverName":"TopologyVersionObserver"}

However, we never reach shutting down the TransportLayer, since we don't see this log line.

This means there's a hang somewhere in this section.

Sending to the Sharding team to investigate.

Comment by Lingzhi Deng [ 29/Apr/20 ]

SERVER-47190 has not been backported yet. And SERVER-47190 was not expected to change the behaviors of signal-based shutdown.

Comment by Judah Schvimer [ 29/Apr/20 ]

lingzhi.deng, any idea if this is related to SERVER-47190?
mira.carey@mongodb.com, am I wrong that this is unexpected? Does this sound related to any of Service Arch's recent shutdown work?

Generated at Thu Feb 08 05:15:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.