[SERVER-47839] CSRS member fails to be killed on 4.4.0-rc3 Created: 29/Apr/20 Updated: 08/Jan/24 Resolved: 08/May/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Louisa Berger | Assignee: | Janna Golden |
| Resolution: | Gone away | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Operating System: | ALL | ||||
| Sprint: | Sharding 2020-05-18 | ||||
| Participants: | |||||
| Description |
|
Running a sharded cluster on 4.4.0-rc3. I killed all 3 CSRS members by running a kill <pid> command. The first and third members were successfully shut down, but the second member was not able to shut down. My random guess at the most relevant error is here:
Attaching the full logs. "member2_9008" is the process that failed to be killed. Note – this doesn't happen every time, but I can reliably trigger again. Spoke to judah.schvimer and he recommended filing a bug directly. |
| Comments |
| Comment by Janna Golden [ 08/May/20 ] | |
|
This was fixed by this commit in 4.4.0-rc4. | |
| Comment by Louisa Berger [ 08/May/20 ] | |
|
Sure, go ahead! | |
| Comment by Janna Golden [ 07/May/20 ] | |
|
Yay! I can close this ticket if you feel comfortable doing so. | |
| Comment by Louisa Berger [ 07/May/20 ] | |
|
Ran our tests on rc4 and it looks resolved! | |
| Comment by Janna Golden [ 06/May/20 ] | |
|
I think this should be fixed in rc4. I have not been able to reproduce the issue, but looking at the logs I'm pretty convinced that this is the same issue described in
just before the SignalHandler logs that it received a shutdown signal. The PeriodicShardedIndexConsistencyChecker runs an agg pipeline that is run on the ARS, so we can get stuck here when running PeriodicShardedIndexConsistencyChecker::onShutDown() in the same way as the LSC. louisa.berger rc4 is getting cut tonight, so I think it might be easiest to see if rc4 fixes the issues you're seeing. If it doesn't I'm happy to continue to try to repro after. | |
| Comment by Tess Avitabile (Inactive) [ 30/Apr/20 ] | |
|
We catch the error from the stepdown attemptĀ here and continue. We can see that we proceed in shutting down the MirrorMaestro here by this log line:
However, we never reach shutting down the TransportLayer, since we don't see this log line. This means there's a hang somewhere in this section. Sending to the Sharding team to investigate. | |
| Comment by Lingzhi Deng [ 29/Apr/20 ] | |
|
| |
| Comment by Judah Schvimer [ 29/Apr/20 ] | |
|
lingzhi.deng, any idea if this is related to |