[SERVER-55817] Request timed out while host is locked Created: 06/Apr/21 Updated: 23/Apr/21 Resolved: 22/Apr/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Aggregation Framework |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Wernfried Domscheit | Assignee: | Edwin Zhou |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL |
| Steps To Reproduce: | I am not able to reproduce the error in reliable way. My application runs about 20 of these jobs in parallel, they are all the same, just the db and collection names differ. On 5-10 jobs it fails, the others are running fine. |
| Participants: |
| Description |
|
I try to setup a backup procedure according to Back Up a Sharded Cluster with File System Snapshots I have 3-member Config Replica Set with 4 shards. Each shard is a PSA Replica Set. While backup my application is running of course. I run db.fsyncLock() on one Config SECONDARY. While the config RS member is locked my aggregation pipeline fails:
Error:
Host d-mipmdb-cfg-01.xxx.xxxx.xxx is the locked Config SECONDARY. Looks like, the error is not related to aggregation pipeline itself, the same error appears also on other commands, e.g.
|
| Comments |
| Comment by Edwin Zhou [ 22/Apr/21 ] | ||||||||
|
Hi wernfried.domscheit@sunrise.net, Thanks for following up. I will go ahead and close this ticket. Best, | ||||||||
| Comment by Wernfried Domscheit [ 21/Apr/21 ] | ||||||||
|
Hi Edwin
Please see attached logfile. Yes, the aggregation runs on mongos. My application logging is this one, it should help you to find relevant messages from logfile.
Host "d-mipmdb-cfg-01.swi.srse.net" is the primary CSRS, host "d-mipmdb-cfg-02.swi.srse.net" is the locked secondary CSRS. Please note, by proper scheduling of my backup I can prevent this error in my application. It's up to you whether you like to investigate this issue further. Of course, I will further support you in this case.
Best Regards [^mongos-exp.log.gz]
| ||||||||
| Comment by Edwin Zhou [ 20/Apr/21 ] | ||||||||
|
Hi wernfried.domscheit@sunrise.net, Thanks for following up with the added clarity. It's certainly strange that these commands are only timing out intermittently. Could you provide some more information for me?
We look forward to hearing from you! Best, | ||||||||
| Comment by Wernfried Domscheit [ 20/Apr/21 ] | ||||||||
|
Hi Edwin I locked only one (out of 3) CSRS secondary, all the shards are also unlocked. I don't understand why a $merge command requires any write operation on CSRS, the target collections are existing already. Why does it fail when 2 out of 3 CSRS members are up and running unlocked? Just note again: most of the time these commands are running fine, the timeout occurs occasionally only. Best Regards
| ||||||||
| Comment by Edwin Zhou [ 19/Apr/21 ] | ||||||||
|
Hi wernfried.domscheit@sunrise.net, I was able to reproduce this error message by running db.fsyncLock() on all shard secondaries and CSRS secondaries. I suspect this behavior is intended when using db.fsyncLock() and attempting to perform a write using $merge. db.fsyncLock() will prevent additional writes until the user releases the lock with a corresponding db.fsyncUnlock() command. Since $merge is a write operation, we'd expect issues to arise when attempting to write after using db.fsyncLock(). Upon performing db.fsyncUnlock(), the error messages ceased. Does my description clear up this behavior? Best, |