[SERVER-55817] Request timed out while host is locked Created: 06/Apr/21  Updated: 23/Apr/21  Resolved: 22/Apr/21

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Wernfried Domscheit Assignee: Edwin Zhou
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Steps To Reproduce:

I am not able to reproduce the error in reliable way. My application runs about 20 of these jobs in parallel, they are all the same, just the db and collection names differ. On 5-10 jobs it fails, the others are running fine.

Participants:

 Description   

I try to setup a backup procedure according to Back Up a Sharded Cluster with File System Snapshots

I have 3-member Config Replica Set with 4 shards. Each shard is a PSA Replica Set.

While backup my application is running of course. I run db.fsyncLock() on one Config SECONDARY. While the config RS member is locked my aggregation pipeline fails:

db.getSiblingDB('lau01mipmed03').getCollection('sessions#0.096-1500').aggregate([
   { '$match': { 'n': { '$ne': 'dummy' }, 't': { '$gte': ISODate('2021-04-06T13:00:00.000Z'), '$lt': ISODate('2021-04-06T19:00:00.000Z') } } },
   { '$unset': '_id' },
   { '$merge': { 'into': { 'db': 'lau01mipmed03', 'coll': 'sessions#0' } } }
], { 'allowDiskUse': true })

Error:

"Request 13963276 timed out, deadline was 2021-04-06T15:00:42.947+02:00, op was RemoteCommand 13963276 -- target:[d-mipmdb-cfg-01.xxx.xxxx.xxx:27019] db:config expDate:2021-04-06T15:00:42.947+02:00 cmd:{ find: \"collections\", filter: { _id: \"lau01mipmed03.sessions#0\" }, readConcern: { level: \"majority\", afterOpTime: { ts: Timestamp(1617714012, 1), t: 2 } }, limit: 1, maxTimeMS: 30000 }"

Host d-mipmdb-cfg-01.xxx.xxxx.xxx is the locked Config SECONDARY.

Looks like, the error is not related to aggregation pipeline itself, the same error appears also on other commands, e.g.

db.getSiblingDB('t-mipmed-as-01').getCollection('sessions.096-1500.stats').aggregate([
   { '$project': { 'stats': 1 } },
   { '$unwind': '$stats' },
   { '$replaceRoot': { 'newRoot': '$stats' } },
   { '$merge': { 'into': { 'db': 'mip', 'coll': 'session.stats.raw' } } }
], { allowDiskUse: true })

 

 

 

 

 

 



 Comments   
Comment by Edwin Zhou [ 22/Apr/21 ]

Hi wernfried.domscheit@sunrise.net,

Thanks for following up. I will go ahead and close this ticket.

Best,
Edwin

Comment by Wernfried Domscheit [ 21/Apr/21 ]

Hi Edwin

 

Please see attached logfile. Yes, the aggregation runs on mongos.

My application logging is this one, it should help you to find relevant messages from logfile.

{
  "name" : "mergeSessions", 
  "command" : "db.getSiblingDB('pzur01mipmed0').getCollection('sessions#0.111-1000').aggregate([{'$match':{'n':{'$ne':'dummy'},'t':{'$gte':'2021-04-21T08:00:00.000Z','$lt':'2021-04-21T14:00:00.000Z'}}},{'$unset':'_id'},{'$merge':{'into':{'db':'pzur01mipmed0','coll':'sessions#0'}}}], {'allowDiskUse':true})", 
  "ts" : ISODate("2021-04-21T10:00:41.858+0200"), 
  "state" : "error", 
  "error" : "Request 107938050 timed out, deadline was 2021-04-21T10:00:41.851+02:00, op was RemoteCommand 107938050 -- target:[d-mipmdb-cfg-01.swi.srse.net:27019] db:config expDate:2021-04-21T10:00:41.851+02:00 cmd:{ find: \"collections\", filter: { _id: \"pzur01mipmed0.sessions#0\" }, readConcern: { level: \"majority\", afterOpTime: { ts: Timestamp(1618992011, 21576), t: 4 } }, limit: 1, maxTimeMS: 30000 }", 
  "duration" : 30.13
}

Host "d-mipmdb-cfg-01.swi.srse.net" is the primary CSRS, host "d-mipmdb-cfg-02.swi.srse.net" is the locked secondary CSRS.

Please note, by proper scheduling of my backup I can prevent this error in my application. It's up to you whether you like to investigate this issue further. Of course, I will further support you in this case.

 

Best Regards
Wernfried

 [^mongos-exp.log.gz]

 

Comment by Edwin Zhou [ 20/Apr/21 ]

Hi wernfried.domscheit@sunrise.net,

Thanks for following up with the added clarity. It's certainly strange that these commands are only timing out intermittently. Could you provide some more information for me?

  1. Are you running your aggregation on the mongos, and are you finding these log lines in the mongos.log?
  2. Can you share the log file that covers the log line you shared, and when the aggregation is run?
  3. If the log file doesn't cover the version of MongoDB you're using, can you please provide that information as well?

We look forward to hearing from you!

Best,
Edwin

Comment by Wernfried Domscheit [ 20/Apr/21 ]

Hi Edwin

I locked only one (out of 3) CSRS secondary, all the shards are also unlocked. I don't understand why a $merge command requires any write operation on CSRS, the target collections are existing already. Why does it fail when 2 out of 3 CSRS members are up and running unlocked?

Just note again: most of the time these commands are running fine, the timeout occurs occasionally only.

Best Regards
Wernfried

 

 

 

 

 

 

 

Comment by Edwin Zhou [ 19/Apr/21 ]

Hi wernfried.domscheit@sunrise.net,

I was able to reproduce this error message by running db.fsyncLock() on all shard secondaries and CSRS secondaries. I suspect this behavior is intended when using db.fsyncLock() and attempting to perform a write using $merge.

db.fsyncLock() will prevent additional writes until the user releases the lock with a corresponding db.fsyncUnlock() command. Since $merge is a write operation, we'd expect issues to arise when attempting to write after using db.fsyncLock().

Upon performing db.fsyncUnlock(), the error messages ceased.

Does my description clear up this behavior?

Best,
Edwin

Generated at Thu Feb 08 05:37:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.