[SERVER-26735] Slow response to queries and no way to investigate it Created: 22/Oct/16  Updated: 05/Mar/17  Resolved: 19/Nov/16

Status: Closed
Project: Core Server
Component/s: Querying, Sharding
Affects Version/s: 3.2.10
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: Tudor Aursulesei Assignee: Unassigned
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

I have an 8 member replica set, with 3 replica sets for each shard, and i'm running a query which affects each shard equally. The query targets an update for almost all elements in a collection, but it has to be a full COLLSCAN. Most of the shards finish the query very fast (<5 second), but randomly, one or two of the shards take a very long time to execute, even as much as 1-2 minutes. Whenever this issue appears, it takes place on a different member of the shard as the last time. I've run db.currentOP() on the shard replica set master, and the only query running is the update. I've also tried mongotop, mongostat and even setProfilingLevel, but to no avail; i'm unable to find what is slowing me down. I've noticed between running this command twice, that the number of yields grows very fast:

rs6:PRIMARY> ct = 0 ;db.currentOp().inprog.forEach(   function(op) {     if(op.secs_running > 0 && op.ns != "local.oplog.rs" && op.ns != "local.replset.minvalid") {ct += 1; printjson(op);}   } ); ct
        ...
        "secs_running" : 26,
        ...
        "numYields" : 1461,
        "locks" : {
                "Global" : "w",
                "local" : "w",
                "Database" : "w",
                "Collection" : "w",
                "Metadata" : "w"
        },
        "waitingForLock" : false,
        "lockStats" : {
                "Global" : {
                        "acquireCount" : {
                                "r" : NumberLong(7942),
                                "w" : NumberLong(7942)
                        }
                },
                "Database" : {
                        "acquireCount" : {
                                "w" : NumberLong(7942)
                        }
                },
                "Collection" : {
                        "acquireCount" : {
                                "w" : NumberLong(1462)
                        }
                },
                "Metadata" : {
                        "acquireCount" : {
                                "w" : NumberLong(6480)
                        }
                },
                "oplog" : {
                        "acquireCount" : {
                                "w" : NumberLong(6480)
                        }
                }
        }
}
1
 
rs6:PRIMARY> ct = 0 ;db.currentOp().inprog.forEach(   function(op) {     if(op.secs_running > 0 && op.ns != "local.oplog.rs" && op.ns != "local.replset.minvalid") {ct += 1; printjson(op);}   } ); ct
        ...
        "secs_running" : 27,
        ...
        "numYields" : 1352,
        "locks" : {
                "Global" : "w",
                "local" : "w",
                "Database" : "w",
                "Collection" : "w",
                "Metadata" : "w"
        },
        "waitingForLock" : false,
        "lockStats" : {
                "Global" : {
                        "acquireCount" : {
                                "r" : NumberLong(7513),
                                "w" : NumberLong(7513)
                        }
                },
                "Database" : {
                        "acquireCount" : {
                                "w" : NumberLong(7513)
                        }
                },
                "Collection" : {
                        "acquireCount" : {
                                "w" : NumberLong(1353)
                        }
                },
                "Metadata" : {
                        "acquireCount" : {
                                "w" : NumberLong(6160)
                        }
                },
                "oplog" : {
                        "acquireCount" : {
                                "w" : NumberLong(6160)
                        }
                }
        }
}
1

I have very good hardware, more than enough free RAM. I've noticed that the primary of the slow shards tend to have much less free RAM than the others, so i decided to restart all the mongod processes in the cluster, and i had no problems for 2-3 days. After that, the issue appears again. How can i investigate this issue further? What is causing the operation to yield? I've been restarting the servers for almost 10 days now. Thanks



 Comments   
Comment by Tudor Aursulesei [ 05/Mar/17 ]

This was caused by using ext4. I switched to XFS and the problem got away.

https://jira.mongodb.org/browse/SERVER-22606

Comment by Kelsey Schubert [ 19/Nov/16 ]

Hi thestick613,

We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Regards,
Thomas

Comment by Ramon Fernandez Marina [ 25/Oct/16 ]

thestick613, can you please upload the mongod.log and the contents of the diagnostic.data directory for one of the affected nodes during a period of slow queries?

Thanks,
Ramón.

Generated at Thu Feb 08 04:13:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.