Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Incomplete
Priority: Minor - P4
Fix Version/s: None
Affects Version/s: 3.2.10
Component/s: Querying, Sharding
Labels:
None

Operating System:
ALL
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

I have an 8 member replica set, with 3 replica sets for each shard, and i'm running a query which affects each shard equally. The query targets an update for almost all elements in a collection, but it has to be a full COLLSCAN. Most of the shards finish the query very fast (<5 second), but randomly, one or two of the shards take a very long time to execute, even as much as 1-2 minutes. Whenever this issue appears, it takes place on a different member of the shard as the last time. I've run db.currentOP() on the shard replica set master, and the only query running is the update. I've also tried mongotop, mongostat and even setProfilingLevel, but to no avail; i'm unable to find what is slowing me down. I've noticed between running this command twice, that the number of yields grows very fast:

rs6:PRIMARY> ct = 0 ;db.currentOp().inprog.forEach(   function(op) {     if(op.secs_running > 0 && op.ns != "local.oplog.rs" && op.ns != "local.replset.minvalid") {ct += 1; printjson(op);}   } ); ct
        ...
        "secs_running" : 26,
        ...
        "numYields" : 1461,
        "locks" : {
                "Global" : "w",
                "local" : "w",
                "Database" : "w",
                "Collection" : "w",
                "Metadata" : "w"
        },
        "waitingForLock" : false,
        "lockStats" : {
                "Global" : {
                        "acquireCount" : {
                                "r" : NumberLong(7942),
                                "w" : NumberLong(7942)
                        }
                },
                "Database" : {
                        "acquireCount" : {
                                "w" : NumberLong(7942)
                        }
                },
                "Collection" : {
                        "acquireCount" : {
                                "w" : NumberLong(1462)
                        }
                },
                "Metadata" : {
                        "acquireCount" : {
                                "w" : NumberLong(6480)
                        }
                },
                "oplog" : {
                        "acquireCount" : {
                                "w" : NumberLong(6480)
                        }
                }
        }
}
1

rs6:PRIMARY> ct = 0 ;db.currentOp().inprog.forEach(   function(op) {     if(op.secs_running > 0 && op.ns != "local.oplog.rs" && op.ns != "local.replset.minvalid") {ct += 1; printjson(op);}   } ); ct
        ...
        "secs_running" : 27,
        ...
        "numYields" : 1352,
        "locks" : {
                "Global" : "w",
                "local" : "w",
                "Database" : "w",
                "Collection" : "w",
                "Metadata" : "w"
        },
        "waitingForLock" : false,
        "lockStats" : {
                "Global" : {
                        "acquireCount" : {
                                "r" : NumberLong(7513),
                                "w" : NumberLong(7513)
                        }
                },
                "Database" : {
                        "acquireCount" : {
                                "w" : NumberLong(7513)
                        }
                },
                "Collection" : {
                        "acquireCount" : {
                                "w" : NumberLong(1353)
                        }
                },
                "Metadata" : {
                        "acquireCount" : {
                                "w" : NumberLong(6160)
                        }
                },
                "oplog" : {
                        "acquireCount" : {
                                "w" : NumberLong(6160)
                        }
                }
        }
}
1

I have very good hardware, more than enough free RAM. I've noticed that the primary of the slow shards tend to have much less free RAM than the others, so i decided to restart all the mongod processes in the cluster, and i had no problems for 2-3 days. After that, the issue appears again. How can i investigate this issue further? What is causing the operation to yield? I've been restarting the servers for almost 10 days now. Thanks

Assignee:: Unassigned
Reporter:: Tudor Aursulesei
Participants:: Kelsey Schubert, Ramon Fernandez Marina, Tudor Aursulesei
Votes:: 0 Vote for this issue
Watchers:: 7 Start watching this issue

Created:: Oct 22 2016 07:14:42 AM UTC
Updated:: Mar 05 2017 08:40:01 AM UTC
Resolved:: Nov 19 2016 03:50:41 PM UTC

Details

Description

Attachments

Activity

People

Dates