[SERVER-23283] RangeDeleter does not log cursor ids correctly in deleteNow() Created: 22/Mar/16  Updated: 25/Jan/17  Resolved: 28/Mar/16

Status: Closed
Project: Core Server
Component/s: Diagnostics, Sharding
Affects Version/s: 3.0.9
Fix Version/s: 3.0.12, 3.2.5, 3.3.4

Type: Bug Priority: Major - P3
Reporter: Steffen Assignee: Randolph Tan
Resolution: Done Votes: 0
Labels: code-only
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File range_delete_stuck.log     File stack.log.gz    
Issue Links:
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Sprint: Sharding 12 (04/01/16)
Participants:

 Description   

RangeDeleter currently uses the logCursorsWaiting helper to log the list of cursors it waits. The helper method extracts this information from the cursorsToWait member of the RangeDeleteEntry. The problem is that the the blocking method (deleteNow) keeps track of the cursor independently and does update the cursors in RangeDeleteEntry, so it is not logged properly:

https://github.com/mongodb/mongo/blob/r3.3.3/src/mongo/db/range_deleter.cpp#L320

Original title: Chunk range deleter gets stuck with empty id list
Original description:

We are using mongo Balancer to balance chunks evenly to all shards with tag aware sharding.

Since balancer deletes can hit our IO very hard we set waitForDelete to true.

Balancer settings
{ "_id" : "balancer", "_secondaryThrottle" : true, "_waitForDelete" : true, "stopped" : false, "activeWindow" : { "start" : "19:00", "stop" : "1:00" } }

This time we captured enough log entrys which shows when the balancer gets stuck on rangedelete. (attaching complete log entrys)

/var/log/mongodb/mongodb-shard.log-20160316.gz:2016-03-15T19:16:56.513+0000 I SHARDING [conn702709] doing delete inline for cleanup of chunk data
/var/log/mongodb/mongodb-shard.log-20160316.gz:2016-03-15T19:16:56.513+0000 I SHARDING [conn702709] MigrateFromStatus::done About to acquire global lock to exit critical section
/var/log/mongodb/mongodb-shard.log-20160316.gz:2016-03-15T19:16:56.513+0000 I SHARDING [conn702709] waiting for open cursors before removing range [{ source: "AU:2073920479" }, { source: "AU:2073975491" }) in database001.collection1, cursor ids: []
/var/log/mongodb/mongodb-shard.log-20160316.gz:2016-03-15T19:31:57.011+0000 I SHARDING [conn702709] waiting for open cursors before removing range [{ source: "AU:2073920479" }, { source: "AU:2073975491" }) in database001.collection1, elapsed secs: 900, cursor ids: []

Interesting here is that list of cursor ids is empty from start.
This operation runs now since more than 6 days and blocks other migrations on this shard and due to our balancer setting also the balancer.

{
        "inprog" : [
                {
                        "desc" : "conn702709",
                        "threadId" : "0x19c4aa40",
                        "connectionId" : 702709,
                        "opid" : "repset5:-1973023014",
                        "active" : true,
                        "secs_running" : 574207,
                        "microsecs_running" : NumberLong("574207823437"),
                        "op" : "query",
                        "ns" : "database001.collection1",
                        "query" : {
                                "$msg" : "query not recording (too large)"
                        },
                        "client_s" : "172.X.X.X:39474",
                        "msg" : "step 5 of 6",
                        "killPending" : true,
                        "numYields" : 527,
                        "locks" : {
 
                        },
                        "waitingForLock" : false,
                        "lockStats" : {
                                "Global" : {
                                        "acquireCount" : {
                                                "r" : NumberLong(1792878),
                                                "w" : NumberLong(2),
                                                "R" : NumberLong(2)
                                        }
                                },
                                "MMAPV1Journal" : {
                                        "acquireCount" : {
                                                "r" : NumberLong(899439),
                                                "w" : NumberLong(2)
                                        },
                                        "acquireWaitCount" : {
                                                "r" : NumberLong(5686)
                                        },
                                        "timeAcquiringMicros" : {
                                                "r" : NumberLong(357486150)
                                        }
                                },
                                "Database" : {
                                        "acquireCount" : {
                                                "r" : NumberLong(896437),
                                                "w" : NumberLong(2)
                                        }
                                },
                                "Collection" : {
                                        "acquireCount" : {
                                                "R" : NumberLong(896437),
                                                "W" : NumberLong(2)
                                        },
                                        "acquireWaitCount" : {
                                                "R" : NumberLong(3000)
                                        },
                                        "timeAcquiringMicros" : {
                                                "R" : NumberLong(110598818)
                                        }
                                }
                        }
                }
        ]
}



 Comments   
Comment by Ramon Fernandez Marina [ 28/Mar/16 ]

steffen, the bug in the logging of cursor ids has been fixed and backported to 3.0.11 and 3.2.5. While there's no tentative release date for 3.0.11 at the time of this writing, a 3.2.5-rc0 release candidate is scheduled for early April. If this is a critical issue for you you may want to consider upgrading to 3.2.5.

Regards,
Ramón.

EDIT: this fix was moved from 3.0.11 to 3.0.12.

Comment by Githook User [ 28/Mar/16 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-23283 RangeDeleter does not log cursor ids correctly in deleteNow()

(cherry picked from commit 26603490725d969247044de4f36f487972264023)
Branch: v3.2
https://github.com/mongodb/mongo/commit/15520881fd4e961943df89aa35f79b4085d1653b

Comment by Githook User [ 28/Mar/16 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-23283 RangeDeleter does not log cursor ids correctly in deleteNow()

(cherry picked from commit 26603490725d969247044de4f36f487972264023)
Branch: v3.0
https://github.com/mongodb/mongo/commit/5a3b985bb4a26ba09a324bc2eed168613b70999c

Comment by Githook User [ 28/Mar/16 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-23283 RangeDeleter does not log cursor ids correctly in deleteNow()
Branch: master
https://github.com/mongodb/mongo/commit/26603490725d969247044de4f36f487972264023

Comment by Randolph Tan [ 22/Mar/16 ]

Hi,

I have found the problem. The chunk cleanup is waiting for some open cursors to go away, but is not logging it correctly. I have updated the description to reflect the issue. There is currently no good way to get out of this situation other than restarting the server unless you have an idea what the open cursor is (most likely with no timeout since mongod also cleans up cursors that are idle for 10 minutes). There are also plans to include a new command that will list all cursors in the server (https://jira.mongodb.org/browse/SERVER-3090) which can help in situations like this.

Thanks!

Comment by Steffen [ 22/Mar/16 ]

Result of this command:

sudo gdb -ex "thread apply all bt"  --batch -p 142216 > stack.log

Comment by Randolph Tan [ 22/Mar/16 ]

Hi,

Alternatively, you can use this gdb command without the need to explicitly attach to the process:

gdb -ex "thread apply all bt"  --batch -p <process id of the instance, which is stuck> > stack.log

Note that the command might need higher privileges and you might need to sudo the command.

Comment by Randolph Tan [ 22/Mar/16 ]

Hi,

Is it possible to attach gdb to the process and capture the backtrace for all threads? It will also be helpful if you capture a couple of times, around 3~5, with an interval of around a second. This will give us more clue on where the chunk deleter is stuck.

Thanks!

Comment by Steffen [ 22/Mar/16 ]

Forgot to say that we've sent a db.killOp("repset5:-1973023014") after 5 days which didn't kill it.

Generated at Thu Feb 08 04:02:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.