[SERVER-27009] Replication initial sync creates cursors with no timeout Created: 11/Nov/16  Updated: 06/Dec/22  Resolved: 17/Apr/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.0.0, 3.2.0, 3.4.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kaloian Manassiev Assignee: Backlog - Replication Team
Resolution: Done Votes: 1
Labels: dogfooding
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-6036 Disable cursor timeout for cursors th... Closed
related to SERVER-31688 W SHARDING [conn161595] can't accept ... Closed
Assigned Teams:
Replication
Operating System: ALL
Participants:

 Description   

Both the cloner and oplog fetcher in replication initial sync use a cursor with no timeout:

2016-11-03T19:58:56.081+0000 I COMMAND  [conn47601] command buildlogs.logs command: find { find: "logs", noCursorTimeout: true, batchSize: 13981010 } planSummary: COLLSCAN cursorid:45904553724 keysExamined:0 docsExamined:822 numYields:14 nreturned:821 reslen:16750452 locks:{ Global: { acquireCount: { r: 30 } }, Database: { acquireCount: { r: 15 } }, Collection: { acquireCount: { r: 15 } } } protocol:op_command 447ms

While both these components have graceful shutdown and clean up the cursors that they open, in case of network failure or crash of a secondary node, these cursors will be leaked and never get cleaned up.

This is especially problematic with replica set shards, because having a cursor open on a sharded collection will eventually block migrations to that shard:

2016-11-09T16:09:06.572+0000 I SHARDING [RangeDeleter] waiting for open cursors before removing range [{ build_id: "337bc5b6432ea606a010e4c95a5e5f9a", test_id: ObjectId('57f3eb919041302d8b03ffdf'), seq: 1 }, { build_id: "337c88bdf0f88e7c95d9ba482d042e71", test_id: ObjectId('57d1b969be07c42b9805e57f'), seq: 2 }) in buildlogs.logs, elapsed secs: 499819, cursor ids: [45904553724]



 Comments   
Comment by Matthew Russotto [ 17/Apr/20 ]

Cloning uses exhaust cursors, except when disabled by a server parameter only used in tests, so it should no longer be an issue after SERVER-44699.

We have always used no-timeout cursors on the collection cloners; I don't know the original reason. Probably whatever it was could be handled by the resume mechanism now, but with exhaust cursors it isn't necessary.

Comment by Judah Schvimer [ 17/Apr/20 ]

matthew.russotto, do you know if this was intentional for the cloners, or if this is still a bug?

Comment by Lingzhi Deng [ 09/Apr/20 ]

The cursor cleanup issue (for exhaust cursors) is no longer true after SERVER-44699 where exhaust cursors are cleaned up on network failures. The cursors used by OplogFetcher should have a timeout. However I think cloners still use cursor without a timeout.

Comment by Siyuan Zhou [ 09/Apr/20 ]

ldeng, do you think this is still a problem after resumable initial sync and using exhaust cursor for oplog fetching projects in 4.4?

Generated at Thu Feb 08 04:13:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.