[SERVER-48283] Cursor iteration freezes after first batch if no batchLimit is set Created: 18/May/20  Updated: 21/Aug/20  Resolved: 28/Jul/20

Status: Closed
Project: Core Server
Component/s: Internal Client, Performance, Shell
Affects Version/s: 4.0.18
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Puiu Hrenciuc Assignee: Dmitry Agranat
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive mongodb-batches-issue.zip    
Issue Links:
Related
related to SERVER-35106 cursor iteration freeze after the 101... Closed
Operating System: ALL
Steps To Reproduce:

Don't know how to get to this behavior, but once you're there, can't really get read of it (100% reproduction).

Participants:

 Description   

Hello,

I have the following situation:

  • I have a GridFS bucket with 810,959 files (.files count) stored in 2,499,160 chunks (.chunks count)
  • I want to iterate over all entries in .chunks and print (just for sake of simplicity) the _id of each entry
  • for this I use the following command
  • db.sandboxReports.chunks.find({},{_id:1}).forEach(r=>print(r._id)) 

During the time the query executes, I have the following output from currentOp:

                {
                        "host" : "HOST_NAME_HERE:27017",
                        "desc" : "conn11650",
                        "connectionId" : 11650,
                        "client" : "127.0.0.1:48568",
                        "appName" : "MongoDB Shell",
                        "clientMetadata" : {
                                "application" : {
                                        "name" : "MongoDB Shell"
                                },
                                "driver" : {
                                        "name" : "MongoDB Internal Client",
                                        "version" : "4.0.18"
                                },
                                "os" : {
                                        "type" : "Linux",
                                        "name" : "Ubuntu",
                                        "architecture" : "x86_64",
                                        "version" : "18.04"
                                }
                        },
                        "active" : true,
                        "currentOpTime" : "2020-05-18T21:37:24.801+0000",
                        "opid" : 9652856,
                        "lsid" : {
                                "id" : UUID("ab8e0656-ddd0-496e-8c1b-6c95dc9e35b3"),
                                "uid" : BinData(0,"lrHY+QTs83BmclcR9MKKYcSvfFmsJ+jW+bP+OfRjiFM=")
                        },
                        "secs_running" : NumberLong(906),
                        "microsecs_running" : NumberLong(906036759),
                        "op" : "getmore",
                        "ns" : "DB_NAME.sandboxReports.chunks",
                        "command" : {
                                "getMore" : NumberLong("171272634643"),
                                "collection" : "sandboxReports.chunks",
                                "lsid" : {
                                        "id" : UUID("ab8e0656-ddd0-496e-8c1b-6c95dc9e35b3")
                                },
                                "$clusterTime" : {
                                        "clusterTime" : Timestamp(1589836937, 5),
                                        "signature" : {
                                                "hash" : BinData(0,"v8OUtyPICaqX5BTz+tZ3u/rAsV8="),
                                                "keyId" : NumberLong("6785026434701197326")
                                        }
                                },
                                "$db" : "sandboxReports"
                        },
                        "originatingCommand" : {
                                "find" : "sandboxReports.chunks",
                                "filter" : {                                },
                                "projection" : {
                                        "_id" : 1
                                },
                                "lsid" : {
                                        "id" : UUID("ab8e0656-ddd0-496e-8c1b-6c95dc9e35b3")
                                },
                                "$clusterTime" : {
                                        "clusterTime" : Timestamp(1589816462, 13),
                                        "signature" : {
                                                "hash" : BinData(0,"ZTJm56Md0AIWuKYHN/EiyyopmLY="),
                                                "keyId" : NumberLong("6785026434701197326")
                                        }
                                },
                                "$db" : "sandboxReports"
                        },
                        "planSummary" : "COLLSCAN",
                        "numYields" : 42856,
                        "locks" : {
                                "Global" : "r",
                                "Database" : "r",
                                "Collection" : "r"
                        },
                        "waitingForLock" : false,
                        "lockStats" : {
                                "Global" : {
                                        "acquireCount" : {
                                                "r" : NumberLong(42857)
                                        }
                                },
                                "Database" : {
                                        "acquireCount" : {
                                                "r" : NumberLong(42857)
                                        }
                                },
                                "Collection" : {
                                        "acquireCount" : {
                                                "r" : NumberLong(42857)
                                        }
                                }
                        }
                },
 

Server is MongoDB 4.0.18 running on Ubuntu 18.04 setup in replica set mode.

I have tried to run the command on different servers from replica set

If I can help you with other information, please let me know.

Any help is greatly appreciated.

 

Regards,
Puiu



 Comments   
Comment by Dmitry Agranat [ 28/Jul/20 ]

Hi,

We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Regards,
Dima

Comment by Dmitry Agranat [ 09/Jul/20 ]

Hi phrenciuc@bitdefender.com,

I do not believe the requested queries are useful after removing documents unless you still see the same issue.

Thanks,
Dima

Comment by Puiu Hrenciuc [ 06/Jul/20 ]

Hi,

Since then, I removed most of the documents in .chunks collection (migrated to S3), think running the above queries are still useful?

Thanks,

Puiu

Comment by Carl Champain (Inactive) [ 24/Jun/20 ]

Hi phrenciuc@bitdefender.com,

Sorry for the delay!
We are still investigating this ticket and would need some additional information. Can you please provide the output of explain("executionStats") for the following queries?

db.sandboxReports.chunks.find({},{_id:1}).explain("executionStats")
db.sandboxReports.chunks.find({}, {_id:1}).sort({_id:1}).explain("executionStats")
db.sandboxReports.chunks.find({},{files_id:1}).sort({_id:1}).explain("executionStats")
db.sandboxReports.chunks.find({},{files_id:1}).sort({files_id:1}).explain("executionStats")
db.sandboxReports.chunks.find({},{_id:0, files_id:1}).sort({files_id:1}).explain("executionStats")

Kind regards,
Carl

Comment by Puiu Hrenciuc [ 28/May/20 ]

See bellow

gridfs:PRIMARY> db.sandboxReports.chunks.find({},{_id:1}).explain()
{
        "queryPlanner" : {
                "plannerVersion" : 1,
                "namespace" : "sandboxReports.sandboxReports.chunks",
                "indexFilterSet" : false,
                "parsedQuery" : {                },
                "winningPlan" : {
                        "stage" : "PROJECTION",
                        "transformBy" : {
                                "_id" : 1
                        },
                        "inputStage" : {
                                "stage" : "COLLSCAN",
                                "direction" : "forward"
                        }
                },
                "rejectedPlans" : [ ]
        },
        "serverInfo" : {
                "host" : "SERVER_NAME",
                "port" : 27017,
                "version" : "4.0.12",
                "gitVersion" : "5776e3cbf9e7afe86e6b29e22520ffb6766e95d4"
        },
        "ok" : 1,
        "operationTime" : Timestamp(1590649007, 12),
        "$clusterTime" : {
                "clusterTime" : Timestamp(1590649007, 12),
                "signature" : {
                        "hash" : BinData(0,"lIt7aEhadl8nKmCGOAsKpfIjAHU="),
                        "keyId" : NumberLong("6801136504841175046")
                }
        }
}
gridfs:PRIMARY> db.sandboxReports.chunks.find({}, {_id:1}).sort({_id:1}).explain()
{
        "queryPlanner" : {
                "plannerVersion" : 1,
                "namespace" : "sandboxReports.sandboxReports.chunks",
                "indexFilterSet" : false,
                "parsedQuery" : {                },
                "winningPlan" : {
                        "stage" : "PROJECTION",
                        "transformBy" : {
                                "_id" : 1
                        },
                        "inputStage" : {
                                "stage" : "IXSCAN",
                                "keyPattern" : {
                                        "_id" : 1
                                },
                                "indexName" : "_id_",
                                "isMultiKey" : false,
                                "multiKeyPaths" : {
                                        "_id" : [ ]
                                },
                                "isUnique" : true,
                                "isSparse" : false,
                                "isPartial" : false,
                                "indexVersion" : 1,
                                "direction" : "forward",
                                "indexBounds" : {
                                        "_id" : [
                                                "[MinKey, MaxKey]"
                                        ]
                                }
                        }
                },
                "rejectedPlans" : [ ]
        },
        "serverInfo" : {
                "host" : "SERVER_NAME",
                "port" : 27017,
                "version" : "4.0.12",
                "gitVersion" : "5776e3cbf9e7afe86e6b29e22520ffb6766e95d4"
        },
        "ok" : 1,
        "operationTime" : Timestamp(1590649034, 162),
        "$clusterTime" : {
                "clusterTime" : Timestamp(1590649034, 162),
                "signature" : {
                        "hash" : BinData(0,"hY368bsU0JrqrTMp3eyh8moYBHM="),
                        "keyId" : NumberLong("6801136504841175046")
                }
        }
}
gridfs:PRIMARY> db.sandboxReports.chunks.find({},{files_id:1}).sort({_id:1}).explain()
{
        "queryPlanner" : {
                "plannerVersion" : 1,
                "namespace" : "sandboxReports.sandboxReports.chunks",
                "indexFilterSet" : false,
                "parsedQuery" : {                },
                "winningPlan" : {
                        "stage" : "PROJECTION",
                        "transformBy" : {
                                "files_id" : 1
                        },
                        "inputStage" : {
                                "stage" : "FETCH",
                                "inputStage" : {
                                        "stage" : "IXSCAN",
                                        "keyPattern" : {
                                                "_id" : 1
                                        },
                                        "indexName" : "_id_",
                                        "isMultiKey" : false,
                                        "multiKeyPaths" : {
                                                "_id" : [ ]
                                        },
                                        "isUnique" : true,
                                        "isSparse" : false,
                                        "isPartial" : false,
                                        "indexVersion" : 1,
                                        "direction" : "forward",
                                        "indexBounds" : {
                                                "_id" : [
                                                        "[MinKey, MaxKey]"
                                                ]
                                        }
                                }
                        }
                },
                "rejectedPlans" : [ ]
        },
        "serverInfo" : {
                "host" : "SERVER_NAME",
                "port" : 27017,
                "version" : "4.0.12",
                "gitVersion" : "5776e3cbf9e7afe86e6b29e22520ffb6766e95d4"
        },
        "ok" : 1,
        "operationTime" : Timestamp(1590649052, 47),
        "$clusterTime" : {
                "clusterTime" : Timestamp(1590649052, 47),
                "signature" : {
                        "hash" : BinData(0,"R1cU8ZuycmfV0m8kiSgpMTL+Aks="),
                        "keyId" : NumberLong("6801136504841175046")
                }
        }
}
gridfs:PRIMARY> db.sandboxReports.chunks.find({},{files_id:1}).sort({files_id:1}).explain()
{
        "queryPlanner" : {
                "plannerVersion" : 1,
                "namespace" : "sandboxReports.sandboxReports.chunks",
                "indexFilterSet" : false,
                "parsedQuery" : {                },
                "winningPlan" : {
                        "stage" : "PROJECTION",
                        "transformBy" : {
                                "files_id" : 1
                        },
                        "inputStage" : {
                                "stage" : "FETCH",
                                "inputStage" : {
                                        "stage" : "IXSCAN",
                                        "keyPattern" : {
                                                "files_id" : 1,
                                                "n" : 1
                                        },
                                        "indexName" : "files_id_1_n_1",
                                        "isMultiKey" : false,
                                        "multiKeyPaths" : {
                                                "files_id" : [ ],
                                                "n" : [ ]
                                        },
                                        "isUnique" : true,
                                        "isSparse" : false,
                                        "isPartial" : false,
                                        "indexVersion" : 1,
                                        "direction" : "forward",
                                        "indexBounds" : {
                                                "files_id" : [
                                                        "[MinKey, MaxKey]"
                                                ],
                                                "n" : [
                                                        "[MinKey, MaxKey]"
                                                ]
                                        }
                                }
                        }
                },
                "rejectedPlans" : [ ]
        },
        "serverInfo" : {
                "host" : "SERVER_NAME",
                "port" : 27017,
                "version" : "4.0.12",
                "gitVersion" : "5776e3cbf9e7afe86e6b29e22520ffb6766e95d4"
        },
        "ok" : 1,
        "operationTime" : Timestamp(1590649088, 46),
        "$clusterTime" : {
                "clusterTime" : Timestamp(1590649088, 46),
                "signature" : {
                        "hash" : BinData(0,"VYc4WpJbdjJBqLyWGjZumeTfvcU="),
                        "keyId" : NumberLong("6801136504841175046")
                }
        }
}
gridfs:PRIMARY> db.sandboxReports.chunks.find({},{_id:0, files_id:1}).sort({files_id:1}).explain()
{
        "queryPlanner" : {
                "plannerVersion" : 1,
                "namespace" : "sandboxReports.sandboxReports.chunks",
                "indexFilterSet" : false,
                "parsedQuery" : {                },
                "winningPlan" : {
                        "stage" : "PROJECTION",
                        "transformBy" : {
                                "_id" : 0,
                                "files_id" : 1
                        },
                        "inputStage" : {
                                "stage" : "IXSCAN",
                                "keyPattern" : {
                                        "files_id" : 1,
                                        "n" : 1
                                },
                                "indexName" : "files_id_1_n_1",
                                "isMultiKey" : false,
                                "multiKeyPaths" : {
                                        "files_id" : [ ],
                                        "n" : [ ]
                                },
                                "isUnique" : true,
                                "isSparse" : false,
                                "isPartial" : false,
                                "indexVersion" : 1,
                                "direction" : "forward",
                                "indexBounds" : {
                                        "files_id" : [
                                                "[MinKey, MaxKey]"
                                        ],
                                        "n" : [
                                                "[MinKey, MaxKey]"
                                        ]
                                }
                        }
                },
                "rejectedPlans" : [ ]
        },
        "serverInfo" : {
                "host" : "SERVER_NAME",
                "port" : 27017,
                "version" : "4.0.12",
                "gitVersion" : "5776e3cbf9e7afe86e6b29e22520ffb6766e95d4"
        },
        "ok" : 1,
        "operationTime" : Timestamp(1590649107, 40),
        "$clusterTime" : {
                "clusterTime" : Timestamp(1590649107, 40),
                "signature" : {
                        "hash" : BinData(0,"yNFO4irvrpj0WXJ0hrf5Ck82exM="),
                        "keyId" : NumberLong("6801136504841175046")
                }
        }
}
gridfs:PRIMARY> db.sandboxReports.chunks.getIndexes()
[
        {
                "v" : 1,
                "unique" : true,
                "key" : {
                        "files_id" : 1,
                        "n" : 1
                },
                "name" : "files_id_1_n_1",
                "ns" : "sandboxReports.sandboxReports.chunks",
                "background" : false
        },
        {
                "v" : 1,
                "key" : {
                        "_id" : 1
                },
                "name" : "_id_",
                "ns" : "sandboxReports.sandboxReports.chunks"
        }
]
 

Comment by Carl Champain (Inactive) [ 27/May/20 ]

Hi phrenciuc@bitdefender.com,

Thank you for the logs and diagnostic.data.

We would need a few more things from you to help us understand what's happening with your queries. Can you please provide:

  1. the explain output for the following queries:

db.sandboxReports.chunks.find({},{_id:1}).explain()
db.sandboxReports.chunks.find({}, {_id:1}).sort({_id:1}).explain()
db.sandboxReports.chunks.find({},{files_id:1}).sort({_id:1}).explain()
db.sandboxReports.chunks.find({},{files_id:1}).sort({files_id:1}).explain()
db.sandboxReports.chunks.find({},{_id:0, files_id:1}).sort({files_id:1}).explain()

      2. the getIndexes output for the sandboxReports.chunks collection:

db.sandboxReports.chunks.getIndexes()

Kind regards,
Carl  

Comment by Puiu Hrenciuc [ 21/May/20 ]

Hi Carl,

So, I ran the first command from the issue, meaning:

 db.sandboxReports.chunks.find({},{_id:1}).forEach(r=>print(r._id))

I captured all lines in mongod.log related to that specific MongoShell connection (mongod.log in attached file). As you can see, first batch is returned ~1 sec, second in over 2h, and 3rd in ~20m. I also have attached content from diagnostic.data/ folder for today. We do not use sharding, so there is no mongos.log file.

Let me know if there is anything else I may help you with.

 

Regards,
Puiu

Comment by Carl Champain (Inactive) [ 20/May/20 ]

Hi phrenciuc@bitdefender.com,

Thank you for the report.
To help us investigate this issue, can you please:

  1. Archive (tar or zip) the $dbpath/diagnostic.data directory (the contents are described here) and attach it to this ticket.
  2. Attach the mongod.log and mongos.log files covering this behavior.

Kind regards,
Carl

Comment by Puiu Hrenciuc [ 19/May/20 ]

Some more updates:

So, thinking I have found the solution, I tried to run the query with a slight modification (since I needed another field from db, rather than _id), so I did:

 db.sandboxReports.chunks.find({},{files_id:1}).sort({_id:1}).forEach(r=>print(r.files_id))

what do you know? It failed graciously. Same behavior. Stuck after first batch.

I was thinking that it is important to select only fields that are in the index used for sorting, so I tried:

db.sandboxReports.chunks.find({},{files_id:1}).sort({files_id:1}).forEach(r=>print(r.files_id))

No luck. Same behavior after first batch. Then another idea came to my mind: what if this happens because the query also returns the id field which is not in the used index (the index being used is _

{ files_id:1, n:1 }

_ , so I decided to give another try and used this:

db.sandboxReports.chunks.find({},{_id:0, files_id:1}).sort({files_id:1}).forEach(r=>print(r.files_id))

And it worked perfectly! So maybe a proper description of the bug should be something like:

When iterating over a large collection, if you do not force an index by using sort AND if the projection does not include ONLY fields that are in the used index, the query stalls after each batch, due to a full collection scan.

Hope it helps.

Comment by Puiu Hrenciuc [ 19/May/20 ]

Some updates:

It seems that the cursor does not actually freeze, but the time it takes to load a new batch is huge (I missed this because I did not printed a timestamp in the previous tries), somewhere around 2h / batch.

In MongoDB Log I have found entries like this:

2020-05-19T04:50:15.940+0000 I COMMAND  [conn21966] command sandboxReports.sandboxReports.chunks appName: "MongoDB Shell" command: getMore { getMore: 1705
29766110, collection: "sandboxReports.chunks", lsid: { id: UUID("1e4e64b3-e524-4066-87a2-231a35c7b606") }, $clusterTime: { clusterTime: Timestamp(15898572
99, 4), signature: { hash: BinData(0, 8A8AAE686E9E29DF5A7A3DE27F82A23F51F6E6A8), keyId: 6785026434701197326 } }, $db: "sandboxReports" } originatingComman
d: { find: "sandboxReports.chunks", filter: {}, projection: { _id: 1.0 }, lsid: { id: UUID("1e4e64b3-e524-4066-87a2-231a35c7b606") }, $clusterTime: { clus
terTime: Timestamp(1589838338, 11), signature: { hash: BinData(0, 5F58F1E4F3F504753AFB2EEA7375B1692B4E04E8), keyId: 6785026434701197326 } }, $db: "sandbox
Reports" } planSummary: COLLSCAN cursorid:170529766110 keysExamined:0 docsExamined:562942 numYields:300856 nreturned:562942 reslen:16777399 locks:{ Global
: { acquireCount: { r: 300857 } }, Database: { acquireCount: { r: 300857 } }, Collection: { acquireCount: { r: 300857 } } } storage:{ data: { bytesRead: 1
25119702733, timeReadingMicros: 6481554692 } } protocol:op_msg 6489080ms

what intrigues me is the "storage" section which said it read 125,119,702,733 bytes and that took 6,481,554,692 microseconds (roughly the entire duration of operation is spent on disk i/o).

So I thought that for each batch the cursor does a full collection scan, so I tried this:

 db.sandboxReports.chunks.find({},{_id:1}).sort({_id:1}).forEach(r=>print(r._id))

and it works perfectly, iterating through the entire collection takes only 15 seconds ! That's the MongoDB I know

So, the problem seems to reproduce when:

  • you have a big collection
  • you are using the natural sort order
  • ... (TBD)

 

Generated at Thu Feb 08 05:16:40 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.