[SERVER-34747] find() and findOne() hangs when invoked from mongos Created: 30/Apr/18  Updated: 16/Nov/21  Resolved: 23/May/18

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.6.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Sasa Skevin Assignee: Kelsey Schubert
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Steps To Reproduce:

1) Connect to mongos

2) select database

3) invoke findOne()

(it hangs and does not return result)

 

The one that works:

1) Connect to primary of any of the shards

2) select database

3) invoke findOne()

(result is immediately returned)

Participants:

 Description   

A MongoDB v3.6.4 database on Debian 9.4 with a large sharded collection with two shards and WiredTiger engine hangs when issuing a findOne() or find() command without any parameters. The collection has only the default index on the _id key.

If findOne() or find() is invoked on any of the shards then it works immediately. If it is invoked on a mongos then it hangs.

With db.currentOp() this can be seen after findOne() is invoked:

{ 
 "host" : "db1:27017",
 "desc" : "conn277",
 "connectionId" : 277,
 "client" : "10.240.137.10:57294",
 "appName" : "MongoDB Shell",
 "clientMetadata" : {
 "application" : {
 "name" : "MongoDB Shell"
 },
 "driver" : {
 "name" : "MongoDB Internal Client",
 "version" : "3.6.4"
 },
 "os" : {
 "type" : "Linux",
 "name" : "PRETTY_NAME=\"Debian GNU/Linux 9 (stretch)\"",
 "architecture" : "x86_64",
 "version" : "Kernel 4.9.0-6-amd64"
 },
 "mongos" : {
 "host" : "m1:27017",
 "client" : "10.240.0.0:38190",
 "version" : "3.6.4"
 }
 },
 "active" : true,
 "currentOpTime" : "2018-04-28T08:13:26.371+0200",
 "opid" : 2148656,
 "secs_running" : NumberLong(4),
 "microsecs_running" : NumberLong(4344989),
 "op" : "query",
 "ns" : "somedb.somecol",
 "command" : {
 "find" : "somecol",
 "limit" : NumberLong(1),
 "shardVersion" : [
 Timestamp(188696, 1),
 ObjectId("5ac6b2abbd8bbc9f42f34a39")
 ],
 "$clusterTime" : {
 "clusterTime" : Timestamp(1524896001, 1),
 "signature" : {
 "hash" : BinData(0,"RuV/v6Qm7H9AvPMVRNH0jkdIwRM="),
 "keyId" : NumberLong("6540622458888650772")
 }
 },
 "$client" : {
 "application" : {
 "name" : "MongoDB Shell"
 },
 "driver" : {
 "name" : "MongoDB Internal Client",
 "version" : "3.6.4"
 },
 "os" : {
 "type" : "Linux",
 "name" : "PRETTY_NAME=\"Debian GNU/Linux 9 (stretch)\"",
 "architecture" : "x86_64",
 "version" : "Kernel 4.9.0-6-amd64"
 },
 "mongos" : {
 "host" : "m1:27017",
 "client" : "10.240.0.0:38190",
 "version" : "3.6.4"
 }
 },
 "$configServerState" : {
 "opTime" : {
 "ts" : Timestamp(1524896001, 1),
 "t" : NumberLong(3)
 }
 },
 "$db" : "feedback"
 },
 "planSummary" : "COLLSCAN",
 "numYields" : 803,
 "locks" : {
 "Global" : "r",
 "Database" : "r",
 "Collection" : "r"
 },
 "waitingForLock" : false,
 "lockStats" : {
 "Global" : {
 "acquireCount" : {
 "r" : NumberLong(1608)
 }
 },
 "Database" : {
 "acquireCount" : {
 "r" : NumberLong(804)
 }
 },
 "Collection" : {
 "acquireCount" : {
 "r" : NumberLong(804)
 }
 }
 }
 },



 Comments   
Comment by Kelsey Schubert [ 23/May/18 ]

Thanks for the update!

Comment by Sasa Skevin [ 23/May/18 ]

After two weeks, the cleanup finished and after that find() and findOne() works as expected.

Comment by Sasa Skevin [ 08/May/18 ]

Hi @Kelsey,

Well yes, we moved quite a lot of chunks in the last couple of weeks and it seems that the majority of them become orphans.

We initiated orphan cleanup but according to current speed of cleanup it will take about two weeks before it is finished. I'll get back then.

Thanks,

Sasa

 

Comment by Kelsey Schubert [ 08/May/18 ]

Hi ssasa,

Thanks for reporting this issue. First, I'd like to clarify that Debian 9 is not supported on 3.6.4. SERVER-29463 adds support for Debian 9, and the next minor release, 3.6.5, will include binaries built specifically for Debian 9.

That said, my suspicion is that there are a number of orphan documents in your sharded cluster. As a result, a findOne() must iterate through some number of documents to find one that belongs to the shard that it is residing on. This process is delaying the first batch that returns from the mongod, resulting in the behavior that you describe.

Would you please execute the cleanupOrphaned command against your shards, taking care to note the potential performance implications if running on a production system, and let us know if it resolves the issue?

Thank you,
Kelsey

Generated at Thu Feb 08 04:37:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.