[SERVER-9200] Cannot Connect During Long Query Created: 01/Apr/13  Updated: 10/Dec/14  Resolved: 03/Jul/13

Status: Closed
Project: Core Server
Component/s: Concurrency
Affects Version/s: 2.2.1, 2.2.3, 2.4.1
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Matthew Cross Assignee: Thomas Rueckstiess
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu Linux, OSX


Issue Links:
Duplicate
duplicates SERVER-9986 Refactor $geoNear searches into new q... Closed
Related
is related to SERVER-9983 Authenticating as internal user shoul... Closed
Operating System: ALL
Steps To Reproduce:

1. Start a mongod.
2. Setup auth.
3. Run a long query using the credentials of a user.
4. Attempt to connect with mongo during this time using the same DB and user as above.

Participants:

 Description   

In production we have noticed that we cannot connect to the DB while it is busy doing a long query. I tried to reproduce this locally and could not until I realized that we have auth turned on in production. When I enabled auth I also saw this behavior in my development environment. I CAN successfully connect to a DIFFERENT DB than the currently busy db.

I assume this is because auth requires a lock and the DB query is holding the lock.

I have not looked at the connection with wireshark to see if the connection completes and THEN the client blocks waiting to auth or whether there is never a connection established in the first place.

The upshot of what I'm saying is that the title of this ticket may be misleading.

In case it is important the long-running query in question is a Geo query.



 Comments   
Comment by Thomas Rueckstiess [ 21/Jun/13 ]

Matthew,

It took us a while, but we may have found the problem that is causing this. I've written up an explanation on SERVER-9986. Basically, if a long-running geo query with authentication enabled does not match documents for a long time, it causes the behavior you are seeing.

I wasn't able to reproduce this before because my geo queries would always match documents and yield.

If you are still experiencing this problem, it would be helpful to see if your geo-query is also not yielding for some time. If you turned the count from above into a find and attached an .explain, that would be very interesting to see. We'd expect a very low number of matched documents and a small number of yields.

Let me know if you are still able to reproduce the issue and can send us the explain.

Regards,
Thomas

Comment by Thomas Rueckstiess [ 18/Jun/13 ]

Hi Matthew,

Have you been able to go through the last comment my colleague David sent you? We haven't heard back from you in a while and would still need further information to continue the diagnosis.

If we can close this issue instead, please let us know.

Regards,
Thomas

Comment by David Hows [ 09/May/13 ]

Hi Matthew,

I've discussed your issue further with Thomas and I have a theory about what may be causing your issue.

Much of this will be related to performance on your system and as such I would want to gather a performance baseline and see the impacts that running the query had on your system.
Would you be able to collect the following statistics for a few minutes (5 should be plenty) then execute the query and collect the time of execution. Running the stats collections prior to executing the query is important as it gives us a baseline to compare against. Let the statistics collection run until the query finishes and you are able to login again. The stats I would like you too collect are;

  1. mongostat
  2. iostat -xmt 1

These should give indicators of what is going on within your MongoD instance and in your systems CPU/Disk subsystems.

Could you also attach a copy of the system.indexes collection for the database you run this query against?

My suspicion is that the polygon query is consuming a significant amount of system resources and blocking other activity from occurring within the MongoD node while the query runs. This includes sending periodic heartbeats to other nodes. It also blocks all other queries from executing, which means the connections sending those queries would be waiting to execute as well. This would lead to both the replica set issues you saw and explain the number of connections being established.

Please let me know if you can collect these statistics so I can get an understanding on what is happening on the node at a low level.

Regards,
David

Comment by Matthew Cross [ 08/May/13 ]

Yes we are still having this problem. This is a replica set. It is NOT sharded. I did not open any other tickets.

Comment by Thomas Rueckstiess [ 02/May/13 ]

Hi Matthew,

We haven't heard back from you on this issue for a while. Did you go ahead and open a Community Private ticket? And are you still seeing the problems of not being able to connect during the geo queries?

Is this a sharded cluster or a single replica set?

Regards,
Thomas

Comment by Thomas Rueckstiess [ 11/Apr/13 ]

Hi Matthew,

At this point it would probably be best to inspect your production system logfiles. You can either attach them to this ticket (bear in mind that this is a public project, visible to anyone) or if you're concerned about privacy issues, you can open a new issue in our Community Private project. If you do, please refer to this ticket number: SERVER-9200.

We would like to see the logfile of the node that the geo-query was running on and know the exact timeframe of the last connection issue please. Other logs (for example from your application) for the same time window would be helpful as well, if they show any connection problems.

In your original description you provided a number of operating systems and versions. I think it's best if we focus on one specific environment for now, to reduce the number of unknown variables. What OS and mongodb version is your production system running on?

Can you also confirm the driver and the version of the driver you're using to connect? Do you experience the issue in the same way from the driver and through the mongo shell?

Do you use MMS to monitor this replica set in production? If not, the output of mongostat during the time when the issue occurs would also be interesting to look at.

Once we've gathered all this information, we'll look through the logfile and see if anything stands out.

Regards,
Thomas

Comment by Matthew Cross [ 08/Apr/13 ]

Also today in production we had this issue again. At the time of the inability to connect mongo was being clobbered with geo queries that started off taking about 3-4s but as time went on and the queries continued the query time went up to 100s or longer.

During this time there were no issues with the replica set or the number of connections/files open (under 200 according to /proc/<PID>/fd or anything else that I could see in the log. It was simply very heavily loaded and connections did not happen.

So I can only suggest that you take a slow-ish geo query and run it in a fairly tight loop while trying to connect.

Comment by Matthew Cross [ 08/Apr/13 ]

On our QA box ulimits are 1024. In production they are unlimited so perhaps there are two different problems here.

However I think the real issue is that connections escalate the way they do during this query and that the shell cannot connect (the connection count when the shell first tries to connect is only in the 300s) and once the process hits the ulimit for open files the connections thing seem to return to normal.

So I think this is a potentially more complex issue.

1. I fire the long geo query
2. replicas cant talk to the server doing the query, vote it out of primary, go crazy making connections until ulimit is reached.
3. ulimit reached and/or query completes (I'm not sure which) and things go back to normal.

Am I right? In other words I think raising the ulimit will simply make the problem window last LONGER. Just a theory.

Also on my laptop I have neither a replica set nor a ulimit problem and I still experience the inability to connect during that long geo query.

I'm sorry if I'm mixing different issues into the same ticket here. Please advise and I can file new issues ass necessary.

// This is QA. I will check prod when I can.
ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 20
file size (blocks, -f) unlimited
pending signals (-i) 16382
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) unlimited
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

Comment by Thomas Rueckstiess [ 04/Apr/13 ]

Hi Matthew,

This line is indicative of too low ulimits:

Thu Apr 4 13:08:04.060 [initandlisten] connection refused because too many open connections: 819

MongoDB uses 80% of available file handles. I have a suspicion that those are set to 1024 on your system (1024*0.8 = 819.2).
Can you please get the output of ulimit -a and attach it in the comment to confirm?

Can you also check your maxConns and see what they are set to? You'd find those in your mongod.conf or as a command line parameter on startup of your mongod. If it's not set, the default 20000 applies.

If we ruled these out we can then find a way to get your logfiles here securely for inspection and investigate further.

Thanks,
Thomas

Comment by Matthew Cross [ 04/Apr/13 ]

This is the exact query I ran in our QA environment. On my laptop I have different data but I run a similar large within->polygon query.

I tailed the mongod log on my laptop during the hang and I don't see anything.

On the QA system however I see some pretty horrible things going on.

There is some sensitive info in the log so I don't want to add it to this ticket but basically as soon as I make the query the replica set seems to have problems, connections start growing and I cannot connect with another mongo. The query I sent only runs after all the replica set issues resolve.

I can send you the logs offline if that helps.

Here is a synopsis of what I see in the log...

I see myself authenticating. Then I see other activity unrelated to me.

Then the replica set starts having issues and I see a lot of activity saying that the other replicas think we are dead and another replica becomes the primary.

Then connections continue to grow I assume from the other replicas connecting. I continue to see the "xyz thinks we are dead" messages during this time.

After several minutes of thrashing I see mongo saying it has too many open connections and it starts refusing them.

Once it starts refusing connections I finally see my query hit the log

Thu Apr 4 13:08:04.060 [initandlisten] connection refused because too many open connections: 819
Thu Apr 4 13:08:04.543 [conn25264] command center0.$cmd command: { count: "contact", query: { address.gisLocation: { $within:

{ $polygon: [ [ -99.08820343017558, 38.7394878963234 ], [ -96.37745612886617, 41.17195559141593 ], [ -94.19658642284462, 43.79587996123102 ], [ -92.59929456215912, 46.51446201284167 ], [ -91.62491117351669, 49.2358776049858 ], [ -91.29742880976087, 51.87850271375808 ], [ -91.62491117351669, 54.37444870105993 ], [ -92.59929456215912, 56.67112749421409 ], [ -94.19658642284462, 58.73097272549451 ], [ -96.37745612886617, 60.52974294311186 ], [ -99.08820343017558, 62.05396442131694 ], [ -102.2620807317463, 63.29804847804755 ], [ -105.8209366409317, 64.26150093784622 ], [ -109.6771403140776, 64.94649457382098 ], [ -113.7357392187376, 65.35594562673927 ], [ -117.8967971801747, 65.49214493363145 ], [ -122.0578551416119, 65.35594562673927 ], [ -126.1164540462719, 64.94649457382135 ], [ -129.9726577194178, 64.26150093784662 ], [ -133.5315136286032, 63.29804847804797 ], [ -136.7053909301739, 62.05396442131735 ], [ -139.4161382314833, 60.52974294311232 ], [ -141.5970079375049, 58.73097272549499 ], [ -143.1942997981904, 56.67112749421457 ], [ -144.1686831868328, 54.37444870106044 ], [ -144.4961655505886, 51.8785027137592 ], [ -144.1686831868328, 49.23587760498638 ], [ -143.1942997981904, 46.51446201284228 ], [ -141.5970079375049, 43.79587996123168 ], [ -139.4161382314833, 41.17195559141661 ], [ -136.7053909301739, 38.73948789632409 ], [ -133.5315136286032, 36.59399375378268 ], [ -129.9726577194178, 34.82328056262398 ], [ -126.1164540462719, 33.50165998733384 ], [ -122.0578551416119, 32.68535325541297 ], [ -117.8967971801747, 32.40930335578322 ], [ -113.7357392187376, 32.68535325541297 ], [ -109.6771403140776, 33.5016599873331 ], [ -105.8209366409317, 34.82328056262325 ], [ -102.2620807317463, 36.59399375378197 ], [ -99.08820343017558, 38.7394878963234 ] ] }

} }, fields: {} } ntoreturn:1 keyUpdates:0 locks(micros) r:1466609019 reslen:48 1466609ms
Thu Apr 4 13:08:04.565 [conn17856] getMore: cursorid not found local.oplog.rs 799726737618179828

//HERE IS THE QUERY
db.contact.count({ "address.gisLocation" : { "$within" :

{ "$polygon" : [ [ -99.08820343017558 , 38.7394878963234] , [ -96.37745612886617 , 41.17195559141593] , [ -94.19658642284462 , 43.795879961231016] , [ -92.59929456215912 , 46.51446201284167] , [ -91.6249111735167 , 49.235877604985795] , [ -91.29742880976087 , 51.87850271375808] , [ -91.6249111735167 , 54.37444870105993] , [ -92.59929456215912 , 56.67112749421409] , [ -94.19658642284462 , 58.73097272549451] , [ -96.37745612886617 , 60.52974294311186] , [ -99.08820343017558 , 62.05396442131694] , [ -102.2620807317463 , 63.29804847804755] , [ -105.82093664093168 , 64.26150093784622] , [ -109.67714031407759 , 64.94649457382098] , [ -113.7357392187376 , 65.35594562673927] , [ -117.89679718017474 , 65.49214493363145] , [ -122.05785514161188 , 65.35594562673927] , [ -126.11645404627191 , 64.94649457382135] , [ -129.9726577194178 , 64.26150093784662] , [ -133.5315136286032 , 63.298048478047974] , [ -136.7053909301739 , 62.053964421317346] , [ -139.41613823148333 , 60.529742943112325] , [ -141.59700793750486 , 58.73097272549499] , [ -143.19429979819037 , 56.67112749421457] , [ -144.1686831868328 , 54.37444870106044] , [ -144.49616555058861 , 51.878502713759204] , [ -144.1686831868328 , 49.235877604986385] , [ -143.19429979819037 , 46.51446201284228] , [ -141.59700793750486 , 43.79587996123168] , [ -139.41613823148333 , 41.17195559141661] , [ -136.7053909301739 , 38.739487896324086] , [ -133.5315136286032 , 36.593993753782684] , [ -129.9726577194178 , 34.82328056262398] , [ -126.11645404627191 , 33.50165998733384] , [ -122.05785514161188 , 32.68535325541297] , [ -117.89679718017474 , 32.40930335578322] , [ -113.7357392187376 , 32.68535325541297] , [ -109.67714031407759 , 33.5016599873331] , [ -105.82093664093168 , 34.823280562623246] , [ -102.2620807317463 , 36.59399375378197] , [ -99.08820343017558 , 38.7394878963234]]}

}})

Comment by Thomas Rueckstiess [ 04/Apr/13 ]

Hi Matthew,

I tested this as well and could not reproduce the problem, even with authentication.

These are the steps I followed:

  1. I've started a mongod (single instance) with authentication and created a user on the test database
  2. I filled a collection with millions of coordinate points
  3. I ran a geo query ($geoWithin) on that collection, taking a long time to return (minutes)
  4. Meanwhile, I successfully logged in from a different console with the same user credentials to the same database and ran a separate command (a count on the same collection) which returned instantly

There were no delays as you describe them.

Is there anything that is significantly different to your tests? Can you please let us know the exact steps you executed? For example, what kind of operations did you run?

Ideally, if you had a short javascript code snippet with the commands executed on either session, that reliably reproduces the issue, that would be very helpful.

Thanks,
Thomas

Generated at Thu Feb 08 03:19:40 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.