[SERVER-9200] Cannot Connect During Long Query Created: 01/Apr/13 Updated: 10/Dec/14 Resolved: 03/Jul/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Concurrency |
| Affects Version/s: | 2.2.1, 2.2.3, 2.4.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Matthew Cross | Assignee: | Thomas Rueckstiess |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu Linux, OSX |
||
| Issue Links: |
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Steps To Reproduce: | 1. Start a mongod. |
||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
In production we have noticed that we cannot connect to the DB while it is busy doing a long query. I tried to reproduce this locally and could not until I realized that we have auth turned on in production. When I enabled auth I also saw this behavior in my development environment. I CAN successfully connect to a DIFFERENT DB than the currently busy db. I assume this is because auth requires a lock and the DB query is holding the lock. I have not looked at the connection with wireshark to see if the connection completes and THEN the client blocks waiting to auth or whether there is never a connection established in the first place. The upshot of what I'm saying is that the title of this ticket may be misleading. In case it is important the long-running query in question is a Geo query. |
| Comments |
| Comment by Thomas Rueckstiess [ 21/Jun/13 ] | |
|
Matthew, It took us a while, but we may have found the problem that is causing this. I've written up an explanation on I wasn't able to reproduce this before because my geo queries would always match documents and yield. If you are still experiencing this problem, it would be helpful to see if your geo-query is also not yielding for some time. If you turned the count from above into a find and attached an .explain, that would be very interesting to see. We'd expect a very low number of matched documents and a small number of yields. Let me know if you are still able to reproduce the issue and can send us the explain. Regards, | |
| Comment by Thomas Rueckstiess [ 18/Jun/13 ] | |
|
Hi Matthew, Have you been able to go through the last comment my colleague David sent you? We haven't heard back from you in a while and would still need further information to continue the diagnosis. If we can close this issue instead, please let us know. Regards, | |
| Comment by David Hows [ 09/May/13 ] | |
|
Hi Matthew, I've discussed your issue further with Thomas and I have a theory about what may be causing your issue. Much of this will be related to performance on your system and as such I would want to gather a performance baseline and see the impacts that running the query had on your system.
These should give indicators of what is going on within your MongoD instance and in your systems CPU/Disk subsystems. Could you also attach a copy of the system.indexes collection for the database you run this query against? My suspicion is that the polygon query is consuming a significant amount of system resources and blocking other activity from occurring within the MongoD node while the query runs. This includes sending periodic heartbeats to other nodes. It also blocks all other queries from executing, which means the connections sending those queries would be waiting to execute as well. This would lead to both the replica set issues you saw and explain the number of connections being established. Please let me know if you can collect these statistics so I can get an understanding on what is happening on the node at a low level. Regards, | |
| Comment by Matthew Cross [ 08/May/13 ] | |
|
Yes we are still having this problem. This is a replica set. It is NOT sharded. I did not open any other tickets. | |
| Comment by Thomas Rueckstiess [ 02/May/13 ] | |
|
Hi Matthew, We haven't heard back from you on this issue for a while. Did you go ahead and open a Community Private ticket? And are you still seeing the problems of not being able to connect during the geo queries? Is this a sharded cluster or a single replica set? Regards, | |
| Comment by Thomas Rueckstiess [ 11/Apr/13 ] | |
|
Hi Matthew, At this point it would probably be best to inspect your production system logfiles. You can either attach them to this ticket (bear in mind that this is a public project, visible to anyone) or if you're concerned about privacy issues, you can open a new issue in our Community Private project. If you do, please refer to this ticket number: We would like to see the logfile of the node that the geo-query was running on and know the exact timeframe of the last connection issue please. Other logs (for example from your application) for the same time window would be helpful as well, if they show any connection problems. In your original description you provided a number of operating systems and versions. I think it's best if we focus on one specific environment for now, to reduce the number of unknown variables. What OS and mongodb version is your production system running on? Can you also confirm the driver and the version of the driver you're using to connect? Do you experience the issue in the same way from the driver and through the mongo shell? Do you use MMS to monitor this replica set in production? If not, the output of mongostat during the time when the issue occurs would also be interesting to look at. Once we've gathered all this information, we'll look through the logfile and see if anything stands out. Regards, | |
| Comment by Matthew Cross [ 08/Apr/13 ] | |
|
Also today in production we had this issue again. At the time of the inability to connect mongo was being clobbered with geo queries that started off taking about 3-4s but as time went on and the queries continued the query time went up to 100s or longer. During this time there were no issues with the replica set or the number of connections/files open (under 200 according to /proc/<PID>/fd or anything else that I could see in the log. It was simply very heavily loaded and connections did not happen. So I can only suggest that you take a slow-ish geo query and run it in a fairly tight loop while trying to connect. | |
| Comment by Matthew Cross [ 08/Apr/13 ] | |
|
On our QA box ulimits are 1024. In production they are unlimited so perhaps there are two different problems here. However I think the real issue is that connections escalate the way they do during this query and that the shell cannot connect (the connection count when the shell first tries to connect is only in the 300s) and once the process hits the ulimit for open files the connections thing seem to return to normal. So I think this is a potentially more complex issue. 1. I fire the long geo query Am I right? In other words I think raising the ulimit will simply make the problem window last LONGER. Just a theory. Also on my laptop I have neither a replica set nor a ulimit problem and I still experience the inability to connect during that long geo query. I'm sorry if I'm mixing different issues into the same ticket here. Please advise and I can file new issues ass necessary. // This is QA. I will check prod when I can. | |
| Comment by Thomas Rueckstiess [ 04/Apr/13 ] | |
|
Hi Matthew, This line is indicative of too low ulimits:
MongoDB uses 80% of available file handles. I have a suspicion that those are set to 1024 on your system (1024*0.8 = 819.2). Can you also check your maxConns and see what they are set to? You'd find those in your mongod.conf or as a command line parameter on startup of your mongod. If it's not set, the default 20000 applies. If we ruled these out we can then find a way to get your logfiles here securely for inspection and investigate further. Thanks, | |
| Comment by Matthew Cross [ 04/Apr/13 ] | |
|
This is the exact query I ran in our QA environment. On my laptop I have different data but I run a similar large within->polygon query. I tailed the mongod log on my laptop during the hang and I don't see anything. On the QA system however I see some pretty horrible things going on. There is some sensitive info in the log so I don't want to add it to this ticket but basically as soon as I make the query the replica set seems to have problems, connections start growing and I cannot connect with another mongo. The query I sent only runs after all the replica set issues resolve. I can send you the logs offline if that helps. Here is a synopsis of what I see in the log... I see myself authenticating. Then I see other activity unrelated to me. Then the replica set starts having issues and I see a lot of activity saying that the other replicas think we are dead and another replica becomes the primary. Then connections continue to grow I assume from the other replicas connecting. I continue to see the "xyz thinks we are dead" messages during this time. After several minutes of thrashing I see mongo saying it has too many open connections and it starts refusing them. Once it starts refusing connections I finally see my query hit the log Thu Apr 4 13:08:04.060 [initandlisten] connection refused because too many open connections: 819 } }, fields: {} } ntoreturn:1 keyUpdates:0 locks(micros) r:1466609019 reslen:48 1466609ms //HERE IS THE QUERY }}) | |
| Comment by Thomas Rueckstiess [ 04/Apr/13 ] | |
|
Hi Matthew, I tested this as well and could not reproduce the problem, even with authentication. These are the steps I followed:
There were no delays as you describe them. Is there anything that is significantly different to your tests? Can you please let us know the exact steps you executed? For example, what kind of operations did you run? Ideally, if you had a short javascript code snippet with the commands executed on either session, that reliably reproduces the issue, that would be very helpful. Thanks, |