[SERVER-3157] Replicaset becomes inaccessable and instable after mapreduce job Created: 27/May/11  Updated: 30/Mar/12  Resolved: 04/Jun/11

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 1.8.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: thomas Assignee: Unassigned
Resolution: Cannot Reproduce Votes: 1
Labels: connection, mapreduce,, php, replication
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Replicat Set with 2 Nodes (24 GB RAM) + 2 Arbiter all openSuse 11.3 Kernel 2.6.34 64bit


Attachments: Zip Archive logs_and_mongostat_complete.zip     Text File lx03_mongostat_cutted.txt     PNG File munin_graphs.png     HTML File overview_mongod_after_midnight.html     HTML File overview_mongod_normal.html     HTML File overview_mongod_today_morning.html    
Operating System: Linux
Participants:

 Description   

After we run a mapreduce job which updates thousends of records the primary mongodb server becomes inaccessable. It was not able to connect via PHP webnode or local mongo shell. In a short time the server reached his connection limit (in normal operation we have around 10/s; after the mapreduce job they step up to > 13000; the PHP webnodes use non-persistent conncetions; see lx03_mongostat_cutted.txt). the 13000 connection where full established but idle (see attachment overview_mongod_after_midnight.html).

Our first action was to shutdown the php webserver nodes. connections jumps back to 10 and the system becomes accessable again.

Second action was to shutdown the secondary and start the map reduce job again. everything run smooths seeming without probems. During the mapreduce job used ram increased steadily (see munin graphs). When the job was finished we start the secondary again. From here everything works as expected running the operations from oplog. After a short sleep we saw in the morning there was a connection jump again to 1000. So I decieded to stop and start the current primary and let the secondary take over to get a clean state again.

The attachments containing mongostats, munin graphs, mongodb.logs and the home view from mongodb internal webserver. The munin graphs contains some leaks where the primary was inaccessable to gather data. (server lx03 is primary, lx04 is secondary)

In the past we had the situation once or twice per month probably from a cron job starting another mapreduce operation. But until yesterday we couldn't track it down.



 Comments   
Comment by thomas [ 03/Jun/11 ]

Currently we are unable to use another server for redis. It wouldn't be matter if only performance goes down for a moment. A bigger problem are the leaked connections (they are established but not used anymore - which could also be a php driver problem) and not being able to connect to the server anymore during the block. But I know it's a special case with our setup and hard to reproduce without the traffic and load of a production enviroment. At the moment it seems our stripped down solution is working. I still have to observe it and will inform you if the problem occurs again.

Comment by Eliot Horowitz (Inactive) [ 03/Jun/11 ]

Its possible that if something else on the box starts eating resources, mongo performance can spiral down as more connections pile on, etc...
Is it possible to split redis and mongo?

Comment by thomas [ 03/Jun/11 ]

Yesterday I found out the problems could be related to a redis instance running on the same server. Writing the redis dump file needs around 15 sec. In this time the cpu usage of mongodb increase rapidly and new connections to mongodb begin to leak or failed. After redis has finished his dump, the mongodb logfile suddenly shows a bunch of operations needing more then 100ms to complete. When we shutdown the redis instance mongodb runs fine.

The server has one partition only with a hardware RAID10 on SAS disks. As there is enough memory (24G) we thought it's no problem to run a redis instance beside mongodb. Also the disk and system related munin graphs shows no obvious problems in this direction, I think.

Right now we changed the application to reduce the amount of connections per second made with mongodb - no problems occur since the change despite of running mongodb and redis in parallel. But unfortunately we cannot use it in the haviness we want.

Generated at Thu Feb 08 03:02:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.