[SERVER-4522] inserts take long time after several hours insert operation Created: 19/Dec/11  Updated: 11/Jul/16  Resolved: 17/Apr/12

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 2.0.2
Fix Version/s: None

Type: Bug Priority: Blocker - P1
Reporter: kobevaliant Assignee: Tad Marshall
Resolution: Done Votes: 0
Labels: hang, insert
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

windows 2008
Intel Xeon E5504 @2GHz
4G RAM


Attachments: JPEG File dbstats.jpg     JPEG File status1.jpg     JPEG File status2.jpg     JPEG File vmmap.JPG    
Issue Links:
Related
is related to SERVER-2942 MapViewOfFileEx failed during large i... Closed
is related to SERVER-5663 Private pages accumulate in Windows R... Closed
is related to SERVER-6311 FlushViewOfFile failures can result i... Closed
Operating System: Windows
Participants:

 Description   

My application is write intensive. The op type is only insert. During the begin hours every thing is ok, but after that the insert op may takes a very very long time. I have no idea of why that happen. The log:

Mon Dec 19 12:57:36 [conn10] insert agis.test.tp 234ms
Mon Dec 19 13:00:03 [clientcursormon] mem (MB) res:3806 virt:65558 mapped:32705
Mon Dec 19 13:02:38 [conn34] insert agis.test.tp 390ms
Mon Dec 19 13:05:04 [clientcursormon] mem (MB) res:3809 virt:65558 mapped:32705
Mon Dec 19 13:06:19 [conn34] insert agis.test.tp 109ms
Mon Dec 19 13:06:45 [conn12] insert agis.test._records 390ms
Mon Dec 19 13:06:51 [journal] old journal file will be removed: data/journal/j._9
Mon Dec 19 13:07:46 [conn33] insert agis.test._records 109ms
Mon Dec 19 13:07:58 [conn34] insert agis.test.tp 577ms
Mon Dec 19 13:08:37 [conn34] insert agis.test._records 124ms
Mon Dec 19 13:08:38 [conn34] insert agis.test.tp 265ms
Mon Dec 19 13:10:05 [clientcursormon] mem (MB) res:3788 virt:69652 mapped:34752
Mon Dec 19 13:10:39 [conn32] insert agis.test._records 280ms
Mon Dec 19 13:10:45 [conn16] insert agis.test._records 795ms
Mon Dec 19 13:11:04 [conn10] insert agis.test._records 109ms
Mon Dec 19 13:11:11 [conn28] insert agis.test._records 124ms
Mon Dec 19 16:12:23 [conn34] insert agis.test.tp 10828681ms      ****** look at this line *******
Mon Dec 19 16:12:25 [conn33] insert agis.test._records 1544ms
Mon Dec 19 16:12:25 [conn33] command agis.$cmd command: { getlasterror: 1, w: 1, wtimeout: 0 } ntoreturn:1 reslen:78 280ms
Mon Dec 19 16:12:25 [conn28] command agis.$cmd command: { getlasterror: 1, w: 1, wtimeout: 0 } ntoreturn:1 reslen:78 234ms
Mon Dec 19 16:12:25 [conn34] command agis.$cmd command: { getlasterror: 1, w: 1, wtimeout: 0 } ntoreturn:1 reslen:78 280ms

currentOp show that one insert operation is processing while others are waiting for lock.

The data size is about 12G. The index data is small against RAM.

After restart(I must kill mongod process) it will run normally hours again, but hang again at last.



 Comments   
Comment by kobevaliant [ 17/Apr/12 ]

I test on latest nightly build(2.1.1 pre). Everything is ok now, the memory remains low. Thank you for your work. MongoDB is now shining brilliantly again

Comment by Tad Marshall [ 16/Apr/12 ]

Need to debug/fix the remaining issues with remapPrivateView.

Comment by Tad Marshall [ 04/Apr/12 ]

We have fixed a problem related to FlushViewOfFile for version 2.1.1 that may have been involved in poor memory usage and we have a work-in-progress for either version 2.1.1 or 2.1.2 that may make a serious improvement in memory usage on Windows. Without knowing the specifics of what caused a query to take 8 hours to complete, the probability seems high that our upcoming change to memory-mapped file usage on Windows will fix this problem.

Turning off journaling on Windows does sidestep this issue. If the work-in-progress code change works as we hope it will, you can reenable journaling and see only the expected results (double the mapped memory, additional disk I/O) and not horrible results like out of memory or multi-hour response times.

Comment by kobevaliant [ 14/Feb/12 ]

I think something should had happened if there were something...

Comment by Tad Marshall [ 13/Feb/12 ]

I have a long-running test going on (English) Windows Server 2008 R2 looking for memory problems; I'll check on it tomorrow and see if it has anything to tell me.

Comment by kobevaliant [ 01/Feb/12 ]

I use --nojournal switch to prevent such problem temporarily.
The Empty Working Set command has no effect.
I use Chinese version Windows Server 2008 R Enterprise:
CPU: Intel(R) Xeon(R) E5504 @ 2.00GHz 2.00GHz
RAM: 4.00 GB

I tried on another computer, there's also the same problem.

Comment by Tad Marshall [ 31/Jan/12 ]

I still can't duplicate this. Are you still hitting this problem? Did you try using VMMap's Empty Working Set command, and did it fix it for you or help?

Comment by Tad Marshall [ 23/Dec/11 ]

Thanks for the VMMap screenshot. The entry for Private in the Mapped File line (3,329,940 KB == 3.3 GB) is the problem ... this number should never be anything like this big, because the modified data is supposed to be flushed back to the data file and stop being private. This does not appear to be working right.

What happens if you use VMMap's Empty Working Set command from the View menu? Does that flush the private data and drop your working set to a more reasonable size, or does it stay huge?

Journaling changes the way that memory mapped files are managed in MongoDB. With journaling enabled (the default in current 64-bit versions), every data file is mapped twice, once as read-only data and once as a copy that is changed page-by-page to CopyOnWrite as needed for updating as the journal is written. The CopyOnWrite pages are supposed to be reset as the data is flushed to disk, so they should be flagged as "private" only briefly. The huge accumulation of private mapped pages suggests that this is going wrong.

Thanks also for the Java program, I'll try testing with it and see if I can reproduce what you're seeing. For some reason, I haven't had much luck duplicating some of these Windows memory issues. At this level, it might even be a difference in CPUs (I'm guessing, just looking for any explanation) where mine handles paging slightly differently than yours. I've been testing on Intel Core i7 systems, one with 12 GB RAM and one with 8 GB RAM, both with Windows 7 64-bit with all updates. Any details you can provide that might help me duplicate the problem could be helpful, thanks!

Comment by kobevaliant [ 23/Dec/11 ]

I find if I use --nojournal switch, the private commit memory will be much more small and stable.
Does journal consume large memory? Or there's memory leak in code about journal? SERVER-3911

After searching through internet, I find many persons seem to being facing to this problem.

http://groups.google.com/group/mongodb-user/browse_thread/thread/df47745722250789?pli=1
http://stackoverflow.com/questions/7783536/make-mongodb-work-for-production

The memory alloced by journal seems stay forever, unless I call closeAllDatabases. But when I call closeAllDatabases while insert operation exists, the server will report error: [DataFileSync] FlushFileBuffers failed 6

Comment by kobevaliant [ 23/Dec/11 ]

I post vmmap screenshot.
I wrote some code to test, I think it is easy to reproduce the problem. I set paging file size to 1G, and I have 4G RAM.
code:

public class TestMongo {
public static void main(String[] args) throws Exception {
Mongo mongo = new Mongo("localhost");
DBCollection c = mongo.getDB("test").getCollection("test");
c.ensureIndex("idx");
byte[] data = new byte[30 * 1024];
int i = 0;
while (true) {
System.out.println("insert... " + i);
c.insert(new BasicDBObject().append("idx", i++).append("data", data), WriteConcern.SAFE);
}
}
}

Comment by Tad Marshall [ 23/Dec/11 ]

Our usage of memory mapped files is not supposed to result in private committed memory, but the evidence is suggesting that heavy writing does cause data to consume page file space. The lines "extra_info":

{ "availPageFileMB": 585, "note": "fields vary by platform", "page_faults": 8880461, "ramMB": 4086, "totalPageFileMB": 16345, "usagePageFileMB": 15177 }

and "mem":

{ "bits": 64, "mapped": 20423, "mappedWithJournal": 40846, "resident": 2645, "supported": true, "virtual": 40998 }

are certainly suggestive of a problem like this; you have a 4 GB RAM machine ("ramMB": 4086) with a 16 GB page file ("totalPageFileMB": 16345), the page file is about exhausted ("usagePageFileMB": 15177) and one copy of your mapped data is 20 GB ("mapped": 20423).

And actually, RAM plus page file is the maximum memory available to the system, and memory allocations will start to fail across the board when this limit is hit. So perhaps releasing a lock requires a memory allocation somehow, and it is blocked. I would expect many failures in many things if the system got into this state, so I'm surprised that you could remote desktop in to a machine in this state. Can you look at the event log and see if it is full of out-of-memory complaints?

Can you try running VMMap ( http://technet.microsoft.com/en-us/sysinternals/dd535533 ) and see what it says the mongod.exe process's memory looks like when the writes start blocking like this? Perhaps starting VMMap before the problem happens would be best so you would be sure that it was running when the problem happened. Thanks!

Comment by kobevaliant [ 23/Dec/11 ]

It seems that when Memory-Commit Size of mongod process reach the sum of RAM size and Paging File size insert operation will be blocked. I doubt virtual memory space size of a process is limited as RAM size plus Paging File size? Or is there memory leak? File mapping doesn't increase Memory-Commit Size, am I wrong?

Every time when I log in the remote desktop of the server several hours later, the blocked insert operation get through, but after several seconds another insert operation will be blocked again.

Comment by Tad Marshall [ 22/Dec/11 ]

Unless there is a reporting bug or some other explanation, this line sort of jumps out at me:

Wed Dec 21 09:31:14 [conn22] insert agis.test.tp 30665551ms *********** oh my god ***********

That 30665551ms (if correct and in milliseconds as implied by 'ms') is 30665.551 seconds, or 511 minutes or 8.5 hours. The only thing that I could imagine holding up an insert for 8.5 hours is a lock that isn't being handled correctly. The next mystery would be what causes the lock to be released after 8 hours. I suppose that out-of-memory issues could cause lock handling code to misbehave, but that still leaves the question of how the lock was ever released later.

Comment by kobevaliant [ 22/Dec/11 ]

I guess memory shortage lead to such conditions. Write opration will be blocked because of something related to memory. I will continue to test with Windows System Resource Manager, and to see what will happen.

Comment by Tad Marshall [ 21/Dec/11 ]

Can you elaborate on the connection with SERVER-4478?

Comment by kobevaliant [ 21/Dec/11 ]

I find this problem is same as SERVER-4478.

Comment by kobevaliant [ 21/Dec/11 ]

I take some mms screenshot. (Is there any better way to let you know the server's status?)

mongodb hang now, the last lines of log:

Wed Dec 21 07:41:03 [conn67] end connection 127.0.0.1:3475
Wed Dec 21 07:41:03 [initandlisten] connection accepted from 127.0.0.1:3549 #68
Wed Dec 21 08:08:03 [conn68] end connection 127.0.0.1:3549
Wed Dec 21 08:08:03 [initandlisten] connection accepted from 127.0.0.1:3625 #69
Wed Dec 21 08:35:03 [conn69] end connection 127.0.0.1:3625
Wed Dec 21 08:35:03 [initandlisten] connection accepted from 127.0.0.1:3701 #70
Wed Dec 21 09:02:03 [conn70] end connection 127.0.0.1:3701
Wed Dec 21 09:02:03 [initandlisten] connection accepted from 127.0.0.1:3776 #71
Wed Dec 21 09:29:04 [conn71] end connection 127.0.0.1:3776
Wed Dec 21 09:29:04 [initandlisten] connection accepted from 127.0.0.1:3851 #72
Wed Dec 21 09:31:14 [conn22] insert agis.test.tp 30665551ms                         {color:red}*********** oh my god ***********{color}
Wed Dec 21 09:31:15 [conn26] query agis._tasks ntoreturn:1 scanAndOrder:1 resle:20 124ms
Wed Dec 21 09:31:15 [conn53] command admin.$cmd command: { listDatabases: 1 } noreturn:1 reslen:280 624ms
Wed Dec 21 09:31:16 [conn53] command agis.$cmd command: { dbstats: 1 } ntoretur:1 reslen:215 748ms
Wed Dec 21 09:31:16 [conn24] insert agis.test.tp 187ms
Wed Dec 21 09:31:39 [conn18] insert agis.test.tp 109ms
Wed Dec 21 09:31:47 [initandlisten] connection accepted from 127.0.0.1:3861 #73
Wed Dec 21 09:32:03 [conn24] insert agis.test._records 140ms
Wed Dec 21 09:32:07 [clientcursormon] mem (MB) res:2698 virt:40998 mapped:20423
Wed Dec 21 09:56:04 [conn72] end connection 127.0.0.1:3851
Wed Dec 21 09:56:04 [initandlisten] connection accepted from 127.0.0.1:3929 #74

last ping:

{
    "agentHostname": "WIN-H3KOGQJLJG2",
    "agentSessionKey": "1446db29f73520cf842417aaf7e2b705",
    "agentVersion": "1.3.7",
    "buildInfo": {
        "bits": 64,
        "debug": false,
        "gitVersion": "514b122d308928517f5841888ceaa4246a7f18e3",
        "maxBsonObjectSize": 16777216,
        "ok": 1,
        "sysInfo": "windows (6, 1, 7601, 2, 'Service Pack 1') BOOST_LIB_VERSION=1_42",
        "version": "2.0.2",
        "versionArray": [
            2,
            0,
            2,
            0
        ]
    },
    "cmdLineOpts": {
        "argv": [
            "mongodb-win32-x86_64-2.0.2\\bin\\mongod",
            "--dbpath",
            "data"
        ],
        "ok": 1,
        "parsed": {"dbpath": "data"}
    },
    "connPoolStats": {
        "createdByType": {},
        "hosts": {},
        "numAScopedConnection": 0,
        "numDBClientConnection": 0,
        "ok": 1,
        "replicaSets": {},
        "totalAvailable": 0,
        "totalCreated": 0
    },
    "duplicateHost": false,
    "getParameterAll": {
        "logLevel": 0,
        "notablescan": false,
        "ok": 1,
        "quiet": false,
        "replApplyBatchSize": 1,
        "syncdelay": 60
    },
    "host": "localhost",
    "isMaster": {
        "ismaster": true,
        "maxBsonObjectSize": 16777216,
        "ok": 1
    },
    "isSelf": {
        "id": {"$oid": "4ef04bf4844435d7d84157bf"},
        "ok": 1
    },
    "port": 27017,
    "serverStatus": {
        "asserts": {
            "msg": 0,
            "regular": 0,
            "rollovers": 0,
            "user": 0,
            "warning": 0
        },
        "backgroundFlushing": {
            "average_ms": 1231.3899803536347,
            "flushes": 1018,
            "last_finished": {"$date": "2011-12-21T01:46:59.503Z"},
            "last_ms": 15,
            "total_ms": 1253555
        },
        "connections": {
            "available": 19987,
            "current": 13
        },
        "cursors": {
            "clientCursors_size": 0,
            "timedOut": 0,
            "totalOpen": 0
        },
        "dur": {
            "commits": 11,
            "commitsInWriteLock": 0,
            "compression": 0.9999995971278438,
            "earlyCommits": 0,
            "journaledMB": 2.482176,
            "timeMs": {
                "dt": 3385,
                "prepLogBuffer": 0,
                "remapPrivateView": 0,
                "writeToDataFiles": 1747,
                "writeToJournal": 93
            },
            "writeToDataFilesMB": 3.035718
        },
        "extra_info": {
            "availPageFileMB": 585,
            "note": "fields vary by platform",
            "page_faults": 8880461,
            "ramMB": 4086,
            "totalPageFileMB": 16345,
            "usagePageFileMB": 15177
        },
        "globalLock": {
            "activeClients": {
                "readers": 2,
                "total": 11,
                "writers": 9
            },
            "currentQueue": {
                "readers": 3,
                "total": 11,
                "writers": 8
            },
            "lockTime": 5.2219391303E10,
            "ratio": 0.8538110543189176,
            "totalTime": 6.1160359823E10
        },
        "host": "WIN-H3KOGQJLJG2",
        "indexCounters": {"note": "not supported on this platform"},
        "localTime": {"$date": "2011-12-21T01:47:58.237Z"},
        "mem": {
            "bits": 64,
            "mapped": 20423,
            "mappedWithJournal": 40846,
            "resident": 2645,
            "supported": true,
            "virtual": 40998
        },
        "network": {
            "bytesIn": 1.5568615319E10,
            "bytesOut": 126747587,
            "numRequests": 2586500
        },
        "ok": 1,
        "opcounters": {
            "command": 1298883,
            "delete": 0,
            "getmore": 0,
            "insert": 1275255,
            "query": 12349,
            "update": 7
        },
        "process": "mongod",
        "uptime": 61161,
        "uptimeEstimate": 39178,
        "version": "2.0.2",
        "writeBacksQueued": false
    },
    "shardVersion": {
        "configServer": "",
        "global": {
            "$inc": 0,
            "$ts": 0
        },
        "inShardedMode": false,
        "mine": {
            "$inc": 0,
            "$ts": 0
        },
        "ok": 1
    },
    "startupWarnings": {
        "log": [],
        "ok": 1
    },
    "writeBacksQueued": {
        "hasOpsQueued": false,
        "ok": 1,
        "queues": {},
        "totalOpsQueued": 0
    }
}
 
> db.currentOp()
{
        "inprog" : [
                {
                        "opid" : 2586149,
                        "active" : false,
                        "lockType" : "read",
                        "waitingForLock" : true,
                        "op" : "query",
                        "ns" : "",
                        "query" : {
                                "$query" : {
 
                                }
                        },
                        "client" : "127.0.0.1:2434",
                        "desc" : "conn",
                        "connectionId" : 53,
                        "numYields" : 0
                },
                {
                        "opid" : 2586141,
                        "active" : false,
                        "lockType" : "write",
                        "waitingForLock" : true,
                        "op" : "insert",
                        "ns" : "",
                        "query" : {
 
                        },
                        "client" : "192.168.1.163:1257",
                        "desc" : "conn",
                        "connectionId" : 18,
                        "numYields" : 0
                },
                {
                        "opid" : 2586142,
                        "active" : false,
                        "lockType" : "write",
                        "waitingForLock" : true,
                        "op" : "insert",
                        "ns" : "",
                        "query" : {
 
                        },
                        "client" : "192.168.1.163:1259",
                        "desc" : "conn",
                        "connectionId" : 19,
                        "numYields" : 0
                },
                {
                        "opid" : 2586139,
                        "active" : false,
                        "lockType" : "write",
                        "waitingForLock" : true,
                        "op" : "insert",
                        "ns" : "",
                        "query" : {
 
                        },
                        "client" : "192.168.1.163:1258",
                        "desc" : "conn",
                        "connectionId" : 20,
                        "numYields" : 0
                },
                {
                        "opid" : 2586146,
                        "active" : false,
                        "lockType" : "write",
                        "waitingForLock" : true,
                        "op" : "insert",
                        "ns" : "",
                        "query" : {
 
                        },
                        "client" : "192.168.1.163:1261",
                        "desc" : "conn",
                        "connectionId" : 21,
                        "numYields" : 0
                },
                {
                        "opid" : 2586143,
                        "active" : false,
                        "lockType" : "write",
                        "waitingForLock" : true,
                        "op" : "insert",
                        "ns" : "",
                        "query" : {
 
                        },
                        "client" : "192.168.1.163:1262",
                        "desc" : "conn",
                        "connectionId" : 22,
                        "numYields" : 0
                },
                {
                        "opid" : 2586138,
                        "active" : true,
                        "lockType" : "write",
                        "waitingForLock" : false,
                        "secs_running" : 1293,
                        "op" : "insert",
                        "ns" : "agis.test._records",
                        "query" : {
 
                        },
                        "client" : "192.168.1.163:1263",
                        "desc" : "conn",
                        "connectionId" : 23,
                        "numYields" : 0
                },
                {
                        "opid" : 2586145,
                        "active" : false,
                        "lockType" : "write",
                        "waitingForLock" : true,
                        "op" : "insert",
                        "ns" : "",
                        "query" : {
 
                        },
                        "client" : "192.168.1.163:1264",
                        "desc" : "conn",
                        "connectionId" : 24,
                        "numYields" : 0
                },
                {
                        "opid" : 2586150,
                        "active" : false,
                        "lockType" : "read",
                        "waitingForLock" : true,
                        "op" : "query",
                        "ns" : "",
                        "query" : {
                                "query" : {
                                        "status" : 1
                                },
                                "orderby" : {
                                        "priority" : -1
                                }
                        },
                        "client" : "192.168.1.163:1265",
                        "desc" : "conn",
                        "connectionId" : 25,
                        "numYields" : 0
                },
                {
                        "opid" : 2586140,
                        "active" : false,
                        "lockType" : "write",
                        "waitingForLock" : true,
                        "op" : "insert",
                        "ns" : "",
                        "query" : {
 
                        },
                        "client" : "192.168.1.163:1266",
                        "desc" : "conn",
                        "connectionId" : 26,
                        "numYields" : 0
                },
                {
                        "opid" : 2586144,
                        "active" : false,
                        "lockType" : "write",
                        "waitingForLock" : true,
                        "op" : "insert",
                        "ns" : "",
                        "query" : {
 
                        },
                        "client" : "192.168.1.163:1267",
                        "desc" : "conn",
                        "connectionId" : 27,
                        "numYields" : 0
                }
        ]
}

Comment by kobevaliant [ 20/Dec/11 ]

I have installed mms but without munin, my system is windows. Later I will post the full stats.

Comment by Eliot Horowitz (Inactive) [ 20/Dec/11 ]

Can you instal mms with munin so we can see the full stats?

Comment by Eliot Horowitz (Inactive) [ 20/Dec/11 ]

Can you instal mms with munin so we can see the full stats?

Generated at Thu Feb 08 03:06:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.