[SERVER-12216] Database freezes during flushing mmaps, flushing takes over half minute Created: 30/Dec/13 Updated: 10/Dec/14 Resolved: 12/Mar/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 2.4.8 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Ilya | Assignee: | Unassigned |
| Resolution: | Duplicate | Votes: | 6 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Windows Server 2008 |
||
| Issue Links: |
|
||||||||
| Operating System: | Windows | ||||||||
| Participants: | |||||||||
| Description |
|
Log says: |
| Comments |
| Comment by Daniel Pasette (Inactive) [ 12/Mar/14 ] | |||||||||||||||
|
Thanks for the update. | |||||||||||||||
| Comment by Mark Callaghan [ 11/Mar/14 ] | |||||||||||||||
|
That fixed the problem. The longest stall with 2.4.9 on my benchmark is almost 300 seconds and about 10 seconds with 2.6.0 rc0 | |||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 10/Mar/14 ] | |||||||||||||||
|
In 2.5.5 we made a change that should alleviate the worst of the problem you're seeing. See: | |||||||||||||||
| Comment by Mark Callaghan [ 10/Mar/14 ] | |||||||||||||||
|
I suspect that dropping a database might also get stuck/delayed by msync even when there are no dirty pages for that database. From browsing the source I also think that Solaris and Windows builds use the exclusive lock more frequently and are more likely to get stalls. | |||||||||||||||
| Comment by Mark Callaghan [ 08/Mar/14 ] | |||||||||||||||
|
From browsing, LockedFilesExclusive is used by I wonder if #2 is also a a potential stall if that were called when a long-running sync were done. | |||||||||||||||
| Comment by Mark Callaghan [ 08/Mar/14 ] | |||||||||||||||
|
I reproduce minutes long stalls about once per hour using the insert benchmark – http://www.tokutek.com/resources/benchmark-results/tokumx-benchmark-hdd/. My test server has a HW RAID card (fast fsync) and disk array that does between 1000 and 2000 random IOPs. I use the default syncdelay of 60 seconds and MongoDB 2.4.9. The problem is that updating internal mongod metadata after adding a new file is blocked until msync's on all files are finished. The mongod error log has this pattern around the time my java/mongo client gets an exception from the 60 second socketwait timeout:
Stacktraces from the time of the long wait show the problem
MongoFile::_flushAll appears to hold LockMongoFilesShared while doing the msync for all files. MongoFile::created gets LockMongoFilesExclusive to finish adding a new file. That will block until the msyncs are finished. I wonder whether other stalls lurk in the code form the use of LockMongoFiles {Shared/Exclusive} |