Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-12216

Database freezes during flushing mmaps, flushing takes over half minute

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Duplicate
    • Affects Version/s: 2.4.8
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:
      Windows Server 2008
    • Operating System:
      Windows

      Description

      Log says:
      Mon Dec 30 17:12:08.065 [DataFileSync] flushing mmaps took 143988ms for 30 files
      (with syncdelay = 300, with default delay flushing takes 30-40 sec).

        Issue Links

          Activity

          Hide
          mdcallag Mark Callaghan added a comment -

          I reproduce minutes long stalls about once per hour using the insert benchmark – http://www.tokutek.com/resources/benchmark-results/tokumx-benchmark-hdd/. My test server has a HW RAID card (fast fsync) and disk array that does between 1000 and 2000 random IOPs. I use the default syncdelay of 60 seconds and MongoDB 2.4.9.

          The problem is that updating internal mongod metadata after adding a new file is blocked until msync's on all files are finished.

          The mongod error log has this pattern around the time my java/mongo client gets an exception from the 60 second socketwait timeout:
          1) flushing maps took 60+ seconds
          2) followed immediately by "allocating new datafile"

          Sat Mar  8 10:00:57.466 [conn20] command iibench.$cmd command: { getlasterror: 1, fsync: true } ntoreturn:1 keyUpdates:0  reslen:79 259ms
          Sat Mar  8 10:00:57.772 [conn20] command iibench.$cmd command: { getlasterror: 1, fsync: true } ntoreturn:1 keyUpdates:0  reslen:79 211ms
          Sat Mar  8 10:01:57.897 [conn19] end connection 127.0.0.1:22258 (1 connection now open)
          Sat Mar  8 10:01:59.007 [initandlisten] connection accepted from 127.0.0.1:48491 #21 (2 connections now open)
          Sat Mar  8 10:02:02.800 [DataFileSync] flushing mmaps took 114360ms  for 26 files
          Sat Mar  8 10:02:03.719 [FileAllocator] allocating new datafile /data/mysql/m249/db/iibench.24, filling with zeroes...
          Sat Mar  8 10:02:03.720 [FileAllocator] done allocating datafile /data/mysql/m249/db/iibench.24, size: 2047MB,  took 0 secs
          Sat Mar  8 10:02:03.770 [conn20] insert iibench.purchases_index ninserted:1000 keyUpdates:0 locks(micros) w:65994240 65994ms
          Sat Mar  8 10:02:03.818 [conn20] end connection 127.0.0.1:22614 (1 connection now open)
          Sat Mar  8 10:02:03.942 [initandlisten] connection accepted from 127.0.0.1:42227 #22 (2 connections now open)
          Sat Mar  8 10:02:10.534 [conn22] insert iibench.purchases_index ninserted:1000 keyUpdates:0 locks(micros) w:102909 102ms
          Sat Mar  8 10:02:14.056 [conn22] command iibench.$cmd command: { getlasterror: 1, fsync: true } ntoreturn:1 keyUpdates:0  reslen:79 2923ms

          Stacktraces from the time of the long wait show the problem

          nanosleep,mongo::sleepmicros,mongo::RWLockRecursiveNongreedy::Lock,Exclusive,LockMongoFilesExclusive,mongo::MongoFile::created,mongo::MemoryMappedFile::MemoryMappedFile,mongo::MongoMMF::MongoMMF,MongoDataFile,mongo::Database::getFile,mongo::Database::addAFile,mongo::Database::suitableFile,mongo::Database::allocExtent,mongo::outOfSpace,mongo::allocateSpaceForANewRecord,mongo::DataFileMgr::insert,mongo::DataFileMgr::insertWithObjMod,mongo::checkAndInsert,mongo::insertMulti,mongo::receivedInsert,mongo::assembleResponse,mongo::MyMessageHandler::process,mongo::PortMessageServer::handleIncomingMsg
           
          msync,mongo::PosixFlushable::flush,mongo::MongoFile::_flushAll,mongo::MongoFile::flushAll,mongo::DataFileSync::run,mongo::BackgroundJob::jobBody,operator(),operator()<boost::_mfi::mf1<void,,operator(),boost::detail::thread_data<boost::_bi::bind_t<void,,boost::(anonymous

          MongoFile::_flushAll appears to hold LockMongoFilesShared while doing the msync for all files.

          MongoFile::created gets LockMongoFilesExclusive to finish adding a new file. That will block until the msyncs are finished.

          I wonder whether other stalls lurk in the code form the use of LockMongoFiles

          {Shared/Exclusive}
          Show
          mdcallag Mark Callaghan added a comment - I reproduce minutes long stalls about once per hour using the insert benchmark – http://www.tokutek.com/resources/benchmark-results/tokumx-benchmark-hdd/ . My test server has a HW RAID card (fast fsync) and disk array that does between 1000 and 2000 random IOPs. I use the default syncdelay of 60 seconds and MongoDB 2.4.9. The problem is that updating internal mongod metadata after adding a new file is blocked until msync's on all files are finished. The mongod error log has this pattern around the time my java/mongo client gets an exception from the 60 second socketwait timeout: 1) flushing maps took 60+ seconds 2) followed immediately by "allocating new datafile" Sat Mar 8 10:00:57.466 [conn20] command iibench.$cmd command: { getlasterror: 1, fsync: true } ntoreturn:1 keyUpdates:0 reslen:79 259ms Sat Mar 8 10:00:57.772 [conn20] command iibench.$cmd command: { getlasterror: 1, fsync: true } ntoreturn:1 keyUpdates:0 reslen:79 211ms Sat Mar 8 10:01:57.897 [conn19] end connection 127.0.0.1:22258 (1 connection now open) Sat Mar 8 10:01:59.007 [initandlisten] connection accepted from 127.0.0.1:48491 #21 (2 connections now open) Sat Mar 8 10:02:02.800 [DataFileSync] flushing mmaps took 114360ms for 26 files Sat Mar 8 10:02:03.719 [FileAllocator] allocating new datafile /data/mysql/m249/db/iibench.24, filling with zeroes... Sat Mar 8 10:02:03.720 [FileAllocator] done allocating datafile /data/mysql/m249/db/iibench.24, size: 2047MB, took 0 secs Sat Mar 8 10:02:03.770 [conn20] insert iibench.purchases_index ninserted:1000 keyUpdates:0 locks(micros) w:65994240 65994ms Sat Mar 8 10:02:03.818 [conn20] end connection 127.0.0.1:22614 (1 connection now open) Sat Mar 8 10:02:03.942 [initandlisten] connection accepted from 127.0.0.1:42227 #22 (2 connections now open) Sat Mar 8 10:02:10.534 [conn22] insert iibench.purchases_index ninserted:1000 keyUpdates:0 locks(micros) w:102909 102ms Sat Mar 8 10:02:14.056 [conn22] command iibench.$cmd command: { getlasterror: 1, fsync: true } ntoreturn:1 keyUpdates:0 reslen:79 2923ms Stacktraces from the time of the long wait show the problem nanosleep,mongo::sleepmicros,mongo::RWLockRecursiveNongreedy::Lock,Exclusive,LockMongoFilesExclusive,mongo::MongoFile::created,mongo::MemoryMappedFile::MemoryMappedFile,mongo::MongoMMF::MongoMMF,MongoDataFile,mongo::Database::getFile,mongo::Database::addAFile,mongo::Database::suitableFile,mongo::Database::allocExtent,mongo::outOfSpace,mongo::allocateSpaceForANewRecord,mongo::DataFileMgr::insert,mongo::DataFileMgr::insertWithObjMod,mongo::checkAndInsert,mongo::insertMulti,mongo::receivedInsert,mongo::assembleResponse,mongo::MyMessageHandler::process,mongo::PortMessageServer::handleIncomingMsg   msync,mongo::PosixFlushable::flush,mongo::MongoFile::_flushAll,mongo::MongoFile::flushAll,mongo::DataFileSync::run,mongo::BackgroundJob::jobBody,operator(),operator()<boost::_mfi::mf1<void,,operator(),boost::detail::thread_data<boost::_bi::bind_t<void,,boost::(anonymous MongoFile::_flushAll appears to hold LockMongoFilesShared while doing the msync for all files. MongoFile::created gets LockMongoFilesExclusive to finish adding a new file. That will block until the msyncs are finished. I wonder whether other stalls lurk in the code form the use of LockMongoFiles {Shared/Exclusive}
          Hide
          mdcallag Mark Callaghan added a comment -

          From browsing, LockedFilesExclusive is used by
          1) MongoMMF::close - didn't take time to figure out how this is used
          2) MongoFile::setFilename - called by MemoryMappedFile::map, maybe when files are opened.

          I wonder if #2 is also a a potential stall if that were called when a long-running sync were done.

          Show
          mdcallag Mark Callaghan added a comment - From browsing, LockedFilesExclusive is used by 1) MongoMMF::close - didn't take time to figure out how this is used 2) MongoFile::setFilename - called by MemoryMappedFile::map, maybe when files are opened. I wonder if #2 is also a a potential stall if that were called when a long-running sync were done.
          Hide
          mdcallag Mark Callaghan added a comment -

          I suspect that dropping a database might also get stuck/delayed by msync even when there are no dirty pages for that database.

          From browsing the source I also think that Solaris and Windows builds use the exclusive lock more frequently and are more likely to get stalls.

          Show
          mdcallag Mark Callaghan added a comment - I suspect that dropping a database might also get stuck/delayed by msync even when there are no dirty pages for that database. From browsing the source I also think that Solaris and Windows builds use the exclusive lock more frequently and are more likely to get stalls.
          Hide
          pasette Dan Pasette added a comment -

          In 2.5.5 we made a change that should alleviate the worst of the problem you're seeing. See: SERVER-7973, but in short, the MongoFilesLock is not held during the msync system call anymore.

          Show
          pasette Dan Pasette added a comment - In 2.5.5 we made a change that should alleviate the worst of the problem you're seeing. See: SERVER-7973 , but in short, the MongoFilesLock is not held during the msync system call anymore.
          Hide
          mdcallag Mark Callaghan added a comment -

          That fixed the problem. The longest stall with 2.4.9 on my benchmark is almost 300 seconds and about 10 seconds with 2.6.0 rc0

          Show
          mdcallag Mark Callaghan added a comment - That fixed the problem. The longest stall with 2.4.9 on my benchmark is almost 300 seconds and about 10 seconds with 2.6.0 rc0
          Hide
          pasette Dan Pasette added a comment -

          Thanks for the update.

          Show
          pasette Dan Pasette added a comment - Thanks for the update.

            People

            • Votes:
              6 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: