Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-34188

CappedPositionLost errors in very large oplog

    • Type: Icon: Question Question
    • Resolution: Works as Designed
    • Priority: Icon: Minor - P4 Minor - P4
    • None
    • Affects Version/s: 3.2.11
    • Component/s: Querying, WiredTiger
    • Labels:
      None
    • Environment:
      Debian
      Debian packages

      Hello,

      I am using the oplog collection to create incremental backups from a MongoDB replica (Secondary).

      Sometimes, during busy periods on the replica set with undersized WiredTiger cache on the secondary, a very small percentage of the backups will fail with the following error from the MongoDB server.

      29-Mar 06:01 + oplog_backup
      29-Mar 07:06 since filter: {'ts': {'$gt': Timestamp(1522278803, 94)}}
      29-Mar 07:06 Traceback (most recent call last):
      29-Mar 07:06   File "oplog_backup", line 99, in <module>
      29-Mar 07:06     main()
      29-Mar 07:06   File "oplog_backup", line 74, in main
      29-Mar 07:06     for doc in oplog.find(filter).sort("$natural"):
      29-Mar 07:06   File "/usr/lib/python2.7/dist-packages/pymongo/cursor.py", line 1114, in next
      29-Mar 07:06     if len(self.__data) or self._refresh():
      29-Mar 07:06   File "/usr/lib/python2.7/dist-packages/pymongo/cursor.py", line 1036, in _refresh
      29-Mar 07:06     self.__collation))
      29-Mar 07:06   File "/usr/lib/python2.7/dist-packages/pymongo/cursor.py", line 928, in __send_message
      29-Mar 07:06     helpers._check_command_response(doc['data'][0])
      29-Mar 07:06   File "/usr/lib/python2.7/dist-packages/pymongo/helpers.py", line 210, in _check_command_response
      29-Mar 07:06     raise OperationFailure(msg % errmsg, code, response)
      29-Mar 07:06 pymongo.errors.OperationFailure: Executor error during find command: CappedPositionLost: CollectionScan died due to position in capped collection being deleted. Last seen record id: RecordId(<InsertLongNumber>)
      

      From pymongo it seems that CappedPositionLost error originates from the MongoDB secondary server

      Log from mongod

      2018-03-29T07:06:19.873+0300 W QUERY    [conn492712] Could not restore RecordCursor for CollectionScan: local.oplog.rs
      2018-03-29T07:06:19.873+0300 E QUERY    [conn492712] Plan executor error during find command: DEAD, stats: { stage: "COLLSCAN", filter: { ts: { $gt: Timestamp 1522278803000|94 } }, nReturned: 0, executionTimeMillisEstimate: 1690, works: 1241191, advanced: 0, needTime: 1241190, needYield: 0, saveState: 9697, restoreState: 9697, isEOF: 0, invalidates: 0, direction: "forward", docsExamined: 1241189 }
      2018-03-29T07:06:19.874+0300 I COMMAND  [conn492712] command local.oplog.rs command: find { find: "oplog.rs", filter: { ts: { $gt: Timestamp 1522278803000|94 } }, sort: { $natural: 1 } } keyUpdates:0 writeConflicts:0 numYields:9697 reslen:256 locks:{ Global: { acquireCount: { r: 19396 }, acquireWaitCount: { r: 2519 }, timeAcquiringMicros: { r: 3889033515 } }, Database: { acquireCount: { r: 9698 } }, oplog: { acquireCount: { r: 9698 } } } protocol:op_query 3890891ms
      

      Replication info shows a size that covers a period of ten days and each incremental snapshot is 20 minutes up to a couple of hours so I do not see how a record can be missing from the oplog.

      configured oplog size:   9502.277587890625MB
      log length start to end: 989089secs (274.75hrs)
      oplog first event time:  Sun Mar 18 2018 05:08:54 GMT+0200 (EET)
      oplog last event time:   Thu Mar 29 2018 16:53:43 GMT+0300 (EEST)
      now:                     Thu Mar 29 2018 16:54:01 GMT+0300 (EEST)
      

      Is this normal behavior of mongod ?

            Assignee:
            dmitry.agranat@mongodb.com Dmitry Agranat
            Reporter:
            vfxcode Stefanos Boglou
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: