Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-34955

Primary stopped writing until we manually restarted the instance

    • Type: Icon: Bug Bug
    • Resolution: Incomplete
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.4.7
    • Component/s: None
    • None
    • Environment:
      ubuntu 14.04
    • None
    • 3
    • None
    • None
    • None
    • None
    • None
    • None

      Hello Team,

      For some weird reason, my mongo instance (PRIMARY) stops writing all of a sudden. although the find command works perfectly. Can you please tell me what could be the reason behind this. It starts working if we restart the instance. But I don't see that as a viable option.

       

      Below are the snippets of logs.

       

      1) PRIMARY:

      2018-05-09T14:44:34.805+0530 I - [conn240053] end connection 10.0.0.xx1:45868 (48 connections now open) <------ this is secondary 

      2018-05-09T14:44:37.818+0530 I - [conn239982] end connection 10.0.0.xx2:58920 (47 connections now open) <------ this is secondary

      2018-05-09T14:45:08.674+0530 I NETWORK [thread1] connection accepted from 10.0.0.xx2:59035 #242865 (47 connections now open)
      2018-05-09T14:45:08.675+0530 I NETWORK [conn242865] received client metadata from 10.0.0.xx2:59035 conn242865: { driver:

      { name: "NetworkInterfaceASIO-RS", version: "3.4.7" }

      , os: { type: "Linux", name: "Ubuntu", architecture: "x86_64", version: "14.04" } }
      2018-05-09T14:45:08.678+0530 I ACCESS [conn242865] Successfully authenticated as principal __system on local

      2018-05-09T14:46:08.672+0530 I - [conn240585] end connection 10.0.0.xx2:58936 (47 connections now open)

       

      2) 1st Secondary

      2018-05-09T14:44:54.266+0530 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to ip-10-0-0-xxx:27017 <----- this is primary
      2018-05-09T14:44:54.276+0530 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to ip-10-0-0-xxx:27017, took 10ms (3 connections now open to ip-10-0-0-xxx:27017)
      2018-05-09T14:45:54.268+0530 I ASIO [NetworkInterfaceASIO-RS-0] Ending idle connection to host ip-10-0-0-xxx:27017 because the pool meets constraints; 2 connections to that host remain open
      2018-05-09T14:49:48.877+0530 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to ip-10-0-0-xxx:27017
      2018-05-09T14:49:48.898+0530 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to ip-10-0-0-xxx:27017, took 21ms (3 connections now open to ip-10-0-0-xxx:27017)
      2018-05-09T14:50:54.712+0530 I ASIO [NetworkInterfaceASIO-RS-0] Ending idle connection to host ip-10-0-0-xxx:27017 because the pool meets constraints; 2 connections to that host remain open
      2018-05-09T14:51:27.964+0530 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to ip-10-0-0-xxx:27017
      2018-05-09T14:51:27.980+0530 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to ip-10-0-0-xxx:27017, took 16ms (3 connections now open to ip-10-0-0-xxx:27017)
      2018-05-09T14:52:27.965+0530 I ASIO [NetworkInterfaceASIO-RS-0] Ending idle connection to host ip-10-0-0-xxx:27017 because the pool meets constraints; 2 connections to that host remain open

      2018-05-09T14:56:57.057+0530 I REPL [replication-125] Restarting oplog query due to error: StaleTerm: Replication term of this node was stale; retry query. Last fetched optime (with hash): { ts: Timestamp 1525857925000|2, t: 3 }[-2714852630922522331]. Restarts remaining: 3
      2018-05-09T14:56:57.057+0530 I REPL [replication-125] Scheduled new oplog query Fetcher source: ip-10-0-0-xxx:27017 database: local query: { find: "oplog.rs", filter: { ts:

      { $gte: Timestamp 1525857925000|2 }

      }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 4 } query metadata: { $replData: 1, $oplogQueryData: 1, $ssm:

      { $secondaryOk: true }

      } active: 1 timeout: 65000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 19776432 – target:ip-10-0-0-136:27017 db:local cmd:{ find: "oplog.rs", filter: { ts:

      { $gte: Timestamp 1525857925000|2 }

      }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 4 } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms
      2018-05-09T14:56:57.059+0530 I REPL [replication-126] Restarting oplog query due to error: StaleTerm: Replication term of this node was stale; retry query. Last fetched optime (with hash): { ts: Timestamp 1525857925000|2, t: 3 }[-2714852630922522331]. Restarts remaining: 2
      2018-05-09T14:56:57.060+0530 I REPL [replication-126] Scheduled new oplog query Fetcher source: ip-10-0-0-xxx:27017 database: local query: { find: "oplog.rs", filter: { ts:

      { $gte: Timestamp 1525857925000|2 }

      }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 4 } query metadata: { $replData: 1, $oplogQueryData: 1, $ssm:

      { $secondaryOk: true }

      } active: 1 timeout: 65000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 19776434 – target:ip-10-0-0-136:27017 db:local cmd:{ find: "oplog.rs", filter: { ts:

      { $gte: Timestamp 1525857925000|2 }

      }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 4 } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms
      2018-05-09T14:56:57.061+0530 W REPL [rsBackgroundSync] Fetcher stopped querying remote oplog with error: InvalidSyncSource: Sync source's last applied OpTime { ts: Timestamp 1525857925000|2, t: 3 } is not greater than our last fetched OpTime { ts: Timestamp 1525857925000|2, t: 3 }. Choosing new sync source. <------- This is probably due to restart. Just mentioning

       

       

      3) 2nd Secondary

      2018-05-09T14:43:42.300+0530 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to ip-10-0-0-xxx:27017 <-------------- This is primary
      2018-05-09T14:43:42.312+0530 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to ip-10-0-0-xxx:27017, took 12ms (3 connections now open to ip-10-0-0-xxx:27017)
      2018-05-09T14:45:01.608+0530 I ASIO [NetworkInterfaceASIO-RS-0] Ending idle connection to host ip-10-0-0-xxx:27017 because the pool meets constraints; 2 connections to that host remain open
      2018-05-09T14:45:32.459+0530 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to ip-10-0-0-xxx:27017
      2018-05-09T14:45:32.470+0530 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to ip-10-0-0-xxx:27017, took 11ms (3 connections now open to ip-10-0-0-xxx:27017)
      2018-05-09T14:46:32.463+0530 I ASIO [NetworkInterfaceASIO-RS-0] Ending idle connection to host ip-10-0-0-xxx:27017 because the pool meets constraints; 2 connections to that host remain open
      2018-05-09T14:50:09.273+0530 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to ip-10-0-0-xxx:27017
      2018-05-09T14:50:09.287+0530 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to ip-10-0-0-xxx:27017, took 14ms (3 connections now open to ip-10-0-0-xxx:27017)
      2018-05-09T14:51:09.274+0530 I ASIO [NetworkInterfaceASIO-RS-0] Ending idle connection to host ip-10-0-0-xxx:27017 because the pool meets constraints; 2 connections to that host remain open
      2018-05-09T14:56:00.361+0530 I REPL [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
      2018-05-09T14:56:00.361+0530 I REPL [ReplicationExecutor] conducting a dry run election to see if we could be elected
      2018-05-09T14:56:00.361+0530 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to ip-10-0-0-xxx:27017
      2018-05-09T14:56:00.361+0530 I REPL [ReplicationExecutor] VoteRequester(term 3 dry run) received a yes vote from ip-10-0-0-xx1:27017; response message: { term: 3, voteGranted: true, reason: "", ok: 1.0 }
      2018-05-09T14:56:00.361+0530 I REPL [ReplicationExecutor] dry election run succeeded, running for election
      2018-05-09T14:56:00.362+0530 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to ip-10-0-0-xxx:27017
      2018-05-09T14:56:00.364+0530 I REPL [ReplicationExecutor] VoteRequester(term 4) received a yes vote from ip-10-0-0-xx1:27017; response message: { term: 4, voteGranted: true, reason: "", ok: 1.0 }
      2018-05-09T14:56:00.364+0530 I REPL [ReplicationExecutor] election succeeded, assuming primary role in term 4
      2018-05-09T14:56:00.364+0530 I REPL [ReplicationExecutor] transition to PRIMARY
      2018-05-09T14:56:00.364+0530 I REPL [ReplicationExecutor] Entering primary catch-up mode.
      2018-05-09T14:56:00.364+0530 I ASIO [NetworkInterfaceASIO-Replication-0] Ending connection to host ip-10-0-0-xxx:27017 due to bad connection status; 2 connections to that host remain open
      2018-05-09T14:56:00.364+0530 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to ip-10-0-0-xxx:27017
      2018-05-09T14:56:00.375+0530 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to ip-10-0-0-xxx:27017, took 14ms (3 connections now open to ip-10-0-0-xxx:27017)
      2018-05-09T14:56:00.376+0530 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to ip-10-0-0-xxx:27017, took 14ms (3 connections now open to ip-10-0-0-xxx:27017)
      2018-05-09T14:56:00.376+0530 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to ip-10-0-0-xxx:27017, took 12ms (3 connections now open to ip-10-0-0-xxx:27017)
      2018-05-09T14:56:00.376+0530 I REPL [ReplicationExecutor] Caught up to the latest optime known via heartbeats after becoming primary.
      2018-05-09T14:56:00.377+0530 I REPL [ReplicationExecutor] Exited primary catch-up mode.
      2018-05-09T14:56:00.377+0530 I ASIO [NetworkInterfaceASIO-RS-0] Ending connection to host ip-10-0-0-xx:27017 due to bad connection status; 1 connections to that host remain open
      2018-05-09T14:56:00.377+0530 I REPL [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source. Abandoning this batch of oplog entries and re-evaluating our sync source.
      2018-05-09T14:56:01.706+0530 I REPL [rsSync] transition to primary complete; database writes are now permitted
      2018-05-09T14:56:02.378+0530 I REPL [ReplicationExecutor] Member ip-10-0-0-xxx:27017 is now in state SECONDARY
      2018-05-09T14:56:02.593+0530 I NETWORK [thread1] connection accepted from 10.0.0.136:60041 #6132 (12 connections now open)
      2018-05-09T14:56:02.594+0530 I NETWORK [conn6132] received client metadata from 10.0.0.xxx:60041 conn6132: { driver:

      { name: "NetworkInterfaceASIO-Replication", version: "3.4.7" }

      , os: { type: "Linux", name: "Ubuntu", architecture: "x86_64", version: "14.04" } }
      2018-05-09T14:56:02.601+0530 I ACCESS [conn6132] Successfully authenticated as principal __system on local
      2018-05-09T14:56:03.291+0530 I NETWORK [thread1] connection accepted from 10.0.0.136:60042 #6133 (13 connections now open)
      2018-05-09T14:56:03.292+0530 I NETWORK [conn6133] received client metadata from 10.0.0.xxx:60042 conn6133: { driver:

      { name: "NetworkInterfaceASIO-RS", version: "3.4.7" }

      , os: { type: "Linux", name: "Ubuntu", architecture: "x86_64", version: "14.04" } }
      2018-05-09T14:56:03.298+0530 I ACCESS [conn6133] Successfully authenticated as principal __system on local
      2018-05-09T14:56:03.307+0530 I NETWORK [thread1] connection accepted from 10.0.0.xxx:60043 #6134 (14 connections now open)
      2018-05-09T14:56:03.307+0530 I NETWORK [conn6134] received client metadata from 10.0.0.xxx:60043 conn6134: { driver:

      { name: "NetworkInterfaceASIO-RS", version: "3.4.7" }

      , os: { type: "Linux", name: "Ubuntu", architecture: "x86_64", version: "14.04" } }
      2018-05-09T14:56:03.314+0530 I ACCESS [conn6134] Successfully authenticated as principal __system on local
      2018-05-09T14:56:04.413+0530 I - [conn3289] end connection 10.0.0.xxx:53676 (14 connections now open)
      2018-05-09T14:56:06.981+0530 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update to ip-10-0-0-xxx:27017: InvalidSyncSource: Sync source was cleared. Was ip-10-0-0-xxx:27017
      2018-05-09T14:56:07.391+0530 I NETWORK [thread1] connection accepted from 10.0.0.xx1:35349 #6135 (14 connections now open)
      2018-05-09T14:56:07.391+0530 I NETWORK [conn6135] received client metadata from 10.0.0.xx1:35349 conn6135: { driver:

      { name: "NetworkInterfaceASIO-RS", version: "3.4.7" }

      , os: { type: "Linux", name: "Ubuntu", architecture: "x86_64", version: "14.04" } }
      2018-05-09T14:56:07.394+0530 I ACCESS [conn6135] Successfully authenticated as principal __system on local
      2018-05-09T14:56:07.396+0530 I NETWORK [thread1] connection accepted from 10.0.0.xx1:35350 #6136 (15 connections now open)
      2018-05-09T14:56:07.396+0530 I NETWORK [conn6136] received client metadata from 10.0.0.xx1:35350 conn6136: { driver:

      { name: "NetworkInterfaceASIO-RS", version: "3.4.7" }

      , os: { type: "Linux", name: "Ubuntu", architecture: "x86_64", version: "14.04" } }
      2018-05-09T14:56:07.399+0530 I ACCESS [conn6136] Successfully authenticated as principal __system on local
      2018-05-09T14:57:00.376+0530 I ASIO [NetworkInterfaceASIO-Replication-0] Ending idle connection to host ip-10-0-0-xxx:27017 because the pool meets constraints; 2 connections to that host remain open
      2018-05-09T14:57:00.377+0530 I ASIO [NetworkInterfaceASIO-Replication-0] Ending idle connection to host ip-10-0-0-xxx:27017 because the pool meets constraints; 1 connections to that host remain open

       

      Let me know if you need anything else.

       

      Regards,

      Ajit

            Assignee:
            kelsey.schubert@mongodb.com Kelsey Schubert
            Reporter:
            ajit@wizely.in Ajit Hegde
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: