[SERVER-34955] Primary stopped writing until we manually restarted the instance Created: 11/May/18  Updated: 23/Jul/18  Resolved: 21/Jun/18

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.4.7
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ajit Hegde Assignee: Kelsey Schubert
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

ubuntu 14.04


Participants:

 Description   

Hello Team,

For some weird reason, my mongo instance (PRIMARY) stops writing all of a sudden. although the find command works perfectly. Can you please tell me what could be the reason behind this. It starts working if we restart the instance. But I don't see that as a viable option.

 

Below are the snippets of logs.

 

1) PRIMARY:

2018-05-09T14:44:34.805+0530 I - [conn240053] end connection 10.0.0.xx1:45868 (48 connections now open) <------ this is secondary 

2018-05-09T14:44:37.818+0530 I - [conn239982] end connection 10.0.0.xx2:58920 (47 connections now open) <------ this is secondary

2018-05-09T14:45:08.674+0530 I NETWORK [thread1] connection accepted from 10.0.0.xx2:59035 #242865 (47 connections now open)
2018-05-09T14:45:08.675+0530 I NETWORK [conn242865] received client metadata from 10.0.0.xx2:59035 conn242865: { driver:

{ name: "NetworkInterfaceASIO-RS", version: "3.4.7" }

, os: { type: "Linux", name: "Ubuntu", architecture: "x86_64", version: "14.04" } }
2018-05-09T14:45:08.678+0530 I ACCESS [conn242865] Successfully authenticated as principal __system on local

2018-05-09T14:46:08.672+0530 I - [conn240585] end connection 10.0.0.xx2:58936 (47 connections now open)

 

2) 1st Secondary

2018-05-09T14:44:54.266+0530 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to ip-10-0-0-xxx:27017 <----- this is primary
2018-05-09T14:44:54.276+0530 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to ip-10-0-0-xxx:27017, took 10ms (3 connections now open to ip-10-0-0-xxx:27017)
2018-05-09T14:45:54.268+0530 I ASIO [NetworkInterfaceASIO-RS-0] Ending idle connection to host ip-10-0-0-xxx:27017 because the pool meets constraints; 2 connections to that host remain open
2018-05-09T14:49:48.877+0530 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to ip-10-0-0-xxx:27017
2018-05-09T14:49:48.898+0530 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to ip-10-0-0-xxx:27017, took 21ms (3 connections now open to ip-10-0-0-xxx:27017)
2018-05-09T14:50:54.712+0530 I ASIO [NetworkInterfaceASIO-RS-0] Ending idle connection to host ip-10-0-0-xxx:27017 because the pool meets constraints; 2 connections to that host remain open
2018-05-09T14:51:27.964+0530 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to ip-10-0-0-xxx:27017
2018-05-09T14:51:27.980+0530 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to ip-10-0-0-xxx:27017, took 16ms (3 connections now open to ip-10-0-0-xxx:27017)
2018-05-09T14:52:27.965+0530 I ASIO [NetworkInterfaceASIO-RS-0] Ending idle connection to host ip-10-0-0-xxx:27017 because the pool meets constraints; 2 connections to that host remain open

2018-05-09T14:56:57.057+0530 I REPL [replication-125] Restarting oplog query due to error: StaleTerm: Replication term of this node was stale; retry query. Last fetched optime (with hash): { ts: Timestamp 1525857925000|2, t: 3 }[-2714852630922522331]. Restarts remaining: 3
2018-05-09T14:56:57.057+0530 I REPL [replication-125] Scheduled new oplog query Fetcher source: ip-10-0-0-xxx:27017 database: local query: { find: "oplog.rs", filter: { ts:

{ $gte: Timestamp 1525857925000|2 }

}, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 4 } query metadata: { $replData: 1, $oplogQueryData: 1, $ssm:

{ $secondaryOk: true }

} active: 1 timeout: 65000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 19776432 – target:ip-10-0-0-136:27017 db:local cmd:{ find: "oplog.rs", filter: { ts:

{ $gte: Timestamp 1525857925000|2 }

}, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 4 } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms
2018-05-09T14:56:57.059+0530 I REPL [replication-126] Restarting oplog query due to error: StaleTerm: Replication term of this node was stale; retry query. Last fetched optime (with hash): { ts: Timestamp 1525857925000|2, t: 3 }[-2714852630922522331]. Restarts remaining: 2
2018-05-09T14:56:57.060+0530 I REPL [replication-126] Scheduled new oplog query Fetcher source: ip-10-0-0-xxx:27017 database: local query: { find: "oplog.rs", filter: { ts:

{ $gte: Timestamp 1525857925000|2 }

}, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 4 } query metadata: { $replData: 1, $oplogQueryData: 1, $ssm:

{ $secondaryOk: true }

} active: 1 timeout: 65000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 19776434 – target:ip-10-0-0-136:27017 db:local cmd:{ find: "oplog.rs", filter: { ts:

{ $gte: Timestamp 1525857925000|2 }

}, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 60000, term: 4 } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: RetryPolicyImpl maxAttempts: 1 maxTimeMillis: -1ms
2018-05-09T14:56:57.061+0530 W REPL [rsBackgroundSync] Fetcher stopped querying remote oplog with error: InvalidSyncSource: Sync source's last applied OpTime { ts: Timestamp 1525857925000|2, t: 3 } is not greater than our last fetched OpTime { ts: Timestamp 1525857925000|2, t: 3 }. Choosing new sync source. <------- This is probably due to restart. Just mentioning

 

 

3) 2nd Secondary

2018-05-09T14:43:42.300+0530 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to ip-10-0-0-xxx:27017 <-------------- This is primary
2018-05-09T14:43:42.312+0530 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to ip-10-0-0-xxx:27017, took 12ms (3 connections now open to ip-10-0-0-xxx:27017)
2018-05-09T14:45:01.608+0530 I ASIO [NetworkInterfaceASIO-RS-0] Ending idle connection to host ip-10-0-0-xxx:27017 because the pool meets constraints; 2 connections to that host remain open
2018-05-09T14:45:32.459+0530 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to ip-10-0-0-xxx:27017
2018-05-09T14:45:32.470+0530 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to ip-10-0-0-xxx:27017, took 11ms (3 connections now open to ip-10-0-0-xxx:27017)
2018-05-09T14:46:32.463+0530 I ASIO [NetworkInterfaceASIO-RS-0] Ending idle connection to host ip-10-0-0-xxx:27017 because the pool meets constraints; 2 connections to that host remain open
2018-05-09T14:50:09.273+0530 I ASIO [NetworkInterfaceASIO-RS-0] Connecting to ip-10-0-0-xxx:27017
2018-05-09T14:50:09.287+0530 I ASIO [NetworkInterfaceASIO-RS-0] Successfully connected to ip-10-0-0-xxx:27017, took 14ms (3 connections now open to ip-10-0-0-xxx:27017)
2018-05-09T14:51:09.274+0530 I ASIO [NetworkInterfaceASIO-RS-0] Ending idle connection to host ip-10-0-0-xxx:27017 because the pool meets constraints; 2 connections to that host remain open
2018-05-09T14:56:00.361+0530 I REPL [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
2018-05-09T14:56:00.361+0530 I REPL [ReplicationExecutor] conducting a dry run election to see if we could be elected
2018-05-09T14:56:00.361+0530 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to ip-10-0-0-xxx:27017
2018-05-09T14:56:00.361+0530 I REPL [ReplicationExecutor] VoteRequester(term 3 dry run) received a yes vote from ip-10-0-0-xx1:27017; response message: { term: 3, voteGranted: true, reason: "", ok: 1.0 }
2018-05-09T14:56:00.361+0530 I REPL [ReplicationExecutor] dry election run succeeded, running for election
2018-05-09T14:56:00.362+0530 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to ip-10-0-0-xxx:27017
2018-05-09T14:56:00.364+0530 I REPL [ReplicationExecutor] VoteRequester(term 4) received a yes vote from ip-10-0-0-xx1:27017; response message: { term: 4, voteGranted: true, reason: "", ok: 1.0 }
2018-05-09T14:56:00.364+0530 I REPL [ReplicationExecutor] election succeeded, assuming primary role in term 4
2018-05-09T14:56:00.364+0530 I REPL [ReplicationExecutor] transition to PRIMARY
2018-05-09T14:56:00.364+0530 I REPL [ReplicationExecutor] Entering primary catch-up mode.
2018-05-09T14:56:00.364+0530 I ASIO [NetworkInterfaceASIO-Replication-0] Ending connection to host ip-10-0-0-xxx:27017 due to bad connection status; 2 connections to that host remain open
2018-05-09T14:56:00.364+0530 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to ip-10-0-0-xxx:27017
2018-05-09T14:56:00.375+0530 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to ip-10-0-0-xxx:27017, took 14ms (3 connections now open to ip-10-0-0-xxx:27017)
2018-05-09T14:56:00.376+0530 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to ip-10-0-0-xxx:27017, took 14ms (3 connections now open to ip-10-0-0-xxx:27017)
2018-05-09T14:56:00.376+0530 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to ip-10-0-0-xxx:27017, took 12ms (3 connections now open to ip-10-0-0-xxx:27017)
2018-05-09T14:56:00.376+0530 I REPL [ReplicationExecutor] Caught up to the latest optime known via heartbeats after becoming primary.
2018-05-09T14:56:00.377+0530 I REPL [ReplicationExecutor] Exited primary catch-up mode.
2018-05-09T14:56:00.377+0530 I ASIO [NetworkInterfaceASIO-RS-0] Ending connection to host ip-10-0-0-xx:27017 due to bad connection status; 1 connections to that host remain open
2018-05-09T14:56:00.377+0530 I REPL [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source. Abandoning this batch of oplog entries and re-evaluating our sync source.
2018-05-09T14:56:01.706+0530 I REPL [rsSync] transition to primary complete; database writes are now permitted
2018-05-09T14:56:02.378+0530 I REPL [ReplicationExecutor] Member ip-10-0-0-xxx:27017 is now in state SECONDARY
2018-05-09T14:56:02.593+0530 I NETWORK [thread1] connection accepted from 10.0.0.136:60041 #6132 (12 connections now open)
2018-05-09T14:56:02.594+0530 I NETWORK [conn6132] received client metadata from 10.0.0.xxx:60041 conn6132: { driver:

{ name: "NetworkInterfaceASIO-Replication", version: "3.4.7" }

, os: { type: "Linux", name: "Ubuntu", architecture: "x86_64", version: "14.04" } }
2018-05-09T14:56:02.601+0530 I ACCESS [conn6132] Successfully authenticated as principal __system on local
2018-05-09T14:56:03.291+0530 I NETWORK [thread1] connection accepted from 10.0.0.136:60042 #6133 (13 connections now open)
2018-05-09T14:56:03.292+0530 I NETWORK [conn6133] received client metadata from 10.0.0.xxx:60042 conn6133: { driver:

{ name: "NetworkInterfaceASIO-RS", version: "3.4.7" }

, os: { type: "Linux", name: "Ubuntu", architecture: "x86_64", version: "14.04" } }
2018-05-09T14:56:03.298+0530 I ACCESS [conn6133] Successfully authenticated as principal __system on local
2018-05-09T14:56:03.307+0530 I NETWORK [thread1] connection accepted from 10.0.0.xxx:60043 #6134 (14 connections now open)
2018-05-09T14:56:03.307+0530 I NETWORK [conn6134] received client metadata from 10.0.0.xxx:60043 conn6134: { driver:

{ name: "NetworkInterfaceASIO-RS", version: "3.4.7" }

, os: { type: "Linux", name: "Ubuntu", architecture: "x86_64", version: "14.04" } }
2018-05-09T14:56:03.314+0530 I ACCESS [conn6134] Successfully authenticated as principal __system on local
2018-05-09T14:56:04.413+0530 I - [conn3289] end connection 10.0.0.xxx:53676 (14 connections now open)
2018-05-09T14:56:06.981+0530 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update to ip-10-0-0-xxx:27017: InvalidSyncSource: Sync source was cleared. Was ip-10-0-0-xxx:27017
2018-05-09T14:56:07.391+0530 I NETWORK [thread1] connection accepted from 10.0.0.xx1:35349 #6135 (14 connections now open)
2018-05-09T14:56:07.391+0530 I NETWORK [conn6135] received client metadata from 10.0.0.xx1:35349 conn6135: { driver:

{ name: "NetworkInterfaceASIO-RS", version: "3.4.7" }

, os: { type: "Linux", name: "Ubuntu", architecture: "x86_64", version: "14.04" } }
2018-05-09T14:56:07.394+0530 I ACCESS [conn6135] Successfully authenticated as principal __system on local
2018-05-09T14:56:07.396+0530 I NETWORK [thread1] connection accepted from 10.0.0.xx1:35350 #6136 (15 connections now open)
2018-05-09T14:56:07.396+0530 I NETWORK [conn6136] received client metadata from 10.0.0.xx1:35350 conn6136: { driver:

{ name: "NetworkInterfaceASIO-RS", version: "3.4.7" }

, os: { type: "Linux", name: "Ubuntu", architecture: "x86_64", version: "14.04" } }
2018-05-09T14:56:07.399+0530 I ACCESS [conn6136] Successfully authenticated as principal __system on local
2018-05-09T14:57:00.376+0530 I ASIO [NetworkInterfaceASIO-Replication-0] Ending idle connection to host ip-10-0-0-xxx:27017 because the pool meets constraints; 2 connections to that host remain open
2018-05-09T14:57:00.377+0530 I ASIO [NetworkInterfaceASIO-Replication-0] Ending idle connection to host ip-10-0-0-xxx:27017 because the pool meets constraints; 1 connections to that host remain open

 

Let me know if you need anything else.

 

Regards,

Ajit



 Comments   
Comment by Kelsey Schubert [ 21/Jun/18 ]

Hi ajit@wizely.in,

We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Regards,
Kelsey

Comment by Kelsey Schubert [ 14/May/18 ]

Hi ajit@wizely.in,

Would you please upload the complete logs and an archive of the diagnostic.data directory for each node in the replica set? Since I see a failover occurred in the logs you've provided, to make sure we're on the same page would you please specify the host name of the node that you expected to accept writes and the start and end of the time that it was unable to do so?

Thank you,
Kelsey

Generated at Thu Feb 08 04:38:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.