[SERVER-60553] Secondary replicaset initial sync errors with "NotWritablePrimary: Not-primary error while processing 'find' operation on 'database_production' database via fire-and-forget command execution." Created: 08/Oct/21  Updated: 29/Oct/23  Resolved: 01/Nov/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.5, 5.0.3
Fix Version/s: 6.2.0-rc0

Type: Bug Priority: Major - P3
Reporter: Robert Beekman Assignee: Amirsaman Memaripour
Resolution: Fixed Votes: 1
Labels: connection-pool, executor, replica-set
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-70510 Avoid considering recovering nodes as... Backlog
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Service Arch 2022-10-03, Service Arch 2022-10-17, Service Arch 2022-10-31, Service Arch 2022-11-14
Participants:
Case:

 Description   

We have a 3-node replicaset where I'm trying to replace a node with a new server. To do this I'd like to add the new server to the replicaset and remove an old one eventually.

 

I'm trying to add a member to this replicaset, but I see a ton of errors in the syncing replica member's log.

 

"NotWritablePrimary: Not-primary error while processing 'find' operation  on 'database_production' database via fire-and-forget command execution."

On the primary server, where the replica is syncing from I can see a ton of:

 

"Dropping unhealthy pooled connection","attr":{"hostAndPort":"main-mongo-ams1:27017"}}

 

The machines can connect to each other and it seems some data is being synced, albeit very slowly.

 

The primary runs 4.4.5 and the secondary 4.4.5, but I also tried 4.4.9 with the same results. 

 

Any idea what's going on? 



 Comments   
Comment by Amirsaman Memaripour [ 01/Nov/22 ]

Pushed a patch to temporarily suppress logs for failed fire-and-forget, mirroring operations on recovering secondaries. We can revert the changes upon completion of SERVER-70510. Closing this as Fixed since the concerning logs on recovering secondaries are addressed by the fix.

Comment by Githook User [ 31/Oct/22 ]

Author:

{'name': 'Amirsaman Memaripour', 'email': 'amirsaman.memaripour@mongodb.com', 'username': 'samanca'}

Message: SERVER-60553 Temporarily suppress mirroring errors on recovering secondaries
Branch: master
https://github.com/mongodb/mongo/commit/1ac82c418b0ff6fc21734962e7e355677fea9634

Comment by Edwin Zhou [ 18/May/22 ]

Thank you for following up brooke.miller@mongodb.com and robert@appsignal.com. I'll keep this ticket open since it's a separate issue to WT-7570, and we're investigating the

NotWritablePrimary: Not-primary error while processing 'find' operation on 'database_production' database via fire-and-forget command execution.

error which seems to occur on 4.4 during initial sync.

Comment by Brooke Miller [ 16/May/22 ]

Thank you, robert@appsignal.com, for the update. edwin.zhou@mongodb.com, does it make sense to close this ticket since Robert is being assisted with a separate segfault issue in WT-7570?

Comment by Robert Beekman [ 13/May/22 ]

Hi @Brooke Miller,

 

We haven't seen it happen anymore, though we haven't done initial syncs in a while either. 

Coincidentally we did have a segfault this morning on a secondary, but It seems similar to this issue: https://jira.mongodb.org/browse/WT-7570

 

Best,

Robert

Comment by Brooke Miller [ 12/May/22 ]

Hi robert@appsignal.com,

We still need additional information to diagnose the problem. If this is still an issue for you, would you please provide the information requested by Edwin above?

Regards,

Brooke

Comment by Edwin Zhou [ 07/Mar/22 ]

Hi robert@appsignal.com,

I apologize for having missed your follow up. Thank you for letting us know you've experienced this issue again after replacing a disk. If this is still an issue for you or if you still have logs and diagnostics from the previous incident, please upload them through this upload portal.

Best,
Edwin

Comment by Robert Beekman [ 22/Nov/21 ]

Hi Edwin Zhou!

We just experienced the same issue while replacing the disk in a MongoDB server. During the replication from another server we see a ton of these errors in the log. If you can provide a new uploader link that's active I can upload the requested data.

Best,

Robert

Comment by Edwin Zhou [ 18/Nov/21 ]

Hi robert@appsignal.com,

We haven’t heard back from you for some time, so I’m going to close this ticket. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Best,
Edwin

Comment by Edwin Zhou [ 03/Nov/21 ]

Hi robert@appsignal.com,

We still need additional information to diagnose the problem. If this is still an issue for you, would you please archive (tar or zip) the mongod.log files covering the incident and the $dbpath/diagnostic.data directory and upload them to this support uploader location?

Best,
Edwin

Comment by Edwin Zhou [ 18/Oct/21 ]

Hi robert@appsignal.com,

Would you please archive (tar or zip) the mongod.log files covering the incident and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location?

I've created a secure upload portal for you. Files uploaded to this portal are hosted on Box, are visible only to MongoDB employees, and are routinely deleted after some time.

Best,
Edwin

Comment by Robert Beekman [ 08/Oct/21 ]

Complete log of a connection that triggers this error:

 

Primary side:

 

{"t":{"$date":"2021-10-08T15:48:40.116+02:00"},"s":"I",  "c":"CONNPOOL", "id":22561,   "ctx":"MirrorMaestro-1229","msg":"Dropping unhealthy pooled connection","attr":{"hostAndPort":"ams2:27017"}}

Syncing secondary side:

 

{"t":{"$date":"2021-10-08T10:10:40.303+00:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn6846","msg":"client metadata","attr":{"remote":"main-mongo1:60034","client":"conn6846","doc":{"driver":{"name":"NetworkInterfaceTL","version":"4.4.5"},"os":{"type":"Linux","name":"Ubuntu","architecture":"x86_64","version":"18.04"}}}}
{"t":{"$date":"2021-10-08T10:10:40.303+00:00"},"s":"I",  "c":"ACCESS",   "id":20427,   "ctx":"conn6846","msg":"Authenticate","attr":{"db":"$external","command":"{ authenticate: \"1\", mechanism: \"MONGODB-X509\", db: \"$external\" }"}}
{"t":{"$date":"2021-10-08T10:10:40.303+00:00"},"s":"I",  "c":"ACCESS",   "id":20429,   "ctx":"conn6846","msg":"Successfully authenticated","attr":{"user":"CN=main-mongo1,O=production,L=Amsterdam,ST=Noord-Holland,C=NL","db":"$external","client":"main-mongo1:60034"}}
{"t":{"$date":"2021-10-08T10:10:40.306+00:00"},"s":"I",  "c":"NETWORK",  "id":22990,   "ctx":"conn6846","msg":"DBException handling request, closing client connection","attr":{"error":"NotWritablePrimary: Not-primary error while processing 'find' operation  on 'production' database via fire-and-forget command execution."}}
{"t":{"$date":"2021-10-08T10:10:40.306+00:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn6846","msg":"Connection ended","attr":{"remote":"main-mongo1:60034","connectionId":6846,"connectionCount":524}}

Generated at Thu Feb 08 05:50:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.