[SERVER-23390] Missing collection during replication causes shutdown Created: 29/Mar/16  Updated: 20/Dec/16  Resolved: 18/Nov/16

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.2.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Spencer Brody (Inactive) Assignee: Spencer Brody (Inactive)
Resolution: Done Votes: 0
Labels: RF
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-19768 Failed applyOps command does not crea... Closed
related to SERVER-17634 do not apply replicated insert operat... Closed
is related to SERVER-26741 "Fatal Assertion 16360" triggered by ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2016-11-21
Participants:
Linked BF Score: 0

 Description   

Several users have reported crashes with the message "[repl writer worker 10] writer worker caught exception: :: caused by :: 26 Failed to apply insert due to missing collection" when replicating a write.

This is likely fallout related to the changes from SERVER-17634 that went into 3.2, which enforced the rule that the primary must replicate the createCollection commands whenever a collection is created (explicit or not); When this was not done, it was a bug.



 Comments   
Comment by Spencer Brody (Inactive) [ 18/Nov/16 ]

Ever case where a user reported this where we were able to find a root cause it turned out to be because they had dropped a collection while the node was running in standalone mode, so there was no oplog entry recorded for the drop. We never found any evidence of an actual bug in replication leading to these errors.

Comment by Eric Milkie [ 07/Apr/16 ]

The user case that triggered this ticket's creation was due to running a replica set member node in standalone-mode (by omitting --replSet) and then doing writes, which caused the nodes' data to get out of sync.

Comment by Scott Hernandez (Inactive) [ 29/Mar/16 ]

Please upload the logs and oplog (if possible) from incidents where this occurred.

Also, please include any manual actions taken and their effects. For example, were you able to restart the node and everything went back to normal, or was a wipe + resync done?

Generated at Thu Feb 08 04:03:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.