[SERVER-35425] After a map reduce an exception NamespaceNotFound happens in a secondary shard Created: 05/Jun/18 Updated: 12/Jul/18 Resolved: 12/Jul/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.6.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Rui Ribeiro | Assignee: | Esha Maharishi (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Sprint: | Repl 2018-06-18, Sharding 2018-07-02, Sharding 2018-07-16 | ||||||||
| Participants: | |||||||||
| Description |
|
Can you help with the following problem. When I do a map reduce, after it is finished... i get the following expection: 2018-06-05T14:52:53.464+0000 F REPL [repl writer worker 3] writer worker caught exception: NamespaceNotFound: Failed to apply operation due to missing collection (16e14cce-7a36-4e3d-850c-0e630dedaea3): { ts 2018-06-05T14:52:53.464+0000 F - [repl writer worker 3] Fatal assertion 16359 NamespaceNotFound: Failed to apply operation due to missing collection (16e14cce-7a36-4e3d-850c-0e630dedaea3): {
I saw that you have some Jira issues regarding this exception NamespaceNotFound, but all are already closed, i could not figure out what was the solution for this.
Thank you |
| Comments |
| Comment by Esha Maharishi (Inactive) [ 12/Jul/18 ] | ||||||||||||||||||||||||||
|
ruiribeiro, I am going to close this as a duplicate of | ||||||||||||||||||||||||||
| Comment by Esha Maharishi (Inactive) [ 10/Jul/18 ] | ||||||||||||||||||||||||||
|
I think most likely, inserts from a migration on M_20180602.tmp.uimsi_d_a3 (which is not one of the internal collections created by mapReduce) are colliding with dropping the tmp.mr namespaces of mapReduces on the destination shard. In this case, you should be able to avoid this specific manifestation of the bug by turning off the balancer. | ||||||||||||||||||||||||||
| Comment by Esha Maharishi (Inactive) [ 10/Jul/18 ] | ||||||||||||||||||||||||||
|
ruiribeiro, you mentioned on
1) Can you list the exact mapReduce commands (specifically, including the input and output namespaces) used in this multi-step process? I ask because it seems like some of your user-created collections (including those created by specifying them as the $out collection of a mapReduce) start with "tmp". This is confusing at first, because mapReduce also creates internal collections that start with "tmp" ("tmp.mr" and "tmp.mrs"). 2) Are you also running migrations on your user-created collection that starts with "tmp" (or, do you have the balancer on?) I think you are hitting some variant of the bug in | ||||||||||||||||||||||||||
| Comment by Spencer Brody (Inactive) [ 18/Jun/18 ] | ||||||||||||||||||||||||||
|
Yeah, the node is definitely SECONDARY when the crash happens. Something weird from the logs is that a few seconds before the crash happens, a collection with the same UUID but a different namespace is dropped:
This does sound a lot like the description of | ||||||||||||||||||||||||||
| Comment by Esha Maharishi (Inactive) [ 18/Jun/18 ] | ||||||||||||||||||||||||||
|
spencer, can you confirm from the logs attached to the ticket that this node is in steady state replication when the crash happens, not initial sync? | ||||||||||||||||||||||||||
| Comment by Esha Maharishi (Inactive) [ 18/Jun/18 ] | ||||||||||||||||||||||||||
|
Actually, that's weird, how could there even be a migration active on a temp collection?! A migration is tied to a particular collection... | ||||||||||||||||||||||||||
| Comment by Esha Maharishi (Inactive) [ 18/Jun/18 ] | ||||||||||||||||||||||||||
|
spencer that makes sense since these oplog entries were marked 'fromMigrate: true'. I think we should simply not include writes to temp collections in migrations. I am curious though why this only seems to trip an assert in 3.6, since this seems like a beginning-of-time bug. | ||||||||||||||||||||||||||
| Comment by Spencer Brody (Inactive) [ 18/Jun/18 ] | ||||||||||||||||||||||||||
|
Swallowing the NamespaceNotFound error during oplog application only happens for initial sync and startup recovery. In steady state, we should not need to ignore NamespaceNotFound errors as they are not expected. This means that somehow there was an 'insert' oplog entry for this namespace without there being a 'create' oplog entry for it. | ||||||||||||||||||||||||||
| Comment by Esha Maharishi (Inactive) [ 12/Jun/18 ] | ||||||||||||||||||||||||||
|
I am going to move this to the repl backlog as Needs Triage. | ||||||||||||||||||||||||||
| Comment by Rui Ribeiro [ 12/Jun/18 ] | ||||||||||||||||||||||||||
|
Thank you esha.maharishi for the help. Yes the problem it is happening in a secondary node and that collection is the temporary collection that the Map Reduce creates.
| ||||||||||||||||||||||||||
| Comment by Esha Maharishi (Inactive) [ 11/Jun/18 ] | ||||||||||||||||||||||||||
|
rribeiro, I took a quick look at the logs attached here based on your comment on It looks like the node is a secondary, and the crash occurs during regular replset syncing (since the crash occurs in sync_tail.cpp).
It seems like two of the "repl writer worker" threads (10 and 15) encountered a NamespaceNotFound exception ("writer worker caught exception: NamesapceNotFound") on an insert oplog entry for the M_20180602.tmp.uimsi_d_a3 collection. (This collection should be one of the tmp collections used for writing the intermediate results of a mapReduce; I also notice that both oplog entries that triggered the exception had from migrate: true). It looks like this was the fassert tripped (on v3.6.5) in multiSyncApply() , and that in this commit on v4.0 by for benety.goh, judah.schvimer, should ignoring NamespaceNotFound in this catch block be backported to 3.6? | ||||||||||||||||||||||||||
| Comment by Rui Ribeiro [ 08/Jun/18 ] | ||||||||||||||||||||||||||
|
Hi I just updated the mongo log of the shard where this fatal assertion happens. This problem impacted a client apliccation that was blocked for inserts after the failure. Even I have a replica set of two members and one arbiter. | ||||||||||||||||||||||||||
| Comment by Kaloian Manassiev [ 06/Jun/18 ] | ||||||||||||||||||||||||||
|
Hi ruiribeiro, Thank you for your report. Would it be possible to attach the complete mongod logs from the shard which experienced this fatal assertion? Best regards, | ||||||||||||||||||||||||||
| Comment by Kaloian Manassiev [ 06/Jun/18 ] | ||||||||||||||||||||||||||
|
janna.golden/esha.maharishi - is it possible that this is a manifestation of |