[SERVER-35425] After a map reduce an exception NamespaceNotFound happens in a secondary shard Created: 05/Jun/18  Updated: 12/Jul/18  Resolved: 12/Jul/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.6.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Rui Ribeiro Assignee: Esha Maharishi (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongod.log    
Issue Links:
Duplicate
duplicates SERVER-33538 mapReduce "replace" on a sharded outp... Closed
Operating System: ALL
Sprint: Repl 2018-06-18, Sharding 2018-07-02, Sharding 2018-07-16
Participants:

 Description   

Can you help with the following problem.

When I do a map reduce, after it is finished... i get the following expection:

2018-06-05T14:52:53.464+0000 F REPL     [repl writer worker 3] writer worker caught exception: NamespaceNotFound: Failed to apply operation due to missing collection (16e14cce-7a36-4e3d-850c-0e630dedaea3): { ts

2018-06-05T14:52:53.464+0000 F -        [repl writer worker 3] Fatal assertion 16359 NamespaceNotFound: Failed to apply operation due to missing collection (16e14cce-7a36-4e3d-850c-0e630dedaea3): {

 

I saw that you have some Jira issues regarding this exception NamespaceNotFound, but all are  already closed, i could not figure out what was the solution for this.

 

Thank you



 Comments   
Comment by Esha Maharishi (Inactive) [ 12/Jul/18 ]

ruiribeiro, I am going to close this as a duplicate of SERVER-33538, but let me know if you are able to provide the additional info and want a more specific diagnosis.

Comment by Esha Maharishi (Inactive) [ 10/Jul/18 ]

I think most likely, inserts from a migration on M_20180602.tmp.uimsi_d_a3 (which is not one of the internal collections created by mapReduce) are colliding with dropping the tmp.mr namespaces of mapReduces on the destination shard.

In this case, you should be able to avoid this specific manifestation of the bug by turning off the balancer.

Comment by Esha Maharishi (Inactive) [ 10/Jul/18 ]

ruiribeiro, you mentioned on SERVER-33538 you are using a multi-mapReduce process:

I understand your solution, but i am not using the option out: "replace", I use the option out: "reduce"

In my situation I do:

                         Pickup 3 Collections (sharded)

                                      ¦

                        Map Reduce (out:reduce, sharded:true)

                                      ¦

                          Ouput Collection(MR 1)

                                      ¦

                       Map Reduce (out:reduce, sharded:true)

                                      ¦

                          Ouput Collection(MR 2)

1) Can you list the exact mapReduce commands (specifically, including the input and output namespaces) used in this multi-step process?

I ask because it seems like some of your user-created collections (including those created by specifying them as the $out collection of a mapReduce) start with "tmp". This is confusing at first, because mapReduce also creates internal collections that start with "tmp" ("tmp.mr" and "tmp.mrs").

2) Are you also running migrations on your user-created collection that starts with "tmp" (or, do you have the balancer on?)

I think you are hitting some variant of the bug in SERVER-33538, but am not sure exactly how it's manifesting.

Comment by Spencer Brody (Inactive) [ 18/Jun/18 ]

Yeah, the node is definitely SECONDARY when the crash happens.

Something weird from the logs is that a few seconds before the crash happens, a collection with the same UUID but a different namespace is dropped:

2018-06-05T14:45:03.169+0000 I COMMAND  [repl writer worker 4] CMD: drop M_20180602.tmp.mr.uimsi_d_6
2018-06-05T14:45:03.169+0000 I STORAGE  [repl writer worker 4] dropCollection: M_20180602.tmp.mr.uimsi_d_6 (16e14cce-7a36-4e3d-850c-0e630dedaea3) - renaming to drop-pending collection: M_20180602.system.drop.1528209903i23797t6.tmp.mr.uimsi_d_6 with drop optime { ts: Timestamp(1528209903, 23797), t: 6 }
2018-06-05T14:45:03.245+0000 I REPL     [replication-423] Completing collection drop for M_20180602.system.drop.1528209903i23797t6.tmp.mr.uimsi_d_6 with drop optime { ts: Timestamp(1528209903, 23797), t: 6 } (notification optime: { ts: Timestamp(1528209903, 23797), t: 6 })
2018-06-05T14:45:03.245+0000 I STORAGE  [replication-423] Finishing collection drop for M_20180602.system.drop.1528209903i23797t6.tmp.mr.uimsi_d_6 (16e14cce-7a36-4e3d-850c-0e630dedaea3).

2018-06-05T14:52:53.464+0000 F REPL     [repl writer worker 3] writer worker caught exception: NamespaceNotFound: Failed to apply operation due to missing collection (16e14cce-7a36-4e3d-850c-0e630dedaea3): { ts: Timestamp(1528210373, 667), t: 6, h: 285159385916138329, v: 2, op: "i", ns: "M_20180602.tmp.uimsi_d_a3", ui: UUID("16e14cce-7a36-4e3d-850c-0e630dedaea3"), fromMigrate: true, wall: new Date(1528210373457), o: { _id: { utc: new Date(1527897600000), imsi: "46e783793e1a0491a44cd352be6608835596895d1da3e5197d2cf288f6e4b85a", vpmn: "TZAVC", hpmn: "KENSA" }, value: { msisdn: "2e045846194e8240ff95e78d3d8ad9358182f8c068cfd1b0d73711e18e4e19b1", imei: undefined, hsai_s: 14.0, hul_s: 13.0, hcl_s: 14.0, hcl_f: 13.0, hmtfw_s: 1.0, hmtfw_t: 2.0, hprn_s: 4.0 } } } on: { ts: Timestamp(1528210373, 667), t: 6, h: 285159385916138329, v: 2, op: "i", ns: "M_20180602.tmp.uimsi_d_a3", ui: UUID("16e14cce-7a36-4e3d-850c-0e630dedaea3"), fromMigrate: true, wall: new Date(1528210373457), o: { _id: { utc: new Date(1527897600000), imsi: "46e783793e1a0491a44cd352be6608835596895d1da3e5197d2cf288f6e4b85a", vpmn: "TZAVC", hpmn: "KENSA" }, value: { msisdn: "2e045846194e8240ff95e78d3d8ad9358182f8c068cfd1b0d73711e18e4e19b1", imei: undefined, hsai_s: 14.0, hul_s: 13.0, hcl_s: 14.0, hcl_f: 13.0, hmtfw_s: 1.0, hmtfw_t: 2.0, hprn_s: 4.0 } } }
2018-06-05T14:52:53.464+0000 F -        [repl writer worker 3] Fatal assertion 16359 NamespaceNotFound: Failed to apply operation due to missing collection (16e14cce-7a36-4e3d-850c-0e630dedaea3): { ts: Timestamp(1528210373, 667), t: 6, h: 285159385916138329, v: 2, op: "i", ns: "M_20180602.tmp.uimsi_d_a3", ui: UUID("16e14cce-7a36-4e3d-850c-0e630dedaea3"), fromMigrate: true, wall: new Date(1528210373457), o: { _id: { utc: new Date(1527897600000), imsi: "46e783793e1a0491a44cd352be6608835596895d1da3e5197d2cf288f6e4b85a", vpmn: "TZAVC", hpmn: "KENSA" }, value: { msisdn: "2e045846194e8240ff95e78d3d8ad9358182f8c068cfd1b0d73711e18e4e19b1", imei: undefined, hsai_s: 14.0, hul_s: 13.0, hcl_s: 14.0, hcl_f: 13.0, hmtfw_s: 1.0, hmtfw_t: 2.0, hprn_s: 4.0 } } } at src/mongo/db/repl/sync_tail.cpp 1204
2018-06-05T14:52:53.464+0000 F -        [repl writer worker 3] 

This does sound a lot like the description of SERVER-33538. Sending back to sharding to rethink whether this could be another manifestation of SERVER-33538

Comment by Esha Maharishi (Inactive) [ 18/Jun/18 ]

spencer, can you confirm from the logs attached to the ticket that this node is in steady state replication when the crash happens, not initial sync?

Comment by Esha Maharishi (Inactive) [ 18/Jun/18 ]

Actually, that's weird, how could there even be a migration active on a temp collection?! A migration is tied to a particular collection...

Comment by Esha Maharishi (Inactive) [ 18/Jun/18 ]

spencer that makes sense since these oplog entries were marked 'fromMigrate: true'.

I think we should simply not include writes to temp collections in migrations.

I am curious though why this only seems to trip an assert in 3.6, since this seems like a beginning-of-time bug.

Comment by Spencer Brody (Inactive) [ 18/Jun/18 ]

Swallowing the NamespaceNotFound error during oplog application only happens for initial sync and startup recovery. In steady state, we should not need to ignore NamespaceNotFound errors as they are not expected. This means that somehow there was an 'insert' oplog entry for this namespace without there being a 'create' oplog entry for it.

Comment by Esha Maharishi (Inactive) [ 12/Jun/18 ]

I am going to move this to the repl backlog as Needs Triage.

Comment by Rui Ribeiro [ 12/Jun/18 ]

Thank you esha.maharishi  for the help.

Yes the problem it is happening  in a secondary node and that collection is the temporary collection that the Map Reduce creates.

 

Comment by Esha Maharishi (Inactive) [ 11/Jun/18 ]

rribeiro, I took a quick look at the logs attached here based on your comment on SERVER-33538.

It looks like the node is a secondary, and the crash occurs during regular replset syncing (since the crash occurs in sync_tail.cpp).

2018-06-05T14:53:39.508+0000 I REPL     [replexec-0] This node is cmodb889.abc.com:27018 in the config
2018-06-05T14:53:39.508+0000 I REPL     [replexec-0] transition to STARTUP2 from STARTUP
2018-06-05T14:53:39.509+0000 I REPL     [replexec-0] transition to RECOVERING from STARTUP2
2018-06-05T14:53:39.510+0000 I REPL     [rsSync] transition to SECONDARY from RECOVERING
...
2018-06-05T14:53:39.537+0000 I REPL     [replexec-2] Member cmodb881.abc.com:27018 is now in state PRIMARY
2018-06-05T14:53:39.541+0000 I REPL     [replexec-1] Member cmodb875.abc.com:27027 is now in state ARBITER
2018-06-05T14:53:39.583+0000 I REPL     [replexec-0] Member cmodb899.abc.com:27027 is now in state ARBITER
...
2018-06-05T14:53:40.510+0000 I REPL     [rsBackgroundSync] sync source candidate: cmodb881.abc.com:27018
2018-06-05T14:53:40.510+0000 I ASIO     [NetworkInterfaceASIO-RS-0] Connecting to cmodb881.abc.com:27018
2018-06-05T14:53:40.529+0000 I ASIO     [NetworkInterfaceASIO-RS-0] Successfully connected to cmodb881.abc.com:27018, took 19ms (1 connections now open to cmodb881.abc.com:27018)
...
 
// NamespaceNotFound exception triggered by oplog entry with 'fromMigrate: true' for an insert ('op: "i"') into M_20180602.tmp.uimsi_d_a3
2018-06-05T14:53:40.871+0000 F REPL     [repl writer worker 10] writer worker caught exception: NamespaceNotFound: Failed to apply operation due to missing collection (16e14cce-7a36-4e3d-850c-0e630dedaea3): { ts: Timestamp(1528210373, 684), t: 6, h: -5869453682661485531, v: 2, op: "i", ns: "M_20180602.tmp.uimsi_d_a3", ui: UUID("16e14cce-7a36-4e3d-850c-0e630dedaea3"), fromMigrate: true, wall: new Date(1528210373457), o: { _id: { utc: new Date(1527897600000), imsi: "46e7c101c004f89624c28e364b783ee821361149287cc59d271c982feac1cc8f", vpmn: "SGPM1", hpmn: "CHNCT" }, value: { msisdn: undefined, imei: undefined, vsai_f: 2.0, vgsai_f: 2.0 } } } on: { ts: Timestamp(1528210373, 684), t: 6, h: -5869453682661485531, v: 2, op: "i", ns: "M_20180602.tmp.uimsi_d_a3", ui: UUID("16e14cce-7a36-4e3d-850c-0e630dedaea3"), fromMigrate: true, wall: new Date(1528210373457), o: { _id: { utc: new Date(1527897600000), imsi: "46e7c101c004f89624c28e364b783ee821361149287cc59d271c982feac1cc8f", vpmn: "SGPM1", hpmn: "CHNCT" }, value: { msisdn: undefined, imei: undefined, vsai_f: 2.0, vgsai_f: 2.0 } } }
 
// NamespaceNotFound trips an fassert
2018-06-05T14:53:40.871+0000 F -        [repl writer worker 10] Fatal assertion 16359 NamespaceNotFound: Failed to apply operation due to missing collection (16e14cce-7a36-4e3d-850c-0e630dedaea3): { ts: Timestamp(1528210373, 684), t: 6, h: -5869453682661485531, v: 2, op: "i", ns: "M_20180602.tmp.uimsi_d_a3", ui: UUID("16e14cce-7a36-4e3d-850c-0e630dedaea3"), fromMigrate: true, wall: new Date(1528210373457), o: { _id: { utc: new Date(1527897600000), imsi: "46e7c101c004f89624c28e364b783ee821361149287cc59d271c982feac1cc8f", vpmn: "SGPM1", hpmn: "CHNCT" }, value: { msisdn: undefined, imei: undefined, vsai_f: 2.0, vgsai_f: 2.0 } } } at src/mongo/db/repl/sync_tail.cpp 1204
2018-06-05T14:53:40.871+0000 F REPL     [repl writer worker 15] writer worker caught exception: NamespaceNotFound: Failed to apply operation due to missing collection (16e14cce-7a36-4e3d-850c-0e630dedaea3): { ts: Timestamp(1528210373, 697), t: 6, h: 140069530794967834, v: 2, op: "i", ns: "M_20180602.tmp.uimsi_d_a3", ui: UUID("16e14cce-7a36-4e3d-850c-0e630dedaea3"), fromMigrate: true, wall: new Date(1528210373457), o: { _id: { utc: new Date(1527897600000), imsi: "46e7e2ad3736701706d1c233286baa456e2128dcb15066f622938dfb28060ac0", vpmn: "GBRVF", hpmn: "SWEIQ" }, value: { msisdn: "da2c302c05b2417c9aeb2b8cf3e196700ef6d9120cae1aa5538fb4fb59d8cb07", imei: undefined, hsai_s: 5.0, hgsai_s: 4.0, hul_s: 1.0, hgul_s: 1.0 } } } on: { ts: Timestamp(1528210373, 697), t: 6, h: 140069530794967834, v: 2, op: "i", ns: "M_20180602.tmp.uimsi_d_a3", ui: UUID("16e14cce-7a36-4e3d-850c-0e630dedaea3"), fromMigrate: true, wall: new Date(1528210373457), o: { _id: { utc: new Date(1527897600000), imsi: "46e7e2ad3736701706d1c233286baa456e2128dcb15066f622938dfb28060ac0", vpmn: "GBRVF", hpmn: "SWEIQ" }, value: { msisdn: "da2c302c05b2417c9aeb2b8cf3e196700ef6d9120cae1aa5538fb4fb59d8cb07", imei: undefined, hsai_s: 5.0, hgsai_s: 4.0, hul_s: 1.0, hgul_s: 1.0 } } }
 
// same occurs on another 'repl writer worker' thread, for an oplog entry for an insert on the same collection
2018-06-05T14:53:40.871+0000 F REPL     [repl writer worker 15] writer worker caught exception: NamespaceNotFound: Failed to apply operation due to missing collection (16e14cce-7a36-4e3d-850c-0e630dedaea3): { ts: Timestamp(1528210373, 697), t: 6, h: 140069530794967834, v: 2, op: "i", ns: "M_20180602.tmp.uimsi_d_a3", ui: UUID("16e14cce-7a36-4e3d-850c-0e630dedaea3"), fromMigrate: true, wall: new Date(1528210373457), o: { _id: { utc: new Date(1527897600000), imsi: "46e7e2ad3736701706d1c233286baa456e2128dcb15066f622938dfb28060ac0", vpmn: "GBRVF", hpmn: "SWEIQ" }, value: { msisdn: "da2c302c05b2417c9aeb2b8cf3e196700ef6d9120cae1aa5538fb4fb59d8cb07", imei: undefined, hsai_s: 5.0, hgsai_s: 4.0, hul_s: 1.0, hgul_s: 1.0 } } } on: { ts: Timestamp(1528210373, 697), t: 6, h: 140069530794967834, v: 2, op: "i", ns: "M_20180602.tmp.uimsi_d_a3", ui: UUID("16e14cce-7a36-4e3d-850c-0e630dedaea3"), fromMigrate: true, wall: new Date(1528210373457), o: { _id: { utc: new Date(1527897600000), imsi: "46e7e2ad3736701706d1c233286baa456e2128dcb15066f622938dfb28060ac0", vpmn: "GBRVF", hpmn: "SWEIQ" }, value: { msisdn: "da2c302c05b2417c9aeb2b8cf3e196700ef6d9120cae1aa5538fb4fb59d8cb07", imei: undefined, hsai_s: 5.0, hgsai_s: 4.0, hul_s: 1.0, hgul_s: 1.0 } } }
2018-06-05T14:53:40.871+0000 F -        [repl writer worker 15] 
 
***aborting after fassert() failure 

It seems like two of the "repl writer worker" threads (10 and 15) encountered a NamespaceNotFound exception ("writer worker caught exception: NamesapceNotFound") on an insert oplog entry for the M_20180602.tmp.uimsi_d_a3 collection.

(This collection should be one of the tmp collections used for writing the intermediate results of a mapReduce; I also notice that both oplog entries that triggered the exception had from migrate: true).

It looks like this was the fassert tripped (on v3.6.5) in multiSyncApply() , and that in this commit on v4.0 by for SERVER-32334, special handling was added to ignore the NamespaceNotFound error code. (Also see SERVER-24927, mentioned in the comment in the commit).

benety.goh, judah.schvimer, should ignoring NamespaceNotFound in this catch block be backported to 3.6?

Comment by Rui Ribeiro [ 08/Jun/18 ]

Hi

I just updated the mongo log of the shard where this fatal assertion happens.

This problem impacted a client apliccation that was blocked for inserts after the failure.

Even I have a replica set of two members and one arbiter.

Comment by Kaloian Manassiev [ 06/Jun/18 ]

Hi ruiribeiro,

Thank you for your report.

Would it be possible to attach the complete mongod logs from the shard which experienced this fatal assertion?

Best regards,
-Kal.

Comment by Kaloian Manassiev [ 06/Jun/18 ]

janna.golden/esha.maharishi - is it possible that this is a manifestation of SERVER-33538 or something related?

Generated at Thu Feb 08 04:39:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.