[SERVER-61204] Secondary crashing due to namespace not found Created: 03/Nov/21  Updated: 01/Jun/22  Resolved: 06/Dec/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.6.16
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Nitesh Vaidyanath Assignee: Edwin Zhou
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

https://jira.mongodb.org/browse/SERVER-23390


Operating System: ALL
Participants:

 Description   

Hello, 

  We have couple of outages due to below error. This is only happening for "data_profile_evaluation_status_logs" collection. In logs i don't see "createCollection" for this collection. 

2021-11-02T21:22:33.331Z F REPL     [repl writer worker 9] writer worker caught exception: NamespaceNotFound: Failed to apply operation due to missing collection (9e40471b-f2dc-4fcc-ad4b-909803130fdb): { ts: Timestamp(1635884553, 1625), t: 57, h: 5460987143494895826, v: 2, op: "i", ns: "xx.data_profile_evaluation_status_logs", ui: UUID("9e40471b-f2dc-4fcc-ad4b-909803130fdb"), wall: new Date(1635884553862), lsid: { id: UUID("6e7ec0b0-d44e-482f-9bdf-221341333532"), uid: BinData(0, 170E43ECFD8063491A7508F7918660CD321AE259BDC7D454560C5F6E04B636F9) }, txnNumber: 38, stmtId: 0, prevOpTime: { ts: Timestamp(0, 0), t: -1 }, o: {
 
 at src/mongo/db/repl/sync_tail.cpp 1266
2021-11-02T20:22:33.852Z F -        [repl writer worker 8]***aborting after fassert() failure
2021-11-02T21:07:12.589Z I CONTROL  [main] ***** SERVER RESTARTED *****
2021-11-02T21:07:12.620Z I CONTROL  [initandlisten] MongoDB starting : pid=483 port=27017 dbpath=/db/data 64-bit host=rs7-3
2021-11-02T21:07:12.620Z I CONTROL  [initandlisten] db version v3.6.16
2021-11-02T21:07:12.620Z I CONTROL  [initandlisten] git version: 204ab367a130a4fd2db1c54b02cd6a86e4e07f56
2021-11-02T21:07:12.620Z I CONTROL  [initandlisten] OpenSSL version: OpenSSL 1.0.2n  7 Dec 2017
2021-11-02T21:07:12.620Z I CONTROL  [initandlisten] allocator: tcmalloc
2021-11-02T21:07:12.620Z I CONTROL  [initandlisten] modules: none
2021-11-02T21:07:12.620Z I CONTROL  [initandlisten] build environment:
2021-11-02T21:07:12.620Z I CONTROL  [initandlisten]     distmod: ubuntu1604
2021-11-02T21:07:12.620Z I CONTROL  [initandlisten]     distarch: x86_64
2021-11-02T21:07:12.620Z I CONTROL  [initandlisten]     target_arch: x86_64

 

 

 



 Comments   
Comment by Edwin Zhou [ 06/Dec/21 ]

Hi nvaidyanath@paloaltonetworks.com,

Thank you for following up. I will now close this ticket.

Best,
Edwin

Comment by Nitesh Vaidyanath [ 30/Nov/21 ]

I have deleted all the data and re synced i don't have diagnostic data. We can close this ticket for now. I will reopen this ticket if issue reappears. 

Comment by Edwin Zhou [ 29/Nov/21 ]

Hi nvaidyanath@paloaltonetworks.com,

We still need additional information to diagnose the problem. If this is still an issue for you, would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location? The full mongod.log files will help us see the activity that leads up to this behavior and help us diagnose this issue.

Best,
Edwin

Comment by Edwin Zhou [ 15/Nov/21 ]

Hi nvaidyanath@paloaltonetworks.com,

Thank you for taking the time to upgrade to a later version of MongoDB and updating us of this issue.

Would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location?

Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Best,
Edwin

Comment by Abhilash Rathod [ 12/Nov/21 ]

Hello Team,
We are using the Stable version of mongo 4.4.9 and there is 4 node replica cluster, in that 1-Primary, 2-Secondary, and 1-Delayed node.
We are also facing the same error on the secondary MongoDB instance.
Below is our error log for the same.

{"t":{"$date":"2021-11-11T10:00:26.261-08:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn6952","msg":"Connection ended","attr":{"remote":"127.0.0.1:57556","connectionId":6952,"connectionCount":26}}
{"t":{"$date":"2021-11-11T10:00:26.261-08:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn6953","msg":"Connection ended","attr":{"remote":"xx.xx.xx.xx:38144","connectionId":6953,"connectionCount":25}}
{"t":{"$date":"2021-11-11T10:00:27.583-08:00"},"s":"I",  "c":"NETWORK",  "id":22943,   "ctx":"listener","msg":"Connection accepted","attr":{"remote":"xx.xx.xx.xx:40716","connectionId":6954,"connectionCount":26}}
{"t":{"$date":"2021-11-11T10:00:27.583-08:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn6954","msg":"client metadata","attr":{"remote":"xx.xx.xx.xx:40716","client":"conn6954","doc":{"os":{"type":"linux","architecture":"amd64"},"driver":{"name":"mgo","version":"globalsign"}}}}
{"t":{"$date":"2021-11-11T10:00:27.593-08:00"},"s":"I",  "c":"ACCESS",   "id":20250,   "ctx":"conn6954","msg":"Authentication succeeded","attr":{"mechanism":"SCRAM-SHA-1","speculative":false,"principalName":"newrelic","authenticationDatabase":"admin","remote":"xx.xx.xx.xx:40716","extraInfo":{}}}
{"t":{"$date":"2021-11-11T10:00:27.753-08:00"},"s":"I",  "c":"NETWORK",  "id":22944,   "ctx":"conn6954","msg":"Connection ended","attr":{"remote":"xx.xx.xx.xx:40716","connectionId":6954,"connectionCount":25}}
{"t":{"$date":"2021-11-11T10:00:30.868-08:00"},"s":"F",  "c":"REPL",     "id":21238,   "ctx":"ReplWriterWorker-33","msg":"Writer worker caught exception","attr":{"error":"NamespaceNotFound: Failed to apply operation: { lsid: { id: UUID(\"82e67257-6cc3-4f85-b533-39d6db763148\"), uid: BinData(0, 26CF2A8F3157EDFB202FDC538950D84EFA3F537EE4A79935780F2075604FB508) }, txnNumber: 1, op: \"i\", ns: \"fpid.events\", ui: UUID(\"587beef0-6621-4298-994e-d472cb075ea7\"), o: { _id: ObjectId('618d5a3e4e14855a5232838a'), processed: 0.0, msg: \"did=+xxxxxxxx;voip_id=xxxx;action=test\", type: \"agent_flagged\", created_date: 1636653630.0 }, ts: Timestamp(1636653630, 3), t: 11, v: 2, wall: new Date(1636653630866), stmtId: 0, prevOpTime: { ts: Timestamp(0, 0), t: -1 } } :: caused by :: Unable to resolve 587beef0-6621-4298-994e-d472cb075ea7","oplogEntry":{"lsid":{"id":{"$uuid":"82e67257-6cc3-4f85-b533-39d6db763148"},"uid":{"$binary":{"base64":"Js8qjzFX7fsgL9xTiVDYTvo/U37kp5k1eA8gdWBPtQg=","subType":"0"}}},"txnNumber":1,"op":"i","ns":"fpid.events","ui":{"$uuid":"587beef0-6621-4298-994e-d472cb075ea7"},"o":{"_id":{"$oid":"618d5a3e4e14855a5232838a"},"processed":0.0,"msg":"did=+xxxxxxxx;voip_id=xxxx;action=test","type":"agent_flagged","created_date":1636653630.0},"ts":{"$timestamp":{"t":1636653630,"i":3}},"t":11,"v":2,"wall":{"$date":"2021-11-11T18:00:30.866Z"},"stmtId":0,"prevOpTime":{"ts":{"$timestamp":{"t":0,"i":0}},"t":-1}}}}
{"t":{"$date":"2021-11-11T10:00:30.868-08:00"},"s":"F",  "c":"REPL",     "id":21235,   "ctx":"OplogApplier-0","msg":"Failed to apply batch of operations","attr":{"numOperationsInBatch":1,"firstOperation":{"lsid":{"id":{"$uuid":"82e67257-6cc3-4f85-b533-39d6db763148"},"uid":{"$binary":{"base64":"Js8qjzFX7fsgL9xTiVDYTvo/U37kp5k1eA8gdWBPtQg=","subType":"0"}}},"txnNumber":1,"op":"i","ns":"fpid.events","ui":{"$uuid":"587beef0-6621-4298-994e-d472cb075ea7"},"o":{"_id":{"$oid":"618d5a3e4e14855a5232838a"},"processed":0.0,"msg":"did=+xxxxxxxx;voip_id=xxxx;action=test","type":"agent_flagged","created_date":1636653630.0},"ts":{"$timestamp":{"t":1636653630,"i":3}},"t":11,"v":2,"wall":{"$date":"2021-11-11T18:00:30.866Z"},"stmtId":0,"prevOpTime":{"ts":{"$timestamp":{"t":0,"i":0}},"t":-1}},"lastOperation":{"lsid":{"id":{"$uuid":"82e67257-6cc3-4f85-b533-39d6db763148"},"uid":{"$binary":{"base64":"Js8qjzFX7fsgL9xTiVDYTvo/U37kp5k1eA8gdWBPtQg=","subType":"0"}}},"txnNumber":1,"op":"i","ns":"fpid.events","ui":{"$uuid":"587beef0-6621-4298-994e-d472cb075ea7"},"o":{"_id":{"$oid":"618d5a3e4e14855a5232838a"},"processed":0.0,"msg":"did=+xxxxxxxx;voip_id=xxxx;action=test","type":"agent_flagged","created_date":1636653630.0},"ts":{"$timestamp":{"t":1636653630,"i":3}},"t":11,"v":2,"wall":{"$date":"2021-11-11T18:00:30.866Z"},"stmtId":0,"prevOpTime":{"ts":{"$timestamp":{"t":0,"i":0}},"t":-1}},"failedWriterThread":11,"error":"NamespaceNotFound: Failed to apply operation: { lsid: { id: UUID(\"82e67257-6cc3-4f85-b533-39d6db763148\"), uid: BinData(0, 26CF2A8F3157EDFB202FDC538950D84EFA3F537EE4A79935780F2075604FB508) }, txnNumber: 1, op: \"i\", ns: \"fpid.events\", ui: UUID(\"587beef0-6621-4298-994e-d472cb075ea7\"), o: { _id: ObjectId('618d5a3e4e14855a5232838a'), processed: 0.0, msg: \"did=+xxxxxxxx;voip_id=xxxx;action=test\", type: \"agent_flagged\", created_date: 1636653630.0 }, ts: Timestamp(1636653630, 3), t: 11, v: 2, wall: new Date(1636653630866), stmtId: 0, prevOpTime: { ts: Timestamp(0, 0), t: -1 } } :: caused by :: Unable to resolve 587beef0-6621-4298-994e-d472cb075ea7"}}
{"t":{"$date":"2021-11-11T10:00:30.868-08:00"},"s":"F",  "c":"-",        "id":23095,   "ctx":"OplogApplier-0","msg":"Fatal assertion","attr":{"msgid":34437,"error":"NamespaceNotFound: Failed to apply operation: { lsid: { id: UUID(\"82e67257-6cc3-4f85-b533-39d6db763148\"), uid: BinData(0, 26CF2A8F3157EDFB202FDC538950D84EFA3F537EE4A79935780F2075604FB508) }, txnNumber: 1, op: \"i\", ns: \"fpid.events\", ui: UUID(\"587beef0-6621-4298-994e-d472cb075ea7\"), o: { _id: ObjectId('618d5a3e4e14855a5232838a'), processed: 0.0, msg: \"did=+xxxxxxxx;voip_id=xxxx;action=test\", type: \"agent_flagged\", created_date: 1636653630.0 }, ts: Timestamp(1636653630, 3), t: 11, v: 2, wall: new Date(1636653630866), stmtId: 0, prevOpTime: { ts: Timestamp(0, 0), t: -1 } } :: caused by :: Unable to resolve 587beef0-6621-4298-994e-d472cb075ea7","file":"src/mongo/db/repl/oplog_applier_impl.cpp","line":510}}
{"t":{"$date":"2021-11-11T10:00:30.868-08:00"},"s":"F",  "c":"-",        "id":23096,   "ctx":"OplogApplier-0","msg":"\n\n***aborting after fassert() failure\n\n"}

after this error, we are not able to start this MongoDB service on this node, it's failing on restart.
Could you please suggest what could be the reason for this and how we can prevent this error?

 

Comment by Dmitry Agranat [ 03/Nov/21 ]

Hi nvaidyanath@paloaltonetworks.com, as MongoDB 3.6 has reached EOL, could you try upgrading to one of the supported versions to see if the issue still occurs?

Comment by Nitesh Vaidyanath [ 03/Nov/21 ]

We have not made any secondaries as standalone. Will split brain cause this problem ? Because we were running 2 nodes in aws  us-west-1a both were secondaries and 1 node in us-west-1b which was primary. 

Comment by Nitesh Vaidyanath [ 03/Nov/21 ]

Why there is missing oplog entry for createCollection ? Everytimewhen it happens we have to resync all the secondaries from scratch. 

Related issue https://jira.mongodb.org/browse/SERVER-61204

Generated at Thu Feb 08 05:51:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.