[SERVER-57431] MongoDB primary instance down Created: 04/Jun/21 Updated: 23/Jul/21 Resolved: 10/Jun/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | YS Mao | Assignee: | Edwin Zhou |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
primary instance of replset down and unable to restart. log for restart: 2021-06-04T11:27:54.148+0800 I CONTROL [main] ***** SERVER RESTARTED ***** , processManagement: { fork: true, pidFilePath: "/data3/pretreatment/pid/pretreatment.pid" }, replication: { enableMajorityReadConcern: false, oplogSizeMB: 819200, replSetName: "pretreatment" }, security: { authorization: "enabled", keyFile: "/data3/pretreatment/keyfile/securitypretreatment" }, storage: { dbPath: "/data3/pretreatment/dbdata", directoryPerDB: true, engine: "wiredTiger", journal: { commitIntervalMs: 50, enabled: true }, wiredTiger: { engineConfig: { cacheSizeGB: 60.0, directoryForIndexes: true } } }, systemLog: { destination: "file", logAppend: true, logRotate: "rename", path: "/logs/pretreatment/pretreatment.log", quiet: false } } 2021-06-04T11:31:48.976+0800 I STORAGE [initandlisten] Placing a marker at optime Jun 4 10:51:59:4315 ... 2021-06-04T11:31:50.584+0800 I REPL [rsBackgroundSync] sync source candidate: 10.67.12.3:27077 , os: { type: "Linux", name: "Oracle Linux Server release 7.8", architecture: "x86_64", version: "Kernel 4.14.35-1902.301.1.el7uek.x86_64" } } ... ** 2021-06-04T11:39:43.965+0800 I ROLLBACK [rsBackgroundSync] Rollback finished. The final minValid is: { ts: Timestamp(1622532457, 3), t: 22 } ***aborting after fassert() failure
|
| Comments |
| Comment by Edwin Zhou [ 10/Jun/21 ] | ||||||
|
I look forward to hearing your follow up. Best, | ||||||
| Comment by YS Mao [ 10/Jun/21 ] | ||||||
|
Yup, you are right Edwin, I must checked a wrong config file in yesterday. Thanks again for your analysis and advices, we will follow up your recommandation and see if the same issue occur again. Ticket can be closed. Thanks YS
| ||||||
| Comment by Edwin Zhou [ 09/Jun/21 ] | ||||||
|
You're correct that net.wireObjectCheck is set to true by default. Given that information, you may leave this parameter as it is. For net.serviceExecutor: adaptive, this setting is mentioned in your configuration file that you commented above.
Best, | ||||||
| Comment by YS Mao [ 09/Jun/21 ] | ||||||
|
Thanks Edwin for your detail analyze and advice. The issue not yet reproduced and we have plan to upgrade to 4.2 recently and upgrad to 4.4 later. I have a question regarding below recommandations.
"net.serviceExecutor: adaptive.","net.wireObjectCheck", we didn't set the first parameter in our configuration file, for 2nd one, the setting should be as same as the default value config file: net: Here is explaination for wireObjectCheck from mongo online document. net.wireObjectCheck Type: boolean Default: true When true, the mongod or mongos instance validates all requests from clients upon receipt to prevent clients from inserting malformed or invalid BSON into a MongoDB database. For objects with a high degree of sub-document nesting, net.wireObjectCheck can have a small impact on performance. So, shall we remove this parameter(net.wireObjectCheck) out of config file or we should change the setting from true to false? and for the first one "net.serviceExecutor: adaptive", how can we remove it since we didn't config it? below are all parameters list from getParameter command. pretreatment:SECONDARY> db.adminCommand( { getParameter : '*' } ) , , , , , , , , , }, , , }, }, , }, , , , }, }
Best regards YS | ||||||
| Comment by Edwin Zhou [ 08/Jun/21 ] | ||||||
|
After looking into the diagnostic data you provided, we noticed that your cluster configuration diverges from the default configuration in a number of ways which makes it difficult to understand the root cause. We noticed that there is substantial cache pressure that may have contributed to the failover on the primary, but it remains unclear why this happened. We're interested if this is a reproducible issue given your current settings, and if it reproduces by upgrading to MongoDB v4.4 which has deprecated one of the configurations you used, or by returning the following configurations to default:
Best, | ||||||
| Comment by YS Mao [ 08/Jun/21 ] | ||||||
|
Thanks Edwin, we did full resync after the issue occurred. Any idea what might be the root cause? according to the log files the primary node stopped service as the primary and converted itself as the secondary automatically and try to sync data from the elected new primary for about 4-5 minitues and then errored out and shutdown itself automatically. There was a high workload during that time, avaiable write tickets became zero, and one more thing... we adjusted the read/write tickets from default 128 to 512 on all nodes of this cluster a few days ago, the cpu cores is 56 for each of node. Regards YS | ||||||
| Comment by Edwin Zhou [ 07/Jun/21 ] | ||||||
|
Thank you for providing the ftdc files and the log lines that cover this issue. Based on the log message
Best, | ||||||
| Comment by YS Mao [ 04/Jun/21 ] | ||||||
|
attached log file and 2 diagnostic data files | ||||||
| Comment by YS Mao [ 04/Jun/21 ] | ||||||
|
mongo configuration file: cat /etc/mongodpretreatment.cnf | ||||||
| Comment by YS Mao [ 04/Jun/21 ] | ||||||
|
error log during primary instance first failure. ... 2021-06-04T10:52:56.146+0800 I REPL [replexec-3062] Scheduling priority takeover at 2021-06-04T10:53:06.796+0800 ... 2021-06-04T10:52:57.131+0800 I REPL [rsBackgroundSync] sync source candidate: 10.67.12.3:27077 2021-06-04T10:52:57.237+0800 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: Our last op time fetched: { ts: Timestamp(1622775163, 2189), t: 35 }. source's GTE: { ts: Timestamp(1622775638, 1), t: 36 } hashes: (3943340773125100714/-6300946020427835533) ... 2021-06-04T10:56:34.137+0800 I NETWORK [LogicalSessionCacheRefresh] Starting new replica set monitor for pretreatment/10.67.12.1:27077,10.67.12.2:27077,10.67.12.3:27077 , os: { type: "Windows_NT", name: "win32", architecture: "x64", version: "10.0.19041" }, platform: "Node.js v10.11.0, LE, mongodb-core: 3.2.7-2", application: { name: "NoSQLBoosterV5.2.12_1439.782" } } , os: { type: "Windows_NT", name: "win32", architecture: "x64", version: "10.0.19041" }, platform: "Node.js v10.11.0, LE, mongodb-core: 3.2.7-2", application: { name: "NoSQLBoosterV5.2.12_1439.782" } } ***aborting after fassert() failure | ||||||
| Comment by YS Mao [ 04/Jun/21 ] | ||||||
|
MongoDB shell version v4.0.13 |