[SERVER-57431] MongoDB primary instance down Created: 04/Jun/21  Updated: 23/Jul/21  Resolved: 10/Jun/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: YS Mao Assignee: Edwin Zhou
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2021-06-07 at 2.25.15 PM.png     PNG File Screen Shot 2021-06-07 at 2.26.00 PM.png     File metrics.2021-06-04T01-47-58Z-00000     File metrics.2021-06-04T03-31-50Z-00000     File pretreatment.log.gz    
Issue Links:
Duplicate
is duplicated by SERVER-57336 Implement densification for numeric v... Closed
Operating System: ALL
Participants:

 Description   

primary instance of replset down and unable to restart.

 log for restart:

2021-06-04T11:27:54.148+0800 I CONTROL [main] ***** SERVER RESTARTED *****
2021-06-04T11:27:54.157+0800 I CONTROL [main] Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'
2021-06-04T11:27:54.179+0800 I CONTROL [initandlisten] MongoDB starting : pid=46354 port=27077 dbpath=/data3/pretreatment/dbdata 64-bit host=wqdcsrv414.cn.prod
2021-06-04T11:27:54.179+0800 I CONTROL [initandlisten] db version v4.0.13
2021-06-04T11:27:54.179+0800 I CONTROL [initandlisten] git version: bda366f0b0e432ca143bc41da54d8732bd8d03c0
2021-06-04T11:27:54.179+0800 I CONTROL [initandlisten] OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013
2021-06-04T11:27:54.179+0800 I CONTROL [initandlisten] allocator: tcmalloc
2021-06-04T11:27:54.179+0800 I CONTROL [initandlisten] modules: none
2021-06-04T11:27:54.179+0800 I CONTROL [initandlisten] build environment:
2021-06-04T11:27:54.179+0800 I CONTROL [initandlisten] distmod: rhel70
2021-06-04T11:27:54.179+0800 I CONTROL [initandlisten] distarch: x86_64
2021-06-04T11:27:54.179+0800 I CONTROL [initandlisten] target_arch: x86_64
2021-06-04T11:27:54.179+0800 I CONTROL [initandlisten] options: { config: "/etc/mongodpretreatment.cnf", net:

{ bindIp: "0.0.0.0", maxIncomingConnections: 10000, port: 27077, serviceExecutor: "adaptive", wireObjectCheck: true }

, processManagement: { fork: true, pidFilePath: "/data3/pretreatment/pid/pretreatment.pid" }, replication: { enableMajorityReadConcern: false, oplogSizeMB: 819200, replSetName: "pretreatment" }, security: { authorization: "enabled", keyFile: "/data3/pretreatment/keyfile/securitypretreatment" }, storage: { dbPath: "/data3/pretreatment/dbdata", directoryPerDB: true, engine: "wiredTiger", journal:

{ commitIntervalMs: 50, enabled: true }

, wiredTiger: { engineConfig:

{ cacheSizeGB: 60.0, directoryForIndexes: true }

} }, systemLog: { destination: "file", logAppend: true, logRotate: "rename", path: "/logs/pretreatment/pretreatment.log", quiet: false } }
2021-06-04T11:27:54.180+0800 W STORAGE [initandlisten] Detected unclean shutdown - /data3/pretreatment/dbdata/mongod.lock is not empty.
2021-06-04T11:27:54.181+0800 W STORAGE [initandlisten] Recovering data from the last clean checkpoint.
2021-06-04T11:27:54.181+0800 I STORAGE [initandlisten] wiredtiger_open config: create,cache_size=61440M,cache_overflow=(file_max=0M),session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),statistics_log=(wait=0),verbose=(recovery_progress),
2021-06-04T11:27:54.839+0800 I STORAGE [initandlisten] WiredTiger message [1622777274:839167][46354:0x7f2051264b80], txn-recover: Main recovery loop: starting at 547399/94917248 to 547400/256
2021-06-04T11:27:54.839+0800 I STORAGE [initandlisten] WiredTiger message [1622777274:839519][46354:0x7f2051264b80], txn-recover: Recovering log 547399 through 547400
2021-06-04T11:27:54.860+0800 I STORAGE [initandlisten] WiredTiger message [1622777274:860444][46354:0x7f2051264b80], file:local/collection/8-4992545715697342319.wt, txn-recover: Recovering log 547400 through 547400
2021-06-04T11:27:54.933+0800 I STORAGE [initandlisten] WiredTiger message [1622777274:933146][46354:0x7f2051264b80], file:local/collection/8-4992545715697342319.wt, txn-recover: Set global recovery timestamp: 0

2021-06-04T11:31:48.976+0800 I STORAGE [initandlisten] Placing a marker at optime Jun 4 10:51:59:4315
2021-06-04T11:31:49.069+0800 I FTDC [initandlisten] Initializing full-time diagnostic data capture with directory '/data3/pretreatment/dbdata/diagnostic.data'
2021-06-04T11:31:49.074+0800 I REPL [initandlisten] Rollback ID is 3
2021-06-04T11:31:49.078+0800 I REPL [initandlisten] Recovering from an unstable checkpoint (top of oplog: { ts: Timestamp(1622775163, 2189), t: 35 }, appliedThrough: { ts: Timestamp(1622775163, 2189), t: 35 })
2021-06-04T11:31:49.078+0800 I REPL [initandlisten] Starting recovery oplog application at the appliedThrough: { ts: Timestamp(1622775163, 2189), t: 35 }, through the top of the oplog: { ts: Timestamp(1622775163, 2189), t: 35 }
2021-06-04T11:31:49.078+0800 I REPL [initandlisten] No oplog entries to apply for recovery. Start point is at the top of the oplog.
2021-06-04T11:31:49.115+0800 I CONTROL [LogicalSessionCacheRefresh] Sessions collection is not set up; waiting until next sessions refresh interval: Replication has not yet been configured
2021-06-04T11:31:49.115+0800 I EXECUTOR [initandlisten] No thread count configured for executor. Using number of cores / 2: 28
2021-06-04T11:31:49.115+0800 I EXECUTOR [worker-1] Started new database worker thread 1
2021-06-04T11:31:49.115+0800 I EXECUTOR [worker-2] Started new database worker thread 2
2021-06-04T11:31:49.115+0800 I CONTROL [LogicalSessionCacheReap] Failed to reap transaction table: NotYetInitialized: Replication has not yet been configured

...

2021-06-04T11:31:50.584+0800 I REPL [rsBackgroundSync] sync source candidate: 10.67.12.3:27077
2021-06-04T11:31:50.584+0800 I ASIO [RS] Connecting to 10.67.12.3:27077
2021-06-04T11:31:50.588+0800 I REPL [rsBackgroundSync] Changed sync source from empty to 10.67.12.3:27077
2021-06-04T11:31:50.589+0800 I ASIO [RS] Connecting to 10.67.12.3:27077
2021-06-04T11:31:50.693+0800 I NETWORK [listener] connection accepted from 10.67.12.2:34152 #26 (13 connections now open)
2021-06-04T11:31:50.694+0800 I NETWORK [conn26] received client metadata from 10.67.12.2:34152 conn26: { driver:

{ name: "NetworkInterfaceTL", version: "4.0.13" }

, os: { type: "Linux", name: "Oracle Linux Server release 7.8", architecture: "x86_64", version: "Kernel 4.14.35-1902.301.1.el7uek.x86_64" } }
2021-06-04T11:31:50.701+0800 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: Our last op time fetched: { ts: Timestamp(1622775163, 2189), t: 35 }. source's GTE: { ts: Timestamp(1622775638, 1), t: 36 } hashes: (3943340773125100714/-6300946020427835533)
2021-06-04T11:31:50.702+0800 I REPL [rsBackgroundSync] Replication commit point: { ts: Timestamp(0, 0), t: -1 }
2021-06-04T11:31:50.702+0800 I REPL [rsBackgroundSync] Rollback using the 'rollbackViaRefetch' method.
2021-06-04T11:31:50.702+0800 I REPL [rsBackgroundSync] Canceling priority takeover callback
2021-06-04T11:31:50.702+0800 I REPL [rsBackgroundSync] transition to ROLLBACK from SECONDARY
2021-06-04T11:31:50.702+0800 I NETWORK [rsBackgroundSync] Skip closing connection for connection # 26
2021-06-04T11:31:50.702+0800 I NETWORK [rsBackgroundSync] Skip closing connection for connection # 18
2021-06-04T11:31:50.702+0800 I NETWORK [conn17] Error receiving request from client: CallbackCanceled: Callback was canceled. Ending connection from 127.0.0.1:38140 (connection id: 17)
2021-06-04T11:31:50.702+0800 I NETWORK [conn16] Error receiving request from client: CallbackCanceled: Callback was canceled. Ending connection from 127.0.0.1:38138 (connection id: 16)
2021-06-04T11:31:50.702+0800 I NETWORK [conn14] Error receiving request from client: CallbackCanceled: Callback was canceled. Ending connection from 10.67.12.103:43794 (connection id: 14)
2021-06-04T11:31:50.702+0800 I NETWORK [rsBackgroundSync] Skip closing connection for connection # 12
2021-06-04T11:31:50.702+0800 I NETWORK [conn13] Error receiving request from client: CallbackCanceled: Callback was canceled. Ending connection from 10.67.12.103:43792 (connection id: 13)
2021-06-04T11:31:50.702+0800 I NETWORK [conn11] Error receiving request from client: CallbackCanceled: Callback was canceled. Ending connection from 10.67.12.103:43790 (connection id: 11)
2021-06-04T11:31:50.702+0800 I NETWORK [conn9] Error receiving request from client: CallbackCanceled: Callback was canceled. Ending connection from 10.67.12.103:43788 (connection id: 9)
2021-06-04T11:31:50.702+0800 I NETWORK [conn7] Error receiving request from client: CallbackCanceled: Callback was canceled. Ending connection from 10.67.12.103:43780 (connection id: 7)

...

**

2021-06-04T11:39:43.965+0800 I ROLLBACK [rsBackgroundSync] Rollback finished. The final minValid is: { ts: Timestamp(1622532457, 3), t: 22 }
2021-06-04T11:39:43.965+0800 F ROLLBACK [rsBackgroundSync] Unable to complete rollback. A full resync may be needed: UnrecoverableRollbackError: replSet too much data to roll back.
2021-06-04T11:39:43.965+0800 F - [rsBackgroundSync] Fatal Assertion 40507 at src/mongo/db/repl/rs_rollback.cpp 1542
2021-06-04T11:39:43.965+0800 F - [rsBackgroundSync]

***aborting after fassert() failure

 



 Comments   
Comment by Edwin Zhou [ 10/Jun/21 ]

Hi ys.mao@homecreditcfc.cn,

I look forward to hearing your follow up.

Best,
Edwin

Comment by YS Mao [ 10/Jun/21 ]

Yup, you are right Edwin, I must checked a wrong config file in yesterday.

Thanks again for your analysis and advices, we will follow up your recommandation and see if the same issue occur again.

Ticket can be closed.

Thanks

YS

 

Comment by Edwin Zhou [ 09/Jun/21 ]

Hi ys.mao@homecreditcfc.cn,

You're correct that net.wireObjectCheck is set to true by default. Given that information, you may leave this parameter as it is.

For net.serviceExecutor: adaptive, this setting is mentioned in your configuration file that you commented above.

net:
  port: 27077 
  maxIncomingConnections: 10000 
  wireObjectCheck: true
  bindIp: 0.0.0.0
  serviceExecutor: adaptive  <-- this line here

Best,
Edwin

Comment by YS Mao [ 09/Jun/21 ]

Thanks Edwin for your detail analyze and advice.

The issue not yet reproduced and we have plan to upgrade to 4.2 recently and upgrad to 4.4 later.

I have a question regarding below recommandations.

  • Remove net.serviceExecutor: adaptive. We believe this is possibly the culprit for the primary node failover. This setting was introduced to alleviate issues with large numbers of connections. It has since been deprecated in MongoDB v4.4
  • Remove net.wireObjectCheck.

"net.serviceExecutor: adaptive.","net.wireObjectCheck", we didn't set the first parameter in our configuration file, for 2nd one, the setting should be as same as the default value

config file:

net:
port: 28010
maxIncomingConnections: 5000
wireObjectCheck: true
bindIp: 0.0.0.0

Here is explaination for wireObjectCheck from mongo online document.

net.wireObjectCheck

Type: boolean

Default: true

When true, the mongod or mongos instance validates all requests from clients upon receipt to prevent clients from inserting malformed or invalid BSON into a MongoDB database.

For objects with a high degree of sub-document nesting, net.wireObjectCheck can have a small impact on performance.

So, shall we remove this parameter(net.wireObjectCheck) out of config file or we should change the setting from true to false?

and for the first one "net.serviceExecutor: adaptive",  how can we remove it since we didn't config it?

below are all parameters list from getParameter command.

pretreatment:SECONDARY> db.adminCommand( { getParameter : '*' } )
{
"AsyncRequestsSenderUseBaton" : true,
"KeysRotationIntervalSec" : 7776000,
"ShardingTaskExecutorPoolHostTimeoutMS" : 300000,
"ShardingTaskExecutorPoolMaxConnecting" : 2,
"ShardingTaskExecutorPoolMaxSize" : -1,
"ShardingTaskExecutorPoolMinSize" : 1,
"ShardingTaskExecutorPoolRefreshRequirementMS" : 60000,
"ShardingTaskExecutorPoolRefreshTimeoutMS" : 20000,
"TransactionRecordMinimumLifetimeMinutes" : 30,
"adaptiveServiceExecutorIdlePctThreshold" : 60,
"adaptiveServiceExecutorMaxQueueLatencyMicros" : 500,
"adaptiveServiceExecutorRecursionLimit" : 8,
"adaptiveServiceExecutorReservedThreads" : 28,
"adaptiveServiceExecutorRunTimeJitterMillis" : 500,
"adaptiveServiceExecutorRunTimeMillis" : 5000,
"adaptiveServiceExecutorStuckThreadTimeoutMillis" : 250,
"allowRolesFromX509Certificates" : true,
"allowSecondaryReadsDuringBatchApplication" : true,
"authSchemaVersion" : 5,
"authenticationMechanisms" : [
"MONGODB-X509",
"SCRAM-SHA-1",
"SCRAM-SHA-256"
],
"bgSyncOplogFetcherBatchSize" : 13981010,
"clientCursorMonitorFrequencySecs" : 4,
"cloudFreeMonitoringEndpointURL" : "https://cloud.mongodb.com/freemonitoring/mongo",
"clusterAuthMode" : "keyFile",
"collectionClonerBatchSize" : -1,
"connPoolMaxConnsPerHost" : 200,
"connPoolMaxInUseConnsPerHost" : 2147483647,
"connPoolMaxShardedConnsPerHost" : 200,
"connPoolMaxShardedInUseConnsPerHost" : 2147483647,
"createRollbackDataFiles" : true,
"createTimestampSafeUniqueIndex" : false,
"cursorTimeoutMillis" : NumberLong(600000),
"debugCollectionUUIDs" : false,
"diagnosticDataCollectionDirectorySizeMB" : 200,
"diagnosticDataCollectionEnabled" : true,
"diagnosticDataCollectionFileSizeMB" : 10,
"diagnosticDataCollectionPeriodMillis" : 1000,
"diagnosticDataCollectionSamplesPerChunk" : 300,
"diagnosticDataCollectionSamplesPerInterimUpdate" : 10,
"disableJavaScriptJIT" : true,
"disableLogicalSessionCacheRefresh" : false,
"disableNonSSLConnectionLogging" : false,
"disabledSecureAllocatorDomains" : [ ],
"enableElectionHandoff" : true,
"enableInMemoryTransactions" : false,
"enableLocalhostAuthBypass" : true,
"enableTestCommands" : false,
"failIndexKeyTooLong" : true,
"featureCompatibilityVersion" :

{ "version" : "4.0" }

,
"forceRollbackViaRefetch" : false,
"globalConnPoolIdleTimeoutMinutes" : 2147483647,
"heapProfilingEnabled" : false,
"heapProfilingSampleIntervalBytes" : NumberLong(262144),
"honorSystemUmask" : false,
"initialSyncOplogBuffer" : "collection",
"initialSyncOplogBufferPeekCacheSize" : 10000,
"initialSyncOplogFetcherBatchSize" : 13981010,
"internalDocumentSourceCursorBatchSizeBytes" : 4194304,
"internalDocumentSourceLookupCacheSizeBytes" : 104857600,
"internalGeoNearQuery2DMaxCoveringCells" : 16,
"internalGeoPredicateQuery2DMaxCoveringCells" : 16,
"internalInsertMaxBatchSize" : 64,
"internalLookupStageIntermediateDocumentMaxSizeBytes" : NumberLong(104857600),
"internalProhibitShardOperationRetry" : false,
"internalQueryAlwaysMergeOnPrimaryShard" : false,
"internalQueryCacheEvictionRatio" : 10,
"internalQueryCacheFeedbacksStored" : 20,
"internalQueryCacheSize" : 5000,
"internalQueryEnumerationMaxIntersectPerAnd" : 3,
"internalQueryEnumerationMaxOrSolutions" : 10,
"internalQueryExecMaxBlockingSortBytes" : 33554432,
"internalQueryExecYieldIterations" : 128,
"internalQueryExecYieldPeriodMS" : 10,
"internalQueryFacetBufferSizeBytes" : 104857600,
"internalQueryForceIntersectionPlans" : false,
"internalQueryIgnoreUnknownJSONSchemaKeywords" : false,
"internalQueryMaxScansToExplode" : 200,
"internalQueryPlanEvaluationCollFraction" : 0.3,
"internalQueryPlanEvaluationMaxResults" : 101,
"internalQueryPlanEvaluationWorks" : 10000,
"internalQueryPlanOrChildrenIndependently" : true,
"internalQueryPlannerEnableHashIntersection" : false,
"internalQueryPlannerEnableIndexIntersection" : true,
"internalQueryPlannerGenerateCoveredWholeIndexScans" : false,
"internalQueryPlannerMaxIndexedSolutions" : 64,
"internalQueryProhibitBlockingMergeOnMongoS" : false,
"internalQueryProhibitMergingOnMongoS" : false,
"internalQueryS2GeoCoarsestLevel" : 0,
"internalQueryS2GeoFinestLevel" : 23,
"internalQueryS2GeoMaxCells" : 20,
"internalValidateFeaturesAsMaster" : true,
"javascriptProtection" : false,
"journalCommitInterval" : 50,
"jsHeapLimitMB" : 1100,
"localLogicalSessionTimeoutMinutes" : 30,
"logComponentVerbosity" : {
"verbosity" : 0,
"accessControl" :

{ "verbosity" : -1 }

,
"command" :

{ "verbosity" : -1 }

,
"control" :

{ "verbosity" : -1 }

,
"executor" :

{ "verbosity" : -1 }

,
"geo" :

{ "verbosity" : -1 }

,
"index" :

{ "verbosity" : -1 }

,
"network" : {
"verbosity" : -1,
"asio" :

{ "verbosity" : -1 }

,
"bridge" :

{ "verbosity" : -1 }

,
"connectionPool" :

{ "verbosity" : -1 }

},
"query" :

{ "verbosity" : -1 }

,
"replication" : {
"verbosity" : -1,
"heartbeats" :

{ "verbosity" : -1 }

,
"rollback" :

{ "verbosity" : -1 }

},
"sharding" : {
"verbosity" : -1,
"shardingCatalogRefresh" :

{ "verbosity" : -1 }

},
"storage" : {
"verbosity" : -1,
"recovery" :

{ "verbosity" : -1 }

,
"journal" :

{ "verbosity" : -1 }

},
"write" :

{ "verbosity" : -1 }

,
"ftdc" :

{ "verbosity" : -1 }

,
"tracking" :

{ "verbosity" : -1 }

,
"transaction" :

{ "verbosity" : -1 }

},
"logLevel" : 0,
"logicalSessionRefreshMillis" : 300000,
"maxAcceptableLogicalClockDriftSecs" : NumberLong(31536000),
"maxBSONDepth" : 200,
"maxIndexBuildMemoryUsageMegabytes" : 500,
"maxLogSizeKB" : 10,
"maxNumInitialSyncCollectionClonerCursors" : 1,
"maxSessions" : 1000000,
"maxSyncSourceLagSecs" : 30,
"maxTransactionLockRequestTimeoutMillis" : 5,
"migrateCloneInsertionBatchDelayMS" : 0,
"migrateCloneInsertionBatchSize" : 0,
"newCollectionsUsePowerOf2Sizes" : true,
"notablescan" : false,
"numInitialSyncAttempts" : 10,
"numInitialSyncCollectionCountAttempts" : 3,
"numInitialSyncCollectionFindAttempts" : 3,
"numInitialSyncConnectAttempts" : 10,
"numInitialSyncListCollectionsAttempts" : 3,
"numInitialSyncListDatabasesAttempts" : 3,
"numInitialSyncListIndexesAttempts" : 3,
"numInitialSyncOplogFindAttempts" : 3,
"opensslCipherConfig" : "",
"opensslDiffieHellmanParameters" : "",
"oplogFetcherInitialSyncMaxFetcherRestarts" : 10,
"oplogFetcherSteadyStateMaxFetcherRestarts" : 1,
"oplogInitialFindMaxSeconds" : 60,
"oplogRetriedFindMaxSeconds" : 2,
"orphanCleanupDelaySecs" : 900,
"periodicNoopIntervalSecs" : 10,
"priorityTakeoverFreshnessWindowSeconds" : 2,
"quiet" : false,
"rangeDeleterBatchDelayMS" : 20,
"rangeDeleterBatchSize" : 0,
"recoverFromOplogAsStandalone" : false,
"replBatchLimitBytes" : 104857600,
"replBatchLimitOperations" : 5000,
"replElectionTimeoutOffsetLimitFraction" : 0.15,
"replIndexPrefetch" : "all",
"replWriterThreadCount" : 16,
"reportOpWriteConcernCountersInServerStatus" : false,
"reservedServiceExecutorRecursionLimit" : 8,
"rollbackRemoteOplogQueryBatchSize" : 2000,
"rollbackTimeLimitSecs" : 86400,
"saslHostName" : "wqdcsrv414.cn.prod",
"saslServiceName" : "mongodb",
"saslauthdPath" : "",
"scramIterationCount" : 10000,
"scramSHA256IterationCount" : 15000,
"scriptingEngineInterruptIntervalMS" : 1000,
"shardedConnPoolIdleTimeoutMinutes" : 2147483647,
"skipCorruptDocumentsWhenCloning" : false,
"skipShardingConfigurationChecks" : false,
"sslMode" : "disabled",
"sslWithholdClientCertificate" : false,
"startupAuthSchemaValidation" : true,
"suppressNoTLSPeerCertificateWarning" : false,
"syncdelay" : 60,
"synchronousServiceExecutorRecursionLimit" : 8,
"taskExecutorPoolSize" : 1,
"tcmallocAggressiveMemoryDecommit" : 0,
"tcmallocEnableMarkThreadTemporarilyIdle" : false,
"tcmallocMaxTotalThreadCacheBytes" : NumberLong(1073741824),
"testingSnapshotBehaviorInIsolation" : false,
"traceExceptions" : false,
"traceWriteConflictExceptions" : false,
"transactionLifetimeLimitSeconds" : 60,
"ttlMonitorEnabled" : true,
"ttlMonitorSleepSecs" : 60,
"waitForSecondaryBeforeNoopWriteMS" : 10,
"waitForStepDownOnNonCommandShutdown" : true,
"wiredTigerConcurrentReadTransactions" : 256,
"wiredTigerConcurrentWriteTransactions" : 256,
"wiredTigerCursorCacheSize" : -100,
"wiredTigerEngineRuntimeConfig" : "",
"wiredTigerMaxCacheOverflowSizeGB" : 0,
"wiredTigerSessionCloseIdleTimeSecs" : 300,
"writePeriodicNoops" : true,
"ok" : 1,
"operationTime" : Timestamp(1623206657, 8),
"$clusterTime" : {
"clusterTime" : Timestamp(1623206657, 8),
"signature" :

{ "hash" : BinData(0,"RGhE9taFjsdy8lg9+J37lfYYcuU="), "keyId" : NumberLong("6924963697634854396") }

}
}

 

Best regards

YS

Comment by Edwin Zhou [ 08/Jun/21 ]

Hi ys.mao@homecreditcfc.cn,

After looking into the diagnostic data you provided, we noticed that your cluster configuration diverges from the default configuration in a number of ways which makes it difficult to understand the root cause.

We noticed that there is substantial cache pressure that may have contributed to the failover on the primary, but it remains unclear why this happened.

We're interested if this is a reproducible issue given your current settings, and if it reproduces by upgrading to MongoDB v4.4 which has deprecated one of the configurations you used, or by returning the following configurations to default:

  • Remove net.serviceExecutor: adaptive. We believe this is possibly the culprit for the primary node failover. This setting was introduced to alleviate issues with large numbers of connections. It has since been deprecated in MongoDB v4.4
  • Remove net.wireObjectCheck.
  • journal.commitIntervalMs to 100 from 50.
  • Revert the read/write tickets back to 128.

Best,
Edwin

Comment by YS Mao [ 08/Jun/21 ]

Thanks Edwin, we did full resync after the issue occurred.

Any idea what might be the root cause? according to the log files the primary node stopped service as the primary and converted itself as the secondary automatically and try to sync data from the elected new primary for about 4-5 minitues and then errored out and shutdown itself automatically.

There was a high workload during that time, avaiable write tickets became zero, and one more thing... we adjusted the read/write tickets from default 128 to 512 on all nodes of this cluster a few days ago, the cpu cores is 56 for each of node.

Regards

YS 

Comment by Edwin Zhou [ 07/Jun/21 ]

Hi ys.mao@homecreditcfc.cn,

Thank you for providing the ftdc files and the log lines that cover this issue. Based on the log message

2021-06-04T10:56:54.909+0800 F ROLLBACK [rsBackgroundSync] Unable to complete rollback. A full resync may be needed: UnrecoverableRollbackError: replSet too much data to roll back

the ideal resolution is to perform a clean resync from an unaffected node.

Best,
Edwin

Comment by YS Mao [ 04/Jun/21 ]

attached log file and 2 diagnostic data files

Comment by YS Mao [ 04/Jun/21 ]

mongo configuration file:

cat /etc/mongodpretreatment.cnf
storage:
dbPath: /data3/pretreatment/dbdata
journal:
enabled: true
commitIntervalMs: 50
directoryPerDB: true
engine: wiredTiger
wiredTiger:
engineConfig:
cacheSizeGB: 60
directoryForIndexes: true
systemLog:
quiet: false
path: /logs/pretreatment/pretreatment.log
destination: file
logAppend: true
logRotate: rename
processManagement:
fork: true
pidFilePath: /data3/pretreatment/pid/pretreatment.pid
net:
port: 27077
maxIncomingConnections: 10000
wireObjectCheck: true
bindIp: 0.0.0.0
serviceExecutor: adaptive
security:
keyFile: /data3/pretreatment/keyfile/securitypretreatment
authorization: enabled
replication:
oplogSizeMB: 819200
replSetName: pretreatment
enableMajorityReadConcern: false

Comment by YS Mao [ 04/Jun/21 ]

error log during primary instance first failure.

...

2021-06-04T10:52:56.146+0800 I REPL [replexec-3062] Scheduling priority takeover at 2021-06-04T10:53:06.796+0800
2021-06-04T10:52:56.401+0800 I EXECUTOR [worker-2962] Thread was only executing tasks 59% over the last 5045ms. Exiting thread.
2021-06-04T10:52:56.433+0800 I EXECUTOR [worker-3013] Thread was only executing tasks 54% over the last 4561ms. Exiting thread.
2021-06-04T10:52:56.462+0800 I EXECUTOR [worker-3001] Thread was only executing tasks 53% over the last 4709ms. Exiting thread.
2021-06-04T10:52:56.473+0800 I EXECUTOR [worker-2949] Thread was only executing tasks 60% over the last 5250ms. Exiting thread.
2021-06-04T10:52:56.523+0800 I EXECUTOR [worker-2959] Thread was only executing tasks 58% over the last 5197ms. Exiting thread.
2021-06-04T10:52:56.525+0800 I EXECUTOR [worker-3019] Thread was only executing tasks 60% over the last 4592ms. Exiting thread.

...

2021-06-04T10:52:57.131+0800 I REPL [rsBackgroundSync] sync source candidate: 10.67.12.3:27077
2021-06-04T10:52:57.133+0800 I ASIO [RS] Connecting to 10.67.12.3:27077
2021-06-04T10:52:57.136+0800 I EXECUTOR [worker-2992] Thread was only executing tasks 56% over the last 5475ms. Exiting thread.
2021-06-04T10:52:57.141+0800 I EXECUTOR [worker-3014] Thread was only executing tasks 54% over the last 5259ms. Exiting thread.
2021-06-04T10:52:57.142+0800 I EXECUTOR [worker-3060] Thread was only executing tasks 49% over the last 4797ms. Exiting thread.
2021-06-04T10:52:57.144+0800 I REPL [rsBackgroundSync] Changed sync source from empty to 10.67.12.3:27077
2021-06-04T10:52:57.145+0800 I ASIO [RS] Connecting to 10.67.12.3:27077

2021-06-04T10:52:57.237+0800 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: Our last op time fetched: { ts: Timestamp(1622775163, 2189), t: 35 }. source's GTE: { ts: Timestamp(1622775638, 1), t: 36 } hashes: (3943340773125100714/-6300946020427835533)
2021-06-04T10:52:57.237+0800 I REPL [rsBackgroundSync] Replication commit point: { ts: Timestamp(1622775002, 3234), t: 35 }
2021-06-04T10:52:57.237+0800 I REPL [rsBackgroundSync] Rollback using the 'rollbackViaRefetch' method.
2021-06-04T10:52:57.237+0800 I REPL [rsBackgroundSync] Canceling priority takeover callback
2021-06-04T10:52:57.237+0800 I REPL [rsBackgroundSync] transition to ROLLBACK from SECONDARY
2021-06-04T10:52:57.237+0800 I NETWORK [conn54001871] Error receiving request from client: CallbackCanceled: Callback was canceled. Ending connection from 10.67.231.177:52896 (connection id: 54001871)
2021-06-04T10:52:57.237+0800 I NETWORK [conn54001870] Error receiving request from client: CallbackCanceled: Callback was canceled. Ending connection from 10.67.231.177:52894 (connection id: 54001870)

...

2021-06-04T10:56:34.137+0800 I NETWORK [LogicalSessionCacheRefresh] Starting new replica set monitor for pretreatment/10.67.12.1:27077,10.67.12.2:27077,10.67.12.3:27077
2021-06-04T10:56:34.602+0800 I REPL [replexec-3062] Scheduling priority takeover at 2021-06-04T10:56:45.262+0800
2021-06-04T10:56:42.362+0800 I NETWORK [listener] connection accepted from 10.30.218.4:58270 #54007208 (5214 connections now open)
2021-06-04T10:56:42.363+0800 I NETWORK [conn54007208] received client metadata from 10.30.218.4:58270 conn54007208: { driver:

{ name: "nodejs", version: "3.2.7-2" }

, os: { type: "Windows_NT", name: "win32", architecture: "x64", version: "10.0.19041" }, platform: "Node.js v10.11.0, LE, mongodb-core: 3.2.7-2", application: { name: "NoSQLBoosterV5.2.12_1439.782" } }
2021-06-04T10:56:42.403+0800 I ACCESS [conn54007208] Successfully authenticated as principal root on admin from client 10.30.218.4:58270
2021-06-04T10:56:42.414+0800 I NETWORK [worker-148] end connection 10.30.218.4:58270 (5213 connections now open)
2021-06-04T10:56:45.262+0800 I REPL [replexec-3062] Canceling priority takeover callback
2021-06-04T10:56:45.262+0800 I REPL [replexec-3062] Not starting an election for a priority takeover, since we are not electable due to: Not standing for election because member is not currently a secondary; member is not caught up enough to the most up-to-date member to call for priority takeover - must be within 2 seconds (mask 0x408)
2021-06-04T10:56:46.602+0800 I REPL [replexec-3062] Scheduling priority takeover at 2021-06-04T10:56:58.059+0800
2021-06-04T10:56:49.228+0800 I CONTROL [LogicalSessionCacheRefresh] Failed to refresh session cache: WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true }
2021-06-04T10:56:52.441+0800 I NETWORK [listener] connection accepted from 10.30.218.4:57038 #54007209 (5214 connections now open)
2021-06-04T10:56:52.442+0800 I NETWORK [conn54007209] received client metadata from 10.30.218.4:57038 conn54007209: { driver:

{ name: "nodejs", version: "3.2.7-2" }

, os: { type: "Windows_NT", name: "win32", architecture: "x64", version: "10.0.19041" }, platform: "Node.js v10.11.0, LE, mongodb-core: 3.2.7-2", application: { name: "NoSQLBoosterV5.2.12_1439.782" } }
2021-06-04T10:56:52.478+0800 I ACCESS [conn54007209] Successfully authenticated as principal root on admin from client 10.30.218.4:57038
2021-06-04T10:56:52.489+0800 I NETWORK [worker-2910] end connection 10.30.218.4:57038 (5213 connections now open)
2021-06-04T10:56:53.423+0800 I REPL [rsBackgroundSync] Incremented the rollback ID to 3
2021-06-04T10:56:54.909+0800 I ROLLBACK [rsBackgroundSync] Rollback finished. The final minValid is: { ts: Timestamp(1622532457, 3), t: 22 }
2021-06-04T10:56:54.909+0800 F ROLLBACK [rsBackgroundSync] Unable to complete rollback. A full resync may be needed: UnrecoverableRollbackError: replSet too much data to roll back.
2021-06-04T10:56:54.909+0800 F - [rsBackgroundSync] Fatal Assertion 40507 at src/mongo/db/repl/rs_rollback.cpp 1542
2021-06-04T10:56:54.909+0800 F - [rsBackgroundSync]

***aborting after fassert() failure

Comment by YS Mao [ 04/Jun/21 ]

MongoDB shell version v4.0.13
connecting to: mongodb://127.0.0.1:27077/?authSource=admin&gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("afb50b77-b459-42c8-8526-c64b046f5060") }
MongoDB server version: 4.0.13
pretreatment:PRIMARY> db.hostInfo().os

{ "type" : "Linux", "name" : "Oracle Linux Server release 7.8", "version" : "Kernel 4.14.35-1902.301.1.el7uek.x86_64" }
Generated at Thu Feb 08 05:41:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.