[SERVER-50137] MongoDB crashes with Invariant failure due to oplog entries generated in 3.4 Created: 06/Aug/20  Updated: 29/Oct/23  Resolved: 13/Aug/20

Status: Closed
Project: Core Server
Component/s: Storage, Upgrade/Downgrade
Affects Version/s: 4.4.0
Fix Version/s: 4.4.1, 4.7.0

Type: Bug Priority: Major - P3
Reporter: Kin Wai Cheung Assignee: Gregory Noma
Resolution: Fixed Votes: 0
Labels: KP44
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

dvlp
oracle linux 6.9
running on kernel 2.6.32-696.el6.x86_64


Attachments: File config_mongod_log.gz     File metrics.2020-08.tgz    
Issue Links:
Backports
Problem/Incident
is caused by SERVER-45977 Add a new minRetentionHours field to ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4
Sprint: Execution Team 2020-08-24
Participants:

 Description   

In SERVER-32027, MongoDB 3.6 began adding the "wall" field to all oplog entries. The presence of oplog entries lacking this field causes MongoDB 4.4.0 to crash on startup with:

{"t":{"$date":"2020-08-06T12:53:40.616+02:00"},"s":"F",  "c":"-",        "id":23079,   *"ctx":"initandlisten","msg":"Invariant failure","attr": {"expr":"_oplogManagerCount > 0","file":"src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp","line":2467}} 

In summary: MongoDB 4.4 crashes if any documents in the oplog have not been generated by versions 3.6+.

If a rapid upgrade is truly necessary, this issue can be avoided by fully resyncing all nodes after upgrading from 3.4 but before upgrading to 4.4.

Note: Because sharded config server replica sets receive relatively few write operations, they can often hold older oplog entries than other replica sets.

To determine if a deployment is vulnerable, run the following to inspect the oldest document in the oplog:

use local
db.oplog.rs.findOne().wall

If this command returns a Date (any date), the deployment is not vulnerable:

ISODate("2020-08-10T17:41:26.248Z")

If this command returns no data, then upgrading to 4.4 will fail.

original description

Hello,

We have a sharded cluster with 1 member config replica set and 2 x 1 member shard replica sets with 1 mongos running in our dvlp env.

We have successfully upgrade the cluster from 4.0.9 to 4.2.8 following below document 

https://docs.mongodb.com/manual/release-notes/4.2-upgrade-sharded-cluster

But when we start the upgrade process  to 4.4 from 4.2.8 we came across following issue.

https://docs.mongodb.com/manual/release-notes/4.4-upgrade-sharded-cluster/#upgrade-process

  • confirmed fcv is set to 4.2
  • disable balancer
  • stopped config server (1 member) running on 4.2.8 binaries

while starting the config servers on 4.4.0 binaries it failed.

about to fork child process, waiting until server is ready for connections.
forked process: 17863
ERROR: child process failed, exited with 51
To see additional information in this output, start without the "--fork" option.

I looked in the mongo log for config and it errored out with invariant failure.

Is there anything we else we can look for to fix this?
 

{"t":{"$date":"2020-08-06T12:53:39.624+02:00"},"s":"W",  "c":"CONTROL",  "id":20698,   "ctx":"main","msg":"***** SERVER RESTARTED *****","tags":["startupWarnings"]}
{"t":{"$date":"2020-08-06T12:53:39.624+02:00"},"s":"I",  "c":"CONTROL",  "id":23285,   "ctx":"main","msg":"Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'"}
{"t":{"$date":"2020-08-06T12:53:39.630+02:00"},"s":"W",  "c":"ASIO",     "id":22601,   "ctx":"main","msg":"No TransportLayer configured during NetworkInterface startup"}
{"t":{"$date":"2020-08-06T12:53:39.630+02:00"},"s":"I",  "c":"NETWORK",  "id":4648602, "ctx":"main","msg":"Implicit TCP FastOpen in use."}
{"t":{"$date":"2020-08-06T12:53:39.631+02:00"},"s":"I",  "c":"STORAGE",  "id":4615611, "ctx":"initandlisten","msg":"MongoDB starting","attr":{"pid":8961,"port":28017,"dbPath":"/ctr/trademarkrepository/configsvr/data","architecture":"64-bit","host":"deng-pl110"}}
{"t":{"$date":"2020-08-06T12:53:39.631+02:00"},"s":"I",  "c":"CONTROL",  "id":23403,   "ctx":"initandlisten","msg":"Build Info","attr":{"buildInfo":{"version":"4.4.0","gitVersion":"563487e100c4215e2dce98d0af2a6a5a2d67c5cf","openSSLVersion":"OpenSSL 1.0.1e-fips 11 Feb 2013","modules":[],"allocator":"tcmalloc","environment":{"distmod":"rhel62","distarch":"x86_64","target_arch":"x86_64"}}}}
{"t":{"$date":"2020-08-06T12:53:39.631+02:00"},"s":"I",  "c":"CONTROL",  "id":51765,   "ctx":"initandlisten","msg":"Operating System","attr": {"os":{"name":"Oracle Linux Server release 6.9","version":"Kernel 2.6.32-696.el6.x86_64"}}}
{"t":{"$date":"2020-08-06T12:53:39.631+02:00"},"s":"I",  "c":"CONTROL",  "id":21951,   "ctx":"initandlisten","msg":"Options set by command line","attr": {"options":{"config":"/ctr/trademarkrepository/configsvr/data/automation-mongod.conf","net":{"bindIp":"0.0.0.0","port":28017},"processManagement": {"fork":true},"replication": {"replSetName":"ConfigReplSet"},"sharding": {"clusterRole":"configsvr"},"storage": {"dbPath":"/ctr/trademarkrepository/configsvr/data","engine":"wiredTiger","wiredTiger":{"engineConfig":{"cacheSizeGB":4.0}}},"systemLog": {"destination":"file","logAppend":true,"logRotate":"reopen","path":"/ctr/trademarkrepository/configsvr/log/mongod.log"}}}} 
{"t":{"$date":"2020-08-06T12:53:39.634+02:00"},"s":"I",  "c":"STORAGE",  "id":22297,   "ctx":"initandlisten","msg":"Using the XFS filesystem is strongly recommended with the WiredTiger storage engine. See [http://dochub.mongodb.org/core/prodnotes-filesystem","tags":["startupWarnings"]}|http://dochub.mongodb.org/core/prodnotes-filesystem%22,%22tags%22:[%22startupWarnings%22]%7D] 
{"t":{"$date":"2020-08-06T12:53:39.634+02:00"},"s":"I",  "c":"STORAGE",  "id":22315,   "ctx":"initandlisten","msg":"Opening WiredTiger","attr": {"config":"create,cache_size=4096M,session_max=33000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000,close_scan_interval=10,close_handle_minimum=250),statistics_log=(wait=0),verbose=[recovery_progress,checkpoint_progress,compact_progress],"}} 
{"t":{"$date":"2020-08-06T12:53:40.222+02:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr": {"message":"[1596711220:222207][8961:0x7fc5477bfb40], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 8 through 9"}} 
{"t":{"$date":"2020-08-06T12:53:40.285+02:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr": {"message":"[1596711220:285073][8961:0x7fc5477bfb40], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 9 through 9"}} 
{"t":{"$date":"2020-08-06T12:53:40.402+02:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr": {"message":"[1596711220:402276][8961:0x7fc5477bfb40], txn-recover: [WT_VERB_RECOVERY | WT_VERB_RECOVERY_PROGRESS] Main recovery loop: starting at 8/384 to 9/256"}} 
{"t":{"$date":"2020-08-06T12:53:40.285+02:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr": {"message":"[1596711220:285073][8961:0x7fc5477bfb40], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 9 through 9"}} 
{"t":{"$date":"2020-08-06T12:53:40.402+02:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr": {"message":"[1596711220:402276][8961:0x7fc5477bfb40], txn-recover: [WT_VERB_RECOVERY | WT_VERB_RECOVERY_PROGRESS] Main recovery loop: starting at 8/384 to 9/256"}} 
{"t":{"$date":"2020-08-06T12:53:40.402+02:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr": {"message":"[1596711220:402678][8961:0x7fc5477bfb40], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 8 through 9"}} 
{"t":{"$date":"2020-08-06T12:53:40.483+02:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr": {"message":"[1596711220:483038][8961:0x7fc5477bfb40], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 9 through 9"}} 
{"t":{"$date":"2020-08-06T12:53:40.548+02:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr": {"message":"[1596711220:548195][8961:0x7fc5477bfb40], txn-recover: [WT_VERB_RECOVERY | WT_VERB_RECOVERY_PROGRESS] Set global recovery timestamp: (1596710764, 2)"}} 
{"t":{"$date":"2020-08-06T12:53:40.614+02:00"},"s":"I",  "c":"STORAGE",  "id":4795906, "ctx":"initandlisten","msg":"WiredTiger opened","attr": {"durationMillis":980}} 
{"t":{"$date":"2020-08-06T12:53:40.614+02:00"},"s":"I",  "c":"RECOVERY", "id":23987,   "ctx":"initandlisten","msg":"WiredTiger recoveryTimestamp","attr": {"recoveryTimestamp":{"$timestamp":{"t":1596710764,"i":2}}}} 
{"t":{"$date":"2020-08-06T12:53:40.616+02:00"},"s":"I",  "c":"STORAGE",  "id":22383,   "ctx":"initandlisten","msg":"The size storer reports that the oplog contains","attr": {"numRecords":35593,"dataSize":5050117}} 
{"t":{"$date":"2020-08-06T12:53:40.616+02:00"},"s":"I",  "c":"STORAGE",  "id":22386,   "ctx":"initandlisten","msg":"Sampling the oplog to determine where to place markers for truncation"} 
{"t":{"$date":"2020-08-06T12:53:40.616+02:00"},"s":"I",  "c":"STORAGE",  "id":22389,   "ctx":"initandlisten","msg":"Sampling from the oplog to determine where to place markers for truncation","attr": {"from":{"$timestamp":{"t":1539197546,"i":3}},"to": {"$timestamp":{"t":1539249669,"i":3}}}} 
{"t":{"$date":"2020-08-06T12:53:40.616+02:00"},"s":"I",  "c":"STORAGE",  "id":22390,   "ctx":"initandlisten","msg":"Taking samples and assuming each oplog section contains","attr": {"numSamples":96,"containsNumRecords":3696,"containsNumBytes":524407}} 
{"t":{"$date":"2020-08-06T12:53:40.616+02:00"},"s":"I",  "c":"STORAGE",  "id":22382,   "ctx":"initandlisten","msg":"WiredTiger record store oplog processing finished","attr": {"durationMillis":0}} 
{"t":{"$date":"2020-08-06T12:53:40.616+02:00"},"s":"F",  "c":"-",        "id":23079,   *"ctx":"initandlisten","msg":"Invariant failure","attr": {"expr":"_oplogManagerCount > 0","file":"src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp","line":2467}} 
{"t":{"$date":"2020-08-06T12:53:40.616+02:00"},"s":"F",  "c":"-",        "id":23080,   "ctx":"initandlisten","msg":"\n\n***aborting after invariant() failure\n\n"}* 
{"t":{"$date":"2020-08-06T12:53:40.616+02:00"},"s":"F",  "c":"CONTROL",  "id":4757800, "ctx":"initandlisten","msg":"Writing fatal message","attr": {"message":"Got signal: 6 (Aborted).\n"}} 
{"t":{"$date":"2020-08-06T12:53:40.702+02:00"},"s":"I",  "c":"CONTROL",  "id":31431,   "ctx":"initandlisten","msg":"BACKTRACE:  {bt}","attr": {"bt":{"backtrace":[{"a":"7FC54A497D91","b":"7FC5477DF000","o":"2CB8D91","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.606","s+":"1E1"}, {"a":"7FC54A499439","b":"7FC5477DF000","o":"2CBA439","s":"_ZN5mongo15printStackTraceEv","s+":"29"}, {"a":"7FC54A496C26","b":"7FC5477DF000","o":"2CB7C26","s":"_ZN5mongo12_GLOBAL__N_116abruptQuitActionEiP7siginfoPv","s+":"66"}, {"a":"7FC545E277E0","b":"7FC545E18000","o":"F7E0","s":"_L_unlock_16","s+":"2D"}, {"a":"7FC545AB64F5","b":"7FC545A84000","o":"324F5","s":"gsignal","s+":"35"}, {"a":"7FC545AB7CD5","b":"7FC545A84000","o":"33CD5","s":"abort","s+":"175"}, {"a":"7FC5486E1C55","b":"7FC5477DF000","o":"F02C55","s":"_ZN5mongo15invariantFailedEPKcS1_j","s+":"12C"}, {"a":"7FC5483C9536","b":"7FC5477DF000","o":"BEA536","s":"_ZN5mongo18WiredTigerKVEngine16haltOplogManagerEv.cold.1904","s+":"18"}, {"a":"7FC54880211C","b":"7FC5477DF000","o":"102311C","s":"_ZN5mongo21WiredTigerRecordStoreD1Ev","s+":"2FC"}, {"a":"7FC54880868B","b":"7FC5477DF000","o":"102968B","s":"_ZN5mongo29StandardWiredTigerRecordStoreD0Ev","s+":"1B"}, {"a":"7FC5483C9CDB","b":"7FC5477DF000","o":"BEACDB","s":"_ZN5mongo18WiredTigerKVEngine21getGroupedRecordStoreEPNS_16OperationContextENS_10StringDataES3_RKNS_17CollectionOptionsENS_8KVPrefixE.cold.1921","s+":"57"}, {"a":"7FC549073A36","b":"7FC5477DF000","o":"1894A36","s":"_ZN5mongo17StorageEngineImpl15_initCollectionEPNS_16OperationContextENS_8RecordIdERKNS_15NamespaceStringEb","s+":"316"}, {"a":"7FC54907577D","b":"7FC5477DF000","o":"189677D","s":"_ZN5mongo17StorageEngineImpl11loadCatalogEPNS_16OperationContextE","s+":"90D"}, {"a":"7FC549079390","b":"7FC5477DF000","o":"189A390","s":"_ZN5mongo17StorageEngineImplC1EPNS_8KVEngineENS_20StorageEngineOptionsE","s+":"270"}, {"a":"7FC5487C2FE5","b":"7FC5477DF000","o":"FE3FE5","s":"_ZNK5mongo12_GLOBAL__N_117WiredTigerFactory6createERKNS_19StorageGlobalParamsEPKNS_21StorageEngineLockFileE","s+":"1A5"}, {"a":"7FC5490839AE","b":"7FC5477DF000","o":"18A49AE","s":"_ZN5mongo23initializeStorageEngineEPNS_14ServiceContextENS_22StorageEngineInitFlagsE","s+":"4CE"}, {"a":"7FC54877F577","b":"7FC5477DF000","o":"FA0577","s":"_ZN5mongo12_GLOBAL__N_114_initAndListenEPNS_14ServiceContextEi.isra.1409","s+":"3F7"}, {"a":"7FC548783600","b":"7FC5477DF000","o":"FA4600","s":"_ZN5mongo12_GLOBAL__N_111mongoDbMainEiPPcS2_","s+":"650"}, {"a":"7FC5486F2839","b":"7FC5477DF000","o":"F13839","s":"main","s+":"9"}, {"a":"7FC545AA2D20","b":"7FC545A84000","o":"1ED20","s":"__libc_start_main","s+":"100"}, {"a":"7FC54877EA01","b":"7FC5477DF000","o":"F9FA01","s":"_start","s+":"29"}],"processInfo": {"mongodbVersion":"4.4.0","gitVersion":"563487e100c4215e2dce98d0af2a6a5a2d67c5cf","compiledModules":[],"uname":{"sysname":"Linux","release":"2.6.32-696.el6.x86_64","version":"#1 SMP Wed Mar 22 14:27:19 PDT 2017","machine":"x86_64"},"somap":[ {"b":"7FC5477DF000","elfType":3,"buildId":"9A85C620AD954CD5836D247C61661AFDDF2FF039"}]}}}} 
{"t":{"$date":"2020-08-06T12:53:40.702+02:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"initandlisten","msg":"  Frame:  {frame}","attr": {"frame":{"a":"7FC54A497D91","b":"7FC5477DF000","o":"2CB8D91","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.606","s+":"1E1"}}} 
{"t":{"$date":"2020-08-06T12:53:40.702+02:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"initandlisten","msg":"  Frame:  {frame}","attr": {"frame":{"a":"7FC54A499439","b":"7FC5477DF000","o":"2CBA439","s":"_ZN5mongo15printStackTraceEv","s+":"29"}}} 
{"t":{"$date":"2020-08-06T12:53:40.702+02:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"initandlisten","msg":"  Frame:  {frame}","attr": {"frame":{"a":"7FC54A496C26","b":"7FC5477DF000","o":"2CB7C26","s":"_ZN5mongo12_GLOBAL__N_116abruptQuitActionEiP7siginfoPv","s+":"66"}}}
{"t": {"$date":"2020-08-06T12:53:40.702+02:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"initandlisten","msg":"  Frame:  {frame}","attr": {"frame":{"a":"7FC54A499439","b":"7FC5477DF000","o":"2CBA439","s":"_ZN5mongo15printStackTraceEv","s+":"29"}}} 
{"t":{"$date":"2020-08-06T12:53:40.702+02:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"initandlisten","msg":"  Frame:  {frame}","attr": {"frame":{"a":"7FC54A496C26","b":"7FC5477DF000","o":"2CB7C26","s":"_ZN5mongo12_GLOBAL__N_116abruptQuitActionEiP7siginfoPv","s+":"66"}}}  
{"t": {"$date":"2020-08-06T12:53:40.702+02:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"initandlisten","msg":"  Frame:  {frame}","attr": {"frame":{"a":"7FC545E277E0","b":"7FC545E18000","o":"F7E0","s":"_L_unlock_16","s+":"2D"}}} 
{"t":{"$date":"2020-08-06T12:53:40.702+02:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"initandlisten","msg":"  Frame:  {frame}","attr": {"frame":{"a":"7FC545AB64F5","b":"7FC545A84000","o":"324F5","s":"gsignal","s+":"35"}}} 
{"t":{"$date":"2020-08-06T12:53:40.702+02:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"initandlisten","msg":"  Frame:  {frame}","attr": {"frame":{"a":"7FC545AB7CD5","b":"7FC545A84000","o":"33CD5","s":"abort","s+":"175"}}} 
{"t":{"$date":"2020-08-06T12:53:40.702+02:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"initandlisten","msg":"  Frame:  {frame}","attr": {"frame":{"a":"7FC5486E1C55","b":"7FC5477DF000","o":"F02C55","s":"_ZN5mongo15invariantFailedEPKcS1_j","s+":"12C"}}} 
{"t":{"$date":"2020-08-06T12:53:40.702+02:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"initandlisten","msg":"  Frame:  {frame}","attr": {"frame":{"a":"7FC5483C9536","b":"7FC5477DF000","o":"BEA536","s":"_ZN5mongo18WiredTigerKVEngine16haltOplogManagerEv.cold.1904","s+":"18"}}} 
{"t":{"$date":"2020-08-06T12:53:40.702+02:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"initandlisten","msg":"  Frame:  {frame}","attr": {"frame":{"a":"7FC54880211C","b":"7FC5477DF000","o":"102311C","s":"_ZN5mongo21WiredTigerRecordStoreD1Ev","s+":"2FC"}}} 
{"t":{"$date":"2020-08-06T12:53:40.702+02:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"initandlisten","msg":"  Frame:  {frame}","attr": {"frame":{"a":"7FC54880868B","b":"7FC5477DF000","o":"102968B","s":"_ZN5mongo29StandardWiredTigerRecordStoreD0Ev","s+":"1B"}}} 
{"t":{"$date":"2020-08-06T12:53:40.702+02:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"initandlisten","msg":"  Frame:  {frame}","attr": {"frame":{"a":"7FC5483C9CDB","b":"7FC5477DF000","o":"BEACDB","s":"_ZN5mongo18WiredTigerKVEngine21getGroupedRecordStoreEPNS_16OperationContextENS_10StringDataES3_RKNS_17CollectionOptionsENS_8KVPrefixE.cold.1921","s+":"57"}}} 
{"t":{"$date":"2020-08-06T12:53:40.702+02:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"initandlisten","msg":"  Frame:  {frame}","attr": {"frame":{"a":"7FC549073A36","b":"7FC5477DF000","o":"1894A36","s":"_ZN5mongo17StorageEngineImpl15_initCollectionEPNS_16OperationContextENS_8RecordIdERKNS_15NamespaceStringEb","s+":"316"}}} 
{"t":{"$date":"2020-08-06T12:53:40.702+02:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"initandlisten","msg":"  Frame:  {frame}","attr": {"frame":{"a":"7FC54907577D","b":"7FC5477DF000","o":"189677D","s":"_ZN5mongo17StorageEngineImpl11loadCatalogEPNS_16OperationContextE","s+":"90D"}}} 
{"t":{"$date":"2020-08-06T12:53:40.702+02:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"initandlisten","msg":"  Frame:  {frame}","attr": {"frame":{"a":"7FC549079390","b":"7FC5477DF000","o":"189A390","s":"_ZN5mongo17StorageEngineImplC1EPNS_8KVEngineENS_20StorageEngineOptionsE","s+":"270"}}} 
{"t":{"$date":"2020-08-06T12:53:40.702+02:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"initandlisten","msg":"  Frame:  {frame}","attr": {"frame":{"a":"7FC5487C2FE5","b":"7FC5477DF000","o":"FE3FE5","s":"_ZNK5mongo12_GLOBAL__N_117WiredTigerFactory6createERKNS_19StorageGlobalParamsEPKNS_21StorageEngineLockFileE","s+":"1A5"}}} 
{"t":{"$date":"2020-08-06T12:53:40.702+02:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"initandlisten","msg":"  Frame:  {frame}","attr": {"frame":{"a":"7FC5490839AE","b":"7FC5477DF000","o":"18A49AE","s":"_ZN5mongo23initializeStorageEngineEPNS_14ServiceContextENS_22StorageEngineInitFlagsE","s+":"4CE"}}} 
{"t":{"$date":"2020-08-06T12:53:40.702+02:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"initandlisten","msg":"  Frame:  {frame}","attr": {"frame":{"a":"7FC54877F577","b":"7FC5477DF000","o":"FA0577","s":"_ZN5mongo12_GLOBAL__N_114_initAndListenEPNS_14ServiceContextEi.isra.1409","s+":"3F7"}}} 
{"t":{"$date":"2020-08-06T12:53:40.702+02:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"initandlisten","msg":"  Frame:  {frame}","attr": {"frame":{"a":"7FC548783600","b":"7FC5477DF000","o":"FA4600","s":"_ZN5mongo12_GLOBAL__N_111mongoDbMainEiPPcS2_","s+":"650"}}} 
{"t":{"$date":"2020-08-06T12:53:40.702+02:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"initandlisten","msg":"  Frame:  {frame}","attr": {"frame":{"a":"7FC5486F2839","b":"7FC5477DF000","o":"F13839","s":"main","s+":"9"}}} 
{"t":{"$date":"2020-08-06T12:53:40.702+02:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"initandlisten","msg":"  Frame:  {frame}","attr": {"frame":{"a":"7FC545AA2D20","b":"7FC545A84000","o":"1ED20","s":"__libc_start_main","s+":"100"}}} 
{"t":{"$date":"2020-08-06T12:53:40.702+02:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"initandlisten","msg":"  Frame:  {frame}","attr": {"frame":{"a":"7FC54877EA01","b":"7FC5477DF000","o":"F9FA01","s":"_start","s+":"29"}}}  
 

 



 Comments   
Comment by Githook User [ 13/Aug/20 ]

Author:

{'name': 'Gregory Noma', 'email': 'gregory.noma@gmail.com', 'username': 'gregorynoma'}

Message: SERVER-50137 Use timestamp if wall field not present when calculating oplog stones

(cherry picked from commit 5be60b26dd703f8360157d074de889d0495cab6a)
Branch: v4.4
https://github.com/mongodb/mongo/commit/df5b9daec28555e5fb091e32a821b53d2d92bafe

Comment by Githook User [ 13/Aug/20 ]

Author:

{'name': 'Gregory Noma', 'email': 'gregory.noma@gmail.com', 'username': 'gregorynoma'}

Message: SERVER-50137 Use timestamp if wall field not present when calculating oplog stones
Branch: master
https://github.com/mongodb/mongo/commit/5be60b26dd703f8360157d074de889d0495cab6a

Comment by Kin Wai Cheung [ 12/Aug/20 ]

Hi Eric,

During our upgrade for our acceptance environment.

We already resized our oplog size.

we checked the wall field and it returned a date  and this date is the same date when we execute rs.printReplicationInfo()

CfgReplSet001:PRIMARY> use local
switched to db local
CfgReplSet001:PRIMARY> db.oplog.rs.findOne().wall
ISODate("2020-07-04T23:32:00.048Z")
CfgReplSet001:PRIMARY>

 

But we still faced the invariant failure so just resizing oplog didn't help us and we had to restore and add new config node for a resync with new 4.2 oplog entries.

{"t":{"$date":"2020-08-12T07:49:20.167+00:00"},"s":"W",  "c":"CONTROL",  "id":20698,   "ctx":"main","msg":"***** SERVER RESTARTED *****","tags":["startupWarnings"]}
{"t":{"$date":"2020-08-12T07:49:20.172+00:00"},"s":"I",  "c":"CONTROL",  "id":23285,   "ctx":"main","msg":"Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'"}
{"t":{"$date":"2020-08-12T07:49:20.179+00:00"},"s":"W",  "c":"ASIO",     "id":22601,   "ctx":"main","msg":"No TransportLayer configured during NetworkInterface startup"}
{"t":{"$date":"2020-08-12T07:49:20.179+00:00"},"s":"I",  "c":"NETWORK",  "id":4648602, "ctx":"main","msg":"Implicit TCP FastOpen in use."}
{"t":{"$date":"2020-08-12T07:49:20.180+00:00"},"s":"I",  "c":"STORAGE",  "id":4615611, "ctx":"initandlisten","msg":"MongoDB starting","attr":{"pid":26055,"port":27017,"dbPath":"/automatedwordsearchio/CTR/configsvr/data","architecture":"64-bit","host":"amgo-pl201"}}
{"t":{"$date":"2020-08-12T07:49:20.180+00:00"},"s":"I",  "c":"CONTROL",  "id":23403,   "ctx":"initandlisten","msg":"Build Info","attr":{"buildInfo":{"version":"4.4.0","gitVersion":"563487e100c4215e2dce98d0af2a6a5a2d67c5cf","openSSLVersion":"OpenSSL 1.0.1e-fips 11 Feb 2013","modules":[],"allocator":"tcmalloc","environment":{"distmod":"rhel62","distarch":"x86_64","target_arch":"x86_64"}}}}
{"t":{"$date":"2020-08-12T07:49:20.180+00:00"},"s":"I",  "c":"CONTROL",  "id":51765,   "ctx":"initandlisten","msg":"Operating System","attr":{"os":{"name":"Oracle Linux Server release 6.9","version":"Kernel 2.6.32-696.el6.x86_64"}}}
{"t":{"$date":"2020-08-12T07:49:20.180+00:00"},"s":"I",  "c":"CONTROL",  "id":21951,   "ctx":"initandlisten","msg":"Options set by command line","attr":{"options":{"config":"/automatedwordsearchio/CTR/configsvr/data/automation-mongod.conf","net":{"bindIp":"0.0.0.0","port":27017},"processManagement":{"fork":true},"replication":{"replSetName":"CfgReplSet001"},"sharding":{"clusterRole":"configsvr"},"storage":{"dbPath":"/automatedwordsearchio/CTR/configsvr/data","engine":"wiredTiger","wiredTiger":{"engineConfig":{"cacheSizeGB":4.0}}},"systemLog":{"destination":"file","logAppend":true,"logRotate":"reopen","path":"/automatedwordsearchio/CTR/configsvr/log/mongod.log"}}}}
{"t":{"$date":"2020-08-12T07:49:20.195+00:00"},"s":"I",  "c":"STORAGE",  "id":22297,   "ctx":"initandlisten","msg":"Using the XFS filesystem is strongly recommended with the WiredTiger storage engine. See http://dochub.mongodb.org/core/prodnotes-filesystem","tags":["startupWarnings"]}
{"t":{"$date":"2020-08-12T07:49:20.195+00:00"},"s":"I",  "c":"STORAGE",  "id":22315,   "ctx":"initandlisten","msg":"Opening WiredTiger","attr":{"config":"create,cache_size=4096M,session_max=33000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000,close_scan_interval=10,close_handle_minimum=250),statistics_log=(wait=0),verbose=[recovery_progress,checkpoint_progress,compact_progress],"}}
{"t":{"$date":"2020-08-12T07:49:21.089+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1597218561:89280][26055:0x7f7fe2a85b40], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 385 through 386"}}
{"t":{"$date":"2020-08-12T07:49:21.195+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1597218561:195226][26055:0x7f7fe2a85b40], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 386 through 386"}}
{"t":{"$date":"2020-08-12T07:49:21.370+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1597218561:370746][26055:0x7f7fe2a85b40], txn-recover: [WT_VERB_RECOVERY | WT_VERB_RECOVERY_PROGRESS] Main recovery loop: starting at 385/1488512 to 386/256"}}
{"t":{"$date":"2020-08-12T07:49:21.371+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1597218561:370999][26055:0x7f7fe2a85b40], txn-recover: [WT_VERB_RECOVERY | WT_VERB_RECOVERY_PROGRESS] Creating the history store before applying log records. Likely recovering after anunclean shutdown on an earlier version"}}
{"t":{"$date":"2020-08-12T07:49:21.373+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1597218561:373612][26055:0x7f7fe2a85b40], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 385 through 386"}}
{"t":{"$date":"2020-08-12T07:49:21.461+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1597218561:461964][26055:0x7f7fe2a85b40], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 386 through 386"}}
{"t":{"$date":"2020-08-12T07:49:21.529+00:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1597218561:529819][26055:0x7f7fe2a85b40], txn-recover: [WT_VERB_RECOVERY | WT_VERB_RECOVERY_PROGRESS] Set global recovery timestamp: (1597218457, 1)"}}
{"t":{"$date":"2020-08-12T07:49:21.534+00:00"},"s":"I",  "c":"STORAGE",  "id":4795906, "ctx":"initandlisten","msg":"WiredTiger opened","attr":{"durationMillis":1339}}
{"t":{"$date":"2020-08-12T07:49:21.534+00:00"},"s":"I",  "c":"RECOVERY", "id":23987,   "ctx":"initandlisten","msg":"WiredTiger recoveryTimestamp","attr":{"recoveryTimestamp":{"$timestamp":{"t":1597218457,"i":1}}}}
{"t":{"$date":"2020-08-12T07:49:21.539+00:00"},"s":"I",  "c":"STORAGE",  "id":22383,   "ctx":"initandlisten","msg":"The size storer reports that the oplog contains","attr":{"numRecords":42164,"dataSize":5024271}}
{"t":{"$date":"2020-08-12T07:49:21.539+00:00"},"s":"I",  "c":"STORAGE",  "id":22386,   "ctx":"initandlisten","msg":"Sampling the oplog to determine where to place markers for truncation"}
{"t":{"$date":"2020-08-12T07:49:21.540+00:00"},"s":"I",  "c":"STORAGE",  "id":22389,   "ctx":"initandlisten","msg":"Sampling from the oplog to determine where to place markers for truncation","attr":{"from":{"$timestamp":{"t":1540356866,"i":2}},"to":{"$timestamp":{"t":1557734012,"i":1}}}}
{"t":{"$date":"2020-08-12T07:49:21.540+00:00"},"s":"I",  "c":"STORAGE",  "id":22390,   "ctx":"initandlisten","msg":"Taking samples and assuming each oplog section contains","attr":{"numSamples":95,"containsNumRecords":4400,"containsNumBytes":524304}}
{"t":{"$date":"2020-08-12T07:49:21.540+00:00"},"s":"I",  "c":"STORAGE",  "id":22382,   "ctx":"initandlisten","msg":"WiredTiger record store oplog processing finished","attr":{"durationMillis":1}}
{"t":{"$date":"2020-08-12T07:49:21.540+00:00"},"s":"F",  "c":"-",        "id":23079,   "ctx":"initandlisten","msg":"Invariant failure","attr":{"expr":"_oplogManagerCount > 0","file":"src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp","line":2467}}
{"t":{"$date":"2020-08-12T07:49:21.540+00:00"},"s":"F",  "c":"-",        "id":23080,   "ctx":"initandlisten","msg":"\n\n***aborting after invariant() failure\n\n"}
{"t":{"$date":"2020-08-12T07:49:21.540+00:00"},"s":"F",  "c":"CONTROL",  "id":4757800, "ctx":"initandlisten","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).\n"}}
{"t":{"$date":"2020-08-12T07:49:21.735+00:00"},"s":"I",  "c":"CONTROL",  "id":31431,   "ctx":"initandlisten","msg":"BACKTRACE: {bt}","attr":{"bt":{"backtrace":[{"a":"7F7FE575DD91","b":"7F7FE2AA5000","o":"2CB8D91","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.606","s+":"1E1"},{"a":"7F7FE575F439","b":"7F7FE2AA5000","o":"2CBA439","s":"_ZN5mongo15printStackTraceEv","s+":"29"},{"a":"7F7FE575CC26","b":"7F7FE2AA5000","o":"2CB7C26","s":"_ZN5mongo12_GLOBAL__N_116abruptQuitActionEiP7siginfoPv","s+":"66"},{"a":"7F7FE10ED7E0","b":"7F7FE10DE000","o":"F7E0","s":"_L_unlock_16","s+":"2D"},{"a":"7F7FE0D7C4F5","b":"7F7FE0D4A000","o":"324F5","s":"gsignal","s+":"35"},{"a":"7F7FE0D7DCD5","b":"7F7FE0D4A000","o":"33CD5","s":"abort","s+":"175"},{"a":"7F7FE39A7C55","b":"7F7FE2AA5000","o":"F02C55","s":"_ZN5mongo15invariantFailedEPKcS1_j","s+":"12C"},{"a":"7F7FE368F536","b":"7F7FE2AA5000","o":"BEA536","s":"_ZN5mongo18WiredTigerKVEngine16haltOplogManagerEv.cold.1904","s+":"18"},{"a":"7F7FE3AC811C","b":"7F7FE2AA5000","o":"102311C","s":"_ZN5mongo21WiredTigerRecordStoreD1Ev","s+":"2FC"},{"a":"7F7FE3ACE68B","b":"7F7FE2AA5000","o":"102968B","s":"_ZN5mongo29StandardWiredTigerRecordStoreD0Ev","s+":"1B"},{"a":"7F7FE368FCDB","b":"7F7FE2AA5000","o":"BEACDB","s":"_ZN5mongo18WiredTigerKVEngine21getGroupedRecordStoreEPNS_16OperationContextENS_10StringDataES3_RKNS_17CollectionOptionsENS_8KVPrefixE.cold.1921","s+":"57"},{"a":"7F7FE4339A36","b":"7F7FE2AA5000","o":"1894A36","s":"_ZN5mongo17StorageEngineImpl15_initCollectionEPNS_16OperationContextENS_8RecordIdERKNS_15NamespaceStringEb","s+":"316"},{"a":"7F7FE433B77D","b":"7F7FE2AA5000","o":"189677D","s":"_ZN5mongo17StorageEngineImpl11loadCatalogEPNS_16OperationContextE","s+":"90D"},{"a":"7F7FE433F390","b":"7F7FE2AA5000","o":"189A390","s":"_ZN5mongo17StorageEngineImplC1EPNS_8KVEngineENS_20StorageEngineOptionsE","s+":"270"},{"a":"7F7FE3A88FE5","b":"7F7FE2AA5000","o":"FE3FE5","s":"_ZNK5mongo12_GLOBAL__N_117WiredTigerFactory6createERKNS_19StorageGlobalParamsEPKNS_21StorageEngineLockFileE","s+":"1A5"},{"a":"7F7FE43499AE","b":"7F7FE2AA5000","o":"18A49AE","s":"_ZN5mongo23initializeStorageEngineEPNS_14ServiceContextENS_22StorageEngineInitFlagsE","s+":"4CE"},{"a":"7F7FE3A45577","b":"7F7FE2AA5000","o":"FA0577","s":"_ZN5mongo12_GLOBAL__N_114_initAndListenEPNS_14ServiceContextEi.isra.1409","s+":"3F7"},{"a":"7F7FE3A49600","b":"7F7FE2AA5000","o":"FA4600","s":"_ZN5mongo12_GLOBAL__N_111mongoDbMainEiPPcS2_","s+":"650"},{"a":"7F7FE39B8839","b":"7F7FE2AA5000","o":"F13839","s":"main","s+":"9"},{"a":"7F7FE0D68D20","b":"7F7FE0D4A000","o":"1ED20","s":"__libc_start_main","s+":"100"},{"a":"7F7FE3A44A01","b":"7F7FE2AA5000","o":"F9FA01","s":"_start","s+":"29"}],"processInfo":{"mongodbVersion":"4.4.0","gitVersion":"563487e100c4215e2dce98d0af2a6a5a2d67c5cf","compiledModules":[],"uname":{"sysname":"Linux","release":"2.6.32-696.el6.x86_64","version":"#1 SMP Wed Mar 22 14:27:19 PDT 2017","machine":"x86_64"},"somap":[{"b":"7F7FE2AA5000","elfType":3,"buildId":"9A85C620AD954CD5836D247C61661AFDDF2FF039"}]}}}}

 

 

 

Comment by Eric Sedor [ 10/Aug/20 ]

Hi kinwai.cheung@clarivate.com, we're glad to hear the deployments have made it to 4.4 and will be working on a fix for this in future versions. Thanks for writing in.

Can you please file an additional ticket for your PeriodicShardedIndexConsistencyChecker question?

Comment by Kin Wai Cheung [ 10/Aug/20 ]

Additionally It seems when we upgraded to 4.2.8 or 4.40 every 10 minutes our config log is logged with messages like below when PeriodicShardedIndexConsistencyChecker is being run.

These databases are were dropped  from couple of years ago to more recently by dvlp afaik.

I'm finding  entries in config.collections (config) or  config.cached.collections (data bearing nodes) that reference already dropped db's

Is this a known issue or should I create a new Jira for this? 

2020-08-09T04:01:01.807+0200 I  SHARDING [PeriodicShardedIndexConsistencyChecker] Checking consistency of sharded collection indexes across the cluster
2020-08-09T04:01:01.825+0200 I  SH_REFR  [ConfigServerCatalogCacheLoader-139] Refresh for database acc took 0 ms and failed :: caused by :: NamespaceNotFound: database acc not found
2020-08-09T04:01:01.830+0200 I  SH_REFR  [ConfigServerCatalogCacheLoader-139] Refresh for database lukes-test took 0 ms and failed :: caused by :: NamespaceNotFound: database lukes-test not found
2020-08-09T04:01:01.832+0200 I  SH_REFR  [ConfigServerCatalogCacheLoader-139] Refresh for database ctr-integration-tests_1492084538520 took 0 ms and failed :: caused by :: NamespaceNotFound: database ctr-integration-tests_1492084538520 not found
2020-08-09T04:01:01.833+0200 I  SH_REFR  [ConfigServerCatalogCacheLoader-139] Refresh for database ctr-integration-tests_1492085358785 took 0 ms and failed :: caused by :: NamespaceNotFound: database ctr-integration-tests_1492085358785 not found
2020-08-09T04:01:01.833+0200 I  SH_REFR  [ConfigServerCatalogCacheLoader-139] Refresh for database ctr-integration-tests_1492093702966 took 0 ms and failed :: caused by :: NamespaceNotFound: database ctr-integration-tests_1492093702966 not found
2020-08-09T04:01:01.833+0200 I  SH_REFR  [ConfigServerCatalogCacheLoader-139] Refresh for database ctr-integration-tests_1492094907720 took 0 ms and failed :: caused by :: NamespaceNotFound: database ctr-integration-tests_1492094907720 not found
2020-08-09T04:01:01.833+0200 I  SH_REFR  [ConfigServerCatalogCacheLoader-139] Refresh for database ctr-integration-tests_1492151781828 took 0 ms and failed :: caused by :: NamespaceNotFound: database ctr-integration-tests_1492151781828 not found
 
.
.
.2020-08-09T04:01:02.851+0200 I  SH_REFR  [ConfigServerCatalogCacheLoader-139] Refresh for database ctr-integration-tests_2020-08-07-00-00-46 took 0 ms and failed :: caused by :: NamespaceNotFound: database ctr-integration-tests_2020-08-07-00-00-46 not found
2020-08-09T04:01:02.852+0200 I  SH_REFR  [ConfigServerCatalogCacheLoader-139] Refresh for database ctr-integration-tests_2020-08-07-00-00-46 took 0 ms and failed :: caused by :: NamespaceNotFound: database ctr-integration-tests_2020-08-07-00-00-46 not found
2020-08-09T04:01:02.852+0200 I  SH_REFR  [ConfigServerCatalogCacheLoader-139] Refresh for database ctr-integration-tests_2020-08-07-00-02-08 took 0 ms and failed :: caused by :: NamespaceNotFound: database ctr-integration-tests_2020-08-07-00-02-08 not found
2020-08-09T04:01:02.852+0200 I  SH_REFR  [ConfigServerCatalogCacheLoader-139] Refresh for database ctr-integration-tests_2020-08-07-00-02-08 took 0 ms and failed :: caused by :: NamespaceNotFound: database ctr-integration-tests_2020-08-07-00-02-08 not found
2020-08-09T04:01:02.852+0200 I  SH_REFR  [ConfigServerCatalogCacheLoader-139] Refresh for database ctr-integration-tests_2020-08-07-00-02-08 took 0 ms and failed :: caused by :: NamespaceNotFound: database ctr-integration-tests_2020-08-07-00-02-08 not found
2020-08-09T04:01:02.852+0200 I  SH_REFR  [ConfigServerCatalogCacheLoader-139] Refresh for database ctr-integration-tests_2020-08-07-00-04-07 took 0 ms and failed :: caused by :: NamespaceNotFound: database ctr-integration-tests_2020-08-07-00-04-07 not found
2020-08-09T04:01:02.852+0200 I  SH_REFR  [ConfigServerCatalogCacheLoader-139] Refresh for database ctr-integration-tests_2020-08-07-00-04-07 took 0 ms and failed :: caused by :: NamespaceNotFound: database ctr-integration-tests_2020-08-07-00-04-07 not found
2020-08-09T04:01:02.853+0200 I  SH_REFR  [ConfigServerCatalogCacheLoader-139] Refresh for database ctr-integration-tests_2020-08-07-00-04-07 took 0 ms and failed :: caused by :: NamespaceNotFound: database ctr-integration-tests_2020-08-07-00-04-07 not found
2020-08-09T04:01:02.853+0200 I  SHARDING [PeriodicShardedIndexConsistencyChecker] Found 0 collections with inconsistent indexes

Comment by Kin Wai Cheung [ 10/Aug/20 ]

[^config_mongod_log_20200810.gz]

Hi all,

Even if oplog first event  did showed me a recent date on our dvlp env.

on DVLP running 4.2.8

ConfigReplSet:PRIMARY> rs.printReplicationInfo()
configured oplog size: 5MB
log length start to end: 82986secs (23.05hrs)
oplog first event time: Thu Aug 06 2020 23:19:02 GMT+0200 (CEST)
oplog last event time: Fri Aug 07 2020 22:22:08 GMT+0200 (CEST)
now: Fri Aug 07 2020 22:22:10 GMT+0200 (CEST)

I did face the invariant issue again based on above  ->  not sure if  it was because it already  upgraded to 4.2.8 and resetted timestamp but still had the 3.2/3.4 somewhere or I messed something up. Log with verbosity 5 attached.

I restored the config back to a previous states and used  your workaround.

With adding  a new temp node , sync it which should contain  4.2 oplog entries only.  Make it primary then removed the old node and adding this back again with empty dbpath for another resync and  remove the temp node  (stepdown) again.

This made me able to run config on 4.4 binaries  and also the databearing shards and router.

 

Comment by Eric Sedor [ 07/Aug/20 ]

Resizing the oplog is another way to remove older entries, yes.

Comment by Kin Wai Cheung [ 07/Aug/20 ]

Just checked the config servers on our higher environments.

We upgraded from 3.4 to 4.0.9  around may/june 2019 .

And on our higher environments it still contains oplog events before that date.

Also confirmed that our databearing nodes have a much more recent oplog first time event.

Does resizing the oplog size for config  that the first time event is after the 3.6 upgrade  also solves the issue?  

ACC (1 member replica set)

CfgReplSet001:PRIMARY> rs.printReplicationInfo()
configured oplog size: 20361.590576171875MB
log length start to end: 56426261secs (15673.96hrs)
oplog first event time: Wed Oct 24 2018 20:12:52 GMT+0200 (CEST)
oplog last event time: Fri Aug 07 2020 22:10:33 GMT+0200 (CEST)
now: Fri Aug 07 2020 22:10:39 GMT+0200 (CEST)

 

PROD (3 member replica set)

configReplSet:PRIMARY> rs.printReplicationInfo()
configured oplog size: 51200MB
log length start to end: 111213801secs (30892.72hrs)
oplog first event time: Sat Jan 28 2017 16:26:26 GMT+0100 (CET)
oplog last event time: Fri Aug 07 2020 22:09:47 GMT+0200 (CEST)
now: Fri Aug 07 2020 22:09:48 GMT+0200 (CEST)

 

-----------------------

on DVLP running 4.2.8

ConfigReplSet:PRIMARY> rs.printReplicationInfo()
configured oplog size: 5MB
log length start to end: 82986secs (23.05hrs)
oplog first event time: Thu Aug 06 2020 23:19:02 GMT+0200 (CEST)
oplog last event time: Fri Aug 07 2020 22:22:08 GMT+0200 (CEST)
now: Fri Aug 07 2020 22:22:10 GMT+0200 (CEST)

Comment by Eric Sedor [ 07/Aug/20 ]

kinwai.cheung@clarivate.com, this would apply to any node. It is less likely that data-bearing nodes in the replica sets themselves will have entries that far back, but it is entirely possible. It's a question of how quickly the node is written to, and how large those updates are.

A quick way if you know the upgrade date from 3.4 is to use printReplicationInfo. As long as the first event time is after the upgrade date to 3.6 the node should be safe to upgrade to 4.4.

It is also possible to inspect the oplog directly. Let us know if the above method won't work for you or if the results leave you in doubt!

Comment by Kin Wai Cheung [ 07/Aug/20 ]

Thanks Eric,

I'll try this on Monday.

Does this also applies for our 2 shards  which are both primary replica sets?

Also how can I check this myself if my replica sets still has oplog entries generated from 3.4? 

Since we want to upgrade our higher environments too.

Comment by Eric Sedor [ 07/Aug/20 ]

Hi kinwai.cheung@clarivate.com,

We believe that MongoDB 4.4 is crashing due to an issue handling oplog entries generated by MongoDB 3.4. This could happen if the deployment was running on version 3.4 relatively recently, and/or wasn't receiving many writes due to low sharding-related activity.

In this case, we can see from diagnostic data that the oplog has entries from as early as October 2018.

We are going to look into this further.

In the meantime as a workaround, we recommend adding a new node to the config server replica set, waiting for it to sync, and then promoting it to Primary. Then, you can remove the old node and try to upgrade the new node, which will only have 4.2 oplog entries.

If you still see trouble after that, can you please let us know and provide the same set of information (logs at high verbosity -vvvvv, and diagnostic data)?

Does this make sense? Let me know if I can answer any questions.

Sincerely,
Eric

Comment by Kin Wai Cheung [ 07/Aug/20 ]

I could explain this.

When starting the config ( 1 node only)  with 4.4 binaries I  lalways hit the invariant failure.

Since I was not able to 'downgrade' or start it back with previous binaries 4.2.8 due to unsupported WT version.

I then tried things like repair with 4.4 binaries or 4.2.8 binaries  with and without replicaset.

But it still didn't let met startup config succesfully.

So you should have seen messages like:

    • WARNING: mongod started without --replSet yet document(s) are present in local.system.replset.

or 

2020-08-06T10:20:32.891+0200 F REPL [initandlisten] This instance has been repaired and may contain modified replicated data that would not match other replica set members. To see your repaired data, start mongod without the --replSet option. When you are finished recovering your data and would like to perform a complete re-sync, please refer to the documentation here: https://docs.mongodb.com/manual/tutorial/resync-replica-set-member/
2020-08-06T10:20:32.891+0200 F - [initandlisten] Fatal Assertion 50923 at src/mongo/db/repl/replication_coordinator_impl.cpp 529

After removing the document in local.system.replset  on 4.2.8 binaries without replicaset. Then I started up the config successfully ,with replset option enabled ,   on 4.2.8 binaries afterwards

---------------------------------------

Afterwards i tried several things without success.

So I restored back the config server and started back  everythin on 4.2.8 binaries around 2020-08-06T14:41 as shown in the logs.

 

 

Comment by Bruce Lucas (Inactive) [ 07/Aug/20 ]

daniel.gottlieb here's a possible clue.

In the log file we see multiple restarts immediately after upgrading to 4.4 that hit the invariant. Then there is a restart with repair:true. After validating multiple collections without finding errors, it then hits the following fatal error end exits:

{"t":{"$date":"2020-08-06T08:01:19.396Z"},"s":"I","c":"STORAGE","id":21029,"ctx":"initandlisten","msg":"repairDatabase","attr":{"db":"config"}}
{"t":{"$date":"2020-08-06T08:01:19.405Z"},"s":"I","c":"STORAGE","id":21027,"ctx":"initandlisten","msg":"Repairing collection","attr":{"namespace":"config.version"}}
{"t":{"$date":"2020-08-06T08:01:19.405Z"},"s":"I","c":"STORAGE","id":22327,"ctx":"initandlisten","msg":"Verify succeeded. Not salvaging.","attr":{"uri":"table:collection-4-361705941156805836"}}
{
    "t": {
        "$date": "2020-08-06T08:01:19.406Z"
    },
    "s": "E",
    "c": "STORAGE",
    "id": 20557,
    "ctx": "initandlisten",
    "msg": "DBException in initAndListen, terminating",
    "attr": {
        "error": "NotMasterNoSlaveOk: not master and slaveOk=false"
    }
}

On the subsequent restart without repair it says

{"t":{"$date":"2020-08-06T08:02:47.792Z"},"s":"F","c":"STORAGE","id":50922,"ctx":"initandlisten","msg":"An incomplete repair has been detected! This is likely because a repair operation unexpectedly failed before completing. MongoDB will not start up again without --repair."}

from which I think we can deduce that the previous fatal NotMasterNoSlaveOk error was not a normal part of repair. Could this be the errant exception that is causing the invariant when we're running without repair, and do the logged details about the exception provide enough of a clue to understand the cause?

Comment by Kin Wai Cheung [ 06/Aug/20 ]

config_mongod_log.gz  log of config attached (logrotated every 100Mb of size)

Comment by Daniel Gottlieb (Inactive) [ 06/Aug/20 ]

Could having a 1 member replicaset for config and 1 member for 2 shards, instead of 3 for each in a shared cluster, be causing this issue in a sharded cluster? As I'm actually upgrading binaries on a primary only.

Maybe there's a connection, but this error is early enough in startup that the replication and sharding systems haven't been initialized yet. Any interaction there would have to be with how data files were written out when shutting down the 4.2.8 binary. At the risk of speculating, I would think that a topology of 2 nodes + arbiter or 3 nodes are more likely to create problems for restarting a binary than a single node replica set/shard (where the majority point could be arbitrarily lagged).

Comment by Kin Wai Cheung [ 06/Aug/20 ]

@Bruce you mean the mongod log of the config server? Will attach it later today.

I'll also have another upgrade attempt on Monday from 4.2.8 to 4.4.0
But now having shutdown router first then both primary shards and then stop primary config and afterwards start config with 4.4 binary and verbosity level 5 just in case it fails again.

Previous upgrades from 4.0.9 to 4.2.8 and 4.2.8 to 4.4.0 was done while Mongos and shards were up and running (with applications connecting).

Could having a 1 member replicaset for config and 1 member for 2 shards, instead of 3 for each in a shared cluster, be causing this issue in a sharded cluster? As I'm actually upgrading binaries on a primary only.

Comment by Daniel Gottlieb (Inactive) [ 06/Aug/20 ]

The invariant being hit. When MDB starts up, it spins off a separate thread for updating visibility. That thread comes into existence when the oplog collection is loaded and likewise goes out of existence when it's destructed. This invariant fires when we're trying to destruct the thread, but it hasn't been started.

The oplogManagerCount (unsurprisingly) gets incremented to 1 when startOplogManager is called. That method is called from "postConstructorInit" after the constructor succeeds.

However, if an exception is thrown prior to "starting the oplog manager" (perhaps within initializing oplog stones, which does WT reads), the code would get caught in the state we see – trying to destruct the WTRecordStore for the oplog collection before starting the oplog manager.

Comment by Daniel Gottlieb (Inactive) [ 06/Aug/20 ]

I imagine this is reproducing easily everytime a 4.4.0 binary tries coming up? I'm about to follow up with a short description of what's happening, but in the meantime kinwai.cheung@clarivate.com can you run with higher verbosity logs (e.g: -vvvv).

I believe an exception is being thrown while constructing the WTRecordStore for the oplog, but that's hitting some code that's not exception-safe resulting in the unrelated invariant we see. I'm hoping with the increased verbosity we'll be able to see what the original exception is.

Comment by Bruce Lucas (Inactive) [ 06/Aug/20 ]

Thanks kinwai.cheung@clarivate.com. Can you please also attach the complete log file?

Comment by Kin Wai Cheung [ 06/Aug/20 ]

Hi Eric,

I'm attaching the diagnostic logs from this month.

Please note that upgrade only started today (9 am CEST).

I've already done some steps like repairing from 4.2.8/4.4.0 binaries to have the config running.

Since around 2:45 pm we restored the data files on config to have it running on 4.2.8

As having it started on 4.4.0 binaries produces the error in ticket and subsequently starting it back on 4.2.8 binaries generated this error:

unsupported WiredTiger file version: this build only supports versions up to 4, and the file is version 5: WT_ERROR: non-specific WiredTiger error

Comment by Eric Sedor [ 06/Aug/20 ]

Thanks for writing in kinwai.cheung@clarivate.com,

Would you please archive (tar or zip) the $dbpath/diagnostic.data directory for the config server? The contents are described here. Then, attach it to this ticket.

Gratefully,
Eric

Generated at Thu Feb 08 05:21:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.