[SERVER-58050] mongod standalone sharded replicaset issue during upgrade 4.4.0 to 4.4.6 Created: 24/Jun/21  Updated: 28/Jul/21  Resolved: 28/Jul/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Kin Wai Cheung Assignee: Eric Sedor
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2021-06-30 at 9.20.40 AM.png    
Participants:

 Description   

Hi,

on our dev env we run ong 4.4.0 binaries consisting of:

1 standalone config replica set

2 standalone shards replicaset 

1 router

I've disabled balancer, stopped config and started with 4.4.6 binaries.

This went fine. Then I proceeded with stopping shard0001 and starting with 4.4.6 binaries.

At this moment it seems to not be progressing for over 20 minutes.

log of the mongod:

{"t":{"$date":"2021-06-24T11:10:30.195+02:00"},"s":"I", "c":"CONTROL", "id":20698, "ctx":"main","msg":"***** SERVER RESTARTED *****"}
{"t":{"$date":"2021-06-24T11:10:30.199+02:00"},"s":"I", "c":"CONTROL", "id":23285, "ctx":"main","msg":"Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'"}
{"t":{"$date":"2021-06-24T11:10:30.200+02:00"},"s":"W", "c":"ASIO", "id":22601, "ctx":"main","msg":"No TransportLayer configured during NetworkInterface startup"}
{"t":{"$date":"2021-06-24T11:10:30.201+02:00"},"s":"I", "c":"NETWORK", "id":4648602, "ctx":"main","msg":"Implicit TCP FastOpen in use."}
{"t":{"$date":"2021-06-24T11:10:30.300+02:00"},"s":"I", "c":"STORAGE", "id":4615611, "ctx":"initandlisten","msg":"MongoDB starting","attr":{"pid":43037,"port":28018,"dbPath":"/ctr/trademarkrepository/shard/data","architecture":"64-bit","host":"deng-pl111"}}
{"t":{"$date":"2021-06-24T11:10:30.300+02:00"},"s":"I", "c":"CONTROL", "id":23403, "ctx":"initandlisten","msg":"Build Info","attr":{"buildInfo":{"version":"4.4.6","gitVersion":"72e66213c2c3eab37d9358d5e78ad7f5c1d0d0d7","openSSLVersion":"OpenSSL 1.0.1e-fips 11 Feb 2013","modules":[],"allocator":"tcmalloc","environment":{"distmod":"rhel62","distarch":"x86_64","target_arch":"x86_64"}}}}
{"t":{"$date":"2021-06-24T11:10:30.300+02:00"},"s":"I", "c":"CONTROL", "id":51765, "ctx":"initandlisten","msg":"Operating System","attr":{"os":{"name":"Oracle Linux Server release 6.9","version":"Kernel 4.1.12-61.1.28.el6uek.x86_64"}}}
{"t":{"$date":"2021-06-24T11:10:30.300+02:00"},"s":"I", "c":"CONTROL", "id":21951, "ctx":"initandlisten","msg":"Options set by command line","attr":{"options":{"config":"/ctr/trademarkrepository/shard/data/automation-mongod.conf","net":{"bindIp":"0.0.0.0","port":28018},"processManagement":{"fork":true},"replication":{"replSetName":"shard0001"},"security":{"keyFile":"/ctr/trademarkrepository/keyfile_mongo_ctr_dvlp"},"setParameter":{"transactionLifetimeLimitSeconds":"600"},"sharding":{"clusterRole":"shardsvr"},"storage":{"dbPath":"/ctr/trademarkrepository/shard/data","engine":"wiredTiger","wiredTiger":{"engineConfig":{"cacheSizeGB":256.0,"directoryForIndexes":true}}},"systemLog":{"destination":"file","logAppend":true,"logRotate":"reopen","path":"/ctr/trademarkrepository/shard/log/mongod.log"}}}}
{"t":{"$date":"2021-06-24T11:10:30.301+02:00"},"s":"I", "c":"STORAGE", "id":22297, "ctx":"initandlisten","msg":"Using the XFS filesystem is strongly recommended with the WiredTiger storage engine. See http://dochub.mongodb.org/core/prodnotes-filesystem","tags":["startupWarnings"]}
{"t":{"$date":"2021-06-24T11:10:30.301+02:00"},"s":"I", "c":"STORAGE", "id":22315, "ctx":"initandlisten","msg":"Opening WiredTiger","attr":{"config":"create,cache_size=262144M,session_max=33000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000,close_scan_interval=10,close_handle_minimum=250),statistics_log=(wait=0),verbose=[recovery_progress,checkpoint_progress,compact_progress],"}}
{"t":{"$date":"2021-06-24T11:10:30.897+02:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1624525830:897381][43037:0x7fe1d6387b40], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 395 through 396"}}
{"t":{"$date":"2021-06-24T11:10:30.921+02:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1624525830:921098][43037:0x7fe1d6387b40], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 396 through 396"}}
{"t":{"$date":"2021-06-24T11:10:31.139+02:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1624525831:139469][43037:0x7fe1d6387b40], txn-recover: [WT_VERB_RECOVERY | WT_VERB_RECOVERY_PROGRESS] Main recovery loop: starting at 395/62380416 to 396/256"}}
{"t":{"$date":"2021-06-24T11:10:31.140+02:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1624525831:140297][43037:0x7fe1d6387b40], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 395 through 396"}}
{"t":{"$date":"2021-06-24T11:10:31.168+02:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1624525831:168326][43037:0x7fe1d6387b40], txn-recover: [WT_VERB_RECOVERY_PROGRESS] Recovering log 396 through 396"}}
{"t":{"$date":"2021-06-24T11:10:31.227+02:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1624525831:227706][43037:0x7fe1d6387b40], txn-recover: [WT_VERB_RECOVERY | WT_VERB_RECOVERY_PROGRESS] Set global recovery timestamp: (1624525800, 1)"}}
{"t":{"$date":"2021-06-24T11:10:31.227+02:00"},"s":"I", "c":"STORAGE", "id":22430, "ctx":"initandlisten","msg":"WiredTiger message","attr":{"message":"[1624525831:227776][43037:0x7fe1d6387b40], txn-recover: [WT_VERB_RECOVERY | WT_VERB_RECOVERY_PROGRESS] Set global oldest timestamp: (0, 0)"}}

 

 



 Comments   
Comment by Eric Sedor [ 28/Jul/21 ]

Hi kinwai.cheung@clarivate.com, at this point I'll close the ticket. But, please feel free to comment again or open a new ticket if you see such a startup time again.

Comment by Eric Sedor [ 01/Jul/21 ]

Thanks kinwai.cheung@clarivate.com,

I'm going to keep this ticket open for now in case it comes up again soon. Feel free to update any time.

Comment by Kin Wai Cheung [ 01/Jul/21 ]

Sure ,

atm the cluster have been upgraded to 4.4.6

I did a clean stop afterwards and restarted without any long startup (as expected).

But i'll consider upgrading to 4.4.7 if we face a long startup again.

Comment by Eric Sedor [ 30/Jun/21 ]

Hi kinwai.cheung@clarivate.com

That probably won't help and we recommend against any upgrade process that differs from the documented order of upgrade steps.

Instead, this looks like it is related to startup time on a single node during the WiredTiger recovery process on startup. I do see evidence in the diagnostic data that a relatively large number of data handles were activated (data-handle connection data handles currently active) during the startup of shard0000:

This looks a lot like an issue that was improved by WT-7442 in MongoDB 4.4.5. Note that while the fix for WT-7442 reduced the number of data handles that needed to be activated, it will not have eliminated the delays caused if that number is still large.

While working on WT-7442 we also opened WT-7452 to improve the logging when startup runs long. This improvement is landing in MongoDB 4.4.7 which is due to be released very soon.

Would you be willing to upgrade to 4.4.7 when it becomes available, and provide logs and diagnostic data for a long startup time on that version, so that we can investigate this issue with improved logging?

Comment by Kin Wai Cheung [ 29/Jun/21 ]

Can we avoid this by doing the following in a sharded cluster of 1 member replica sets?

  • shutdown mongos 
  • shutdown every shard (2 primary shard primary replica set, no secondary)
  • shutdown config (1 primary replica set , no secondary)

then start with new binaries:

  • start config
  • start every shard
  • start mongos
Comment by Kin Wai Cheung [ 29/Jun/21 ]

@Eric Sedor 

 

requested files have been uploaded

Comment by Eric Sedor [ 28/Jun/21 ]

Hi kinwai.cheung@clarivate.com, we can take a look.

I've created a secure upload portal for you. Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

For the node you saw the problem, would you please archive (tar or zip) and upload to that link:

  • the mongod logs leading up to and following the upgrade to 4.4.6
  • the $dbpath/diagnostic.data directory (the contents are described here)
Comment by Kin Wai Cheung [ 24/Jun/21 ]

FYI:  mongod recovery finished after 1 hour and 5 minutes.

Is this normal behaviour  and will it occur during 3 member replica set upgrade?

Comment by Kin Wai Cheung [ 24/Jun/21 ]

I see similair issues raised under:

https://jira.mongodb.org/browse/WT-7452

https://jira.mongodb.org/browse/SERVER-56222

 

Another question:

I don't seem to have encountered this when we upgraded from 4.0.0 to 4.2.8 to 4.4.0.

Is this because we run a standalone replica set or will we encounter the same if we upgrade  one member at a time in a 3 member replica set?

 

 

Generated at Thu Feb 08 05:43:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.