[SERVER-56463] MongoDB cannot start after stop and reboot host Created: 29/Apr/21  Updated: 21/May/21  Resolved: 21/May/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.5
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Андрей Маклаков Assignee: Eric Sedor
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to WT-7426 Set write generation number when the ... Closed
Participants:

 Description   

4-node replica set, replication lag grows up on secondary about 6 hours

Previously (4.4.2/4.4.4), stop and restart host used as workaround

Steps:

  1. stop mongod service with

sudo systemctl stop mongod

  1. restart host

 

After restarting mongod service cannot start with some logs

 

{"t":{"$date":"2021-04-29T10:06:47.728+03:00"},"s":"I""c":"FTDC",     "id":20625,   "ctx":"initandlisten","msg":"Initializing full-time diagnostic data capture","attr":{"dataDirectory":"/opt/mongodb/data/diagnosti
c.data"}}
{"t":{"$date":"2021-04-29T10:06:47.729+03:00"},"s":"I""c":"REPL",     "id":21529,   "ctx":"initandlisten","msg":"Initializing rollback ID","attr":{"rbid":13}}
{"t":{"$date":"2021-04-29T10:06:47.729+03:00"},"s":"I""c":"REPL",     "id":501401"ctx":"initandlisten","msg":"Incrementing the rollback ID after unclean shutdown"}
{"t":{"$date":"2021-04-29T10:06:47.729+03:00"},"s":"I""c":"REPL",     "id":21532,   "ctx":"initandlisten","msg":"Incremented the rollback ID","attr":{"rbid":14}}
{"t":{"$date":"2021-04-29T10:06:47.730+03:00"},"s":"I""c":"REPL",     "id":21544,   "ctx":"initandlisten","msg":"Recovering from stable timestamp","attr":{"stableTimestamp":{"$timestamp":{"t":1619665481,"i":5791}}
,"topOfOplog":{"ts":{"$timestamp":{"t":1619670282,"i":569}},"t":262},"appliedThrough":{"ts":{"$timestamp":{"t":1619665481,"i":5791}},"t":262},"oplogTruncateAfterPoint":{"$timestamp":{"t":0,"i":0}}}}
{"t":{"$date":"2021-04-29T10:06:47.730+03:00"},"s":"I""c":"REPL",     "id":21545,   "ctx":"initandlisten","msg":"Starting recovery oplog application at the stable timestamp","attr":{"stableTimestamp":{"$timestamp"
:{"t":1619665481,"i":5791}}}}
{"t":{"$date":"2021-04-29T10:06:47.730+03:00"},"s":"I""c":"REPL",     "id":21550,   "ctx":"initandlisten","msg":"Replaying stored operations from startPoint (inclusive) to endPoint (inclusive)","attr":{"startPoint
":{"$timestamp":{"t":1619665481,"i":5791}},"endPoint":{"$timestamp":{"t":1619670282,"i":569}}}}
{"t":{"$date":"2021-04-29T10:06:48.011+03:00"},"s":"I""c":"FTDC",     "id":20631,   "ctx":"ftdc","msg":"Unclean full-time diagnostic data capture shutdown detected, found interim file, some metrics may have been l
ost","attr":{"error":{"code":0,"codeName":"OK"}}}
{"t":{"$date":"2021-04-29T10:07:38.482+03:00"},"s":"F""c":"REPL",     "id":21238,   "ctx":"ReplWriterWorker-14","msg":"Writer worker caught exception","attr":{"error":"DuplicateKey{ keyPattern: { _id: 1 }, keyValu
e: { _id: { id: UUID(\"6fc79d14-fbfd-4dbb-9119-f4055647bd7d\"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) } } }: E11000 duplicate key error collection: config.transactions inde
x: _id_ dup key: { _id: { id: UUID(\"6fc79d14-fbfd-4dbb-9119-f4055647bd7d\"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) } }","oplogEntry":{"ts":{"$timestamp":{"t":1619665666,"i
":9781}},"t":262,"v":2,"op":"u","ns":"config.transactions","wall":{"$date":"2021-04-29T03:07:46.794Z"},"fromMigrate":false,"o":{"_id":{"id":{"$uuid":"6fc79d14-fbfd-4dbb-9119-f4055647bd7d"},"uid":{"$binary":{"base64":
"47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=","subType":"0"}}},"txnNum":134438,"lastWriteOpTime":{"ts":{"$timestamp":{"t":1619665666,"i":9781}},"t":262},"lastWriteDate":{"$date":"2021-04-29T03:07:46.794Z"}},"o2":{"_
id":{"id":{"$uuid":"6fc79d14-fbfd-4dbb-9119-f4055647bd7d"},"uid":{"$binary":{"base64":"47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=","subType":"0"}}}},"b":true}}}

 

 

How can we make it works again?

 



 Comments   
Comment by Андрей Маклаков [ 04/May/21 ]

Thanks! 

Please close.

Comment by Eric Sedor [ 04/May/21 ]

No, our suggestion would be to downgrade all nodes in the replica set.

Comment by Андрей Маклаков [ 30/Apr/21 ]

But can I downgrade only 1 problem node to 4.4.4 with no changes on other nodes ? 

Comment by Eric Sedor [ 29/Apr/21 ]

Hello maklakov.andrew@gmail.com,

We are actively investigating WT-7426, fixed in the upcoming 4.4.6 release, which has been known to manifest in DuplicateKey errors on config.transactions after unclean restarts.

If you are experiencing this error message on 4.4.5, our current recommendation is to downgrade to 4.4.4 and then perform an initial sync on the affected node.

Eric

Generated at Thu Feb 08 05:39:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.