[SERVER-60055] Mongo DB is not getting restarted in production Created: 17/Sep/21  Updated: 27/Oct/23  Resolved: 26/Sep/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.6
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Gaurav Kumar Assignee: Backlog - Triage Team
Resolution: Community Answered Votes: 0
Labels: bkp, buildbot, host-management
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

OS :- Red Hat Linux 8
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 48
RAM - 256 GB


Attachments: PNG File mongorepair.png    
Issue Links:
Backports
Assigned Teams:
Server Triage
Backport Requested:
v4.4
Participants:

 Description   

We are having 4.4 TB of data ( occupied in file system ) and we have created indexes over collections that is occupying data space of 750GB. 
For some reason, the server got restarted and when we tried to restart the mongo instance, it is not getting restarted. 
As per logs suggestion, we tried to repair and we are getting below error. 
Could someone help in getting the db up ? 

{"t":\{"$date":"2021-09-17T17:51:39.517+00:00"},"s":"F", "c":"-", "id":23081, "ctx":"initandlisten","msg":"Invariant failure","attr":\{"expr":"buildUUID","msg":"collection: dynamediation-reporting.DataBilledindex:octlbfpvd","file":"src/mongo/db/catalog/index_catalog_impl.cpp","line":133}}
{"t":\{"$date":"2021-09-17T17:51:39.517+00:00"},"s":"F", "c":"-", "id":23082, "ctx":"initandlisten",*"msg":"\n\n***aborting after invariant() failure\n\n"}*
{"t":\{"$date":"2021-09-17T17:51:39.517+00:00"},"s":"F", "c":"CONTROL", "id":4757800, "ctx":"initandlisten",*"msg":"Writing fatal message","*attr":\{"message":"Got signal: 6 (Aborted).\n"}}
{"t":\{"$date":"2021-09-17T17:51:39.556+00:00"},"s":"I", "c":"CONTROL", "id":31431, "ctx":"initandlisten","msg":"BACKTRACE: \{bt}","attr":\{"bt":{"backtrace":[{"a":"55D395F2FCCA","b":"55D39319F000","o":"2D90CCA","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.606","s+":"1EA"},\{"a":"55D395F31749","b":"55D39319F000","o":"2D92749","s":"_ZN5mongo15printStackTraceEv","s+":"29"},\{"a":"55D395F2EAF6","b":"55D39319F000","o":"2D8FAF6","s":"_ZN5mongo12_GLOBAL__N_116abruptQuitActionEiP9siginfo_tPv","s+":"66"},\{"a":"7F6A789CFD80","b":"7F6A789BD000","o":"12D80","s":"funlockfile","s+":"50"},\{"a":"7F6A7863093F","b":"7F6A785F9000","o":"3793F","s":"gsignal","s+":"10F"},\{"a":"7F6A7861AC95","b":"7F6A785F9000","o":"21C95","s":"abort","s+":"127"},\{"a":"55D3940CDCDE","b":"55D39319F000","o":"F2ECDE","s":"_ZN5mongo22invariantFailedWithMsgEPKcRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES1_j","s+":"153"},\{"a":"55D393EDBA87","b":"55D39319F000","o":"D3CA87","s":"_ZN5mongo16IndexCatalogImpl4initEPNS_16OperationContextE.cold.1305","s+":"1B"},\{"a":"55D394A7F7F8","b":"55D39319F000","o":"18E07F8","s":"_ZN5mongo14CollectionImpl4initEPNS_16OperationContextE","s+":"328"},\{"a":"55D394A8BFE6","b":"55D39319F000","o":"18ECFE6","s":"_ZNK5mongo12DatabaseImpl4initEPNS_16OperationContextE","s+":"D6"},\{"a":"55D394A87618","b":"55D39319F000","o":"18E8618","s":"_ZN5mongo18DatabaseHolderImpl6openDbEPNS_16OperationContextENS_10StringDataEPb","s+":"308"},\{"a":"55D3945E7327","b":"55D39319F000","o":"1448327","s":"_ZN5mongo14repairDatabaseEPNS_16OperationContextEPNS_13StorageEngineERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE","s+":"247"},\{"a":"55D3945E20CE","b":"55D39319F000","o":"14430CE","s":"_ZN5mongo30repairDatabasesAndCheckVersionEPNS_16OperationContextE","s+":"34E"},\{"a":"55D394167292","b":"55D39319F000","o":"FC8292","s":"_ZN5mongo12_GLOBAL__N_114_initAndListenEPNS_14ServiceContextEi.isra.1583","s+":"762"},\{"a":"55D39416AD30","b":"55D39319F000","o":"FCBD30","s":"_ZN5mongo12_GLOBAL__N_111mongoDbMainEiPPcS2_","s+":"650"},\{"a":"55D3940DD9A9","b":"55D39319F000","o":"F3E9A9","s":"main","s+":"9"},\{"a":"7F6A7861C813","b":"7F6A785F9000","o":"23813","s":"__libc_start_main","s+":"F3"},\{"a":"55D39416635E","b":"55D39319F000","o":"FC735E","s":"_start","s+":"2E"}],"processInfo":\{"mongodbVersion":"4.4.6","gitVersion":"72e66213c2c3eab37d9358d5e78ad7f5c1d0d0d7","compiledModules":[],"uname":{"sysname":"Linux","release":"4.18.0-80.el8.x86_64","version":"#1 SMP Wed Mar 13 12:02:46 UTC 2019","machine":"x86_64"},"somap":[\{"b":"55D39319F000","elfType":3,"buildId":"264EA6EE10C77300B36F30AE15B879E4CE140A3C"},\{"b":"7F6A789BD000","path":"/lib64/libpthread.so.0","elfType":3,"buildId":"5326B8728FA01B7149DAC943100F1405533E76CE"},\{"b":"7F6A785F9000","path":"/lib64/libc.so.6","elfType":3,"buildId":"424987D36A35374FC6A1438711AFB5113A74EA81"}]}}}}
{"t":\{"$date":"2021-09-17T17:51:39.556+00:00"},"s":"I", "c":"CONTROL", "id":31427, "ctx":"initandlisten","msg":" Frame: \{frame}","attr":\{"frame":{"a":"55D395F2FCCA","b":"55D39319F000","o":"2D90CCA","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.606","s+":"1EA"}}}
{"t":\{"$date":"2021-09-17T17:51:39.556+00:00"},"s":"I", "c":"CONTROL", "id":31427, "ctx":"initandlisten","msg":" Frame: \{frame}","attr":\{"frame":{"a":"55D395F31749","b":"55D39319F000","o":"2D92749","s":"_ZN5mongo15printStackTraceEv","s+":"29"}}}
{"t":\{"$date":"2021-09-17T17:51:39.556+00:00"},"s":"I", "c":"CONTROL", "id":31427, "ctx":"initandlisten","msg":" Frame: \{frame}","attr":\{"frame":{"a":"55D395F2EAF6","b":"55D39319F000","o":"2D8FAF6","s":"_ZN5mongo12_GLOBAL__N_116abruptQuitActionEiP9siginfo_tPv","s+":"66"}}}
{"t":\{"$date":"2021-09-17T17:51:39.556+00:00"},"s":"I", "c":"CONTROL", "id":31427, "ctx":"initandlisten","msg":" Frame: \{frame}","attr":\{"frame":{"a":"7F6A789CFD80","b":"7F6A789BD000","o":"12D80","s":"funlockfile","s+":"50"}}}
{"t":\{"$date":"2021-09-17T17:51:39.556+00:00"},"s":"I", "c":"CONTROL", "id":31427, "ctx":"initandlisten","msg":" Frame: \{frame}","attr":\{"frame":{"a":"7F6A7863093F","b":"7F6A785F9000","o":"3793F","s":"gsignal","s+":"10F"}}}
{"t":\{"$date":"2021-09-17T17:51:39.556+00:00"},"s":"I", "c":"CONTROL", "id":31427, "ctx":"initandlisten","msg":" Frame: \{frame}","attr":\{"frame":{"a":"7F6A7861AC95","b":"7F6A785F9000","o":"21C95","s":"abort","s+":"127"}}}
{"t":\{"$date":"2021-09-17T17:51:39.556+00:00"},"s":"I", "c":"CONTROL", "id":31427, "ctx":"initandlisten","msg":" Frame: \{frame}","attr":\{"frame":{"a":"55D3940CDCDE","b":"55D39319F000","o":"F2ECDE","s":"_ZN5mongo22invariantFailedWithMsgEPKcRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES1_j","s+":"153"}}}
{"t":\{"$date":"2021-09-17T17:51:39.556+00:00"},"s":"I", "c":"CONTROL", "id":31427, "ctx":"initandlisten","msg":" Frame: \{frame}","attr":\{"frame":{"a":"55D393EDBA87","b":"55D39319F000","o":"D3CA87","s":"_ZN5mongo16IndexCatalogImpl4initEPNS_16OperationContextE.cold.1305","s+":"1B"}}}
{"t":\{"$date":"2021-09-17T17:51:39.556+00:00"},"s":"I", "c":"CONTROL", "id":31427, "ctx":"initandlisten","msg":" Frame: \{frame}","attr":\{"frame":{"a":"55D394A7F7F8","b":"55D39319F000","o":"18E07F8","s":"_ZN5mongo14CollectionImpl4initEPNS_16OperationContextE","s+":"328"}}}
{"t":\{"$date":"2021-09-17T17:51:39.556+00:00"},"s":"I", "c":"CONTROL", "id":31427, "ctx":"initandlisten","msg":" Frame: \{frame}","attr":\{"frame":{"a":"55D394A8BFE6","b":"55D39319F000","o":"18ECFE6","s":"_ZNK5mongo12DatabaseImpl4initEPNS_16OperationContextE","s+":"D6"}}}
{"t":\{"$date":"2021-09-17T17:51:39.556+00:00"},"s":"I", "c":"CONTROL", "id":31427, "ctx":"initandlisten","msg":" Frame: \{frame}","attr":\{"frame":{"a":"55D394A87618","b":"55D39319F000","o":"18E8618","s":"_ZN5mongo18DatabaseHolderImpl6openDbEPNS_16OperationContextENS_10StringDataEPb","s+":"308"}}}
{"t":\{"$date":"2021-09-17T17:51:39.556+00:00"},"s":"I", "c":"CONTROL", "id":31427, "ctx":"initandlisten","msg":" Frame: \{frame}","attr":\{"frame":{"a":"55D3945E7327","b":"55D39319F000","o":"1448327","s":"_ZN5mongo14repairDatabaseEPNS_16OperationContextEPNS_13StorageEngineERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE","s+":"247"}}}
{"t":\{"$date":"2021-09-17T17:51:39.556+00:00"},"s":"I", "c":"CONTROL", "id":31427, "ctx":"initandlisten","msg":" Frame: \{frame}","attr":\{"frame":{"a":"55D3945E20CE","b":"55D39319F000","o":"14430CE","s":"_ZN5mongo30repairDatabasesAndCheckVersionEPNS_16OperationContextE","s+":"34E"}}}
{"t":\{"$date":"2021-09-17T17:51:39.556+00:00"},"s":"I", "c":"CONTROL", "id":31427, "ctx":"initandlisten","msg":" Frame: \{frame}","attr":\{"frame":{"a":"55D394167292","b":"55D39319F000","o":"FC8292","s":"_ZN5mongo12_GLOBAL__N_114_initAndListenEPNS_14ServiceContextEi.isra.1583","s+":"762"}}}
{"t":\{"$date":"2021-09-17T17:51:39.556+00:00"},"s":"I", "c":"CONTROL", "id":31427, "ctx":"initandlisten","msg":" Frame: \{frame}","attr":\{"frame":{"a":"55D39416AD30","b":"55D39319F000","o":"FCBD30","s":"_ZN5mongo12_GLOBAL__N_111mongoDbMainEiPPcS2_","s+":"650"}}}
{"t":\{"$date":"2021-09-17T17:51:39.556+00:00"},"s":"I", "c":"CONTROL", "id":31427, "ctx":"initandlisten","msg":" Frame: \{frame}","attr":\{"frame":{"a":"55D3940DD9A9","b":"55D39319F000","o":"F3E9A9","s":"main","s+":"9"}}}
{"t":\{"$date":"2021-09-17T17:51:39.556+00:00"},"s":"I", "c":"CONTROL", "id":31427, "ctx":"initandlisten","msg":" Frame: \{frame}","attr":\{"frame":{"a":"7F6A7861C813","b":"7F6A785F9000","o":"23813","s":"__libc_start_main","s+":"F3"}}}
{"t":\{"$date":"2021-09-17T17:51:39.556+00:00"},"s":"I", "c":"CONTROL", "id":31427, "ctx":"initandlisten","msg":" Frame: \{frame}","attr":\{"frame":{"a":"55D39416635E","b":"55D39319F000","o":"FC735E","s":"_start","s+":"2E"}}}

 



 Comments   
Comment by Dmitry Agranat [ 26/Sep/21 ]

Hi gaurav.kumar@aitechbay.com, based on the earlier provided logs, it does not seem like all the issues in your environment were addressed.

Unfortunately, MongoDB is not able to repair the corrupted data.

To avoid a problem like this in the future, it is our strong recommendation to:

Please reach out to our Consulting team (in the same way you did in the past) to make sure your environment is configured with our Production notes and try to manually repair data.

Regards,
Dima

Comment by Gaurav Kumar [ 23/Sep/21 ]

Hi Dmitry, 

All issues pointed out was cleared, after that we tried to repair the data, then we got the error as invariant failure. 
refer image. (file :- mongo-repair.log) tail -f -n 1000 mongod_repair.log

 

Please note that mongo is running, if we change the dbpath to some other directory. 

That's why we came to mongo team seeking for help. It would be of great help, if you could check this further. 

Comment by Dmitry Agranat [ 22/Sep/21 ]

Hi gaurav.kumar@aitechbay.com, the reported issue does not seem to be related to MongoDB but rather OS/environment-related. Unless all of these issues are addressed, MongoDB won't be able to start.

There are too many issues to list but I'll point out some examples:

Someone/something is periodically killing mongod process. The most devastating effect of this event is when you are killing mongod process repeatedly during recovery. Example:

{"t":{"$date":"2021-09-17T08:47:09.992+00:00"},"s":"I",  "c":"CONTROL",  "id":23377,   "ctx":"SignalHandler","msg":"Received signal","attr":{"signal":15,"error":"Terminated"}}
{"t":{"$date":"2021-09-17T08:47:09.992+00:00"},"s":"I",  "c":"CONTROL",  "id":23378,   "ctx":"SignalHandler","msg":"Signal was sent by kill(2)","attr":{"pid":1,"uid":0}}

It's unclear what configuration was used at the time when we were not able to start because of an unknown storage engine. Example:

{"t":{"$date":"2021-09-17T08:49:18.155+00:00"},"s":"E",  "c":"STORAGE",  "id":20557,   "ctx":"initandlisten","msg":"DBException in initAndListen, terminating","attr":{"error":"Location18656: Cannot start server with an unknown storage engine: "}}

Permission denied to the journal dbpath. Example:

{"t":{"$date":"2021-09-17T09:38:55.958+00:00"},"s":"E",  "c":"STORAGE",  "id":22435,   "ctx":"initandlisten","msg":"WiredTiger error","attr":{"error":13,"message":"[1631871535:958806][16273:0x7f95b28b8000], connection: __posix_open_file, 808: /data/mongo/journal/WiredTigerLog.0000037494: handle-open: open: Permission denied"}}
{"t":{"$date":"2021-09-17T09:38:55.964+00:00"},"s":"E",  "c":"STORAGE",  "id":22435,   "ctx":"initandlisten","msg":"WiredTiger error","attr":{"error":13,"message":"[1631871535:964217][16273:0x7f95b28b8000], connection: __posix_open_file, 808: /data/mongo/journal/WiredTigerLog.0000037494: handle-open: open: Permission denied"}}
{"t":{"$date":"2021-09-17T09:38:55.969+00:00"},"s":"E",  "c":"STORAGE",  "id":22435,   "ctx":"initandlisten","msg":"WiredTiger error","attr":{"error":13,"message":"[1631871535:969205][16273:0x7f95b28b8000], connection: __posix_open_file, 808: /data/mongo/journal/WiredTigerLog.0000037494: handle-open: open: Permission denied"}}

Operation not permitted to unlock socket file. Example:

{"t":{"$date":"2021-09-17T10:03:47.630+00:00"},"s":"E",  "c":"NETWORK",  "id":23024,   "ctx":"initandlisten","msg":"Failed to unlink socket file","attr":{"path":"/tmp/mongodb-27017.sock","error":"Operation not permitted"}}

Another mongod process is running and we are unable to lock the lock file. Example:

{"t":{"$date":"2021-09-17T13:24:45.688+00:00"},"s":"D1", "c":"-",        "id":23074,   "ctx":"initandlisten","msg":"User assertion","attr":{"error":"DBPathInUse: Unable to lock the lock file: /data/mongo/mongod.lock (Resource temporarily unavailable). Another mongod instance is already running on the /data/mongo directory","file":"src/mongo/db/storage/storage_engine_init.cpp","line":216}}

Cannot write to pid file, permission denied. Example:

{"t":{"$date":"2021-09-17T13:27:57.962+00:00"},"s":"I",  "c":"CONTROL",  "id":23330,   "ctx":"main","msg":"ERROR: Cannot write pid file to {path_string}: {errAndStr_second}","attr":{"path_string":"/var/run/mongodb/mongod.pid","errAndStr_second":"Permission denied"}}

Read-only directory. Example:

{"t":{"$date":"2021-09-17T15:59:38.741+00:00"},"s":"W",  "c":"-",        "id":23075,   "ctx":"initandlisten","msg":"DBException thrown","attr":{"error":{"code":20,"codeName":"IllegalOperation","errmsg":"Attempted to create a lock file on a read-only directory: /data/mongo"}}}

There are some other examples but the bottom line is that currently, mongodb is repeatedly failing due to configuration/environment issues. And because it went through hundreds of such failed cycles, while being repeatedly killed in the middle of recovery, the integrity of the data is not clear at this point.

We can have one screen share session, if you need.

We do not provide screen sessions for the SERVER project.

With no backups, and since this is a standalone, our options to address this issue are limited (even after all the configuration/environment issues are addressed which is outside the scope of the SERVER project).

Comment by Gaurav Kumar [ 22/Sep/21 ]

Hi Dmitry, 

We have not created backup copy as the data is of 4.4 TB, but we have the complete data as it is.
Yes, you are correct, this is the standalone deployment, we have no replica set.
command used :- mongod --dbpath /data/mongo --repair

We can have one screen share session, if you need. 

Thanks 
Gaurav

Comment by Dmitry Agranat [ 22/Sep/21 ]

Hi gaurav.kumar@aitechbay.com, thank you for uploading the requesting information. I just want to make sure regarding the steps of the repair procedure that you've mentioned. Did you create a backup copy of the data files in the --dbpath before executing repair procedure?

If not, do you have a recent backup that was taken before the reported issue?
This is a standalone deployment (not a replica set), correct?

Comment by Gaurav Kumar [ 20/Sep/21 ]

Hi Dimtry, 
We have tried upgrading the version from "mongodbVersion":"4.4.6" to "mongodbVersion":"4.4.8" and tried to repair, but it is still the same. 
We are getting same error. 

Thanks, 
Gaurav

Comment by Gaurav Kumar [ 20/Sep/21 ]

Hi Dmitry Agranat, 

Any suggestion, what went wrong? 
The client is chasing us. 

Thanks 
Gaurav, 

Comment by Gaurav Kumar [ 19/Sep/21 ]

Thanks Dima for picking this issue. 
I have uploaded the complete logs and diagnostic.data directory as suggested. 
file name :- jira60055.tar
Please check and suggest any solution, as we are on third day of outage. 

With Regards,
Gaurav

Comment by Dmitry Agranat [ 19/Sep/21 ]

Hi gaurav.kumar@aitechbay.com,

For a start, we'll need the data covering the original incident, ideally covering some time before the server was restarted and until you have executed the repair command.

Would you please archive (tar or zip) the mongod.log files covering the incident and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location?

Thanks,
Dima

Comment by Gaurav Kumar [ 18/Sep/21 ]

mongo -dbpath --repair is getting failed at this code : 

if (!ready) {
auto buildUUID = collection->getIndexBuildUUID(indexName);
invariant(buildUUID,
str::stream() << "collection: " << collection->ns() << "index:" << indexName);
// We intentionally do not drop or rebuild unfinished two-phase index builds before
// initializing the IndexCatalog when starting a replica set member in standalone mode.
// This is because the index build cannot complete until it receives a replicated commit
// or abort oplog entry.
if (replSetMemberInStandaloneMode)

{ // Indicate that this index is "frozen". It is not ready but is not currently in // progress either. These indexes may be dropped. auto flags = CreateIndexEntryFlags::kInitFromDisk | CreateIndexEntryFlags::kFrozen; IndexCatalogEntry* entry = createIndexEntry(opCtx, collection, std::move(descriptor), flags); fassert(31433, !entry->isReady(opCtx, collection)); }

else

{ // Initializing with unfinished indexes may occur during rollback or startup. auto flags = CreateIndexEntryFlags::kInitFromDisk; IndexCatalogEntry* entry = createIndexEntry(opCtx, collection, std::move(descriptor), flags); fassert(4505500, !entry->isReady(opCtx, collection)); }

What if, some index is corrupted ( like it didn't complete second phase and system got rebooted) , repair should throw error and remove from repair stack and allow remaining data to go through complete process. At least, other data would be safe and user would be able to get that restore the db back to normal. 

 

Comment by Gaurav Kumar [ 18/Sep/21 ]

Can someone help me on this? 

Generated at Thu Feb 08 05:48:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.