[SERVER-66038] MongoDB docker container unable to start properly. Created: 28/Apr/22 Updated: 01/Jul/22 Resolved: 01/Jul/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 3.6.18 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Sarojini Jillalla | Assignee: | Chris Kelly |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
Hi all, We are using Graylog, Elasticsearch and MongoDB for logging and archiving. These apps are run as docker containers with 3 replicas on 3 RHEL servers. We are using MongoDB version 3.6.18 I had previously created a tickets for the similar issues (https://jira.mongodb.org/browse/SERVER-61936, https://jira.mongodb.org/browse/SERVER-64629). At that time, I had restored the data from the backup and could not try the repair command as mentioned in the solutions. These DBs were smaller, about 1-2GB each and were recovered easily once the data was restored from the backup. This time, the DB03 has issues and is not brought up properly. This DB has about 640GB of data, hence the container bring-up take a very long time. [root@dcvsl126 sjillalla]# docker ps | grep mongo Unknown macro: { CAFile}
}, replication: Unknown macro: { oplogSizeMB}
} When I tried the repair command on this DB, the logs mentioned db03-repair-logs-in-temp-restored-folder.txt 2022-04-27T14:15:01.663+0000 E STORAGE [initandlisten] WiredTiger error (-31802) [1651068901:663138][1:0x7f5e71a08a40], file:WiredTiger.wt, WT_CURSOR.next: __wt_block_read_off, 302: WiredTiger.wt: fatal read error: WT_ERROR: non-specific WiredTiger error Raw: [1651068901:663138][1:0x7f5e71a08a40], file:WiredTiger.wt, WT_CURSOR.next: __wt_block_read_off, 302: WiredTiger.wt: fatal read error: WT_ERROR: non-specific WiredTiger error ***aborting after fassert() failure 2022-04-27T14:15:01.747+0000 F - [initandlisten] Got signal: 6 (Aborted). 0x555f7ba26991 0x555f7ba25ba9 0x555f7ba2608d 0x7f5e703ec390 0x7f5e70046428 0x7f5e7004802a 0x555f7a127ea4 0x555f7a204366 0x555f7a275d29 0x555f7a0c328c 0x555f7a0c36ac 0x555f7a32c025 0x555f7a32c165 0x555f7a2a01b0 0x555f7a2a6c2a 0x555f7a2c11fd 0x555f7a3312b8 0x555f7a2e2d4c 0x555f7a28e721 0x555f7a28ecf1 0x555f7a219847 0x555f7a2162ef 0x555f7a1e5446 0x555f7a1c7fa8 0x555f7a3cf615 0x555f7a1a32fa 0x555f7a1a6b12 0x555f7a129b79 0x7f5e70031830 0x555f7a18e579 Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
],"processInfo":{ "mongodbVersion" : "3.6.18", "gitVersion" : "2005f25eed7ed88fa698d9b800fe536bb0410ba4", "compiledModules" : [], "uname" : Unknown macro: { "sysname" }
, "somap" : [ Unknown macro: { "b" }
, Unknown macro: { "b" }
, Unknown macro: { "b" }
, Unknown macro: { "b" }
, Unknown macro: { "b" }
, Unknown macro: { "b" }
, Unknown macro: { "b" }
, Unknown macro: { "b" }
, Unknown macro: { "b" }
, Unknown macro: { "b" }
, Unknown macro: { "b" }
, Unknown macro: { "b" }
] }} I had restored the data from the backup and tried to bring up the containers. This time, the container exited after about 3.5hrs with the following logs db03-logs-container-exit-after-3.5hrs.txt 2022-04-28T00:07:43.929+0000 E STORAGE [initandlisten] WiredTiger error (-31802) [1651104463:929120][1:0x7fe335501a40], file:WiredTiger.wt, WT_CURSOR.next: __wt_block_read_off, 302: WiredTiger.wt: fatal read error: WT_ERROR: non-specific WiredTiger error Raw: [1651104463:929120][1:0x7fe335501a40], file:WiredTiger.wt, WT_CURSOR.next: __wt_block_read_off, 302: WiredTiger.wt: fatal read error: WT_ERROR: non-specific WiredTiger error ***aborting after fassert() failure 2022-04-28T00:07:44.147+0000 F - [initandlisten] Got signal: 6 (Aborted). 0x560124143991 0x560124142ba9 0x56012414308d 0x7fe333ee5390 0x7fe333b3f428 0x7fe333b4102a 0x560122844ea4 0x560122921366 0x560122992d29 0x5601227e028c 0x5601227e06ac 0x560122a49025 0x560122a49165 0x5601229bd1b0 0x5601229c3c2a 0x5601229de1fd 0x560122a4e2b8 0x5601229ffd4c 0x5601229ab721 0x5601229abcf1 0x560122936847 0x5601229332ef 0x560122902139 0x5601228e4fa8 0x560122aec615 0x5601228c02fa 0x5601228c3b12 0x560122846b79 0x7fe333b2a830 0x5601228ab579 Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
, Unknown macro: {"b"}
],"processInfo":{ "mongodbVersion" : "3.6.18", "gitVersion" : "2005f25eed7ed88fa698d9b800fe536bb0410ba4", "compiledModules" : [], "uname" : Unknown macro: { "sysname" }
, "somap" : [ Unknown macro: { "b" }
, Unknown macro: { "b" }
, Unknown macro: { "b" }
, Unknown macro: { "b" }
, Unknown macro: { "b" }
, Unknown macro: { "b" }
, Unknown macro: { "b" }
, Unknown macro: { "b" }
, Unknown macro: { "b" }
, Unknown macro: { "b" }
, Unknown macro: { "b" }
, Unknown macro: { "b" }
] }} This looks like the data corruption issue and I am not able to understand why the data is getting corrupted so frequently in our environment. This is having a very huge impact on Graylog and subsequently the production applications. Please let me know how I can recover from this and make the MongoDB container up and running properly. |
| Comments |
| Comment by Chris Kelly [ 01/Jul/22 ] |
|
We haven’t heard back from you for some time, so I’m going to close this ticket. If this is still an issue for you, please provide additional information and we will reopen the ticket. |
| Comment by Chris Kelly [ 10/Jun/22 ] |
|
We still need additional information to diagnose the problem. If this is still an issue for you, would you please provide the requested information? Regards, |
| Comment by Chris Kelly [ 24/May/22 ] |
|
Thanks for your patience. The ideal resolution is to perform a clean resync from an unaffected node if --repair fails. I'd recommend that in your case, however if it that doesn't succeed, please help provide the following: For each node in the replica set spanning a time period that includes the incident, would you please archive (tar or zip) the following and upload them to the ticket:
To avoid a problems like this in the future, it is our strong recommendation to:
Regards,
|