[SERVER-37318] POD can't start in a kubernetes replicaset Created: 26/Sep/18 Updated: 27/Dec/18 Resolved: 25/Oct/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Stability, WiredTiger |
| Affects Version/s: | 4.0.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Mauro Tintori | Assignee: | Kelsey Schubert |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL |
| Participants: |
| Description |
|
Hi, we have an environment made up of at least 3 preemptible nodes (max 5 nodes) running with Kubernetes on Google Cloud Platform. Over that we were working with a MongoDB replicaset made up of 3 PODs, version 3.6 docker image. Now we are testing MongoDB version 4.0.2 docker image. Yesterday we done a lot of tests without problems, also with different resources (CPU and memory limits) and different Kubernetes node types: these operations required to update, recreate and move PODs from nodepools and other ones. All tests yesterday was ok. Today the first POD go always to CrashLoopBackOff, others 2 PODs are secondary. In the logs we found this: 2018-09-26T08:09:00.028+0000 I CONTROL [main] Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none' , replication: { replSet: "xxx" }, storage: { mmapv1: { preallocDataFiles: false, smallFiles: true } } } ***aborting after fassert() failure
We can't access to the POD so we can read files or db. How can we resolve our problem? Thank you Mauro |
| Comments |
| Comment by Joachim Aumann [ 27/Dec/18 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I have the same issue. Described in this stackoverflow post: https://stackoverflow.com/questions/53835384/mongodb-statefulset-on-kubernetes-is-not-working-anymore-after-kubernetes-update
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Rakesh [ 22/Dec/18 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I have two kubernetes pods with volumes attached to the pods. I am trying to bring up new set of pods with same volumes mounted but it fails with the above error message | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Rakesh [ 22/Dec/18 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
2018-12-21T11:16:58.229+0000 I STORAGE [initandlisten] Starting OplogTruncaterThread local.oplog.rs 2018-12-21T11:16:58.229+0000 I STORAGE [initandlisten] The size storer reports that the oplog contains 2025121 records totaling to 604496199 bytes 2018-12-21T11:16:58.229+0000 I STORAGE [initandlisten] Sampling from the oplog between Dec 18 11:39:22:1 and Dec 21 05:36:47:1 to determine where to place markers for truncation 2018-12-21T11:16:58.229+0000 I STORAGE [initandlisten] Taking 355 samples and assuming that each section of oplog contains approximately 57012 records totaling to 17018013 bytes 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 18 11:40:54:18 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 18 11:46:41:36 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 18 11:49:47:2813 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 18 11:49:59:4034 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 18 11:50:13:2095 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 18 11:50:26:492 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 18 11:50:39:2504 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 18 11:50:52:3292 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 18 11:51:07:138 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 18 12:44:45:5 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 18 12:49:59:1376 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 18 12:50:14:1380 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 18 12:50:29:731 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 18 12:50:43:2800 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 18 14:00:06:712 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 18 14:00:14:7234 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 19 00:00:32:278 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 19 00:08:41:2867 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 19 00:09:00:524 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 19 00:09:18:898 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 19 00:09:36:2022 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 19 04:00:10:1282 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 19 19:00:05:26 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 20 00:08:05:2452 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 20 00:08:24:2759 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 20 00:08:43:2092 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 20 00:09:02:4478 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 20 00:15:15:1648 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 20 16:45:05:30 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 21 00:00:15:2940 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 21 00:08:57:1666 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 21 00:09:21:296 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 21 00:09:41:2525 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 21 00:10:04:709 2018-12-21T11:16:58.254+0000 I STORAGE [initandlisten] Placing a marker at optime Dec 21 00:15:14:1988 2018-12-21T11:16:58.268+0000 W STORAGE [initandlisten] Detected configuration for non-active storage engine mmapv1 when current storage engine is wiredTiger 2018-12-21T11:16:58.268+0000 I CONTROL [initandlisten] 2018-12-21T11:16:58.268+0000 I CONTROL [initandlisten] ** WARNING: Access control is not enabled for the database. 2018-12-21T11:16:58.268+0000 I CONTROL [initandlisten] ** Read and write access to data and configuration is unrestricted. 2018-12-21T11:16:58.268+0000 I CONTROL [initandlisten] ** WARNING: You are running this process as the root user, which is not recommended. 2018-12-21T11:16:58.268+0000 I CONTROL [initandlisten] 2018-12-21T11:16:58.880+0000 I FTDC [initandlisten] Initializing full-time diagnostic data capture with directory '/data/db/diagnostic.data' 2018-12-21T11:16:58.884+0000 I REPL [initandlisten] Rollback ID is 2 2018-12-21T11:16:58.885+0000 I REPL [initandlisten] Recovering from stable timestamp: Timestamp(1545370895, 1) (top of oplog: { ts: Timestamp(1545370607, 1), t: 8 }, appliedThrough: { ts: Timestamp(0, 0), t: -1 }, TruncateAfter: Timestamp(0, 0)) 2018-12-21T11:16:58.885+0000 I REPL [initandlisten] Starting recovery oplog application at the stable timestamp: Timestamp(1545370895, 1) 2018-12-21T11:16:58.885+0000 F REPL [initandlisten] Applied op { : Timestamp(1545370895, 1) } not found. Top of oplog is { : Timestamp(1545370607, 1) }. 2018-12-21T11:16:58.887+0000 F - [initandlisten] Fatal Assertion 40313 at src/mongo/db/repl/replication_recovery.cpp 361 2018-12-21T11:16:58.887+0000 F - [initandlisten]
***aborting after fassert() failure
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Rakesh [ 22/Dec/18 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
@Mauro - Even I am facing the same issue. How did you resolve it ? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kelsey Schubert [ 25/Oct/18 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi maurotintori, We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket. Regards, | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nick Brewer [ 28/Sep/18 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
maurotintori A resync is likely going to be the most effective way to get this node up and running again. From the log lines provided, it appears that the node may not have been shut down properly - was there an unexpected shutdown, or was the mongod process manually killed for some reason? -Nick | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mauro Tintori [ 28/Sep/18 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi nick, thank you. This is files list I can see in the dbpath:
All tests are made using the same set of dbpath data and we didn't try to do resync. What do you suggest? Thank you, Mauro | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Nick Brewer [ 26/Sep/18 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
maurotintori Thanks for your report. I have a few questions:
-Nick |