[SERVER-37473] mongo can't start in kubernetes after pod restart Created: 04/Oct/18  Updated: 16/Oct/18  Resolved: 12/Oct/18

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Henrique Barcellos Assignee: Kelsey Schubert
Resolution: Done Votes: 0
Labels: RF
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File WiredTiger.turtle     File WiredTiger.wt     File mongod.SERVER-30709.strace    
Issue Links:
Related
related to SERVER-30709 mongo can't start in kubernetes when ... Closed
Operating System: ALL
Participants:

 Description   

I use mongo 4.0.3 to start mongo in kubernetes with the following arguments:

mongod --replSet rs0 --bind_ip 0.0.0.0 --smallfiles --noprealloc

the volume contains the old data, copy from other instance, but I got following error:

 

2018-10-04T12:39:36.525+0000 I STORAGE [initandlisten] Detected data files in /data/db created by the 'wiredTiger' storage engine, so setting the active storage engine to 'wiredTiger'.
2018-10-04T12:39:36.532+0000 I STORAGE [initandlisten] wiredtiger_open config: create,cache_size=7511M,session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),statistics_log=(wait=0),verbose=(recovery_progress),
2018-10-04T12:39:37.231+0000 E STORAGE [initandlisten] WiredTiger error (22) [1538656777:231135][325:0x7fb5e255fa00], wiredtiger_open: __posix_open_file, 715: /data/db/WiredTiger: handle-open: open: Invalid argument Raw: [1538656777:231135][325:0x7fb5e255fa00], wiredtiger_open: __posix_open_file, 715: /data/db/WiredTiger: handle-open: open: Invalid argument
2018-10-04T12:39:37.234+0000 E STORAGE [initandlisten] WiredTiger error (22) [1538656777:234941][325:0x7fb5e255fa00], wiredtiger_open: __posix_open_file, 715: /data/db/WiredTiger: handle-open: open: Invalid argument Raw: [1538656777:234941][325:0x7fb5e255fa00], wiredtiger_open: __posix_open_file, 715: /data/db/WiredTiger: handle-open: open: Invalid argument
2018-10-04T12:39:37.238+0000 E STORAGE [initandlisten] WiredTiger error (22) [1538656777:237994][325:0x7fb5e255fa00], wiredtiger_open: __posix_open_file, 715: /data/db/WiredTiger: handle-open: open: Invalid argument Raw: [1538656777:237994][325:0x7fb5e255fa00], wiredtiger_open: __posix_open_file, 715: /data/db/WiredTiger: handle-open: open: Invalid argument
2018-10-04T12:39:37.238+0000 W STORAGE [initandlisten] Failed to start up WiredTiger under any compatibility version.
2018-10-04T12:39:37.238+0000 F - [initandlisten] Fatal Assertion 28561 at src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp 620
2018-10-04T12:39:37.239+0000 F - [initandlisten]
***aborting after fassert() failure

 

.strace file is in attachment

 

PersistentVolume is an NFS with the following flags:

rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.0.23.74,local_lock=none,addr=10.43.65.60

 



 Comments   
Comment by Henrique Barcellos [ 16/Oct/18 ]

What I posted is a temporary solution, since mongod starts and try to create lock files with unsupported flags on open(2) I really think this is a bug that should be fixed.

mongod should run on any filesystem even covering support for those incompatible flags.

Comment by Kelsey Schubert [ 12/Oct/18 ]

Thanks for the follow up, henrique.barcellos.

Please note that the SERVER project is for reporting bugs or feature suggestions for the MongoDB server. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag. Additional discussion like this would be best posted on the mongodb-users group.

Kind regards,
Kelsey

Comment by Henrique Barcellos [ 09/Oct/18 ]

I have increased the spec.template.specterminationGracePeriodSeconds to 120 seconds and it performs graceful restart but the error still occurs, so what I did to solve the problem is:

lifecycle:
  preStop:
    exec:
      command:
        - sh
        - -c
        - rm -rf /data/db/WiredTiger /data/db/WiredTiger.lock
 
 

Comment by Nick Brewer [ 08/Oct/18 ]

henrique.barcellos After looking at the strace information and the other details provided in this ticket, we do not believe this is a bug on the MongoDB end - the error you're encountering simply indicates that a requested file is not accessible. I still suspect that this is a problem related to unexpected shutdown and subsequent reuse of the same dbpath, in particular due to the fact that this works once the lock file is deleted.

-Nick

Comment by Henrique Barcellos [ 08/Oct/18 ]

Yes, I just delete the pod and k8s recreates-it. When the pod is started I got the error.

I'm using kubectl delete to delete the pod.

Comment by Nick Brewer [ 05/Oct/18 ]

henrique.barcellos I suspect that the mongod is not being shut down correctly, which is leaving the dbpath in a state that is causing these errors - can you elaborate the specific method (kubectl delete, for example) that you're using to delete the pod?

Thanks,
-Nick

Comment by Henrique Barcellos [ 05/Oct/18 ]

I have tried to use the following flags on NFS:

bg,nolock,noatime

Still no success.

With this flags it mounts NFS PersistentVolume with this options:

 

rw,noatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.0.23.74,local_lock=none,addr=10.43.65.60

I just created a StatefulSet, and the first time it works because there is no locks, when I delete the pod, k8s re-creates the pod using the same nfs volume and it fails with the error that I put on this issue description.

 

Comment by Nick Brewer [ 04/Oct/18 ]

henrique.barcellos Sorry, I overlooked that line previously - it does appear that the mount command is using different fstab settings (rw, relatime) than what we recommend in our production notes: bg, nolock, and noatime. However I believe the use of local_lock=none covers the nolock requirement.

I'm glad to hear you were able to get it working. Was copied from another mongod while that instance was running?

Thanks,
-Nick

Comment by Henrique Barcellos [ 04/Oct/18 ]

Deleting WiredTiger and WiredTiger.lock and re-creating the pod makes mongod start successfully.

Maybe it is a permission problem with this files or NFS flag that is incompatible with some parameter used in open(2).

Comment by Henrique Barcellos [ 04/Oct/18 ]

The version is the same.
db version v4.0.3

Since the persistent volume is managed by kubernetes deployment (nfs-server-provisioner), there is nothing in fstab about this NFS.

I can provide the following info from mount command (mount | grep pvc-ff3):

10.43.65.60:/export/pvc-ff3b1d90-c812-11e8-ac84-005056aeaeaa on /var/lib/kubelet/pods/42e64dd7-c813-11e8-ac84-005056aeaeaa/volumes/kubernetes.io~nfs/pvc-ff3b1d90-c812-11e8-ac84-005056aeaeaa type nfs4 (rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.0.23.74,local_lock=none,addr=10.43.65.60)

 

I've attached WiredTiger.wt and WiredTiger.turtle.

Comment by Nick Brewer [ 04/Oct/18 ]

henrique.barcellos Thanks for your report. I'd like to confirm a few details:

  • The MongoDB version that was used with the old data before it was moved
  • The /etc/fstab line for your NFS volume
  • The WiredTiger.wt and WiredTiger.turtle files from your dbpath

Thanks,
-Nick

Generated at Thu Feb 08 04:46:05 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.