[SERVER-32199] Fatal Assertion 34433 Created: 07/Dec/17  Updated: 09/Mar/18  Resolved: 07/Dec/17

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.4.7
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Arnold Ligtvoet Assignee: Mark Agarunov
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File repair-SERVER-32199.tar.gz    
Operating System: ALL
Participants:

 Description   

When I checked my logfiles on the primary server I noticed errors that my 3rd server in the 3 server cluster could not be reached. When checking on the server I found mongod was not running, so I tried to restart the service and got a 34433 fatal assertion.

This server is running as an arbiter only, so I am a bit surprised to see this WiredTiger error on this machine.

The output of /var/log/mongodb/mongod.log:

/var/log/mongodb/mongod.log

2017-12-07T12:26:32.253+0100 I CONTROL  [main] ***** SERVER RESTARTED *****
2017-12-07T12:26:32.270+0100 I CONTROL  [initandlisten] MongoDB starting : pid=18337 port=27017 dbpath=/var/lib/mongodb 64-bit host=xxxxxxxxxxxxx
2017-12-07T12:26:32.270+0100 I CONTROL  [initandlisten] db version v3.4.7
2017-12-07T12:26:32.270+0100 I CONTROL  [initandlisten] git version: cf38c1b8a0a8dca4a11737581beafef4fe120bcd
2017-12-07T12:26:32.270+0100 I CONTROL  [initandlisten] OpenSSL version: OpenSSL 1.0.2g  1 Mar 2016
2017-12-07T12:26:32.270+0100 I CONTROL  [initandlisten] allocator: tcmalloc
2017-12-07T12:26:32.270+0100 I CONTROL  [initandlisten] modules: none
2017-12-07T12:26:32.270+0100 I CONTROL  [initandlisten] build environment:
2017-12-07T12:26:32.270+0100 I CONTROL  [initandlisten]     distmod: ubuntu1604
2017-12-07T12:26:32.270+0100 I CONTROL  [initandlisten]     distarch: x86_64
2017-12-07T12:26:32.270+0100 I CONTROL  [initandlisten]     target_arch: x86_64
2017-12-07T12:26:32.270+0100 I CONTROL  [initandlisten] options: { config: "/etc/mongod.conf", net: { bindIp: "127.0.0.1", port: 27017 }, replication: { replSetName: "rsym" }, security: { keyFile: "/opt/mongo/mongo-keyfile" }, storage: { dbPath: "/var/lib/mongodb", journal: { enabled: false } }, systemLog: { destination: "file", logAppend: true, path: "/var/log/mongodb/mongod.log" } }
2017-12-07T12:26:32.270+0100 W -        [initandlisten] Detected unclean shutdown - /var/lib/mongodb/mongod.lock is not empty.
2017-12-07T12:26:32.296+0100 I -        [initandlisten] Detected data files in /var/lib/mongodb created by the 'wiredTiger' storage engine, so setting the active storage engine to 'wiredTiger'.
2017-12-07T12:26:32.296+0100 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
2017-12-07T12:26:32.296+0100 I STORAGE  [initandlisten]
2017-12-07T12:26:32.296+0100 I STORAGE  [initandlisten] ** WARNING: Using the XFS filesystem is strongly recommended with the WiredTiger storage engine
2017-12-07T12:26:32.296+0100 I STORAGE  [initandlisten] **          See http://dochub.mongodb.org/core/prodnotes-filesystem
2017-12-07T12:26:32.296+0100 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=256M,session_max=20000,eviction=(threads_min=4,threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),,log=(enabled=false),
2017-12-07T12:26:32.311+0100 I -        [initandlisten] Fatal Assertion 34433 at src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp 796
2017-12-07T12:26:32.311+0100 I -        [initandlisten]
 
***aborting after fassert() failure

Running 'mongod --dbpath /var/lib/mongodb --repair' resulted in an aborted repair, but after restarting the mongo service the fatal assertion changed to 28578.

/var/log/mongodb/mongod.log after --repair

2017-12-07T12:44:58.954+0100 I CONTROL  [main] ***** SERVER RESTARTED *****
2017-12-07T12:44:58.971+0100 I CONTROL  [initandlisten] MongoDB starting : pid=19822 port=27017 dbpath=/var/lib/mongodb 64-bit host=xxxxxxxxxxxxx
2017-12-07T12:44:58.971+0100 I CONTROL  [initandlisten] db version v3.4.7
2017-12-07T12:44:58.971+0100 I CONTROL  [initandlisten] git version: cf38c1b8a0a8dca4a11737581beafef4fe120bcd
2017-12-07T12:44:58.971+0100 I CONTROL  [initandlisten] OpenSSL version: OpenSSL 1.0.2g  1 Mar 2016
2017-12-07T12:44:58.971+0100 I CONTROL  [initandlisten] allocator: tcmalloc
2017-12-07T12:44:58.971+0100 I CONTROL  [initandlisten] modules: none
2017-12-07T12:44:58.971+0100 I CONTROL  [initandlisten] build environment:
2017-12-07T12:44:58.971+0100 I CONTROL  [initandlisten]     distmod: ubuntu1604
2017-12-07T12:44:58.971+0100 I CONTROL  [initandlisten]     distarch: x86_64
2017-12-07T12:44:58.971+0100 I CONTROL  [initandlisten]     target_arch: x86_64
2017-12-07T12:44:58.971+0100 I CONTROL  [initandlisten] options: { config: "/etc/mongod.conf", net: { bindIp: "127.0.0.1", port: 27017 }, replication: { replSetName: "rsym" }, security: { keyFile: "/opt/mongo/mongo-keyfile" }, storage: { dbPath: "/var/lib/mongodb", journal: { enabled: false } }, systemLog: { destination: "file", logAppend: true, path: "/var/log/mongodb/mongod.log" } }
2017-12-07T12:44:58.971+0100 W -        [initandlisten] Detected unclean shutdown - /var/lib/mongodb/mongod.lock is not empty.
2017-12-07T12:44:58.996+0100 E NETWORK  [initandlisten] Failed to unlink socket file /tmp/mongodb-27017.sock Operation not permitted
2017-12-07T12:44:58.996+0100 I -        [initandlisten] Fatal Assertion 28578 at src/mongo/util/net/listen.cpp 195
2017-12-07T12:44:58.996+0100 I -        [initandlisten]
 
***aborting after fassert() failure

After removing the '/tmp/mongodb-27017.sock' file and chowning all files in '/var/lib/mongodb' to 'mongodb:mongodb' I am back to 34433 error when restarting the service. I did edit the logfile output and removed only my external IP address.



 Comments   
Comment by Mark Agarunov [ 08/Dec/17 ]

Hello aligtvoet,

I'm glad to hear you fixed the issue. While arbiters generally do not contain data, they will still attempt to load data in the dbpath if the node contained data in the past (was a secondary or primary at some point) and/or the admin database if it was upgraded from a version earlier than 3.2. The following log line suggests that this node is attempting to load data from the dbpath:

2017-12-07T12:26:32.296+0100 I -        [initandlisten] Detected data files in /var/lib/mongodb created by the 'wiredTiger' storage engine, so setting the active storage engine to 'wiredTiger'.

Thanks,
Mark

Comment by Arnold Ligtvoet [ 08/Dec/17 ]

Hi Mark,

thanks for all your help. In the end I did a full resync (which was quick given the role of the server). I'm trying to figure out what happened to the server as I really see no evidence of corruption on disks (does an arbiter even suffer from this as there is no dataset?). I did see a reboot where my /etc/hosts config was lost, so maybe this caused part of the problem in the general sync failing.

Arnold.

Comment by Mark Agarunov [ 07/Dec/17 ]

Hello aligtvoet,

Unfortunately, this error indicates that there was corruption on the disk. In this situation, my best recommendation would be to resync the affected node or restore from a backup if possible.

Thanks,
Mark

Comment by Arnold Ligtvoet [ 07/Dec/17 ]

Hi Mark,
replaced the files and the error 34433 persists:
Fatal Assertion 34433 at src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp 796

Please advice which logfiles (just /var/log/mongodb/mongod.log?) you need and if I can upload them using the same link as before.

In answer to your questions:
1. Service is running Openstack Hypervisor KVM on SSD. Storage is attached locally. No raid or volume management afaik.
2. badblocks -sv /dev/vda1 gives no errors
3. Yes always at 3.4.7
4. No
5. No
6. Erm, good question none at the moment. This machine was only used to be an arbiter in the cluster
7. tune2fs -l /dev/vda1 states filesystem state as clean

I can see this happen in the logfiles (the ":27017" is not edited by me, was empty in the logfile):

2017-12-06T14:24:49.437+0200 I REPL     [replication-1] Starting initial sync (attempt 10 of 10)
2017-12-06T14:24:58.449+0200 I REPL     [replication-0] Initial sync attempt finishing up.
2017-12-06T14:24:58.449+0200 I REPL     [replication-0] Initial Sync Attempt Statistics: { failedInitialSyncAttempts: 9, maxFailedInitialSyncAttempts: 10, initialSyncStart: new Date(1508674999311), initialSyncAttempts: [ { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" }, { durationMillis: 0, status: "InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.", syncSource: ":27017" } ] }
2017-12-06T14:24:58.449+0200 E REPL     [replication-0] Initial sync attempt failed -- attempts left: 0 cause: InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync.
2017-12-06T14:24:58.449+0200 F REPL     [replication-0] The maximum number of retries have been exhausted for initial sync.
2017-12-06T14:24:58.453+0200 E REPL     [replication-0] Initial sync failed, shutting down now. Restart the server to attempt a new initial sync.
2017-12-06T14:24:58.453+0200 I -        [replication-0] Fatal assertion 40088 InitialSyncOplogSourceMissing: No valid sync source found in current replica set to do an initial sync. at src/mongo/db/repl/replication_coordinator_impl.cpp 633
2017-12-06T14:24:58.453+0200 I -        [replication-0] 
 
***aborting after fassert() failure

Thanks,
Arnold.

Comment by Mark Agarunov [ 07/Dec/17 ]

Hello aligtvoet,

Thank you for providing these files. I've attached a repair attempt of the files you've provided. Would you please extract these files and replace them in your $dbpath and let us know if it resolves the issue? If you are still seeing errors after replacing these files, please provide the complete logs from mongod so that we can further investigate. Additionally, if this issue persists, please provide the following information:

  1. What kind of underlying storage mechanism are you using? Are the storage devices attached locally or over the network? Are the disks SSDs or HDDs? What kind of RAID and/or volume management system are you using?
  2. Would you please check the integrity of your disks?
  3. Has the database always been running this version of MongoDB? If not please describe the upgrade/downgrade cycles the database has been through.
  4. Have you manipulated (copied or moved) the underlying database files? If so, was mongod running?
  5. Have you ever restored this instance from backups?
  6. What method do you use to create backups?
  7. When was the underlying filesystem last checked and is it currently marked clean?

Thanks,
Mark

Comment by Arnold Ligtvoet [ 07/Dec/17 ]

Thanks Mark for your quick replies. I have uploaded the files.

Comment by Mark Agarunov [ 07/Dec/17 ]

Hello aligtvoet,

I've created a secure upload portal so that you can send us these files privately. Please note however, that these files don't contain any user data, just metadata for the WiredTiger engine.

Thanks,
Mark

Comment by Arnold Ligtvoet [ 07/Dec/17 ]

Hi Mark, thanks for removing the host name. Can I share these files without making them public as I assume these contain a lot of sensitive info?

Comment by Mark Agarunov [ 07/Dec/17 ]

Hello aligtvoet,

Thank you for the report. I've removed the hostname from the output as requested. If you can provide the WiredTiger.wt and WiredTiger.turtle files we can attempt a repair of the database, but please keep in mind that this is not a guaranteed fix.

Thanks,
Mark

Comment by Arnold Ligtvoet [ 07/Dec/17 ]

Can you please update the issue asap and remove the server name, as I cannot edit or close the issue. Also the formatting for the panel is quite strange and I cannot update that.

Generated at Thu Feb 08 04:29:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.