[SERVER-31340] Fatal Assertion 17441 at src\mongo\db\storage\mmap_v1\record_store_v1_base.cpp 282 Created: 06/Sep/17  Updated: 08/Jul/20  Resolved: 27/Nov/17

Status: Closed
Project: Core Server
Component/s: MMAPv1
Affects Version/s: 3.4.7
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Daniel Putra [X] Assignee: Kelsey Schubert
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

CentOs 7


Attachments: Text File 2017-09-03 mongo-rep1 mongod repair crash.log     File mongod.2017-09-03T05-36-26.mdmp    
Issue Links:
Related
related to SERVER-30629 Fatal Assertion 17441 at src/mongo/db... Closed
Participants:
Case:

 Description   

Repairing a corrupt MongoDB is failing due to a crash. I have a 400GB large MongoDB database consisting of 206 fraction files (mostly 1.99GB). File no 198 seems to be corrupt/broken. The the primary is using MMAPv1 as storage engine. To repair the database I tied the following:

  1. Deleted the secondary replica set and replicated again via version 3.2.6 (production). This started creating WiredTyger file types as is desired. This lead to a *crash* of MongoDB where the corruption was discovered in the first place.
  1. Removed this large DB from Mongo to see if something else is broken. After removal of this large DB all other DBs replicate fine and are fully accessible. This indicates the installation and setup is still fine.
  1. mongodump --host somehost:27017 --db large --out C:\db-mongo\large-backup\
    This lead to a crash of MongoDB version 3.4.7 (development machine).
  1. mongod --repair --repairpath C:\db-mongo\large-repair\ --config "D:\db-mongo\mongo-db-repair.conf"
    This lead to a crash while repairing "large.198" of MongoDB version 3.4.7 after running for 30 hours (development machine 4GHz i7 32GB RAM was unusable/frozen for anything else in that time).

mongo-db-repair.conf:

    dbpath          = F:\mongo-dbs
    logpath         = d:\db-mongo\logs\mongo-rep1.log
    port            = 27017
    logappend       = true
    directoryperdb  = true
    storageEngine   = mmapv1

The last entry in the log (see attached together with related mini dump) before printing the crash info is:
Fatal Assertion 17441 at src\mongo\db\storage\mmap_v1\record_store_v1_base.cpp 282

The plan is to get the replication going again and then update production to 3.4. Due to MongoDB crashing this is not possible at present.



 Comments   
Comment by Kelsey Schubert [ 29/Sep/17 ]

Hi Daniel5,

Yes, I would recommend taking another file copy. This will give mongodump --repair the best chance of success, however, please note that we cannot guarantee that this option will succeed, and, if it does suceed, some manual intervention may be required to removed duplicated documents.

Kind regards,
Kelsey

Comment by Daniel Putra [X] [ 27/Sep/17 ]

Hi Kelsey

Now the copy of the DB on my machine fails to start up. I guess it is related to the crash of the last repair process (see high up in this case) which crashed. I get the errors below in the log.

2017-09-27T09:42:24.734-0700 I CONTROL [main] Trying to start Windows service 'MongoDB'
2017-09-27T09:42:24.735-0700 I CONTROL [initandlisten] MongoDB starting : pid=8580 port=27017 dbpath=F:\mongo-dbs 64-bit host=Daniel2016
2017-09-27T09:42:24.735-0700 I CONTROL [initandlisten] targetMinOS: Windows 7/Windows Server 2008 R2
2017-09-27T09:42:24.735-0700 I CONTROL [initandlisten] db version v3.4.7
2017-09-27T09:42:24.735-0700 I CONTROL [initandlisten] git version: cf38c1b8a0a8dca4a11737581beafef4fe120bcd
2017-09-27T09:42:24.735-0700 I CONTROL [initandlisten] OpenSSL version: OpenSSL 1.0.1u-fips 22 Sep 2016
2017-09-27T09:42:24.735-0700 I CONTROL [initandlisten] allocator: tcmalloc
2017-09-27T09:42:24.735-0700 I CONTROL [initandlisten] modules: none
2017-09-27T09:42:24.735-0700 I CONTROL [initandlisten] build environment:
2017-09-27T09:42:24.736-0700 I CONTROL [initandlisten] distmod: 2008plus-ssl
2017-09-27T09:42:24.736-0700 I CONTROL [initandlisten] distarch: x86_64
2017-09-27T09:42:24.736-0700 I CONTROL [initandlisten] target_arch: x86_64
2017-09-27T09:42:24.736-0700 I CONTROL [initandlisten] options: { config: "D:\db-mongo\mongo-dev.conf", net:

Unknown macro: { port}

, service: true, setParameter:

Unknown macro: { cursorTimeoutMillis}

, storage:

Unknown macro: { dbPath}

, systemLog:

Unknown macro: { destination}

}
2017-09-27T09:42:24.736-0700 W - [initandlisten] Detected unclean shutdown - F:\mongo-dbs\mongod.lock is not empty.
2017-09-27T09:42:24.741-0700 I STORAGE [initandlisten] **************
old lock file: F:\mongo-dbs\mongod.lock. probably means unclean shutdown,
but there are no journal files to recover.
this is likely human error or filesystem corruption.
please make sure that your journal directory is mounted.
found 3 dbs.
see: http://dochub.mongodb.org/core/repair for more information
*************
2017-09-27T09:42:24.742-0700 I STORAGE [initandlisten] exception in initAndListen: 12596 old lock file, terminating
2017-09-27T09:42:24.742-0700 I NETWORK [serviceStopWorker] shutdown: going to close listening sockets...
2017-09-27T09:42:24.742-0700 I NETWORK [serviceStopWorker] shutdown: going to flush diaglog...
2017-09-27T09:42:24.742-0700 I CONTROL [serviceStopWorker] now exiting

I now cannot run mongodump --repair as the DB fails to start up.
I tried mongod --repair but it froze my machine (spec further up) after 15 minutes.

I can take another file copy of the DB from the server but this means shutting down production. As it takes quite some time to copy the 400GB I can schedule this for the weekend to minimise impact. Should I do that or is there another way?

Kind regards,
Daniel

Comment by Kelsey Schubert [ 27/Sep/17 ]

Hi Daniel5,

Please note that mongod --repair and mongodump --repair are different operations that utilize different repair algorithms. From the terminal you should see something like:

$ mongodump --repair
2017-09-27T10:51:32.266-0400	writing repair of admin.system.indexes to
2017-09-27T10:51:32.266-0400		repair cursor found 1 document in admin.system.indexes
2017-09-27T10:51:32.266-0400	done dumping admin.system.indexes (0 documents)
2017-09-27T10:51:32.266-0400	writing repair of admin.system.version to
2017-09-27T10:51:32.267-0400		repair cursor found 1 document in admin.system.version
2017-09-27T10:51:32.267-0400	done dumping admin.system.version (0 documents)
2017-09-27T10:51:32.267-0400	writing repair of test.foo to
2017-09-27T10:51:32.267-0400		repair cursor found 1 document in test.foo
2017-09-27T10:51:32.267-0400	done dumping test.foo (0 documents)

Thanks,
Kelsey

Comment by Daniel Putra [X] [ 27/Sep/17 ]

Hi Kelsey
Not sure what you mean by output?
I have attached the log file as well as the mini dump. Please guide me what else you require.

Kind regards,
Daniel

Comment by Kelsey Schubert [ 26/Sep/17 ]

Hi Daniel5,

Thank you for answering Mark's questions. Unfortunately, in cases like this it is very difficult to identify the root cause of the corruption. Would you please provide the output of mongodump --repair?

Kind regards,
Kelsey

Comment by Daniel Putra [X] [ 22/Sep/17 ]

Thank you Mark for you feedback and attention to the case.

Unfortunately this is the only DB we have. I had to delete the other replica set member a while ago and then thought this is no problem as I can just re-sync it. This is now unfortunately not possible due to the problem described on this case. I need a solution to get replication going again even if it means there is a bit of data loss. At present we are also stuck on v3.2 as upgrading the DB is not possible either.

Please find the answers to your questions below.

  1. What kind of underlying storage mechanism are you using? Are the storage devices attached locally or over the network? DP: Locally. Are the disks SSDs or HDDs? DP: 2TB Seagate SATA HDD. What kind of RAID and/or volume management system are you using? DP: No RAID, Linux Centos 7 ext4. The box is sometimes unresponsive in which case we switch it off completely and then reboot.
  2. Would you please check the integrity of your disks? DP: Done via badblocks, result is: Pass completed, 0 bad blocks found. (0/0/0 errors)
  3. Has the database always been running this version of MongoDB? If not please describe the upgrade/downgrade cycles the database has been through. DP: It started as a MongoDB 2.4 DB for a few years. It was then upgraded to 3.2 in 2016.
  4. Have you manipulated (copied or moved) the underlying database files? If so, was mongod running? DP: Yes, the DB was originally on another server. Before copying mongod had been stopped. The data was copied onto an external USB3 hard drive and then shipped 1000 km via courier services. This was before the upgrade from v2.4.
  5. Have you ever restored this instance from backups? DP: No. Never made one besides copying the entire DB. We have changed this recently where we now execute mongodump. Unfortunately this is not possible for this DB as described on this case.
  6. What method do you use to create backups? DP: Stop mongod and then copy all files. This is done max once a year if not only every two years. We rely on the replica sets being the backup.
  7. When was the underlying filesystem last checked and is it currently marked clean? DP: Never checked the filesystem. Let me know if you need something else.

I am very much looking forward hearing from you.
Daniel

Comment by Mark Agarunov [ 21/Sep/17 ]

Hello Daniel5,

Thank you for the report. Unfortunately, this error indicates that there was corruption on the disk. In this situation, my best recommendation would be to resync the affected node or restore from a backup if possible.

To get an understanding of how this may have happened, I'd like to request some information:

  1. What kind of underlying storage mechanism are you using? Are the storage devices attached locally or over the network? Are the disks SSDs or HDDs? What kind of RAID and/or volume management system are you using?
  2. Would you please check the integrity of your disks?
  3. Has the database always been running this version of MongoDB? If not please describe the upgrade/downgrade cycles the database has been through.
  4. Have you manipulated (copied or moved) the underlying database files? If so, was mongod running?
  5. Have you ever restored this instance from backups?
  6. What method do you use to create backups?
  7. When was the underlying filesystem last checked and is it currently marked clean?

Thanks,
Mark

Generated at Thu Feb 08 04:26:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.