[SERVER-14238] Mongodb server crash after update of replica set member to 2.6 Created: 12/Jun/14  Updated: 10/Dec/14  Resolved: 30/Jun/14

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: 2.6.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Tudor Alexandru Assignee: Kaloian Manassiev
Resolution: Duplicate Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-9856 No check for building identical backg... Closed
Related
Operating System: ALL
Participants:

 Description   

As per http://docs.mongodb.org/manual/release-notes/2.6-upgrade/ i have followed the instructions for upgradming a replica set to 2.6. I have shut down the server with
service stop mongodb
upgraded the mongod binary with apt-get
started the mongodb
this is the output:

2014-06-12T14:22:37.897+0000 ***** SERVER RESTARTED *****
2014-06-12T14:22:37.902+0000 [initandlisten] MongoDB starting : pid=39502 port=27017 dbpath=/mnt/datadrive/mongodb 64-bit host=vt-mongo-4
2014-06-12T14:22:37.902+0000 [initandlisten] db version v2.6.1
2014-06-12T14:22:37.902+0000 [initandlisten] git version: 4b95b086d2374bdcfcdf2249272fb552c9c726e8
2014-06-12T14:22:37.902+0000 [initandlisten] build info: Linux build14.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49
2014-06-12T14:22:37.902+0000 [initandlisten] allocator: tcmalloc
2014-06-12T14:22:37.903+0000 [initandlisten] options: { config: "/etc/mongod.conf", net: { port: 27017 }, replication: { replSet: "vt_replicaset2" }, storage: { dbPath: "/mnt/datadrive/mongodb" }, systemLog: { destination: "file", logAppend: true, path: "/var/log/mongodb/mongod.log" } }
2014-06-12T14:22:37.998+0000 [initandlisten] journal dir=/mnt/datadrive/mongodb/journal
2014-06-12T14:22:37.998+0000 [initandlisten] recover begin
2014-06-12T14:22:37.998+0000 [initandlisten] info no lsn file in journal/ directory
2014-06-12T14:22:37.998+0000 [initandlisten] recover lsn: 0
2014-06-12T14:22:37.998+0000 [initandlisten] recover /mnt/datadrive/mongodb/journal/j._0
2014-06-12T14:22:37.999+0000 [initandlisten] recover cleaning up
2014-06-12T14:22:37.999+0000 [initandlisten] removeJournalFiles
2014-06-12T14:22:38.473+0000 [initandlisten] recover done
2014-06-12T14:22:38.473+0000 [initandlisten] preallocating a journal file /mnt/datadrive/mongodb/journal/prealloc.0
2014-06-12T14:23:54.880+0000 [initandlisten] vt Invariant failure save == _entries.find( descriptor->indexName() ) src/mongo/db/catalog/index_catalog.cpp 156
2014-06-12T14:23:54.901+0000 [initandlisten] vt 0x11c0e91 0x1163109 0x1145978 0x8d2640 0x8d2d5c 0x8aed8c 0x8ba88d 0x761c3e 0x76255f 0x766b1f 0x7673eb 0x767985 0x767ba9 0x7f11f600b76d 0x75eae9 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11c0e91]
 /usr/bin/mongod(_ZN5mongo10logContextEPKc+0x159) [0x1163109]
 /usr/bin/mongod(_ZN5mongo15invariantFailedEPKcS1_j+0x148) [0x1145978]
 /usr/bin/mongod(_ZN5mongo12IndexCatalog24_setupInMemoryStructuresEPNS_15IndexDescriptorE+0x400) [0x8d2640]
 /usr/bin/mongod(_ZN5mongo12IndexCatalog4initEv+0x58c) [0x8d2d5c]
 /usr/bin/mongod(_ZN5mongo10CollectionC1ERKNS_10StringDataEPNS_16NamespaceDetailsEPNS_8DatabaseE+0x30c) [0x8aed8c]
 /usr/bin/mongod(_ZN5mongo8Database13getCollectionERKNS_10StringDataE+0x16d) [0x8ba88d]
 /usr/bin/mongod(_ZN5mongo17checkForIdIndexesEPNS_8DatabaseE+0x19e) [0x761c3e]
 /usr/bin/mongod() [0x76255f]
 /usr/bin/mongod(_ZN5mongo14_initAndListenEi+0x5df) [0x766b1f]
 /usr/bin/mongod(_ZN5mongo13initAndListenEi+0x1b) [0x7673eb]
 /usr/bin/mongod() [0x767985]
 /usr/bin/mongod(main+0x9) [0x767ba9]
 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7f11f600b76d]
 /usr/bin/mongod() [0x75eae9]
2014-06-12T14:23:54.902+0000 [initandlisten] 
 
***aborting after invariant() failure
 
 
2014-06-12T14:23:54.922+0000 [initandlisten] SEVERE: Got signal: 6 (Aborted).
Backtrace:0x11c0e91 0x11c026e 0x7f11f60204a0 0x7f11f6020425 0x7f11f6023b8b 0x11459e5 0x8d2640 0x8d2d5c 0x8aed8c 0x8ba88d 0x761c3e 0x76255f 0x766b1f 0x7673eb 0x767985 0x767ba9 0x7f11f600b76d 0x75eae9 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11c0e91]
 /usr/bin/mongod() [0x11c026e]
 /lib/x86_64-linux-gnu/libc.so.6(+0x364a0) [0x7f11f60204a0]
 /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7f11f6020425]
 /lib/x86_64-linux-gnu/libc.so.6(abort+0x17b) [0x7f11f6023b8b]
 /usr/bin/mongod(_ZN5mongo15invariantFailedEPKcS1_j+0x1b5) [0x11459e5]
 /usr/bin/mongod(_ZN5mongo12IndexCatalog24_setupInMemoryStructuresEPNS_15IndexDescriptorE+0x400) [0x8d2640]
 /usr/bin/mongod(_ZN5mongo12IndexCatalog4initEv+0x58c) [0x8d2d5c]
 /usr/bin/mongod(_ZN5mongo10CollectionC1ERKNS_10StringDataEPNS_16NamespaceDetailsEPNS_8DatabaseE+0x30c) [0x8aed8c]
 /usr/bin/mongod(_ZN5mongo8Database13getCollectionERKNS_10StringDataE+0x16d) [0x8ba88d]
 /usr/bin/mongod(_ZN5mongo17checkForIdIndexesEPNS_8DatabaseE+0x19e) [0x761c3e]
 /usr/bin/mongod() [0x76255f]
 /usr/bin/mongod(_ZN5mongo14_initAndListenEi+0x5df) [0x766b1f]
 /usr/bin/mongod(_ZN5mongo13initAndListenEi+0x1b) [0x7673eb]
 /usr/bin/mongod() [0x767985]
 /usr/bin/mongod(main+0x9) [0x767ba9]
 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7f11f600b76d]
 /usr/bin/mongod() [0x75eae9]
^C

After reading this
http://t193327.db-mongodb-user.dbtalk.us/mongodb-upgrade-problem-t193327.html
we have rolled back to 2.4 and we have no problems.



 Comments   
Comment by Ramon Fernandez Marina [ 30/Jun/14 ]

tudor_alexandru, thanks for the update. I'm convinced the root cause of the problem was duplicate indexes as described in SERVER-9856. Since secondaries were not affected by this, promoting a secondary to primary and re-syncing does solve the issue. As explained in SERVER-9856, running db.repairDatabase() fixes the corruption created by duplicate indexes, so that would be another way to fix the issue.

Now that you're up and running again, you may want to consider validating your data to be absolutely sure there's no data corruption after the update (be aware that this is a very resource-intensive process and may have a performance impact on your MongoDB instance, so I'd recommend you run it on off-peak hours or, ideally, on a maintenance window).

Comment by Tudor Alexandru [ 30/Jun/14 ]

@kaloian, i have managed to fix this in the following way:
i tried to upgrade a different replica member, a couple of times, and it worked - i did a couple of (upgrade+downgrade+replicaset catch-up) cycles. It appears one of the states of the datafiles was ok. i forced that member to be primary.
i upgraded the second member and it worked 100%.
i upgraded the third member, but when i restarted that machine a day afterwards, it hit me with the same error(!!!), so i deleted all the replica sets which had 2.6 binaries over 2.4 data and resynced them. now everything works.
I don't think i have the old datafiles, but i have daily backups - do those help.
I believe i have a couple of duplicate indexes, but in the meantime i moved the replicaset to a sharding cluster, and many indexes have been deleted as mongo 2.6 does index intersection.

Comment by Ramon Fernandez Marina [ 28/Jun/14 ]

kajsa_a, what you're seeing is SERVER-9856, a bug in 2.4.4 that triggers data corruption when duplicate indexes are removed.

tudor_alexandru, can you please check if you have any duplicate indexes?

Comment by Kajsa Anderson [ 27/Jun/14 ]

I'm seeing the same error following upgrade from 2.4.4 to 2.6.3, though I see the database and collection name have been added to the error log.

After downgrading back to 2.4.4, I'm able to see many duplicate indexes - that is, db.<collection>.getIndexes() shows several pairs like:

	{ 
                "v" : 1,
		"key" : {"date" : 1},
		"ns" : "<db>.<collection>",
		"name" : "date_1",
		"background" : true
	},
	{
		"v" : 1,
		"key" : {"date" : 1},
		"ns" : "<db>.<collection>",
		"name" : "date_1",
		"background" : true
	},

When I try to drop this index, either by name or by key pattern, I get:

{
	"errmsg" : "exception: BSONObj size: 0 (0x00000000) is invalid. Size must be between 0 and 16793600(16MB) First element: EOO",
	"code" : 10334,
	"ok" : 0
}

Comment by Kaloian Manassiev [ 26/Jun/14 ]

Hi Tudor,

The assertion, which you have hit most likely indicates that there are duplicate index entries in the namespace file (.ns file) of one of your databases. I would like to examine this file for possible inconsistencies and to get hints as to how this might have happened.

Would it be possible that you share with us just the namespace file (NOT the data files themselves) ? The log does not indicate, which exactly database is corrupted, but hopefully you don't have too many databases.

We could give you privately an scp share, where you can compress and securely upload the .ns file.

Let me know if that works for you.

-Kal.

Comment by Tudor Alexandru [ 17/Jun/14 ]

I ran db.UpgradeCheckAllDBS over the weekend and it showed no errors. It has a very long output as it seems to check every document.

Comment by Tudor Alexandru [ 13/Jun/14 ]

i shut down the replica with

use admin
db.shutdownserver()

and changed the binaries; i had the same result.
will run dbUpgradeCheckAllDBS over weekend and post results.

Comment by Tudor Alexandru [ 12/Jun/14 ]

Thu Jun 12 14:07:00.869 [conn106212] end connection 10.0.0.31:52049 (24 connections now open)
Thu Jun 12 14:07:00.870 [conn106296] end connection 10.0.0.14:51699 (24 connections now open)
Thu Jun 12 14:07:00.869 [conn106338] end connection 10.0.0.22:38765 (27 connections now open)
Thu Jun 12 14:07:00.938 [rsBackgroundSync] replSet sync source problem: 10278 dbclient error communicating with server: vt-mongo-6:27017
Thu Jun 12 14:07:01.037 [slaveTracking] ERROR: Client::shutdown not called: slaveTracking
Thu Jun 12 14:07:01.198 [DataFileSync] ERROR: Client::shutdown not called: DataFileSync
Thu Jun 12 14:07:03.297 [signalProcessingThread] shutdown: closing all files...
Thu Jun 12 14:07:05.018 [signalProcessingThread]                Progress: 53/93 56%     (File Closing Progress)
Thu Jun 12 14:07:05.838 [signalProcessingThread] closeAllFiles() finished
Thu Jun 12 14:07:05.838 [signalProcessingThread] journalCleanup...
Thu Jun 12 14:07:05.838 [signalProcessingThread] removeJournalFiles
Thu Jun 12 14:07:05.986 [signalProcessingThread] shutdown: removing fs lock...
Thu Jun 12 14:07:05.986 dbexit: really exiting now
 
 
***** SERVER RESTARTED *****
HERE WE TRIED TO UPGRADE
FELL BACK TO 2.4 after a while
 
Thu Jun 12 14:33:58.926 [initandlisten] MongoDB starting : pid=40308 port=27017 dbpath=/mnt/datadrive/mongodb 64-bit host=vt-mongo-4
Thu Jun 12 14:33:58.927 [initandlisten] db version v2.4.10
Thu Jun 12 14:33:58.927 [initandlisten] git version: e3d78955d181e475345ebd60053a4738a4c5268a
Thu Jun 12 14:33:58.927 [initandlisten] build info: Linux ip-10-2-29-40 2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28 EST 2009 x86_64 BOOST_LIB_VERSION=1_49
Thu Jun 12 14:33:58.927 [initandlisten] allocator: tcmalloc
Thu Jun 12 14:33:58.927 [initandlisten] options: { config: "/etc/mongodb.conf", dbpath: "/mnt/datadrive/mongodb", logappend: "true", logpath: "/var/log/mongodb/mongodb.log", replSet: "vt_replicaset2" }
Thu Jun 12 14:33:59.011 [initandlisten] journal dir=/mnt/datadrive/mongodb/journal
Thu Jun 12 14:33:59.012 [initandlisten] recover begin
Thu Jun 12 14:33:59.012 [initandlisten] info no lsn file in journal/ directory
Thu Jun 12 14:33:59.012 [initandlisten] recover lsn: 0
Thu Jun 12 14:33:59.012 [initandlisten] recover /mnt/datadrive/mongodb/journal/j._0
Thu Jun 12 14:33:59.012 [initandlisten] recover cleaning up
Thu Jun 12 14:33:59.012 [initandlisten] removeJournalFiles
Thu Jun 12 14:33:59.460 [initandlisten] recover done
Thu Jun 12 14:33:59.460 [initandlisten] preallocating a journal file /mnt/datadrive/mongodb/journal/prealloc.0
Thu Jun 12 14:34:33.087 [initandlisten] waiting for connections on port 27017
Thu Jun 12 14:34:33.087 [websvr] admin web console waiting for connections on port 28017
Thu Jun 12 14:34:33.096 [rsStart] replSet I am vt-mongo-4:27017

maybe we should have stopped the replica set with a different command?

vt_replicaset2:SECONDARY> db.upgradeCheckAllDBs()
Checking database admin
2014-06-12T15:32:26.580+0000 error: { "$err" : "not master and slaveOk=false", "code" : 13435 } at src/mongo/shell/query.js:131

if i run the command on the PRIMARY, it takes a lot of time..

Comment by Thomas Rueckstiess [ 12/Jun/14 ]

Hi Tudor,

Thanks for reporting. To further investigate, could you upload the log file of the server that crashed reaching back to before you shut down the server on 2.4 for the upgrade? We'd like to see if the server shut down cleanly at that point.

Are you also able to run the db.upgradeCheckAllDBs() shell helper to see if it picks up any issues? You don't need to upgrade the server again to run this check, just install a mongo shell version 2.6.1 somewhere (can be your laptop or another system) and connect to the server that crashed earlier. Then run:

use admin
db.upgradeCheckAllDBs()

and report back the output.

Thanks,
Thomas

Generated at Thu Feb 08 03:34:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.