[SERVER-27142] aborting after fassert() failure Created: 21/Nov/16  Updated: 19/Dec/16  Resolved: 01/Dec/16

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Bhaskar Mittal Assignee: Kelsey Schubert
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
duplicates SERVER-7200 use oplog as op buffer on secondaries Closed
Operating System: ALL
Participants:

 Description   

2016-11-16T03:15:17.815+0000 I CONTROL  [main] ***** SERVER RESTARTED *****
2016-11-16T03:15:18.931+0000 I CONTROL  [main] ***** SERVER RESTARTED *****
2016-11-16T03:15:18.941+0000 I CONTROL  [initandlisten] MongoDB starting : pid=3657 port=27218 dbpath=/data0/mongo-config/data/ 64-bit host=nlhypheap041.navitaire.com
2016-11-16T03:15:18.941+0000 I CONTROL  [initandlisten] db version v3.2.5
2016-11-16T03:15:18.941+0000 I CONTROL  [initandlisten] git version: 34e65e5383f7ea1726332cb175b73077ec4a1b02
2016-11-16T03:15:18.941+0000 I CONTROL  [initandlisten] OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013
2016-11-16T03:15:18.941+0000 I CONTROL  [initandlisten] allocator: tcmalloc
2016-11-16T03:15:18.941+0000 I CONTROL  [initandlisten] modules: enterprise
2016-11-16T03:15:18.941+0000 I CONTROL  [initandlisten] build environment:
2016-11-16T03:15:18.941+0000 I CONTROL  [initandlisten]     distmod: rhel70
2016-11-16T03:15:18.941+0000 I CONTROL  [initandlisten]     distarch: x86_64
2016-11-16T03:15:18.941+0000 I CONTROL  [initandlisten]     target_arch: x86_64
2016-11-16T03:15:18.941+0000 I CONTROL  [initandlisten] options: { net: { port: 27218 }, processManagement: { fork: true, pidFilePath: "/data0/mongo-config/logs/mongo.pid" }, replication: { replSet: "nl-con
fig" }, sharding: { clusterRole: "configsvr" }, storage: { dbPath: "/data0/mongo-config/data/", directoryPerDB: true }, systemLog: { destination: "file", logAppend: true, logRotate: "reopen", path: "/data0/
mongo-config/logs/mongo.log" } }
2016-11-16T03:15:18.967+0000 I -        [initandlisten] Detected data files in /data0/mongo-config/data/ created by the 'wiredTiger' storage engine, so setting the active storage engine to 'wiredTiger'.
2016-11-16T03:15:18.968+0000 W -        [initandlisten] Detected unclean shutdown - /data0/mongo-config/data/mongod.lock is not empty.
2016-11-16T03:15:18.968+0000 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
2016-11-16T03:15:18.968+0000 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=8G,session_max=20000,eviction=(threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=
true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2016-11-16T03:15:19.312+0000 I STORAGE  [initandlisten] Starting WiredTigerRecordStoreThread local.oplog.rs
2016-11-16T03:15:19.312+0000 I STORAGE  [initandlisten] The size storer reports that the oplog contains 20609664 records totaling to 4061406103 bytes
2016-11-16T03:15:19.312+0000 I STORAGE  [initandlisten] Sampling from the oplog between May 19 14:47:21:1 and Oct 25 02:57:50:1 to determine where to place markers for truncation
2016-11-16T03:15:19.312+0000 I STORAGE  [initandlisten] Taking 254 samples and assuming that each section of oplog contains approximately 809888 records totaling to 159599111 bytes
2016-11-16T03:15:19.368+0000 I STORAGE  [initandlisten] Placing a marker at optime Jun  7 11:19:24:3
2016-11-16T03:15:19.369+0000 I STORAGE  [initandlisten] Placing a marker at optime Jun 13 02:15:58:1
2016-11-16T03:15:19.369+0000 I STORAGE  [initandlisten] Placing a marker at optime Jun 18 17:01:32:3
2016-11-16T03:15:19.369+0000 I STORAGE  [initandlisten] Placing a marker at optime Jun 24 11:54:17:1
2016-11-16T03:15:19.369+0000 I STORAGE  [initandlisten] Placing a marker at optime Jun 30 07:43:00:1
2016-11-16T03:15:19.369+0000 I STORAGE  [initandlisten] Placing a marker at optime Jul  5 23:03:52:1
2016-11-16T03:15:19.369+0000 I STORAGE  [initandlisten] Placing a marker at optime Jul 11 11:09:09:4
2016-11-16T03:15:19.369+0000 I STORAGE  [initandlisten] Placing a marker at optime Jul 17 00:15:26:1
2016-11-16T03:15:19.369+0000 I STORAGE  [initandlisten] Placing a marker at optime Jul 23 21:23:32:3
2016-11-16T03:15:19.369+0000 I STORAGE  [initandlisten] Placing a marker at optime Aug  3 00:31:17:2
2016-11-16T03:15:19.369+0000 I STORAGE  [initandlisten] Placing a marker at optime Aug  8 16:23:12:4
2016-11-16T03:15:19.369+0000 I STORAGE  [initandlisten] Placing a marker at optime Aug 14 08:43:22:2
2016-11-16T03:15:19.369+0000 I STORAGE  [initandlisten] Placing a marker at optime Aug 19 10:08:03:2
2016-11-16T03:15:19.369+0000 I STORAGE  [initandlisten] Placing a marker at optime Aug 24 17:35:23:2
2016-11-16T03:15:19.369+0000 I STORAGE  [initandlisten] Placing a marker at optime Aug 29 23:49:29:3
2016-11-16T03:15:19.369+0000 I STORAGE  [initandlisten] Placing a marker at optime Sep  4 08:40:28:7
2016-11-16T03:15:19.369+0000 I STORAGE  [initandlisten] Placing a marker at optime Sep  9 19:07:06:1
2016-11-16T03:15:19.369+0000 I STORAGE  [initandlisten] Placing a marker at optime Sep 15 02:15:51:2
2016-11-16T03:15:19.369+0000 I STORAGE  [initandlisten] Placing a marker at optime Sep 20 17:42:11:1
2016-11-16T03:15:19.369+0000 I STORAGE  [initandlisten] Placing a marker at optime Sep 26 03:12:45:3
2016-11-16T03:15:19.369+0000 I STORAGE  [initandlisten] Placing a marker at optime Oct  1 10:07:27:4
2016-11-16T03:15:19.369+0000 I STORAGE  [initandlisten] Placing a marker at optime Oct  6 17:05:48:1
2016-11-16T03:15:19.369+0000 I STORAGE  [initandlisten] Placing a marker at optime Oct 11 19:52:32:5
2016-11-16T03:15:19.369+0000 I STORAGE  [initandlisten] Placing a marker at optime Oct 17 05:09:14:1
2016-11-16T03:15:19.369+0000 I STORAGE  [initandlisten] Placing a marker at optime Oct 22 11:59:40:1
2016-11-16T03:15:19.410+0000 I CONTROL  [initandlisten]
2016-11-16T03:15:19.410+0000 I CONTROL  [initandlisten] ** WARNING: Insecure configuration, access control is not enabled and no --bind_ip has been specified.
2016-11-16T03:15:19.410+0000 I CONTROL  [initandlisten] **          Read and write access to data and configuration is unrestricted,
2016-11-16T03:15:19.410+0000 I CONTROL  [initandlisten] **          and the server listens on all available network interfaces.
2016-11-16T03:15:19.410+0000 I CONTROL  [initandlisten]
2016-11-16T03:15:19.420+0000 I FTDC     [initandlisten] Initializing full-time diagnostic data capture with directory '/data0/mongo-config/data/diagnostic.data'
2016-11-16T03:15:19.420+0000 I NETWORK  [initandlisten] waiting for connections on port 27218
2016-11-16T03:15:19.420+0000 I NETWORK  [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
2016-11-16T03:15:19.440+0000 I REPL     [ReplicationExecutor] New replica set config in use: { _id: "nl-config", version: 1, configsvr: true, protocolVersion: 1, members: [ { _id: 0, host: "nlhypseap041.navitaire.com:27218", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 4.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 1, host: "nlhypheap041.navitaire.com:27218", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 3.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "nlhypseap040.navitaire.com:27218", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 2.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 3, host: "nlhypheap040.navitaire.com:27218", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('573dd1f90c43e512f78a0eb1') } }
2016-11-16T03:15:19.440+0000 I REPL     [ReplicationExecutor] This node is nlhypheap041.navitaire.com:27218 in the config
2016-11-16T03:15:19.440+0000 I REPL     [ReplicationExecutor] transition to STARTUP2
2016-11-16T03:15:19.440+0000 I REPL     [ReplicationExecutor] Starting replication applier threads
2016-11-16T03:15:19.440+0000 I REPL     [ReplicationExecutor] transition to RECOVERING
2016-11-16T03:15:19.445+0000 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to nlhypheap040.navitaire.com:27218
2016-11-16T03:15:19.446+0000 I REPL     [ReplicationExecutor] Member nlhypheap040.navitaire.com:27218 is now in state SECONDARY
2016-11-16T03:15:19.446+0000 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to nlhypseap041.navitaire.com:27218
2016-11-16T03:15:19.448+0000 I ASIO     [NetworkInterfaceASIO-Replication-0] Successfully connected to nlhypseap040.navitaire.com:27218
2016-11-16T03:15:19.448+0000 I REPL     [ReplicationExecutor] Member nlhypseap041.navitaire.com:27218 is now in state PRIMARY
2016-11-16T03:15:19.448+0000 I NETWORK  [initandlisten] connection accepted from 10.213.26.71:45776 #1 (1 connection now open)
2016-11-16T03:15:19.450+0000 I REPL     [ReplicationExecutor] Member nlhypseap040.navitaire.com:27218 is now in state SECONDARY
2016-11-16T03:15:20.130+0000 I NETWORK  [initandlisten] connection accepted from 10.213.26.70:53168 #2 (2 connections now open)
2016-11-16T03:15:20.340+0000 I NETWORK  [initandlisten] connection accepted from 10.212.26.71:35348 #3 (3 connections now open)
2016-11-16T03:15:21.110+0000 I NETWORK  [initandlisten] connection accepted from 10.213.26.70:53169 #4 (4 connections now open)
2016-11-16T03:15:21.239+0000 I NETWORK  [initandlisten] connection accepted from 10.212.26.70:41543 #5 (5 connections now open)
2016-11-16T03:15:22.237+0000 I NETWORK  [initandlisten] connection accepted from 10.213.26.71:45782 #6 (6 connections now open)
2016-11-16T03:15:24.249+0000 I NETWORK  [initandlisten] connection accepted from 10.212.26.70:41545 #7 (7 connections now open)
2016-11-16T03:15:24.249+0000 I NETWORK  [initandlisten] connection accepted from 10.212.26.70:41546 #8 (8 connections now open)
2016-11-16T03:15:24.349+0000 I NETWORK  [initandlisten] connection accepted from 10.212.26.71:35352 #9 (9 connections now open)
2016-11-16T03:15:24.350+0000 I NETWORK  [initandlisten] connection accepted from 10.212.26.71:35351 #10 (10 connections now open)
2016-11-16T03:15:25.442+0000 I REPL     [ReplicationExecutor] syncing from: nlhypheap040.navitaire.com:27218
2016-11-16T03:15:25.445+0000 I REPL     [SyncSourceFeedback] setting syncSourceFeedback to nlhypheap040.navitaire.com:27218
2016-11-16T03:15:25.447+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Successfully connected to nlhypheap040.navitaire.com:27218
2016-11-16T03:15:25.449+0000 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 47, timestamp: Oct 25 02:57:50:1). source's GTE: (term: 48, timestamp: Oct 25 02:57:50:1) hashes: (502322552398015638/711015327143776815)
2016-11-16T03:15:25.449+0000 I -        [rsBackgroundSync] Fatal assertion 18750 UnrecoverableRollbackError: need to rollback, but in inconsistent state. minvalid: (term: 48, timestamp: Oct 25 02:57:50:3) > our last optime: (term: 47, timestamp: Oct 25 02:57:50:1)
2016-11-16T03:15:25.449+0000 I -        [rsBackgroundSync]
 
***aborting after fassert() failure



 Comments   
Comment by Bhaskar Mittal [ 15/Dec/16 ]

Hello Team,

We are having 4 Mongo Servers in Production. And this issue is stopping only 1 mongod process on a single server.

Currently we are having a change freeze and would be needing another 1 month for projecting a recommended version upgrade on production. Can you share a work around for this issue to be resolved as client is not agreeing that why we would be having an issue with one 1 process on single server where as all 3 servers are up and running.

Let me know if you require further information for any of the mongo server of production.

Regards,
Bhaskar Mittal

Comment by Kelsey Schubert [ 05/Dec/16 ]

Hi bhaskar.mittal@accenture.com,

As I've mentioned, the solution is to upgrade to MongoDB 3.2.11. You could also investigate if there are new network conditions that are causing more failovers to take place.

Kind regards,
Thomas

Comment by Bhaskar Mittal [ 02/Dec/16 ]

Hello Thomas,

Just to brief a bit, We are having 4 mongo servers making 2 different shards.
The configurations are similar to the one we have in lower environments. and this configuration was working fine on production as well for a long time.

It just started giving the error some time back.

Couldn't we have a solution without managing any change in configurations.

Thanks,
Bhaskar Mittal

Comment by Kelsey Schubert [ 01/Dec/16 ]

Hi bhaskar.mittal@accenture.com,

I would suggest setting your priorities to same value for all electable nodes. This will reduce the number of elections that occur, and consequently the frequency of encountering this issue. In addition, some users have seen benefit from setting chainingAllowed to false. However, even with these changes, you may still hit this bug.

I strongly recommend upgrading to MongoDB 3.2.11 as soon as possible, which also contains a fix for SERVER-22970.

Kind regards,
Thomas

Comment by Bhaskar Mittal [ 01/Dec/16 ]

Hello Team,

Below is the required config data :

MongoDB shell version: 3.2.5
connecting to: 10.213.26.71:27118/test
Server has startup warnings:
2016-11-24T03:23:47.232+0000 I CONTROL  [initandlisten]
2016-11-24T03:23:47.232+0000 I CONTROL  [initandlisten] ** WARNING: Insecure configuration, access control is not enabled and no --bind_ip has been specified.
2016-11-24T03:23:47.232+0000 I CONTROL  [initandlisten] **          Read and write access to data and configuration is unrestricted,
2016-11-24T03:23:47.232+0000 I CONTROL  [initandlisten] **          and the server listens on all available network interfaces.
2016-11-24T03:23:47.232+0000 I CONTROL  [initandlisten]
MongoDB Enterprise nl-he:PRIMARY> use admin
switched to db admin
MongoDB Enterprise nl-he:PRIMARY> rs.config()
{
        "_id" : "nl-he",
        "version" : 1,
        "protocolVersion" : NumberLong(1),
        "members" : [
                {
                        "_id" : 0,
                        "host" : "nlhypheap041.navitaire.com:27118",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 2,
                        "tags" : {
 
                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                },
                {
                        "_id" : 1,
                        "host" : "nlhypseap041.navitaire.com:27118",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 1,
                        "tags" : {
 
                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                },
                {
                        "_id" : 2,
                        "host" : "nlhypheap040.navitaire.com:27118",
                        "arbiterOnly" : true,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 0,
                        "tags" : {
 
                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                }
        ],
        "settings" : {
                "chainingAllowed" : true,
                "heartbeatIntervalMillis" : 2000,
                "heartbeatTimeoutSecs" : 10,
                "electionTimeoutMillis" : 10000,
                "getLastErrorModes" : {
 
                },
                "getLastErrorDefaults" : {
                        "w" : 1,
                        "wtimeout" : 0
                },
                "replicaSetId" : ObjectId("573dd1fa87c77b3fef58b092")
        }
}
MongoDB Enterprise nl-he:PRIMARY>

Comment by Kelsey Schubert [ 30/Nov/16 ]

You would need to execute this command directly against a mongod.

Comment by Bhaskar Mittal [ 30/Nov/16 ]

Hello Team,

how can we take get the output of rs.config().

Before running this command on production i tried on lower environment on got below output.

bash-4.2$ mongo 127.0.0.1:27017
MongoDB shell version: 3.2.5
connecting to: 127.0.0.1:27017/test
Server has startup warnings:
2016-11-09T04:22:46.100+0000 I CONTROL  [main]
2016-11-09T04:22:46.100+0000 I CONTROL  [main] ** WARNING: Insecure configuration, access control is not enabled and no --bind_ip has been specified.
2016-11-09T04:22:46.100+0000 I CONTROL  [main] **          Read and write access to data and configuration is unrestricted,
2016-11-09T04:22:46.100+0000 I CONTROL  [main] **          and the server listens on all available network interfaces.
2016-11-09T04:22:46.100+0000 I CONTROL  [main]
MongoDB Enterprise mongos> db
test
MongoDB Enterprise mongos> rs.config
function () {
    var resp = db._adminCommand({replSetGetConfig: 1});
    if (resp.ok && !(resp.errmsg) && resp.config)
        return resp.config;
    else if (resp.errmsg && resp.errmsg.startsWith("no such cmd"))
        return db.getSisterDB("local").system.replset.findOne();
    throw new Error("Could not retrieve replica set config: " + tojson(resp));
}
MongoDB Enterprise mongos>
MongoDB Enterprise mongos> rs.config()
2016-11-30T10:20:47.052+0000 E QUERY    [thread1] Error: error: { "$err" : "can't use 'local' database through mongos", "code" : 20 } :
_getErrorWithCode@src/mongo/shell/utils.js:25:13
DBCommandCursor@src/mongo/shell/query.js:689:1
DBQuery.prototype._exec@src/mongo/shell/query.js:118:28
DBQuery.prototype.hasNext@src/mongo/shell/query.js:276:5
DBCollection.prototype.findOne@src/mongo/shell/collection.js:289:10
rs.conf@src/mongo/shell/utils.js:1195:16
@(shell):1:1

Comment by Kelsey Schubert [ 25/Nov/16 ]

Hi bhaskar.mittal@accenture.com,

If you could provide the output of rs.config(). We may be able to suggest changes to your configuration that would mitigate this issue. However, there is no workaround, and we strongly recommend upgrading to 3.2.11 if you are encountering this issue.

Kind regards,
Thomas

Comment by Bhaskar Mittal [ 22/Nov/16 ]

Hello,

Do we have any other workaround to avoid upgrade to 3.2.11. This is currently an issue on production environment.

Regards,
Bhaskar Mittal

Comment by Ramon Fernandez Marina [ 22/Nov/16 ]

bhaskar.mittal@accenture.com, can you please upgrade to 3.2.11 and try again? I think this is a case that we've fixed with SERVER-7200. You may need to resync your node too.

Thanks,
Ramón.

Comment by Bhaskar Mittal [ 21/Nov/16 ]

Somewhere on the google i found that "try to delete it, ie. 'sudo rm /tmp/mongodb-27218.sock' and then restart the mongo process"

Will this work for me? what would be the impact if i delete this sock file? will i lose any data?

Generated at Thu Feb 08 04:14:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.