[SERVER-29124] Fatal Assertion 16360 Created: 11/May/17  Updated: 09/Feb/18  Resolved: 18/Jan/18

Status: Closed
Project: Core Server
Component/s: Index Maintenance, Replication, WiredTiger
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Bob Lunney Assignee: Kelsey Schubert
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Steps To Reproduce:

Run mongod with replication.

Participants:

 Description   

A replica set secondary crashed with the following assertion error:

2017-05-11T09:05:17.741-0400 F REPL     [repl writer worker 3] writer worker caught exception:  :: caused by :: 11000 E11000 duplicate key error <redacted>
2017-05-11T09:05:17.745-0400 I -        [repl writer worker 3] Fatal Assertion 16360
2017-05-11T09:05:17.745-0400 I -        [repl writer worker 3]
 
***aborting after fassert() failure

At the time the primary was under load, servicing 3.6k reads and 132 writes per second. The other replica set secondary was being rebuilt at the time this one crashed.

There are multple 'duplicate key' errors in the primary's log, where documents were rejected on insert, but none are the document reported as duplicate by the secondary that crashed.

MongoDB 3.2.11 on Amazon Linux, version 2106.03
mongo setup: three node replica set, one primary, two secondaries



 Comments   
Comment by Bob Lunney [ 15/Sep/17 ]

Kelsey,

This issue has not recurred since the initial report. We have also not had a failover since then, but we're not looking forward to the inevitable, either.

Unfortunately, as a Mongo noob, I wasn't aware of the value of the diagnostics directory, and probably destroyed any chance of solving this mystery. I am aware that unique indexes on secondaries somehow relies on WiredTiger's MVCC mechanism, as dumps made from secondaries with mongodump will sometimes have duplicate data that prevents unique index creation.

If there is anything else I can do to help please let me know. Otherwise I suggest closing the ticket as unsolvable, since the diagnostic data isn't available.

Thanks for your efforts!

Bob

Comment by Kelsey Schubert [ 15/Sep/17 ]

Hi blunney,

We've been working to understand what has happened here, but haven't had much success. Have you encountered this issue since the initial report?

Thanks,
Kelsey

Comment by Bob Lunney [ 12/May/17 ]

Thomas,

Thanks for your help.

I have uploaded :

  • rs0-mongo.log.gz
  • rs1-mongo.log.gz (see lines 1106 - 1110 for the error)
  • rs2-mongo.log.gz
  • rs2-diagnostic.data.tar.gz
  • rs2-indexes.txt

Sadly, I don't have the diagnostic.data files, nor the indexes for the affected collection from the secondaries at the time of the incident. We needed the secondaries back, so the data directory was purged and the secondaries resynced with the primary. I'll know better next time.

The failed secondary (rs1) was the primary prior to the fatal assertion error. The new primary (rs2) took over via an automatic failover. Just prior to the automatic failover the other secondary (rs0) was shutdown, data directory purged, and restarted to resync it with the primary (rs1 at the time). Then the failover event occurred, rs2 was elected primary, rs1 crashed, and eventually rs0 began resyncing from rs2.

At this point we have rs2 in PRIMARY mode, rs0 in STARTUP2, and rs1 down, i.e. no secondary to fail over to. I let rs0 finish resyncing and transition to SECONDARY mode prior to resyncing rs1.

Thanks for your help, and please let me know if there is anymore information I can provide.

Comment by Kelsey Schubert [ 12/May/17 ]

Hi blunney,

I've created a secure upload portal where you can diagnostic files. Files uploaded to this portal are only visible to MongoDB employees investigating this issue and are routinely deleted after some time.

To help us investigate this issue, would you please provide the following information?

  • An archive of the diagnostic.data directory of the primary
  • An archive of the diagnostic.data directory of the affected secondary
  • Complete log files of the primary
  • Complete log files of the secondary
  • Output of db.collection.getIndexes() where collection is the affected collection executed against the primary
  • Output of db.collection.getIndexes() where collection is the affected collection executed against the affected secondary

Thank you for your help,
Thomas

Comment by Bob Lunney [ 11/May/17 ]

Correction: MongoDB 3.2.11 on Amazon Linux, version 2016.03

Addition: Running WiredTiger as the storage engine.

Generated at Thu Feb 08 04:19:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.