[SERVER-3316] Syncing a new replica in a replica set crashes the primary and leaves secondary in strange state Created: 23/Jun/11  Updated: 12/Jul/16  Resolved: 24/Jun/11

Status: Closed
Project: Core Server
Component/s: Replication, Storage
Affects Version/s: 1.8.1
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Mike K Assignee: Scott Hernandez (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu Natty on EC2


Attachments: Text File segfault.txt     Text File uncommitted.txt    
Operating System: ALL
Participants:

 Description   

Our set up is as follows:

2 shards consisting of 3 machines each (1 primary, 1 secondary, 1 arbiter). Each shard has about 35GB of data, running on 1.8.1.

We lost a secondary today, so are trying to resync a new secondary from scratch. Two things have happened at least twice in this process:

1. The primary segfaults; we had this happen while secondary was mid-sync, but also happen when the secondary was shut down and not communicating with the primary at all (three times).
2. The secondary, once it's finished syncing and building its indexes, complains over and over about "DR102 too much data written uncommitted" (same error as SERVER-2737 but different situation as far as I can tell).

We've tried stopping all the mongods, removing the local files on the primary, starting it up and re-initializing its replica set, and then syncing again, but this led to the same results (we cleared all data off the secondary first, too).

I've attached the logs for both seg faults (one was running with verbose=false, the other =true) and a sample of the DR102 errors on the secondary.



 Comments   
Comment by Mike K [ 24/Jun/11 ]

Can confirm that 1.8.2 fixed the DR102 issue; the segfaults we saw on primary may be related to some EC2 issues, issue can probably be closed for now and we'll keep an eye out.

Comment by Scott Hernandez (Inactive) [ 23/Jun/11 ]

Please upgrade to 1.8.2; it has fixes for many causes to the DR102 error, as well as other fixes.

Generated at Thu Feb 08 03:02:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.