[SERVER-4159] Dataloss on sharded environment when one server in a replicaset goes down (ungracefully shuts down) Created: 27/Oct/11  Updated: 11/Jul/16  Resolved: 03/Dec/11

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.0.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: wouter alleweireldt Assignee: Spencer Brody (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Windows server 2008 64bit, sharded mongoD (3 shards, each are replicasets with 2 servers, so 3 replsets with 2 servers each), 1 mongoS on seperate server, C# driver


Operating System: Windows
Participants:

 Description   

In the scenario where the primary service of a one shard in a sharded collection goes down, we are getting some document losses in safe mode (even with fsync=true) on a record by record based insert (no batches)

We have built in some failover code, where we keep retrying the insert untill the safemode no longer throws an exception. However, even with this setup, we still see some document loss.

These losses occur on 2 moments (we ran some tests trying to determine the cause):
1) the moment the primary goes down and a secondary needs to take over
2) the moment the primary goes back online, and is voted for primary again in its replset (when looking on the replset stats, there is a moment when both servers are marked as secundary)

On a recordset of 50.000 records, we get somewhere between 5-10 document losses.

Enabling the option to wait for a replication write in the safe mode is hard to use in our case, since that would mean it would enter an endless loop to retry to insert the document, unless we expand our failover code to catch for this case as well. However, we think this should be handled on the database itself, rather then in code...

Here's how we're inserting right now (code without fsync option):

var safe = new SafeMode(true);
var opts = new MongoInsertOptions(tdCollection);
opts.SafeMode = safe;

for (int i = 0; i < 50000; i++)
{
try
{

var td = new TestClass();
td.Number = i;
td.NumberAsString = i.ToString();
td.Number2 = i * 2;
bool ok = false;
while (!ok)
{
try

{ var result = tdCollection.Insert(td, opts); ok = result.Ok; }

catch (Exception ex)

{ Console.WriteLine(ex); ok = false; }

}
Console.WriteLine;
}
catch (Exception ex)

{ Console.WriteLine(ex.Message); }

}
Console.WriteLine("Done writing 500000 records");

Is there something we overlooked? Or is this a bug?

Thanks in advance...



 Comments   
Comment by Spencer Brody (Inactive) [ 03/Dec/11 ]

I'm going to go ahead and resolve this issue. If you are still having problems with this feel free to re-open.

Comment by Spencer Brody (Inactive) [ 01/Nov/11 ]

The only way you can be sure that written data won't be lost on replica set failover is if you ensure that the write is propagated to a majority of nodes in the set before acknowledging it. As Eliot mentioned, this can be done using the w flag on inserts. You can specify the number of nodes to wait for the write to propagate to, or in 2.0 you can specify w='majority', which will ensure that the write goes to a majority of members in the set. If you're doing this, it is probably a good idea to set a wtimeout so that the operation won't hang indefinitely if there's a problem. Then you can have your application code retry some number of times before reporting the error back to the user.

Comment by Eliot Horowitz (Inactive) [ 27/Oct/11 ]

You should try adding w=2.
That will guarantee a write went through to a secondary.
fsync doesn't actually help much or at all in this case.

Comment by Eliot Horowitz (Inactive) [ 27/Oct/11 ]

Will the c# driver throw an exception if there is an error or just have ok an error set?
You be getting an error in the code when that happens

Generated at Thu Feb 08 03:05:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.