Details
-
Bug
-
Resolution: Done
-
Major - P3
-
None
-
2.0.1
-
None
-
Windows server 2008 64bit, sharded mongoD (3 shards, each are replicasets with 2 servers, so 3 replsets with 2 servers each), 1 mongoS on seperate server, C# driver
-
Windows
Description
In the scenario where the primary service of a one shard in a sharded collection goes down, we are getting some document losses in safe mode (even with fsync=true) on a record by record based insert (no batches)
We have built in some failover code, where we keep retrying the insert untill the safemode no longer throws an exception. However, even with this setup, we still see some document loss.
These losses occur on 2 moments (we ran some tests trying to determine the cause):
1) the moment the primary goes down and a secondary needs to take over
2) the moment the primary goes back online, and is voted for primary again in its replset (when looking on the replset stats, there is a moment when both servers are marked as secundary)
On a recordset of 50.000 records, we get somewhere between 5-10 document losses.
Enabling the option to wait for a replication write in the safe mode is hard to use in our case, since that would mean it would enter an endless loop to retry to insert the document, unless we expand our failover code to catch for this case as well. However, we think this should be handled on the database itself, rather then in code...
Here's how we're inserting right now (code without fsync option):
var safe = new SafeMode(true);
var opts = new MongoInsertOptions(tdCollection);
opts.SafeMode = safe;
for (int i = 0; i < 50000; i++)
{
try
{
var td = new TestClass();
td.Number = i;
td.NumberAsString = i.ToString();
td.Number2 = i * 2;
bool ok = false;
while (!ok)
{
try
catch (Exception ex)
{ Console.WriteLine(ex); ok = false; } }
Console.WriteLine
;
}
catch (Exception ex)
}
Console.WriteLine("Done writing 500000 records");
Is there something we overlooked? Or is this a bug?
Thanks in advance...