[SERVER-4159] Dataloss on sharded environment when one server in a replicaset goes down (ungracefully shuts down) Created: 27/Oct/11 Updated: 11/Jul/16 Resolved: 03/Dec/11 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 2.0.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | wouter alleweireldt | Assignee: | Spencer Brody (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Windows server 2008 64bit, sharded mongoD (3 shards, each are replicasets with 2 servers, so 3 replsets with 2 servers each), 1 mongoS on seperate server, C# driver |
||
| Operating System: | Windows |
| Participants: |
| Description |
|
In the scenario where the primary service of a one shard in a sharded collection goes down, we are getting some document losses in safe mode (even with fsync=true) on a record by record based insert (no batches) We have built in some failover code, where we keep retrying the insert untill the safemode no longer throws an exception. However, even with this setup, we still see some document loss. These losses occur on 2 moments (we ran some tests trying to determine the cause): On a recordset of 50.000 records, we get somewhere between 5-10 document losses. Enabling the option to wait for a replication write in the safe mode is hard to use in our case, since that would mean it would enter an endless loop to retry to insert the document, unless we expand our failover code to catch for this case as well. However, we think this should be handled on the database itself, rather then in code... Here's how we're inserting right now (code without fsync option): var safe = new SafeMode(true); for (int i = 0; i < 50000; i++) var td = new TestClass(); catch (Exception ex) { Console.WriteLine(ex); ok = false; } } } Is there something we overlooked? Or is this a bug? Thanks in advance... |
| Comments |
| Comment by Spencer Brody (Inactive) [ 03/Dec/11 ] |
|
I'm going to go ahead and resolve this issue. If you are still having problems with this feel free to re-open. |
| Comment by Spencer Brody (Inactive) [ 01/Nov/11 ] |
|
The only way you can be sure that written data won't be lost on replica set failover is if you ensure that the write is propagated to a majority of nodes in the set before acknowledging it. As Eliot mentioned, this can be done using the w flag on inserts. You can specify the number of nodes to wait for the write to propagate to, or in 2.0 you can specify w='majority', which will ensure that the write goes to a majority of members in the set. If you're doing this, it is probably a good idea to set a wtimeout so that the operation won't hang indefinitely if there's a problem. Then you can have your application code retry some number of times before reporting the error back to the user. |
| Comment by Eliot Horowitz (Inactive) [ 27/Oct/11 ] |
|
You should try adding w=2. |
| Comment by Eliot Horowitz (Inactive) [ 27/Oct/11 ] |
|
Will the c# driver throw an exception if there is an error or just have ok an error set? |