[SERVER-10768] add proper support for SIGSTOP and SIGCONT (currently, on replica set primary can cause data loss) Created: 13/Sep/13  Updated: 10/Dec/14  Resolved: 25/Feb/14

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.6
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Yandong Mao Assignee: Matt Dannenberg
Resolution: Duplicate Votes: 1
Labels: replication
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Distributor ID: Ubuntu
Description: Ubuntu 12.04.2 LTS
Release: 12.04
Codename: precise


Attachments: File case.tar.gz    
Issue Links:
Duplicate
duplicates SERVER-9765 Two primaries should cause the earlie... Closed
Related
related to SERVER-10793 Start election if more than one prima... Closed
Operating System: Linux
Steps To Reproduce:

Untar and put all files under MONGDB. Then execute MONGODB/my.py can
reproduce the bug. It triggers the following sequence of events.

  • start a replica set of three replicas: S1, S2, S3.
    Say S1 is primary, S2/S3 are backups
  • insert x:1 into S1
  • Suspend (i.e. send the mongod process a SIGSTOP signal) S1
  • wait for S2/S3 to elect a new leader, say S2
  • insert x:2 (sent to S2) with w=2
  • Resume (i.e. send the mongod process a SIGCONT signal) S1
  • S2 and S3 roll back x:2! But x:2 is supposed to be durable!
Participants:

 Description   

I am not sure if the following "problem" is assumed to be non-realistic, or is it a bug of MongoDB. The problem is that MongoDB may discard data that is replicated at a majority of servers. This is actually a terrible semantic (note that nothing crashes!).



 Comments   
Comment by Eric Milkie [ 25/Feb/14 ]

Solved by SERVER-9765

Comment by Eric Milkie [ 02/Dec/13 ]

This issue can also affect nodes that pause for other reasons besides active process suspension. For example, if the machine is so busy that scheduling threads becomes very slow, this same situation can occur.

Comment by Matt Dannenberg [ 16/Sep/13 ]

We were able to reproduce the issue thanks to your scripts. The issue pertains only to the use of SIGSTOP and SIGCONT, which is not supported. We believe we can add support for them for the next major release.

Generated at Thu Feb 08 03:24:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.