[SERVER-18521] replica in STARTUP2 state cannot be stopped Created: 18/May/15 Updated: 09/Jun/15 Resolved: 09/Jun/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Admin |
| Affects Version/s: | 3.0.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Alex Kotenko | Assignee: | Ramon Fernandez Marina |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
Once entered a STARTUP2 server cannot be stopped correctly, only killed via `kill -9 <pid>`. It probably will stop gracefully once through it's STARTUP2 phase and into RECOVERY or SECONDARY, but STARTUP2 may be quite a lengthy process (days) if large databases are to be replicated. |
| Comments |
| Comment by Scott Hernandez (Inactive) [ 09/Jun/15 ] | |||||||||||||||||||||||||||||||
|
Our current replication system ensures complete replication of all collections and databases with a single serial order of writes via the oplog for the whole replica set. This requires that any failure in any collection can stop replication. In addition, a fatal failure in the storage system in a replica should only cause that single instance, not the whole replica set, to fail. The initial sync process when bring up or adding a node, unfortunately can expose data corruption in the source replica – this can be very problematic when there are small numbers of replicas.
I am closing this as duplicate since we have the issues covered elsewhere. | |||||||||||||||||||||||||||||||
| Comment by Alex Kotenko [ 19/May/15 ] | |||||||||||||||||||||||||||||||
|
Not worth the hassle, I'm not much worried with files corruption as such, I'm much more bothered with the fact that every single corrupted file is not handled in isolation affecting only certain collections, but causes the whole system to crash, rollback and start from scratch. That imposes a serious problem to Mongo as a database system. | |||||||||||||||||||||||||||||||
| Comment by Scott Hernandez (Inactive) [ 19/May/15 ] | |||||||||||||||||||||||||||||||
|
It looks like some form of data corruption in the files. If you want to file a new server issue we can take a look but it would require getting access to the files and history of the replica, its hardware, storage and such to get an idea what happened when. For corruption like this it is very hard sometimes to find the event in the os/filesystem/server where the bad event happened since until the database process accesses the data you will not notice – this is the same as I've seen on many file servers where infrequently accessed files have corruption but aren't noticed until backup/archival disaster recovery testing. | |||||||||||||||||||||||||||||||
| Comment by Alex Kotenko [ 19/May/15 ] | |||||||||||||||||||||||||||||||
|
Ok, I'll get that reporduced and logged tomorrow. Btw, my secondary mongodb 3.0.3 instance has just crashed again (4th time) with identical exception, see below:
This happened multiple times during my current migration process, I handled this by just dropping the problematic collections on the Primary, though that's far not ideal especially considering that these crashes mean full rollback and replication restart from scratch (3Tb, over 30 hours). Also I wonder how this happened in the old DB in the first place, since it has same max BSON object size limit. I'm migrating from 2.6.7 to 3.0.3. | |||||||||||||||||||||||||||||||
| Comment by Scott Hernandez (Inactive) [ 19/May/15 ] | |||||||||||||||||||||||||||||||
|
The default log level should be fine. The big parts of initial sync are logged by default. | |||||||||||||||||||||||||||||||
| Comment by Alex Kotenko [ 18/May/15 ] | |||||||||||||||||||||||||||||||
|
Shall I get general verbosity level to maximum prior reproducing this, or you'd need some particular areas only? | |||||||||||||||||||||||||||||||
| Comment by Scott Hernandez (Inactive) [ 18/May/15 ] | |||||||||||||||||||||||||||||||
|
Alex, I understand it can be frustrating when things are not responsive, and we are working across the board to make improvements. To help us understand what happened can you provide an example, or logs please? Please include logs where you issue the shutdown so we can see what is happening at that time, and after. A list of events, and the commands run would help as well. It is expected that a failed initial sync must restart from the start and discard partial data, due to how replication and replica sources work. There are many places during initial sync where there are long running operations which a clean shutdown will wait for. In particular the collection data cloning or index building may wait for each db/collection to finish, before shutting down. I can assure you that during initial sync (in startup/2) there are many places where a shutdown will abort the process and exit, but we are working to make shutdown more responsive while keeping our initial sync behavior consistent between versions. |