[SERVER-6959] 2.0.6 server crashed when movechunk failed because a config server was down Created: 06/Sep/12  Updated: 15/Feb/13  Resolved: 10/Sep/12

Status: Closed
Project: Core Server
Component/s: Internal Code
Affects Version/s: 2.0.6
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: Mark N Assignee: Spencer Brody (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 10.4, mongo 2.0.6. 8 single replica set servers, 3 config servers, multiple mongos


Operating System: ALL
Participants:

 Description   

We have 8 servers as single replica sets. This is because we can lose the data at any time and its okay. We just start over. Its a caching system.

3 config servers
8 single replica sets
multiple mongos (some where 1.8.5 which have since been upgraded to 2.0.7).

We were moving a config server to another location.
The move occurred in the middle of a movechunk.
The movechunk failed.
One of the data replica set servers crashed because of it.

Right around the crash, we had lots of these because of the config server that was offline.

Wed Sep 5 13:52:31 [conn31392] waiting till out of critical section
Wed Sep 5 13:52:31 [conn31392] waiting till out of critical section
Wed Sep 5 13:52:31 [conn31392] waiting till out of critical section

Then

Wed Sep 5 13:52:37 [conn31375] waiting till out of critical section
Wed Sep 5 13:52:37 [conn31383] waiting till out of critical section
Wed Sep 5 13:52:37 [conn27754] ERROR: moveChunk commit failed: version is at32299|1 instead of 32300|1
Wed Sep 5 13:52:37 [conn27754] ERROR: TERMINATING
Wed Sep 5 13:52:37 dbexit:
Wed Sep 5 13:52:37 [conn27754] shutdown: going to close listening sockets...
Wed Sep 5 13:52:37 [conn27754] closing listening socket: 6
Wed Sep 5 13:52:37 [conn27754] closing listening socket: 7
Wed Sep 5 13:52:37 [conn27754] closing listening socket: 9
Wed Sep 5 13:52:37 [conn27754] removing socket file: /tmp/mongodb-27017.sock
Wed Sep 5 13:52:37 [conn27754] shutdown: going to flush diaglog...
Wed Sep 5 13:52:37 [conn27754] shutdown: going to close sockets...
Wed Sep 5 13:52:37 [conn27754] shutdown: waiting for fs preallocator...
Wed Sep 5 13:52:37 [conn31369] waiting till out of critical section
Wed Sep 5 13:52:37 [conn1] end connection 127.0.0.1:54322
Wed Sep 5 13:52:37 [conn244] end connection 10.5.5.165:40494
Wed Sep 5 13:52:37 [conn243] end connection 10.5.5.165:40493
Wed Sep 5 13:52:37 [conn31337] waiting till out of critical section
Wed Sep 5 13:52:37 [conn31375] waiting till out of critical section
Wed Sep 5 13:52:37 [conn31362] end connection 10.5.5.121:39824
Wed Sep 5 13:52:37 [conn31337] waiting till out of critical section
Wed Sep 5 13:52:37 [initandlisten] now exiting
Wed Sep 5 13:52:37 dbexit: ; exiting immediately
Wed Sep 5 13:52:37 [conn30251] end connection 10.5.5.40:54631
Wed Sep 5 13:52:37 [conn27754] shutdown: lock for final commit...

          • SERVER RESTARTED *****

Wed Sep 5 14:07:45 [initandlisten] MongoDB starting : pid=17294 port=27017 dbpath=/var/lib/mongodb 64-bit hos
t=jeroshard08
Wed Sep 5 14:07:45 [initandlisten] db version v2.0.6, pdfile version 4.5
Wed Sep 5 14:07:45 [initandlisten] git version: e1c0cbc25863f6356aa4e31375add7bb49fb05bc
Wed Sep 5 14:07:45 [initandlisten] build info: Linux ip-10-110-9-236 2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri No
v 20 17:48:28 EST 2009 x86_64 BOOST_LIB_VERSION=1_41
Wed Sep 5 14:07:45 [initandlisten] options:

{ config: "/etc/mongodb.conf", dbpath: "/var/lib/mongodb", direct oryperdb: "true", journal: "true", logappend: "true", logpath: "/var/log/mongodb/mongodb.log", replSet: "j-h", rest: "true" }

Wed Sep 5 14:07:45 [initandlisten] journal dir=/var/lib/mongodb/journal
Wed Sep 5 14:07:45 [initandlisten] recover begin

and the recovery took place and it was fine.



 Comments   
Comment by Spencer Brody (Inactive) [ 10/Sep/12 ]

Yes, this is expected behavior when a config server fails at a certain point in the middle of the migration. This means that the shard had updated its state to think the migration had been completed, but because the chunk data was never updated on the config server it detects an inconsistent state and shuts down. When the shard comes back online after a restart it reloads the chunk data from the config server and the migration is effectively reverted.

To avoid errors like this in the future we recommend disabling the balancer before doing any maintenance on the config servers.

Generated at Thu Feb 08 03:13:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.