Priority: Critical - P2
Affects Version/s: 3.0.0
Linux kernel 3.13.0-46-generic and 3.10.23-xxxx-std-ipv6-64
Hosting : OVH (France)
We had a replica set of three dedicated servers on Ubuntu 14.04. Everything was working fine on 2.6.8. We have stopped one of them to reinstall Ubuntu (clean slate) and to install MongoDB 3.0 with WiredTiger. The initialization has gone OK, the replication was working great (the primary was still on 2.6.8). Using rs.slaveOk(), we could see that the data was up-to-date.
We then have tried to push him as primary by calling a rs.freeze(600) on the other secondary and a stepDown() on the primary. After less than 10 seconds, mongod crashes. Starting mongod again, the server works again as secondary.
As it is written that each server should be on 2.6.8 or preferably 3.0, we've tried to update the second secondary server, but this time without reinstalling everything (so we've only updated the Ubuntu sources list and launch sudo apt-get update / sudo apt-get upgrade). The server updated, and restarted as secondary without any issue. It was possible to call queries from the shell without problem.
Then we've called a rs.freeze(600) on the secondary on 3.0 WiredTiger and a rs.stepDown() on the primary still on 2.6.8. Very quickly the server has crashed and the previous primary turns back as primary. The log was identical as the crash log on the first server.
After restarting the second 3.0 server as secondary, we've hoped that if all the servers on the replica set were on 3.0 it could work… So we've called a rs.freeze(600) on the first 3.0 secondary server (the one using WiredTiger) and we've shut down the primary by calling sudo service mongod stop on Ubuntu shell. The server on 3.0 using MMAPv1 turns primary, and crashes after few seconds. We then have started the server on 2.6.8 and after a very short rollback it has gone back primary.
Here is an example of what we could see in the log on the 3.0 servers after the crash :
We can see in the log that some queries have managed to be called just before the crash, for example :
So now we have a primary on 2.6.8 and two secondaries on 3.0, which can't be primary.