When a replica set is reconfigured (e.g. forcing a member to be primary) the mongo driver may raise a Mongo::OperationFailure error, with a message "10054: not master". This happens because the current master has changed, but the Mongo connection still points to the previous one. Reconnecting after this error seems to work, but only after the new primary has been elected (which can take some time).
While this could be handled by the application, it would make sense to handle this error and attempt to reconnect. In fact, mongoid already does this in Mongo::Collections::Retry module, but it only rescue from Mongo::ConnectionFailure. The only difference is that Mongo::OperationFailure could be raised with other error messages, meaning different kind of errors, specially when using safe mode (you can check for it in here).
My first attempt to solve this would be to add another rescue like this:
def retry_on_connection_failure
retries = 0
begin
yield
rescue Mongo::ConnectionFailure => ex
retries += 1
raise ex if retries > Mongoid.max_retries_on_connection_failure
Kernel.sleep(0.5)
retry
rescue Mongo::OperationFailure => ex
if ex.message =~ /not master/
- master has changed, retrying to connect
retries += 1
raise ex if retries > Mongoid.max_retries_on_connection_failure
Kernel.sleep(0.5)
retry
else - some other Mongo::OperationFailure error, re-raising it
raise ex
end
end
end
Any suggestions on this topic?