[SERVER-18671] SecondaryPreferred can end up using unversioned connections Created: 27/May/15  Updated: 25/Jan/17  Resolved: 20/Jan/16

Status: Closed
Project: Core Server
Component/s: Admin, Sharding
Affects Version/s: 3.0.3
Fix Version/s: 3.0.10, 3.2.3, 3.3.1

Type: Bug Priority: Major - P3
Reporter: Marcin Lipiec Assignee: Randolph Tan
Resolution: Done Votes: 4
Labels: code-and-test
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: HTML File mongos_startup_log     File repro.diff     File test.js    
Issue Links:
Related
is related to SERVER-22739 Sharding SecondaryPreferred read comm... Closed
is related to SERVER-15375 initShardVersion triggers inline RS r... Closed
Backwards Compatibility: Fully Compatible
Backport Completed:
Steps To Reproduce:

Run test.js after applying repro.diff. You should be to see the log near the end of the test.

Sprint: Sharding E (01/08/16), Sharding F (01/29/16), Sharding 11 (03/11/16)
Participants:

 Description   

When mongos tries to setup the version for the connection to be used for queries, it checks if the primary is down with this:

https://github.com/mongodb/mongo/blob/r3.1.5/src/mongo/client/parallel.cpp#L574

bool connIsDown = rawConn->isFailed();

However, if you look at the implementation of isFailed:

return !_master || _master->isFailed();

It can return false if the _master is not initialized (when the replica set connection has not yet talked to the master). The reason this was fine in v2.6 is mongos used to eagerly call setShardVersion on every connection created and by the above codepath is reached, _master is guaranteed to be set unless an error occurred. This is no longer true in v3.0 as SERVER-15375 removed the eager initialization.

Original description from user:

We are following the procedure of upgrading sharded cluster of MongoDB from http://docs.mongodb.org/manual/release-notes/3.0-upgrade/#upgrade-a-sharded-cluster-to-3-0.

After upgrading one of our main mongoses from 2.6.9 to 3.0.3 we started seeing many following messages:

2015-05-27T11:27:46.436+0200 W NETWORK  [conn358] Primary for set3/mongo3:27018,mongo8:27018 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
2015-05-27T11:27:46.478+0200 W NETWORK  [conn312] Primary for set5/mongo10:27018,mongo5:27018 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
2015-05-27T11:27:46.500+0200 W NETWORK  [conn206] Primary for set2/mongo2:27018,mongo7:27018 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
2015-05-27T11:27:46.623+0200 W NETWORK  [conn355] Primary for set5/mongo10:27018,mongo5:27018 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
2015-05-27T11:27:46.688+0200 W NETWORK  [conn98] Primary for set4/mongo4:27018,mongo9:27018 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
2015-05-27T11:27:46.738+0200 W NETWORK  [conn469] Primary for set4/mongo4:27018,mongo9:27018 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
2015-05-27T11:27:46.816+0200 W NETWORK  [conn180] Primary for set4/mongo4:27018,mongo9:27018 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
2015-05-27T11:27:46.846+0200 W NETWORK  [conn288] Primary for set5/mongo10:27018,mongo5:27018 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
2015-05-27T11:27:46.909+0200 W NETWORK  [conn253] Primary for set5/mongo10:27018,mongo5:27018 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
2015-05-27T11:27:46.950+0200 W NETWORK  [conn103] Primary for set5/mongo10:27018,mongo5:27018 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
2015-05-27T11:27:47.016+0200 W NETWORK  [conn56] Primary for set5/mongo10:27018,mongo5:27018 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
2015-05-27T11:27:47.061+0200 W NETWORK  [conn36] Primary for set5/mongo10:27018,mongo5:27018 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
2015-05-27T11:27:47.105+0200 W NETWORK  [conn151] Primary for set3/mongo3:27018,mongo8:27018 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
2015-05-27T11:27:47.197+0200 W NETWORK  [conn138] Primary for set5/mongo10:27018,mongo5:27018 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.
2015-05-27T11:27:47.337+0200 W NETWORK  [conn360] Primary for set5/mongo10:27018,mongo5:27018 was down before, bypassing setShardVersion. The local replica set view and targeting may be stale.

Right now we rollbacked again to 2.6.9. Should we continue upgrading the whole cluster and after that those messages will be gone?



 Comments   
Comment by Githook User [ 23/Feb/16 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-18671 SecondaryPreferred can end up using unversioned connections

(cherry picked from commit 1d611a8c7ee346929a4186f524c21007ef7a279d)
Branch: v3.0
https://github.com/mongodb/mongo/commit/5c2737de7776e37d2fbf5259f4ccd3c2f5cf24fa

Comment by Quentin Schroeder [ 09/Feb/16 ]

We are also running into this issue and would love to see the fix backported to 3.0. I'll keep an eye on this ticket to see what decisions are made.

Comment by Ramon Fernandez Marina [ 08/Feb/16 ]

pperekalov, we're assessing whether a backport to 3.0 is doable safely. Any updates will be posted on this ticket.

Comment by Pavel Perekalov [ 05/Feb/16 ]

Can you tell, please, when this issue will be backported to 3.0

Comment by Githook User [ 29/Jan/16 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-18671 SecondaryPreferred can end up using unversioned connections
(cherry picked from commit 1d611a8c7ee346929a4186f524c21007ef7a279d)
Branch: v3.2
https://github.com/mongodb/mongo/commit/e9e372e1e2e48c1420c11af63f13b5ec227b4e8c

Comment by Githook User [ 20/Jan/16 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-18671 SecondaryPreferred can end up using unversioned connections
Branch: master
https://github.com/mongodb/mongo/commit/1d611a8c7ee346929a4186f524c21007ef7a279d

Comment by Ramon Fernandez Marina [ 02/Jul/15 ]

marcin.lipiec@nokaut.pl, this is to let you know that I've finally been able to reproduce these messages and we're investigating.

Comment by Ramon Fernandez Marina [ 20/Jun/15 ]

Thanks for the update marcin.lipiec@nokaut.pl. I'm still trying to reproduce this issue, but no luck so far. Have you tried removing the read preference to eliminate secondary reads?

Comment by Marcin Lipiec [ 18/Jun/15 ]

We also tried with version 3.0.4 of mongos and the same messages constantly appeared in logs.

Comment by Piotr Duda [ 08/Jun/15 ]

Any ideas about this issue?

Comment by Marcin Lipiec [ 01/Jun/15 ]

Ramon, are there any news/ideas?

Comment by Marcin Lipiec [ 28/May/15 ]

Actually we have two kind of applications:

  • one set is setup to connect to primaries for updates, inserts etc.
  • other set is setup as secondaryPreferred for reads only
Comment by Ramon Fernandez Marina [ 27/May/15 ]

Thanks for uploading the logs marcin.lipiec@nokaut.pl. I assume that the connections to the mongos are not doing any operations that alter collection metadata, but can you elaborate on what kind of operations are these? In particular I'm interested in knowing if you have any read preference set, as this warning is dependent on a specific read preference.

Thanks,
Ramón.

Comment by Marcin Lipiec [ 27/May/15 ]

Yes, on step 4 warnings appeared. I have attached logs you asked to the issue.

Comment by Ramon Fernandez Marina [ 27/May/15 ]

marcin.lipiec@nokaut.pl, if I understand correctly you completed steps 1 to 3 successfully, but on step 4 you started seeing the warnings above, is that correct?

Also, can you please send us the full logs for the mongos you upgraded from the time it started running with 3.0.3 until it was rolled back to 2.6.9?

Thanks,
Ramón.

Comment by Marcin Lipiec [ 27/May/15 ]

We have successfully upgraded cluster's meta data with mongos --upgrade. Then we proceeded to upgrade our remaining mongos processes. Problems appeared after upgrading one of them. After restart of mongos, endpoints started connecting to upgraded instance and then those errors appeared. Those errors were seen until we downgraded to 2.6.9 again (on version 3.0.3 they were consistently appearing in logs).

We're sure that during the process network was available all the time.

Comment by Ramon Fernandez Marina [ 27/May/15 ]

marcin.lipiec@nokaut.pl, this warning means that the primary was temporarily unavailable and mongos decided to skip the shard version handshake (since it can't with no primary) and proceed to talk with the secondary (this is only possible if the query/command has the right read preference). This means that reads can potentially be stale depending on how up to date the secondaries are.

In your case the warning is printed for multiple shards, so this may indicate a temporary network problem in the machine running this 3.0.3 mongos.

Can you elaborate on which step of the upgrade process did this warning appear, and whether there were warning/errors on your system log about network unavailability?

Thanks,
Ramón.

Generated at Thu Feb 08 03:48:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.