[SERVER-13570] mongos does not work if ReplicaSetMonitor no master found Created: 13/Apr/14  Updated: 26/Sep/14  Resolved: 04/Sep/14

Status: Closed
Project: Core Server
Component/s: Diagnostics, Networking, Querying
Affects Version/s: 2.2.2, 2.4.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Andrey Godin Assignee: Jacob Ribnik
Resolution: Incomplete Votes: 4
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mongos.log    
Issue Links:
Related
Operating System: ALL
Participants:

 Description   

Configuration for test:
Sharding database name video_user_data is two ReplicaSet.
ReplicaSet-1

video-test-mongodb-1:> rs.conf()
{
	"_id" : "video-test-mongodb-1",
	"version" : 3,
	"members" : [
		{
			"_id" : 0,
			"host" : "mongo01a.vd:27018"
		},
		{
			"_id" : 1,
			"host" : "mongo01b.vd:27018"
		},
		{
			"_id" : 2,
			"host" : "mongo01c.vd:27018"
		}
	]
}

ReplicaSet-2

video-test-mongodb-2:SECONDARY> rs.conf()
{
	"_id" : "video-test-mongodb-2",
	"version" : 3,
	"members" : [
		{
			"_id" : 0,
			"host" : "mongo02a.vd:27018"
		},
		{
			"_id" : 1,
			"host" : "mongo02b.vd:27018"
		},
		{
			"_id" : 2,
			"host" : "mongo02c.vd:27018"
		}
	]
}

Raised over them mongos.
The problem arises when one of the RS lost PRIMARY.

simple script on python for test

#!/usr/bin/env python
# -*- coding: UTF-8 -*-
 
import pymongo
 
conn = pymongo.Connection('mongo01b.vd:27017')
db = conn.video_user_data
coll = db.films
 
counter=0
for user in coll.find({},partial=True):
        counter+=1
print "%s" % counter
 

Normal script work:

mongo01b.vd:~# ./get.py  
96601

Critical bug for work:

mongo01b.vd:~# time ./get.py  
Traceback (most recent call last):
  File "./get.py", line 16, in <module>
    for user in coll.find({},partial=True):
  File "/usr/lib/python2.6/dist-packages/pymongo/cursor.py", line 814, in next
    if len(self.__data) or self._refresh():
  File "/usr/lib/python2.6/dist-packages/pymongo/cursor.py", line 763, in _refresh
    self.__uuid_subtype))
  File "/usr/lib/python2.6/dist-packages/pymongo/cursor.py", line 720, in __send_message
    self.__uuid_subtype)
  File "/usr/lib/python2.6/dist-packages/pymongo/helpers.py", line 100, in _unpack_response
    error_object["$err"])
pymongo.errors.OperationFailure: database error: ReplicaSetMonitor no master found for set: video-test-mongodb-2
 
real	0m32.829s
user	0m0.056s
sys	0m0.000s

Questions:
1) Why is it taking so long detection problem? During this time, synchronous backend, under heavy load completely kill all the request queue
2) With the loss of one of the RS, I expect that I can get data from other RS database sharded cluster.

Maybe I'm wrong somehow use the driver?

mongodb version:

mongodb=1:2.4.9.yandex1

tnx!



 Comments   
Comment by Jacob Ribnik [ 04/Sep/14 ]

Hi Andrey,

I've closed this ticket as well as we haven't heard from you. Please feel free to reopen and continue the discussion in SUPPORT-991 at any time.

Jake

Comment by Andrey Godin [ 27/May/14 ]

ping?
something you expect from me?

Comment by Andrey Godin [ 02/May/14 ]

Can I help you?

Comment by Evgeniy Zaitsev [ 29/Apr/14 ]

We have checked this bug at MongoDB 2.6.
Bug still exist

Test script use partial bit.
We expect, that Mongos MUST return Data (Partial=True). But in fact, no Data returned.

1. All mongo replica-sets available for Mongos.
All ok, we got 1000000 records.
%%
$ time ./test.py
Collection(Database(Connection('jkp-mrs03', 27017), u'test'), u'test_collection')
1000000

real 0m9.977s
user 0m7.032s
sys 0m0.136s
%%

2. Master of 1st replica-set not available from Mongos (iptables -j DROP)
1st request - no answer during 2 minutes
2nd request - no data returned

1st request:
%%
$ time ./test.py
Collection(Database(Connection('jkp-mrs03', 27017), u'test'), u'test_collection')
^CTraceback (most recent call last):
File "./test.py", line 12, in <module>
for user in coll.find({},partial=True):
File "/usr/lib/python2.7/dist-packages/pymongo/cursor.py", line 904, in next
if len(self.__data) or self._refresh():
File "/usr/lib/python2.7/dist-packages/pymongo/cursor.py", line 848, in _refresh
self.__uuid_subtype))
File "/usr/lib/python2.7/dist-packages/pymongo/cursor.py", line 782, in __send_message
res = client._send_message_with_response(message, **kwargs)
File "/usr/lib/python2.7/dist-packages/pymongo/mongo_client.py", line 1042, in _send_message_with_response
response = self.__send_and_receive(message, sock_info)
File "/usr/lib/python2.7/dist-packages/pymongo/mongo_client.py", line 1020, in __send_and_receive
return self.__receive_message_on_socket(1, request_id, sock_info)
File "/usr/lib/python2.7/dist-packages/pymongo/mongo_client.py", line 1003, in __receive_message_on_socket
header = self.__receive_data_on_socket(16, sock_info)
File "/usr/lib/python2.7/dist-packages/pymongo/mongo_client.py", line 991, in __receive_data_on_socket
chunk = sock_info.sock.recv(length)
KeyboardInterrupt

real 2m10.274s
user 0m0.044s
sys 0m0.028s
%%

2nd request:
%%
$ time ./test.py
Collection(Database(Connection('jkp-mrs03', 27017), u'test'), u'test_collection')
Traceback (most recent call last):
File "./test.py", line 12, in <module>
for user in coll.find({},partial=True):
File "/usr/lib/python2.7/dist-packages/pymongo/cursor.py", line 904, in next
if len(self.__data) or self._refresh():
File "/usr/lib/python2.7/dist-packages/pymongo/cursor.py", line 848, in _refresh
self.__uuid_subtype))
File "/usr/lib/python2.7/dist-packages/pymongo/cursor.py", line 800, in __send_message
self.__uuid_subtype)
File "/usr/lib/python2.7/dist-packages/pymongo/helpers.py", line 100, in _unpack_response
error_object["$err"])
pymongo.errors.OperationFailure: database error: ReplicaSetMonitor no master found for set: jkp_db_mongo-testing-default-1

real 0m0.130s
user 0m0.060s
sys 0m0.008s
%%

3. Master of 1st replica-set not available from Mongos (iptables -j DROP)
Master of 2nd replica-set not available from Mongos (iptables -j DROP)

No data returned.

%%
$ time ./test.py
Collection(Database(Connection('jkp-mrs03', 27017), u'test'), u'test_collection')
Traceback (most recent call last):
File "./test.py", line 12, in <module>
for user in coll.find({},partial=True):
File "/usr/lib/python2.7/dist-packages/pymongo/cursor.py", line 904, in next
if len(self.__data) or self._refresh():
File "/usr/lib/python2.7/dist-packages/pymongo/cursor.py", line 848, in _refresh
self.__uuid_subtype))
File "/usr/lib/python2.7/dist-packages/pymongo/cursor.py", line 800, in __send_message
self.__uuid_subtype)
File "/usr/lib/python2.7/dist-packages/pymongo/helpers.py", line 100, in _unpack_response
error_object["$err"])
pymongo.errors.OperationFailure: database error: ReplicaSetMonitor no master found for set: jkp_db_mongo-testing-default-2

real 0m4.107s
user 0m0.048s
sys 0m0.016s
%%

Comment by Andrey Godin [ 28/Apr/14 ]

ping?

Comment by Andrey Godin [ 15/Apr/14 ]

Hi Tomas,

We use mongos, because backend work with sharded cluster mongodb
mongo01b.vd:27017 - mongos.
mongo01b.vd:27018 - mongodb node replica set.

mongos logs during problem attached to this task .

To reproduce the problem, while working through mongos can perform on Primary node replicaset:

iptables -A INPUT -s mongo01b.vd -p tcp-m tcp - dport 27018 -j REJECT -reject-with icmp-port-unreachable

Full sh.status() when all the PRIMARY available

mongos> sh.status()
--- Sharding Status --- 
  sharding version: {
	"_id" : 1,
	"version" : 3,
	"minCompatibleVersion" : 3,
	"currentVersion" : 4,
	"clusterId" : ObjectId("515045432ceb043cdbb2184d")
}
  shards:
	{  "_id" : "video-test-mongodb-1",  "host" : "video-test-mongodb-1/mongo01a.vd.yandex.net:27018,mongo01b.vd.yandex.net:27018,mongo01c.vd.yandex.net:27018" }
	{  "_id" : "video-test-mongodb-2",  "host" : "video-test-mongodb-2/mongo02a.vd.yandex.net:27018,mongo02b.vd.yandex.net:27018,mongo02c.vd.yandex.net:27018" }
  databases:
	{  "_id" : "admin",  "partitioned" : false,  "primary" : "config" }
	{  "_id" : "video_meta",  "partitioned" : true,  "primary" : "video-test-mongodb-1" }
		video_meta.users
			shard key: { "_id" : 1 }
			chunks:
				video-test-mongodb-2	1
				video-test-mongodb-1	1
			{ "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong(134685) } on : video-test-mongodb-2 Timestamp(2, 0) 
			{ "_id" : NumberLong(134685) } -->> { "_id" : { "$maxKey" : 1 } } on : video-test-mongodb-1 Timestamp(2, 1) 
	{  "_id" : "video_live_apps",  "partitioned" : false,  "primary" : "video-test-mongodb-1" }
	{  "_id" : "test",  "partitioned" : false,  "primary" : "video-test-mongodb-2" }
	{  "_id" : "users",  "partitioned" : false,  "primary" : "video-test-mongodb-2" }
	{  "_id" : "video_bazinga",  "partitioned" : false,  "primary" : "video-test-mongodb-2" }
	{  "_id" : "condig",  "partitioned" : false,  "primary" : "video-test-mongodb-2" }
	{  "_id" : "video_misc",  "partitioned" : false,  "primary" : "video-test-mongodb-2" }
	{  "_id" : "video_moderator",  "partitioned" : false,  "primary" : "video-test-mongodb-1" }
	{  "_id" : "video_fod",  "partitioned" : false,  "primary" : "video-test-mongodb-1" }
	{  "_id" : "video_user_data",  "partitioned" : true,  "primary" : "video-test-mongodb-1" }
		video_user_data.album_films
			shard key: { "_id" : 1 }
			chunks:
				video-test-mongodb-2	1
				video-test-mongodb-1	1
			{ "_id" : { "$minKey" : 1 } } -->> {
	"_id" : {
		"u" : NumberLong(92860164),
		"n" : 100000002,
		"fu" : NumberLong(92860164),
		"fn" : 1
	}
} on : video-test-mongodb-2 Timestamp(2, 0) 
			{
	"_id" : {
		"u" : NumberLong(92860164),
		"n" : 100000002,
		"fu" : NumberLong(92860164),
		"fn" : 1
	}
} -->> { "_id" : { "$maxKey" : 1 } } on : video-test-mongodb-1 Timestamp(2, 1) 
		video_user_data.albums
			shard key: { "_id" : 1 }
			chunks:
				video-test-mongodb-2	1
				video-test-mongodb-1	2
			{ "_id" : { "$minKey" : 1 } } -->> { "_id" : { "u" : NumberLong(134685), "n" : 100000001 } } on : video-test-mongodb-2 Timestamp(2, 0) 
			{ "_id" : { "u" : NumberLong(134685), "n" : 100000001 } } -->> { "_id" : { "u" : NumberLong("1130000000614322"), "n" : 100000001 } } on : video-test-mongodb-1 Timestamp(2, 2) 
			{ "_id" : { "u" : NumberLong("1130000000614322"), "n" : 100000001 } } -->> { "_id" : { "$maxKey" : 1 } } on : video-test-mongodb-1 Timestamp(2, 3) 
		video_user_data.film_albums
			shard key: { "_id" : 1 }
			chunks:
				video-test-mongodb-2	1
				video-test-mongodb-1	1
			{ "_id" : { "$minKey" : 1 } } -->> {
	"_id" : {
		"u" : NumberLong(92860164),
		"n" : 1,
		"au" : NumberLong(92860164),
		"an" : 100000002
	}
} on : video-test-mongodb-2 Timestamp(2, 0) 
			{
	"_id" : {
		"u" : NumberLong(92860164),
		"n" : 1,
		"au" : NumberLong(92860164),
		"an" : 100000002
	}
} -->> { "_id" : { "$maxKey" : 1 } } on : video-test-mongodb-1 Timestamp(2, 1) 
		video_user_data.films
			shard key: { "_id" : 1 }
			chunks:
				video-test-mongodb-2	11
				video-test-mongodb-1	10
			too many chunks to print, use verbose if you want to force print
		video_user_data.url_to_ext_film_id
			shard key: { "_id" : 1 }
			chunks:
				video-test-mongodb-2	1
				video-test-mongodb-1	2
			{ "_id" : { "$minKey" : 1 } } -->> { "_id" : "http://rutube.ru/tracks/43531.html" } on : video-test-mongodb-2 Timestamp(2, 0) 
			{ "_id" : "http://rutube.ru/tracks/59435.html" } -->> { "_id" : "http://www.qwey.ru/watch/78424-cvet-cheremuhi" } on : video-test-mongodb-1 Timestamp(2, 2) 
			{ "_id" : "http://www.qwey.ru/watch/78424-cvet-cheremuhi" } -->> { "_id" : { "$maxKey" : 1 } } on : video-test-mongodb-1 Timestamp(2, 3) 
		video_user_data.users
			shard key: { "_id" : 1 }
			chunks:
				video-test-mongodb-2	1
				video-test-mongodb-1	1
			{ "_id" : { "$minKey" : 1 } } -->> { "_id" : NumberLong(161207) } on : video-test-mongodb-2 Timestamp(2, 0) 
			{ "_id" : NumberLong(161207) } -->> { "_id" : { "$maxKey" : 1 } } on : video-test-mongodb-1 Timestamp(2, 1) 
		video_user_data.votes
			shard key: { "_id" : 1 }
			chunks:
				video-test-mongodb-2	1
				video-test-mongodb-1	1
			{ "_id" : { "$minKey" : 1 } } -->> {
	"_id" : {
		"u" : NumberLong("4611686018550228167"),
		"n" : 98434511,
		"v" : NumberLong(1549920)
	}
} on : video-test-mongodb-2 Timestamp(2, 0) 
			{
	"_id" : {
		"u" : NumberLong("4611686018550228167"),
		"n" : 98434511,
		"v" : NumberLong(1549920)
	}
} -->> { "_id" : { "$maxKey" : 1 } } on : video-test-mongodb-1 Timestamp(2, 1) 
	{  "_id" : "video_view_counts",  "partitioned" : false,  "primary" : "video-test-mongodb-1" }
	{  "_id" : "video_storage",  "partitioned" : false,  "primary" : "video-test-mongodb-1" }
	{  "_id" : "meta",  "partitioned" : false,  "primary" : "video-test-mongodb-1" }

Comment by Thomas Rueckstiess [ 14/Apr/14 ]

Hi Andrey,

Thanks for reporting this issue. To better understand what the problem is, I need to get some more information.

Your Python example shows a Connection to only one single node:

conn = pymongo.Connection('mongo01b.vd:27017')

Is mongo01b.vd:27017 a mongos? In the replica set config, all nodes are on port 27018.

Can you explain the steps involved to go from "normal" to "bug" ? Did you kill the primaries of the replica sets?

Can you please provide the output of sh.status() from the mongos?

Thanks,
Thomas

Comment by Andrey Godin [ 13/Apr/14 ]

mongos> db.databases.find({'partitioned':true})
{ "_id" : "video_user_data", "partitioned" : true, "primary" : "video-test-mongodb-1" }

Generated at Thu Feb 08 03:32:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.