|
Hi Yauhen,
How readily are you able to reproduce this?
I agree with James that this potentially relates to SERVER-8059/SERVER-8870, which would occur when you have multiple MongoS.
Would you be able to test running flushRouterConfig on each of your MongoS after the drop of a database and see if that stops the issue from re-occuring as you have seen?
Thanks,
David
|
|
We see this quite frequently: SERVER-8870, CS-8109 and to me there appear to be two ways to get the mongos into inconsistent states.
- After movePrimary many of the mongos will not have noticed a move
- When a database is dropped and recreated, some mongoses will reflect the correct primary shard for the database, and others will show the previous
flushRouterConfig appears to fix the split brain for us. So far I've been unable to reproduce it reliably outside our cluster.
|
|
Sorry for long post.
Finally I catch this thing on our dev environment.
I spend many time to exclude from logs not related information. Trying to save all logs.
Now, looks like it is not a 'movePrimary()' issue.
The setup on dev: 3 shard servers (without replicas) 3 routers (application on same host uses localhost for mongodb connection)
I found the time when database had been dropped and after 3 minutes had been recreated.
Note: nobody uses direct connect to mongod in our setup (app uses mongos on localhost)
|
Logs from First Router
|
# No related logs
|
|
Logs from Second Router
|
Sat Jan 11 15:27:30.031 [conn4021] DROP DATABASE: org_dev2_sys_quartz
|
Sat Jan 11 15:27:30.031 [conn4021] erased database org_dev2_sys_quartz from local registry
|
Sat Jan 11 15:27:30.033 [conn4021] DBConfig::dropDatabase: org_dev2_sys_quartz
|
Sat Jan 11 15:27:30.033 [conn4021] about to log metadata event: { _id: "SRV249A-2014-01-11T12:27:30-52d138b2262f902e0d2511b7", server: "SRV249A", clientAddr: "N/A", time: n
|
ew Date(1389443250033), what: "dropDatabase.start", ns: "org_dev2_sys_quartz", details: {} }
|
Sat Jan 11 15:27:30.315 [conn4021] DBConfig::dropDatabase: org_dev2_sys_quartz dropped sharded collections: 0
|
Sat Jan 11 15:27:30.340 [conn4021] about to log metadata event: { _id: "SRV249A-2014-01-11T12:27:30-52d138b2262f902e0d2511b8", server: "SRV249A", clientAddr: "N/A", time: n
|
ew Date(1389443250340), what: "dropDatabase", ns: "org_dev2_sys_quartz", details: {} }
|
Sat Jan 11 15:27:36.735 [conn4021] DROP DATABASE: org_dev2_sys_quartz
|
Sat Jan 11 15:27:36.735 [conn4021] erased database org_dev2_sys_quartz from local registry
|
Sat Jan 11 15:27:36.737 [conn4021] DBConfig::dropDatabase: org_dev2_sys_quartz
|
Sat Jan 11 15:27:36.738 [conn4021] about to log metadata event: { _id: "SRV249A-2014-01-11T12:27:36-52d138b8262f902e0d2511be", server: "SRV249A", clientAddr: "N/A", time: new Date(1389443256737), what: "dropDatabase.start", ns: "org_dev2_sys_quartz", details: {} }
|
Sat Jan 11 15:27:36.873 [conn4021] DBConfig::dropDatabase: org_dev2_sys_quartz dropped sharded collections: 0
|
Sat Jan 11 15:27:36.908 [conn4021] about to log metadata event: { _id: "SRV249A-2014-01-11T12:27:36-52d138b8262f902e0d2511bf", server: "SRV249A", clientAddr: "N/A", time: new Date(1389443256908), what: "dropDatabase", ns: "org_dev2_sys_quartz", details: {} }
|
Sat Jan 11 15:27:39.134 [Balancer] distributed lock 'balancer/SRV249A:27017:1388738934:1804289383' acquired, ts : 52d138ba262f902e0d2511c0
|
Sat Jan 11 15:27:39.203 [Balancer] distributed lock 'balancer/SRV249A:27017:1388738934:1804289383' unlocked.
|
|
Logs from Third Router
|
# No related logs
|
|
Log From Shard 001
|
Sat Jan 11 15:27:30.403 [conn1351] dropDatabase org_dev2_sys_quartz starting
|
Sat Jan 11 15:27:30.409 [conn1351] removeJournalFiles
|
Sat Jan 11 15:27:30.426 [conn1351] dropDatabase org_dev2_sys_quartz finished
|
...
|
# Somebody request to create this db? Not me :)
|
Sat Jan 11 15:30:02.414 [FileAllocator] allocating new datafile /var/lib/mongodb-sh-001/org_dev2_sys_security.ns, filling with zeroes...
|
Sat Jan 11 15:30:02.849 [FileAllocator] done allocating datafile /var/lib/mongodb-sh-001/org_dev2_sys_security.ns, size: 16MB, took 0.434 secs
|
Sat Jan 11 15:30:02.850 [FileAllocator] allocating new datafile /var/lib/mongodb-sh-001/org_dev2_sys_security.0, filling with zeroes...
|
Sat Jan 11 15:30:02.854 [FileAllocator] done allocating datafile /var/lib/mongodb-sh-001/org_dev2_sys_security.0, size: 64MB, took 0.003 secs
|
Sat Jan 11 15:30:02.854 [FileAllocator] allocating new datafile /var/lib/mongodb-sh-001/org_dev2_sys_security.1, filling with zeroes...
|
Sat Jan 11 15:30:02.860 [conn776] build index org_dev2_sys_security.User { _id: 1 }
|
Sat Jan 11 15:30:02.863 [conn776] build index done. scanned 0 total records. 0.002 secs
|
Sat Jan 11 15:30:02.863 [conn776] info: creating collection org_dev2_sys_security.User on add index
|
Sat Jan 11 15:30:02.863 [conn776] build index org_dev2_sys_security.User { _e3sId: 1 }
|
Sat Jan 11 15:30:02.864 [FileAllocator] done allocating datafile /var/lib/mongodb-sh-001/org_dev2_sys_security.1, size: 128MB, took 0.002 secs
|
Sat Jan 11 15:30:02.865 [conn776] build index done. scanned 0 total records. 0.002 secs
|
Sat Jan 11 15:30:02.866 [conn776] insert org_dev2_sys_security.system.indexes ninserted:1 keyUpdates:0 locks(micros) w:451717 452ms
|
Sat Jan 11 15:30:02.877 [conn794] build index org_dev2_sys_security.User { securityId: 1 }
|
Sat Jan 11 15:30:02.878 [conn794] build index done. scanned 0 total records. 0.001 secs
|
Sat Jan 11 15:30:02.895 [conn777] build index org_dev2_sys_security.User { group: 1 }
|
Sat Jan 11 15:30:02.897 [conn777] build index done. scanned 0 total records. 0.002 secs
|
Sat Jan 11 15:30:02.904 [conn777] build index org_dev2_sys_security.User { dynamic.group: 1 }
|
Sat Jan 11 15:30:02.905 [conn777] build index done. scanned 0 total records. 0.001 secs
|
|
Logs from Shard 002
|
Sat Jan 11 15:27:37.565 [conn5496] dropDatabase org_dev2_sys_quartz starting
|
Sat Jan 11 15:27:37.570 [conn5496] removeJournalFiles
|
Sat Jan 11 15:27:37.597 [conn5496] dropDatabase org_dev2_sys_quartz finished
|
At "Sat Jan 11" I have only one record and shard is 002:
|
Putting database quartz on router 249a
|
Sat Jan 11 15:27:36.735 [conn4021] put [org_dev2_sys_quartz] on: sh002:sh002/SRV24A2:27018
|
Tue Jan 14 12:36:22.087 [conn4790] put [org_dev2_sys_quartz] on: sh002:sh002/SRV24A2:27018
|
Tue Jan 14 12:40:14.066 [conn4790] put [org_dev2_sys_quartz] on: sh002:sh002/SRV24A2:27018
|
|
Putting database quartz on router 248f
|
Fri Jan 17 17:33:24.491 [conn1253] put [org_dev2_sys_quartz] on: sh002:sh002/SRV24A2:27018
|
Fri Jan 10 16:10:26.491 [conn8] put [org_dev2_sys_quartz] on: sh001:sh001/SRV24A1:27018
|
Fri Jan 3 12:41:33.642 [conn59] put [org_dev2_sys_quartz] on: sh002:sh002/SRV24A2:27018
|
According logs, it has been created on `shard001` on the 10 of January. Then dropped and re-created at `shard002` on the 11 of January.
But the result: database has been marked as created on shard 002, but actually has been created on `shard 001`.
|