[SERVER-39167] Concurrent insert into nonexistent database often fails with "database not found" on 4.1.7 Created: 24/Jan/19  Updated: 06/Dec/22  Resolved: 13/May/19

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 4.1.7
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Shane Harvey Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File repro-1725.py    
Issue Links:
Duplicate
duplicates SERVER-17397 Dropping a Database or Collection in ... Closed
Problem/Incident
causes PYTHON-1725 Test failure - test_threads.TestThrea... Closed
Assigned Teams:
Sharding
Operating System: ALL
Sprint: Sharding 2019-02-25
Participants:

 Description   

Found by a python test in PYTHON-1725.

Concurrently inserting into a collection when the database does not exist often fails on mongos 4.1.7 with "unable to initialize targeter for write op for collection auth_test_46.test :: caused by :: Database auth_test_46 not found :: caused by :: database auth_test_46 not found"

Here's a simple python repro: repro-1725.py. First start a sharded cluster (with or without auth) and then run the script:

$ python3.7 repro-1725.py
Exception in thread Thread-551:
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 917, in _bootstrap_inner
    self.run()
  File "repro-1725.py", line 33, in run
    self.coll.insert_one({'num': i})
  File "/Users/shane/git/mongo-python-driver/pymongo/collection.py", line 694, in insert_one
    session=session),
  File "/Users/shane/git/mongo-python-driver/pymongo/collection.py", line 608, in _insert
    bypass_doc_val, session)
  File "/Users/shane/git/mongo-python-driver/pymongo/collection.py", line 596, in _insert_one
    acknowledged, _insert_command, session)
  File "/Users/shane/git/mongo-python-driver/pymongo/mongo_client.py", line 1329, in _retryable_write
    return self._retry_with_session(retryable, func, s, None)
  File "/Users/shane/git/mongo-python-driver/pymongo/mongo_client.py", line 1282, in _retry_with_session
    return func(session, sock_info, retryable)
  File "/Users/shane/git/mongo-python-driver/pymongo/collection.py", line 591, in _insert_command
    retryable_write=retryable_write)
  File "/Users/shane/git/mongo-python-driver/pymongo/pool.py", line 579, in command
    unacknowledged=unacknowledged)
  File "/Users/shane/git/mongo-python-driver/pymongo/network.py", line 150, in command
    parse_write_concern_error=parse_write_concern_error)
  File "/Users/shane/git/mongo-python-driver/pymongo/helpers.py", line 155, in _check_command_response
    raise OperationFailure(msg % errmsg, code, response)
pymongo.errors.OperationFailure: unable to initialize targeter for write op for collection db_11.test :: caused by :: Database db_11 not found :: caused by :: database db_11 not found
 
Traceback (most recent call last):
  File "repro-1725.py", line 62, in <module>
    repro()
  File "repro-1725.py", line 56, in repro
    assert t.success
AssertionError

The failing command and response is:

23 Jan 19 16:23 -0800 (Connection: 104:657430200)  op_msg insert db_2 Request:{"sections":[{"payload":{"$clusterTime":{"clusterTime":{"$timestamp":{"t":1548289382,"i":191}},"signature":{"hash":{"$binary":"AI87y83wEE6V+ELmJaUtbleLSHY=","$type":"00"},"keyId":{"$numberLong":"6649842944649986060"}}},"$db":"db_2","$readPreference":{"mode":"primary"},"insert":"test","lsid":{"id":{"$binary":"srMbo3+9SlS8J+jwS2CMlg==","$type":"04"}},"ordered":true},"payloadType":0},{"payload":{"documents":[{"_id":{"$oid":"5c49056618a944e27153a1f5"},"num":0}],"identifier":"documents","size":45},"payloadType":1}]}
 
23 Jan 19 16:23 -0800 (Connection: 104:657430200) +118.636ms op_msg reply   Response:{"sections":[{"payload":{"$clusterTime":{"clusterTime":{"$timestamp":{"t":1548289382,"i":196}},"signature":{"hash":{"$binary":"AI87y83wEE6V+ELmJaUtbleLSHY=","$type":"00"},"keyId":{"$numberLong":"6649842944649986060"}}},"code":26,"codeName":"NamespaceNotFound","errmsg":"unable to initialize targeter for write op for collection db_2.test :: caused by :: Database db_2 not found :: caused by :: database db_2 not found","ok":0.0,"operationTime":{"$timestamp":{"t":1548289382,"i":194}}},"payloadType":0}]}

Edit: This is reproducible with and without auth enabled.



 Comments   
Comment by Kaloian Manassiev [ 13/May/19 ]

This is the same problem as movePrimary and drop and re-create database and is due to CRUD operations not doing database versioning, which we consciously decided not to do since it won't be necessary when we start tracking unsharded collections in the catalog. Closing it as duplicate of SERVER-17397 which contains more context.

Comment by Shane Harvey [ 25/Feb/19 ]

Looking at this again, I don't think this behavior just started in 4.1.7. I can reproduce the error (using the attached repro) on 4.1.6 and on all versions of 4.0 but not on 3.6.10.

Comment by Kaloian Manassiev [ 25/Feb/19 ]

janna.golden, Shane mentions that this bug started showing up after 4.1.7, but we have had the issues in the "Unify the Sharding Caching Behaviour" project for a while before that. Something must have changed in 4.2.

shane.harvey, can you confirm that these tests are not new and they work fine with older versions.

Comment by Janna Golden [ 25/Feb/19 ]

When creating a database, we first refresh the catalog cache and check for an existing db entry. . If we do not find one, we create the entry and then refresh the catalog cache again. If there are concurrent createDatabase commands for the same db, we can run into a scenario where the first thread does the first refresh, doesn't find an entry, and then inserts an entry for the db into the catalog cache. A second thread can start the first refresh before this first thread is done inserting the entry into the catalog cache. Then, the first thread will do the second refresh after it finishes inserting the entry, and join the second thread's first refresh. Since this refresh started before the first thread finished inserting into the catalog cache, we will not find an entry in the catalog and will fail with db not found.

I will add this ticket to the "Unify the Sharding Caching Behaviour" project.

Generated at Thu Feb 08 04:51:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.