Uploaded image for project: 'Python Driver'
  1. Python Driver
  2. PYTHON-509

Rare connection leak in Python 2.7.0 and older

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 2.5.1
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None
    • Fully Compatible

      An application running Python 2.7.0 or older (so, not 2.7.1 and up) that creates and destroys very large numbers of threads continuously for a long time, if it calls end_request at least once and start_request more than end_request, can leave a growing number of unreclaimable sockets in Pool._tid_to_sock. These sockets will eventually exceed the process's ulimit or the server's.

      Background: When a thread calls start_request it's assigned a socket, which is put into a dict called _tid_to_sock, keyed with its thread id. We use a weakref callback to a threadlocal to know if the thread dies without calling end_request; if so, we remove its socket from _tid_to_sock and return it to the general pool. See "Knowing When a Python Thread Has Died" for more details on this technique.

      threadlocals have a number of charming quirks rooted in issue 1868, fixed in Python 2.7.1. We thought we'd nailed them all, but a symptom remains. It can be reproduced with the following code:

      import threading
      import weakref
      
      nthreads = 10000
      ncallbacks = 0
      ncallbacks_lock = threading.Lock()
      local = threading.local()
      refs = set()
      
      class Vigil(object):
          pass
      
      def run():
          def on_thread_died(ref):
              global ncallbacks
              ncallbacks_lock.acquire()
              ncallbacks += 1
              ncallbacks_lock.release()
      
          local.vigil = vigil = Vigil()
          refs.add(weakref.ref(vigil, on_thread_died))
      
      threads = [threading.Thread(target=run)
                 for _ in range(nthreads)]
      for t in threads: t.start()
      for t in threads: t.join()
      getattr(local, 'c', None)  # Trigger cleanup in 2.7.0
      assert ncallbacks == nthreads, \
          'only %d callbacks run' % ncallbacks
      

      It appears that assigning to a threadlocal isn't threadsafe in Python <= 2.7.0. If contention is high enough--that is, if there are enough threads, and if some threads are calling end_request as others call start_request--the threadlocal can become corrupted, and some objects stored in it are never cleaned up after their threads die.

      Locking around the assignment fixes the problem:

      local_lock = threading.Lock()
      # ...
          local_lock.acquire()
          local.vigil = vigil = Vigil()
          local_lock.release()
          refs.add(weakref.ref(vigil, on_thread_died))
      

      I expect a lock around threadlocal assignment in ThreadIdent.get() will fix the problem.

      Additionally, removing an unused access of ThreadIdent.get() in end_request will reduce contention.

            Assignee:
            jesse@mongodb.com A. Jesse Jiryu Davis
            Reporter:
            jesse@mongodb.com A. Jesse Jiryu Davis
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: