Hi. I am seeing this issue (or a similar one) with updated drivers, version 1.2.3, connected to a replicaSet running 1.8.3. I can reproduce it reliably. There are several "game" machines running nginx + php_cgi, connected to my replicaSet. If I go to the primary and issue a rs.stepDown(), or take it offline, then the replicaSet reconfigures itself and elects a new primary. However, php code can no longer query the database: I get the "max number of retries exhausted, couldn't send query" message in my logs that collect the exception, and sometimes a "couldn't get response header" as well. I waited for 10-15 minutes, and some queries never completed. I did a new test, disabling APC in php.ini, just to be sure. Same results. If I restart php-cgi on a given machine, all queries immediately complete, and I never get the error.
I believe the reason some queries complete without the restart after 10-15 minutes is because php_cgi respawns some of the child processes by itself, but this is difficult to diagnose.
What I can reproduce reliably is that, after a stepDown, or when mongod is stopped on the primary (simulating a failure), these exceptions pile up for several minutes. Again, restarting php-cgi immediately restores connectivity for one machine (others that are not restarted still can not connect.)
If there is anything I can try to help you diagnose this, please let me know. Also, if there is a way to "force" the persistent connection (I assume the driver is maintaining one "behind the scenes") to be recycled when I get this exception, this will also help us. Thanks.