[SERVER-4359] Non-atomic reference counting in threaded aggregation. Created: 23/Nov/11 Updated: 11/Jul/16 Resolved: 29/Nov/11 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Aggregation Framework |
| Affects Version/s: | None |
| Fix Version/s: | 2.1.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Sergey Mitsyn | Assignee: | Chris Westin |
| Resolution: | Done | Votes: | 0 |
| Labels: | aggregation, concurrency, crash | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
CentOS 5; probably any OS |
||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
With aggregation command, expressions use shared Value instances across multiple threads. Such shared instances may be got using eg. Value::getTrue(). As mongo::Value is reference counted with mongo::IntrusiveCounter, operations on counter are not atomic. With two or more simultaneous aggregations being run this (may) results in freeing of shared values that are being used and thus strange errors like double free detected by glibc or assertions in Value::coerceToBool() for boolean Value. . This 'bug' can be fixed by changing IntrusiveCounter's counter variable type to bson::AtomicUInt, which is ugly but works. Maybe the final fix shouldn't be so radical. |
| Comments |
| Comment by Chris Westin [ 29/Nov/11 ] |
|
Using atomic ints for this would be a heavy hammer. The aggregation pipelines are not normally expected to share any values between threads or different aggregations. In this case, it is the re-usable static values that are the problem. If we were to use atomic integers, it would cause processor cache lines to be flushed so that these reference counters could be incremented and decremented across processor cores when different aggregation pipelines are being executed by different threads on different cores; I suspect that would decrease overall throughput as these values get shuffled back and forth between cores. The original implementation used shared_ptr<>, which comes with a recipe for not modifying reference counts for static objects. Unfortunately, that got lost in translation to intrusive_ptr<> (a move which was made because profiling showed memory allocation of the sideband counters for shared_ptr<>). But that's the right principle. So I made the addRef() and release() methods on IntrusiveCounter abstract, and created a concrete implementation in IntrusiveCounterUnsigned. I also created a ValueStatic class that derives from Value, and which overrides addRef() and release() to do nothing. The static values are now initialized from that. This eliminates any need to worry about thread safety for the reference counts on the static objects, which never go away anyway. Future users of IntrusiveCounter may use atomic integers if they expect to share objects between threads. Before committing this change, I benchmarked it again to make sure the change of the reference counter maintenance from being inlined to being indirect function calls didn't affect things too much, and it barely showed up. |