Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Does anyone have data on how much it leaks in practice? Also, could maliciously crafted input result in a DoS vector for Boehm using applications?


I once worked on a project that used the Boehms collector for a long-running server process.

About once a year, we'd lose a week or two to tracking down a memory issue. While the root cause was never a GC bug, the GC would make it dramatically more difficult to figure out what the real problem was.

The closest we had to an actual GC bug was a leak involving a data structure containing a list of IP addresses that was incorrectly interpreted as containing pointers. It took forever to figure out what had happened, since the issue only appeared if you worked with a specific set of IP addresses--and even after we had a replication case, the GC made it extremely difficult to determine why memory was growing rapidly. The fix was simple once we identified the precise problem--flag that memory as "atomic" (i.e., guaranteed not to contain pointers)--but finding the precise data structure that was holding memory live was a nightmare.

We eventually ran into a slow memory growth issue that we couldn't figure out. After much debugging, one developer became fed up with the situation and tore the GC out. It took him about two weeks to produce a version of the codebase that leaked less than the GC-enabled one, and another couple weeks to eliminate virtually all memory leaks. We continued to find minor leaks for another several months, none of which were difficult to correct. (We used a debugging malloc library that flagged leaked memory on exit, so we always knew exactly where the leak had originated.)

Not only did the GC-less version leak less, it used about 40% less memory.

I would strongly recommend against using the Boehm collector for any long-running processes. The tradeoffs may be more acceptable in short-term processes where slow leakage over time is unimportant.


My story is similar to nield's; I worked on a project that used the Boehm-Weiser collector in a 32-bit process.

Once or twice a tear, we'd have an unbounded memory growth to trace. Most of the time, we were able to solve it by providing layouts for structures that mixed pointers and pointer-like data. The debug facility it provides for providing a random backtrace to a root was helpful in those cases.

However, more than once we had a leak that it was beyond our ability to trace, despite a significant amount of diagnostic work. We found some workarounds that were specific to our app, which helped.

In the end, the costs to us were high enough that we rewrote our codebase in C# (our software was Windows-only). It was pretty much a line-for-line conversion, but it worked and we no longer had any memory leaks (except for one, in .NET timers, that took only minutes to trace using readily-available diagnostic tools).

I do believe a 64-bit process using the Boehm collector would be much less likely to suffer these issues, but this was before x64 was prevalent in corporate server rooms.

On your second question: Crafted input can indeed result in significant memory retention, if you aren't careful to exclude user input from the GC and have large, cyclic structures to collect, on a 32-bit machine, with almost any text encoding.


Correct me if I'm wrong, but the GC isn't the only thing that frees memory, is it? You can still free it manually (and are supposed to), so those leaks would have happened without a GC too, and it can only help things.

I'm getting the impression that posts here imply you're better off without a GC, I'm not sure if that's the intention, but it strikes me as wrong.


Your parent post spoke about going from one GC to another and having the problems disappear. Hence, it only implies that changing GC can have a useful effect.

Your message does betray some confusion about garbage collection. The basic principle of GC is that you do not deallocate RAM manually (and, under most GCs, cannot), and programs written under it expect deallocation to happen automatically. You are not 'supposed to' free manually, as that would be wasted effort, and some standard configurations of the Boehm collector actually ignore all calls to free().

There is a very useful description of garbage collection at http://blogs.msdn.com/b/oldnewthing/archive/2010/08/09/10047... which includes and expands on the useful mental model that "garbage collection is simulating a computer with an infinite amount of memory".


This will vary between 32-bit and 64-bit environments. My understanding is that the vastly larger address in 64-bit environments greatly mitigates the chance that data will be mistaken as pointers.

As far as crafting input to match allocated addresses, that is an interesting idea. One could, at the very least, create data that lies within the heap's address range. The issue is, you'd need a lot of data to blacklist a significant amount of memory, and if that's the case why not just DoS with a ton of data?

In practice, a fragmented heap is probably a much bigger issue than actual leaks. As the article points out, a conservative GC cannot perform compaction.


I think the key thing that prevents the DOS is that you have to have a 64 bit value that points to the beginning of the allocated block of memory in order for that memory to be kept around -- so, if you malloc 16 bytes are are returned address 100, then a value that contains 101 will not confuse the GC. This has implications if you do some wonky things with only keeping pointers to internal structs and use pointer math to access their containers, but you probably shouldn't do that anyway =D


Boehm GC has configurable parameters in a header file; you can tell it that interior pointers should be supported.

Interior pointers can be a perfectly reasonable thing to have. For example, many, if not most, implementations of multiple inheritance rely on different values for the 'this' pointer depending on the type of 'this', where that type is somewhere up the inheritance graph. For a class C : public A, public B {}, the physical value of A * = new C and B * = new C will usually be different. The way this is often implemented is by having the addresses of the various vtables for different ancestors stored in the object data, and the conversion from the descendant class to one of the ancestor class returns an interior pointer to one of these vtable locations inside the object data.


You are right that I should have added by default.

>The way this is often implemented is by having the addresses of the various vtables for different ancestors stored in the object data, and the conversion from the descendant class to one of the ancestor class returns an interior pointer to one of these vtable locations inside the object data.

Suggests that a pointer value would still exist that would reference the beginning of the allocation block, stored in the vtable so it would still avoid erroneous collection.

Edit: but you are right that there are some valid reasons to have an interior pointer, which is why the GC has the friendly macros.


No; the vtable normally contains a read-only list of member function pointers, shared across all objects of that type. Objects need to be kept alive solely through the interior pointer to the (compiler-implemented) field (itself a pointer to a vtable) inside the object.

C++ terms:

    class C : public A, public B {};
Typical class layout, in C terms:

    struct A
    {
        // exists assuming virtual methods on A
        // the data this vtable points to is read-only but C++ ctor semantics requires
        // updating vtable pointer during inherited constructor calls
        A_vtable *vtable; 
        // A data
    };
    struct B
    {
        B_vtable *vtable;
    };
    struct C
    {
        C_vtable *vtable; // incorporates A's vtable
        // struct A a; data members of A but without vtable
        struct B b;
    };
A pointer to an instance of C, but typed as B * , will point at the structure contained in C; it will be an interior pointer. There's no reason to expect that there will be another live pointer of type C * pointing to the object; a pointer of type C * needs to be sufficient. Typecasting back down the hierarchy to C * again (with dynamic_cast) will need to adjust the pointer to the start of the object; this normally requires RTTI, and may be done by e.g. storing the offset of the vtable inside the object in the vtable itself (perhaps at a negative offset).

The above is all implementation dependent, of course, but what I've described above maps almost exactly to how Delphi implements COM interfaces (where A would be the base class and B would be a pure virtual class, or interface - no data members).


Ah, thank you for taking the time to point that out to me. Very interesting and informative!


On most binary builds of GC i have seen in the wild, interior pointers are enabled (and at least since GC 6.0, this is the default in GC's Makefile)


Storing binary data in memory and not explaining what it is to the Boehm GC with one of the macros they supply is the easiest way to fall down. Since "fairly arbitrary" binary data can be encoded as UTF-8 strings (many 32-bit binary codes cannot be a part of a UTF-8 string, but most can) you need to mark strings as non-pointer data too, which people are not always good about doing. If you do things correctly (that is, mark your blobs of arbitrary data properly) then it's reasonably good but of course most people don't and it runs counter to the idea that Boehm is a drop-in replacement.

This isn't Boehm's fault, specifically; any conservative GC would have similar trouble. You can't get completely away from cooperating with the GC and have good GC both.


I would certainly hope that nobody is blaming Boehm. He's provided a valuable tool that has some practical limitations, which are probably mostly due to the nature of conservative garbage collection.

However, it's worth noting that you can't easily mark/annotate all data structures, especially when large third-party libraries are involved, and it's not guaranteed to solve the problem. Implying otherwise might dangerously mislead people.

As an example, look at Mono, which gives the GC very precise information [1]. Even so, at least some projects suffer ever-increasing memory usage when run under it [2], which the Mono team are fixing by writing their own garbage collector.

[1] http://www.mono-project.com/Mono:Runtime#Garbage_Collection [2] http://www.mono-project.com/FAQ:_ASP.NET#Memory_Usage


I used it in a Scheme interpreter with no noticeable problem as my main personal hacking platform for years. OTOH, I didn't use it for long-lived servers, and Racket eventually migrated from it to a precise collector and IIRC the release notes mentioned occasional such problems as a factor. If they said how often or how severe, I missed it.


I have found that it more likely holds on to some pointer longer than it could rather than leaking memory. For instance, I've never known it to start eating up memory.


Holding onto a pointer that should be freed is leaking memory.


That is overly conservative. A perfect garbage collector that always runs in reasonable time is provably impossible. What sambeau said is that it often holds on to [an allocation] for too long, which is true to varying extent of all garbage collectors.

To leak memory is to fail to deallocate it at all.


"Leak" may be the right word for some configurations with conservative GC. A "leak-free" GC would presumably guarantee eventual collection of all dead objects, which is impossible with a conservative GC in theory, and in practice it's not especially difficult to get into a configuration with the Boehm GC where the lifetime of a dead object is infinite (usually because it doesn't move and some binary data that the GC thinks might be a pointer doesn't go away, because it's texture data or something).


And of course there are many valid programs where leaking memory is totally fine (and more performant): shell tools, cgis etc.

When I write any of these, the only time I free anything is when I've had to malloc something huge.


Interestin dichotomy: In a managed system, a "leak" is when a pointer points to a memory allocation too long. In an unmanaged system, a "leak" is when you have no pointers to a memory allocation. :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: