#8275: Deadlocking apps when opening documents -------------------------------------+---------------------------- Reporter: humdinger | Owner: bonefish Type: bug | Status: new Priority: normal | Milestone: R1/alpha4 Component: System/runtime_loader | Version: R1/Development Resolution: | Keywords: Blocked By: | Blocking: 8081 Has a Patch: 0 | Platform: All -------------------------------------+---------------------------- Comment (by bonefish): Replying to [comment:9 leavengood]: > Replying to [comment:6 bonefish]: > There are even TODOs in the runtime_loader's elf.cpp (added by you about 3 years ago) about improving the locking :) Yes, there's a TODO mentioning that locking should be improved. I may have added it, but possibly I just converted/reinterpreted a preexisting comment. IIRC we inherited the locking strategy from NewOS and there had already been a comment mentioning that it is suboptimal. Though, I believe that was only considering the performance aspect of the problem (making the runtime loader essentially single-threaded), not that this is actually a functional bug. > Is there any chance you or anyone else sufficiently skilled could fix this in the next few weeks? I don't think I'll have time to work on that anytime soon. I can't speak for anyone else, but in open source it is usually safe to assume that if you don't do it yourself nobody will. :-) > If not I could take a stab, but low-level stuff like this isn't my forte. While the runtime loader may be considered low-level, the abstract problem at hand isn't anything you couldn't just as well encounter in your multi- threaded application. > I'm taking a look at alpha4 marked issues, though I suppose this could be ignored for alpha4. It is annoying but it happens pretty infrequently in my testing. Yeah, well, that might also be depending on the test machine. What happens infrequently on your machine might happen relatively often on others and not at all on yet others. BTW, it should be pretty easy to implement a reliable test case, if that helps. In pseudo-code: {{{ static BLocker sLock; static struct Foo { Foo() { start new thread in thread_function(); sleep 1s; sLock.Lock(); sLock.Unlock(); } } sFoo; thread_function() { sLock.Lock(); call any runtime loader service, e.g. `get_image_symbol()`; sLock.Unlock(); } main() {} }}} The `Foo` constructor tries to lock `sLock` while holding the runtime loader lock. `thread_function()` tries to lock the runtime loader lock while holding `sLock`. In case you or someone else wants to tackle the issue, I'll try and give and overview of the situation: * There's a global list, `sLoadedImages` plus `sLoadedImageCount` in images.cpp, which contains objects (the loaded images, `image_t`). This list is constructed while loading the program, but it can change later on as add-ons or shared libraries are dynamically loaded and unloaded. * That's an image's lifetime: - The image is loaded, that is the ELF file header is processed, the segments are mapped into memory, an image object is created and added to the global list (`sLoadedImages`). - The image's dependencies are loaded. Like the first step, just for all shared libraries the first image depends on, recursively. There's a ref-counting in the image objects. A dependency is only loaded once. Its reference count is incremented for each other image that depends on it. - The image and the images it depends on are relocated (only the dependencies that haven't been relocated yet). This happens in topological order, though I think the order doesn't actually matter in this case. - The read-only segments of the images are remapped read-only (needed to be writable for relocation). - The image and its dependencies are initialized (only the not yet initialized ones). Initialization means, that the initialization function of the image is called, the one that calls static constructors and the like, i.e. this calls code in the image, outside the runtime loader. The initialization happens bottom up, dependencies before images depending on them. - The image object remains unchanged until the image is unloaded. Unloading an image only decrements its reference count. Unless it drops to 0, nothing else happens. Otherwise ... - The image is moved from `sLoadedImages` to `sDisposableImages`. - For the images it depends on the previous two steps and this one are executed. - For each image that ends up in `sDisposableImages` (i.e. has a ref count of 0) its deinitialization function is called (calling static destructors etc.), the image is unmapped, and the image object is destroyed. - After loading the program image and its dependencies the runtime loader is essentially a library, providing the following kinds of services to the program: - Dynamically loading/unloading an add-on/library. - Looking up a symbol by name in a given library or globally. - Reverse looking up a symbol by address. - Iterating through loaded images. - Iterating through the symbols of an image (actually looking one up by index). - Iterating through the direct dependencies of an image. - There's a single global runtime loader lock, which is held when the program is initially loaded and when any of the provided services is performed. In particular it is always held when calling an image initialization or deinitialization function. The latter is the main issue. Since an image must be fully initialized before it can be used, just dropping the lock before calling the initialization function is not possible. An image's initialization function must not be called before the initialization function of all dependencies have been called (and returned). Analogously the deinitialization function of an image must not be called before the deinitialization function of all images depending on it have been called (and returned). Consequently: - The single global lock approach doesn't work. A global lock for the image list is necessary -- it should become an innermost lock -- but additionally one or more locks per image are needed. - The image lists compiled when loading an image (the image plus its dependencies) and when unloading it (`sDisposableImages`) probably won't work anymore. I think a more dynamic approach is necessary -- namely depth first search with dynamic locking. The already existing ref-counting will help a lot, since it can make sure that image objects won't die while playing with them without holding the global lock. Note that while it looks like the images could be individually fully loaded (mapped, relocated, and initialized) in a bottom up manner, it is necessary to keep the phases separate. Relocation must wait until all images are mapped and in the search list for symbols, so that undefined library symbols (defined in the app) and symbol preemption are supported correctly. Analogously initialization must wait until all images have been relocated (since the initialization function may use an undefined or preempted symbol). - IIRC (needs checking), after putting an image object in the global list, the image object is basically immutable until the image is about to be unloaded. That could be made use of, e.g. by using a lockless strategy (for the members that is), read-write locking, or "init once" locking primitives. There are exceptions, though: - The `image_t::flags` field contains bits for relocation and initialization status, which change when/after relocating/initializing the image. One or more other bits may be used temporarily for algorithms. This needs to be considered/changed accordingly. - After `fork()` the image IDs of the child process differ from those of the parent. The (still old) ID stored in the image structures is updated lazily (`update_image_ids()`). In case this isn't clear already, this task is probably a bit of work (as in more than a few hours), though it will depend on how smoothly things go. -- Ticket URL: <http://dev.haiku-os.org/ticket/8275#comment:10> Haiku <http://dev.haiku-os.org> Haiku - the operating system.