[haiku-bugs] Re: [Haiku] #8275: Deadlocking apps when opening documents

  • From: "bonefish" <trac@xxxxxxxxxxxx>
  • Date: Mon, 25 Jun 2012 00:22:39 -0000

#8275: Deadlocking apps when opening documents
-------------------------------------+----------------------------
   Reporter:  humdinger              |      Owner:  bonefish
       Type:  bug                    |     Status:  new
   Priority:  normal                 |  Milestone:  R1/alpha4
  Component:  System/runtime_loader  |    Version:  R1/Development
 Resolution:                         |   Keywords:
 Blocked By:                         |   Blocking:  8081
Has a Patch:  0                      |   Platform:  All
-------------------------------------+----------------------------

Comment (by bonefish):

 Replying to [comment:9 leavengood]:
 > Replying to [comment:6 bonefish]:
 > There are even TODOs in the runtime_loader's elf.cpp (added by you about
 3 years ago) about improving the locking :)

 Yes, there's a TODO mentioning that locking should be improved. I may have
 added it, but possibly I just converted/reinterpreted a preexisting
 comment. IIRC we inherited the locking strategy from NewOS and there had
 already been a comment mentioning that it is suboptimal. Though, I believe
 that was only considering the performance aspect of the problem (making
 the runtime loader essentially single-threaded), not that this is actually
 a functional bug.

 > Is there any chance you or anyone else sufficiently skilled could fix
 this in the next few weeks?

 I don't think I'll have time to work on that anytime soon. I can't speak
 for anyone else, but in open source it is usually safe to assume that if
 you don't do it yourself nobody will. :-)

 > If not I could take a stab, but low-level stuff like this isn't my
 forte.

 While the runtime loader may be considered low-level, the abstract problem
 at hand isn't anything you couldn't just as well encounter in your multi-
 threaded application.

 > I'm taking a look at alpha4 marked issues, though I suppose this could
 be ignored for alpha4. It is annoying but it happens pretty infrequently
 in my testing.

 Yeah, well, that might also be depending on the test machine. What happens
 infrequently on your machine might happen relatively often on others and
 not at all on yet others.

 BTW, it should be pretty easy to implement a reliable test case, if that
 helps. In pseudo-code:
 {{{
     static BLocker sLock;

     static struct Foo {
         Foo()
         {
             start new thread in thread_function();
             sleep 1s;
             sLock.Lock();
             sLock.Unlock();
         }
     } sFoo;

     thread_function()
     {
         sLock.Lock();
         call any runtime loader service, e.g. `get_image_symbol()`;
         sLock.Unlock();
     }

     main() {}
 }}}
 The `Foo` constructor tries to lock `sLock` while holding the runtime
 loader lock. `thread_function()` tries to lock the runtime loader lock
 while holding `sLock`.

 In case you or someone else wants to tackle the issue, I'll try and give
 and overview of the situation:

  * There's a global list, `sLoadedImages` plus `sLoadedImageCount` in
 images.cpp, which contains objects (the loaded images, `image_t`). This
 list is constructed while loading the program, but it can change later on
 as add-ons or shared libraries are dynamically loaded and unloaded.
  * That's an image's lifetime:
    - The image is loaded, that is the ELF file header is processed, the
 segments are mapped into memory, an image object is created and added to
 the global list (`sLoadedImages`).
    - The image's dependencies are loaded. Like the first step, just for
 all shared libraries the first image depends on, recursively. There's a
 ref-counting in the image objects. A dependency is only loaded once. Its
 reference count is incremented for each other image that depends on it.
    - The image and the images it depends on are relocated (only the
 dependencies that haven't been relocated yet). This happens in topological
 order, though I think the order doesn't actually matter in this case.
    - The read-only segments of the images are remapped read-only (needed
 to be writable for relocation).
    - The image and its dependencies are initialized (only the not yet
 initialized ones). Initialization means, that the initialization function
 of the image is called, the one that calls static constructors and the
 like, i.e. this calls code in the image, outside the runtime loader. The
 initialization happens bottom up, dependencies before images depending on
 them.
    - The image object remains unchanged until the image is unloaded.
 Unloading an image only decrements its reference count. Unless it drops to
 0, nothing else happens. Otherwise ...
    - The image is moved from `sLoadedImages` to `sDisposableImages`.
    - For the images it depends on the previous two steps and this one are
 executed.
    - For each image that ends up in `sDisposableImages` (i.e. has a ref
 count of 0) its deinitialization function is called (calling static
 destructors etc.), the image is unmapped, and the image object is
 destroyed.
  - After loading the program image and its dependencies the runtime loader
 is essentially a library, providing the following kinds of services to the
 program:
    - Dynamically loading/unloading an add-on/library.
    - Looking up a symbol by name in a given library or globally.
    - Reverse looking up a symbol by address.
    - Iterating through loaded images.
    - Iterating through the symbols of an image (actually looking one up by
 index).
    - Iterating through the direct dependencies of an image.
  - There's a single global runtime loader lock, which is held when the
 program is initially loaded and when any of the provided services is
 performed. In particular it is always held when calling an image
 initialization or deinitialization function.

 The latter is the main issue. Since an image must be fully initialized
 before it can be used, just dropping the lock before calling the
 initialization function is not possible. An image's initialization
 function must not be called before the initialization function of all
 dependencies have been called (and returned). Analogously the
 deinitialization function of an image must not be called before the
 deinitialization function of all images depending on it have been called
 (and returned). Consequently:
  - The single global lock approach doesn't work. A global lock for the
 image list is necessary -- it should become an innermost lock -- but
 additionally one or more locks per image are needed.
  - The image lists compiled when loading an image (the image plus its
 dependencies) and when unloading it (`sDisposableImages`) probably won't
 work anymore. I think a more dynamic approach is necessary -- namely depth
 first search with dynamic locking. The already existing ref-counting will
 help a lot, since it can make sure that image objects won't die while
 playing with them without holding the global lock. Note that while it
 looks like the images could be individually fully loaded (mapped,
 relocated, and initialized) in a bottom up manner, it is necessary to keep
 the phases separate. Relocation must wait until all images are mapped and
 in the search list for symbols, so that undefined library symbols (defined
 in the app) and symbol preemption are supported correctly. Analogously
 initialization must wait until all images have been relocated (since the
 initialization function may use an undefined or preempted symbol).
  - IIRC (needs checking), after putting an image object in the global
 list, the image object is basically immutable until the image is about to
 be unloaded. That could be made use of, e.g. by using a lockless strategy
 (for the members that is), read-write locking, or "init once" locking
 primitives. There are exceptions, though:
    - The `image_t::flags` field contains bits for relocation and
 initialization status, which change when/after relocating/initializing the
 image. One or more other bits may be used temporarily for algorithms. This
 needs to be considered/changed accordingly.
    - After `fork()` the image IDs of the child process differ from those
 of the parent. The (still old) ID stored in the image structures is
 updated lazily (`update_image_ids()`).

 In case this isn't clear already, this task is probably a bit of work (as
 in more than a few hours), though it will depend on how smoothly things
 go.

-- 
Ticket URL: <http://dev.haiku-os.org/ticket/8275#comment:10>
Haiku <http://dev.haiku-os.org>
Haiku - the operating system.

Other related posts: