Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ci: backport fix for osx-* 6h timeouts #700

Commits on Oct 8, 2024

  1. simple-ipc: split async server initialization and running

    To start an async ipc server, you call ipc_server_run_async(). That
    initializes the ipc_server_data object, and starts all of the threads
    running, which may immediately start serving clients.
    
    This can create some awkward timing problems, though. In the fsmonitor
    daemon (the sole user of the simple-ipc system), we want to create the
    ipc server early in the process, which means we may start serving
    clients before the rest of the daemon is fully initialized.
    
    To solve this, let's break run_async() into two parts: an initialization
    which allocates all data and spawns the threads (without letting them
    run), and a start function which actually lets them begin work. Since we
    have two simple-ipc implementations, we have to handle this twice:
    
      - in ipc-unix-socket.c, we have a central listener thread which hands
        connections off to worker threads using a work_available mutex. We
        can hold that mutex after init, and release it when we're ready to
        start.
    
        We do need an extra "started" flag so that we know whether the main
        thread is holding the mutex or not (e.g., if we prematurely stop the
        server, we want to make sure all of the worker threads are released
        to hear about the shutdown).
    
      - in ipc-win32.c, we don't have a central mutex. So we'll introduce a
        new startup_barrier mutex, which we'll similarly hold until we're
        ready to let the threads proceed.
    
        We again need a "started" flag here to make sure that we release the
        barrier mutex when shutting down, so that the sub-threads can
        proceed to the finish.
    
    I've renamed the run_async() function to init_async() to make sure we
    catch all callers, since they'll now need to call the matching
    start_async().
    
    We could leave run_async() as a wrapper that does both, but there's not
    much point. There are only two callers, one of which is fsmonitor, which
    will want to actually do work between the two calls. And the other is
    just a test-tool wrapper.
    
    For now I've added the start_async() calls in fsmonitor where they would
    otherwise have happened, so there should be no behavior change with this
    patch.
    
    Signed-off-by: Jeff King <peff@peff.net>
    Acked-by: Koji Nakamaru <koji.nakamaru@gree.net>
    Signed-off-by: Junio C Hamano <gitster@pobox.com>
    peff authored and gitster committed Oct 8, 2024
    Configuration menu
    Copy the full SHA
    766fce6 View commit details
    Browse the repository at this point in the history
  2. fsmonitor: initialize fs event listener before accepting clients

    There's a racy hang in fsmonitor on macOS that we sometimes see in CI.
    When we serve a client, what's supposed to happen is:
    
      1. The client thread calls with_lock__wait_for_cookie() in which we
         create a cookie file and then wait for a pthread_cond event
    
      2. The filesystem event listener sees the cookie file creation, does
         some internal book-keeping, and then triggers the pthread_cond.
    
    But there's a problem: we start the listener that accepts client threads
    before we start the fs event thread. So it's possible for us to accept a
    client which creates the cookie file and starts waiting before the fs
    event thread is initialized, and we miss those filesystem events
    entirely. That leaves the client thread hanging forever.
    
    In CI, the symptom is that t9210 (which is testing scalar, which always
    enables fsmonitor under the hood) may hang forever in "scalar clone". It
    is waiting on "git fetch" which is waiting on the fsmonitor daemon.
    
    The race happens more frequently under load, but you can trigger it
    predictably with a sleep like this, which delays the start of the fs
    event thread:
    
      --- a/compat/fsmonitor/fsm-listen-darwin.c
      +++ b/compat/fsmonitor/fsm-listen-darwin.c
      @@ -510,6 +510,7 @@ void fsm_listen__loop(struct fsmonitor_daemon_state *state)
              FSEventStreamSetDispatchQueue(data->stream, data->dq);
              data->stream_scheduled = 1;
    
      +       sleep(1);
              if (!FSEventStreamStart(data->stream)) {
                      error(_("Failed to start the FSEventStream"));
                      goto force_error_stop_without_loop;
    
    One solution might be to reverse the order of initialization: start the
    fs event thread before we start the thread listening for clients. But
    the fsmonitor code explicitly does it in the opposite direction. The fs
    event thread wants to refer to the ipc_server_data struct, so we need it
    to be initialized first.
    
    A further complication is that we need a signal from the fs event thread
    that it is actually ready and listening. And those details happen within
    backend-specific fsmonitor code, whereas the initialization is in the
    shared code.
    
    So instead, let's use the ipc_server init/start split added in the
    previous commit. The generic fsmonitor code will init the ipc_server but
    _not_ start it, leaving that to the backend specific code, which now
    needs to call ipc_server_start_async() at the right time.
    
    For macOS, that is right after we start the FSEventStream that you can
    see in the diff above.
    
    It's not clear to me if Windows suffers from the same problem (and we
    simply don't trigger it in CI), or if it is immune. Regardless, the
    obvious place to start accepting clients there is right after we've
    established the ReadDirectoryChanges watch.
    
    This makes the hangs go away in our macOS CI environment, even when
    compiled with the sleep() above.
    
    Helped-by: Koji Nakamaru <koji.nakamaru@gree.net>
    Signed-off-by: Jeff King <peff@peff.net>
    Acked-by: Koji Nakamaru <koji.nakamaru@gree.net>
    Signed-off-by: Junio C Hamano <gitster@pobox.com>
    peff authored and gitster committed Oct 8, 2024
    Configuration menu
    Copy the full SHA
    51907f8 View commit details
    Browse the repository at this point in the history