Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add experimental 'git survey' builtin #5174

Merged
merged 10 commits into from
Sep 26, 2024

Commits on Sep 26, 2024

  1. survey: stub in new experimental 'git-survey' command

    Start work on a new 'git survey' command to scan the repository
    for monorepo performance and scaling problems.  The goal is to
    measure the various known "dimensions of scale" and serve as a
    foundation for adding additional measurements as we learn more
    about Git monorepo scaling problems.
    
    The initial goal is to complement the scanning and analysis performed
    by the GO-based 'git-sizer' (https://github.com/github/git-sizer) tool.
    It is hoped that by creating a builtin command, we may be able to take
    advantage of internal Git data structures and code that is not
    accessible from GO to gain further insight into potential scaling
    problems.
    
    Co-authored-by: Derrick Stolee <stolee@gmail.com>
    Signed-off-by: Jeff Hostetler <git@jeffhostetler.com>
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    jeffhostetler and derrickstolee committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    3a8cd93 View commit details
    Browse the repository at this point in the history
  2. survey: add command line opts to select references

    By default we will scan all references in "refs/heads/", "refs/tags/"
    and "refs/remotes/".
    
    Add command line opts let the use ask for all refs or a subset of them
    and to include a detached HEAD.
    
    Signed-off-by: Jeff Hostetler <git@jeffhostetler.com>
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    jeffhostetler authored and derrickstolee committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    c08fa91 View commit details
    Browse the repository at this point in the history
  3. survey: start pretty printing data in table form

    When 'git survey' provides information to the user, this will be presented
    in one of two formats: plaintext and JSON. The JSON implementation will be
    delayed until the functionality is complete for the plaintext format.
    
    The most important parts of the plaintext format are headers specifying the
    different sections of the report and tables providing concreted data.
    
    Create a custom table data structure that allows specifying a list of
    strings for the row values. When printing the table, check each column for
    the maximum width so we can create a table of the correct size from the
    start.
    
    The table structure is designed to be flexible to the different kinds of
    output that will be implemented in future changes.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    2c0755d View commit details
    Browse the repository at this point in the history
  4. survey: add object count summary

    At the moment, nothing is obvious about the reason for the use of the
    path-walk API, but this will become more prevelant in future iterations. For
    now, use the path-walk API to sum up the counts of each kind of object.
    
    For example, this is the reachable object summary output for my local repo:
    
    REACHABLE OBJECT SUMMARY
    ========================
    Object Type |  Count
    ------------+-------
           Tags |   1343
        Commits | 179344
          Trees | 314350
          Blobs | 184030
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    9e2f0af View commit details
    Browse the repository at this point in the history
  5. survey: summarize total sizes by object type

    Now that we have explored objects by count, we can expand that a bit more to
    summarize the data for the on-disk and inflated size of those objects. This
    information is helpful for diagnosing both why disk space (and perhaps
    clone or fetch times) is growing but also why certain operations are slow
    because the inflated size of the abstract objects that must be processed is
    so large.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    947c2c5 View commit details
    Browse the repository at this point in the history
  6. survey: show progress during object walk

    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    4e42826 View commit details
    Browse the repository at this point in the history
  7. survey: add ability to track prioritized lists

    In future changes, we will make use of these methods. The intention is to
    keep track of the top contributors according to some metric. We don't want
    to store all of the entries and do a sort at the end, so track a
    constant-size table and remove rows that get pushed out depending on the
    chosen sorting algorithm.
    
    Co-authored-by: Jeff Hostetler <git@jeffhostetler.com>
    Signed-off-by; Jeff Hostetler <git@jeffhostetler.com>
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee and Jeff Hostetler committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    2a99b7c View commit details
    Browse the repository at this point in the history
  8. survey: add report of "largest" paths

    Since we are already walking our reachable objects using the path-walk API,
    let's now collect lists of the paths that contribute most to different
    metrics. Specifically, we care about
    
     * Number of versions.
     * Total size on disk.
     * Total inflated size (no delta or zlib compression).
    
    This information can be critical to discovering which parts of the
    repository are causing the most growth, especially on-disk size. Different
    packing strategies might help compress data more efficiently, but the toal
    inflated size is a representation of the raw size of all snapshots of those
    paths. Even when stored efficiently on disk, that size represents how much
    information must be processed to complete a command such as 'git blame'.
    
    Since the on-disk size is likely to be fragile, stop testing the exact
    output of 'git survey' and check that the correct set of headers is
    output.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    af8bd64 View commit details
    Browse the repository at this point in the history
  9. survey: add --top=<N> option and config

    The 'git survey' builtin provides several detail tables, such as "top
    files by on-disk size". The size of these tables defaults to 100,
    currently.
    
    Allow the user to specify this number via a new --top=<N> option or the
    new survey.top config key.
    
    Signed-off-by: Derrick Stolee <stolee@gmail.com>
    derrickstolee committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    f18c0c2 View commit details
    Browse the repository at this point in the history
  10. survey: clearly note the experimental nature in the output

    While this command is definitely something we _want_, chances are that
    upstreaming this will require substantial changes.
    
    We still want to be able to experiment with this before that, to focus
    on what we need out of this command: To assist with diagnosing issues
    with large repositories, as well as to help monitoring the growth and
    the associated painpoints of such repositories.
    
    To that end, we are about to integrate this command into
    `microsoft/git`, to get the tool into the hands of users who need it
    most, with the idea to iterate in close collaboration between these
    users and the developers familar with Git's internals.
    
    However, we will definitely want to avoid letting anybody have the
    impression that this command, its exact inner workings, as well as its
    output format, are anywhere close to stable. To make that fact utterly
    clear (and thereby protect the freedom to iterate and innovate freely
    before upstreaming the command), let's mark its output as experimental
    in all-caps, as the first thing we do.
    
    Signed-off-by: Johannes Schindelin <johannes.schindelin@gmx.de>
    dscho authored and derrickstolee committed Sep 26, 2024
    Configuration menu
    Copy the full SHA
    d28dc5b View commit details
    Browse the repository at this point in the history