Skip to content

Realm deployment script#1652

Open
elliottslaughter wants to merge 64 commits into
flexflow:masterfrom
elliottslaughter:realm-deploy
Open

Realm deployment script#1652
elliottslaughter wants to merge 64 commits into
flexflow:masterfrom
elliottslaughter:realm-deploy

Conversation

@elliottslaughter
Copy link
Copy Markdown
Collaborator

@elliottslaughter elliottslaughter commented Jun 3, 2026

Adds deployment script and capability for a generic, Realm-based model runner executable.

Miscellaneous other fixes:

  • Remove all submodules. We don't build them in Nix, and now the deployment script doesn't need them either.
  • Remove internal build and related variables from CMake. We're going to do this either from Nix or the deployment script now.
  • Directly reference CMake dependencies by their fully-qualified names to match CMake conventions. This one is a bit of a judgment call, but I put it into the original branch because it felt like adding aliases for everything was hiding some important CMake semantics (i.e., a fully-qualified CMake name vs a raw name can actually behave differently in a build).
  • Support for modern cuDNN, required to build on Sapling.
  • Fixes for random binaries that haven't made it upstream yet.
  • Add a CI mode for the deployment script so that we can be sure it's working (at least builds).

This change is Reviewable

@elliottslaughter elliottslaughter requested a review from lockshaw June 3, 2026 21:38
@lockshaw lockshaw force-pushed the realm-deploy branch 3 times, most recently from 45a5a61 to 0197d0c Compare June 5, 2026 06:58
Copy link
Copy Markdown
Collaborator Author

@elliottslaughter elliottslaughter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elliottslaughter reviewed 59 files and all commit messages, and made 4 comments.
Reviewable status: all files reviewed, 4 unresolved discussions (waiting on lockshaw).


deploy.sh line 10 at r1 (raw file):

}

while [[ $# -gt 0 ]]; do

I wanted to make a couple of notes here because it looks like you're taking this in a different direction than what I intended.

This is NOT intended to be a one-size-fits-all, reusable deployment script. It is specifically intended to work on Sapling, and then I hacked it to also support CI so that we don't build something without test coverage. It is intended to be an example of how you could deploy on other machines without necessarily directly supporting any of them, because inevitably other HPC machines will look different, and it was a non-goal to account for every possible variation. I've worked on systems that have that as a goal, and it's doable, but it's also inherently brittle because essentially no HPC machines support CI and you will always be relying on human effort to keep any of it working.

The changes to make this more generic do appear to interfere with it working out of the box on Sapling, which was also a goal of the original.

We can decide to target a different set of goals but I want to be clear on what we're doing so that we build the right infrastructure for the job.


.github/workflows/deploy.yml line 16 at r1 (raw file):

        run: |
          apt-get update -qq
          apt-get install -y build-essential cmake curl gcc-${{ matrix.gcc }} g++-${{ matrix.gcc }} git libibverbs-dev mpich libmpich-dev python3 zlib1g-dev python3-venv

Not a huge deal but I was keeping this list sorted.


bin/run-model/src/run-model/main.cc line 110 at r1 (raw file):

      perform_all_passes_for_pcg_instance(
          /*instance=*/pcg_instance,
          /*profiling_settings=*/ProfilingSettings{0, 0},

Note to self that we need to run 1 real iteration or this won't do anything.


lib/local-execution/CMakeLists.txt line 16 at r1 (raw file):

    task-spec
    pcg
    deps::spdlog

Should we re-sort these dependency lines to be either alphabetical, or put dependencies last? The new prefix breaks the previous sort order.

@lockshaw
Copy link
Copy Markdown
Collaborator

lockshaw commented Jun 5, 2026

deploy.sh line 10 at r1 (raw file):

Previously, elliottslaughter (Elliott Slaughter) wrote…

I wanted to make a couple of notes here because it looks like you're taking this in a different direction than what I intended.

This is NOT intended to be a one-size-fits-all, reusable deployment script. It is specifically intended to work on Sapling, and then I hacked it to also support CI so that we don't build something without test coverage. It is intended to be an example of how you could deploy on other machines without necessarily directly supporting any of them, because inevitably other HPC machines will look different, and it was a non-goal to account for every possible variation. I've worked on systems that have that as a goal, and it's doable, but it's also inherently brittle because essentially no HPC machines support CI and you will always be relying on human effort to keep any of it working.

The changes to make this more generic do appear to interfere with it working out of the box on Sapling, which was also a goal of the original.

We can decide to target a different set of goals but I want to be clear on what we're doing so that we build the right infrastructure for the job.

Yeah, I thought you had changed your mind on this because the script didn't having "sapling" in the name so I was just going along with it. put sapling in the name, and then go ahead and re-add in the module loads (I'd do it but it's probably easier for you to test). I think the require_cmd's are still good for at least moving the errors up front, and moving over to pip is still good for reducing the base, and the flags are just for making the usage a bit cleaner (though if it would be better to go back to GCC_VERSION instead of just pulling from CC/CXX, go for it). I think that's all the changes I made, but if there's anything else lmk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants