Skip to content

[codex] Add Tinker checkpoint delete endpoint#1840

Open
j316chuck wants to merge 1 commit into
NovaSky-AI:mainfrom
j316chuck:codex/delete-checkpoint-endpoint
Open

[codex] Add Tinker checkpoint delete endpoint#1840
j316chuck wants to merge 1 commit into
NovaSky-AI:mainfrom
j316chuck:codex/delete-checkpoint-endpoint

Conversation

@j316chuck

@j316chuck j316chuck commented Jun 26, 2026

Copy link
Copy Markdown

Summary

  • Add DELETE /api/v1/training_runs/{unique_id}/checkpoints/{checkpoint_id} for Tinker checkpoint cleanup.
  • Delete the saved checkpoint artifact from checkpoints_base and remove the checkpoint DB row without unloading the live model.
  • Support SDK delete paths like tinker://run/weights/ckpt, direct checkpoint ids, and both training and sampler checkpoint storage layouts.
  • Update the Tinker agent docs with checkpoint deletion guidance.

Fixes #1834.

Root Cause

The local Tinker server had save/list/download checkpoint flows but no REST delete implementation behind the SDK delete helper, so saved checkpoint artifacts could accumulate on disk until local storage was exhausted.

Live Endpoint Example

I brought up the SkyRL Tinker API server from this branch on http://0.0.0.0:8000/, created a tiny LoRA training model, saved a checkpoint, then deleted it through the new endpoint-backed SDK helper.

endpoint=http://0.0.0.0:8000/
base_model=trl-internal-testing/tiny-Qwen3ForCausalLM
save_state_path=tinker://model_4b1374f9/weights/demo_delete_training
disk_path=/tmp/tmp5sxmfa5_/model_4b1374f9/demo_delete_training.tar.gz
exists_after_save=True
list_before_delete=['demo_delete_training']
delete_route=/api/v1/training_runs/model_4b1374f9/checkpoints/weights/demo_delete_training
exists_after_delete=False
list_after_delete=[]

Test Coverage

The delete-specific coverage now lives in tests/tinker/test_checkpoint_delete_api.py so it does not start the Tinker server subprocess or load a model. The tests seed an isolated SQLite DB plus tiny checkpoint artifacts under a temp checkpoints_base, then call the FastAPI route directly.

  • test_delete_checkpoint_removes_saved_artifact_and_list_entry
    • Seeds a completed training checkpoint DB row and a tiny directory-shaped training checkpoint artifact.
    • Deletes it through DELETE /api/v1/training_runs/{model_id}/checkpoints/weights/{checkpoint_id}.
    • Seeds a completed sampler checkpoint DB row and a tiny sampler checkpoint file.
    • Deletes it through DELETE /api/v1/training_runs/{model_id}/checkpoints/{checkpoint_id}.
    • Verifies deleted artifacts are gone from disk and gone from list_checkpoints.
  • test_delete_even_checkpoints_leaves_odd_checkpoints_listed
    • Creates tiny checkpoint artifacts and DB rows for checkpoints 1, 2, 3, 4, and 5.
    • Deletes only checkpoints 2 and 4.
    • Verifies 2 and 4 are gone from disk, 1, 3, and 5 remain on disk, and list_checkpoints returns only 1, 3, and 5.

Latest focused test run:

tests/tinker/test_checkpoint_delete_api.py::test_delete_checkpoint_removes_saved_artifact_and_list_entry PASSED [ 50%]
tests/tinker/test_checkpoint_delete_api.py::test_delete_even_checkpoints_leaves_odd_checkpoints_listed PASSED [100%]

2 passed, 1 warning in 4.45s
real 7.28

Validation

  • /usr/bin/time -p uv run --isolated --extra dev --extra tinker pytest tests/tinker/test_checkpoint_delete_api.py -v
  • uv run --isolated --extra dev --extra jax --extra tinker pytest tests/tinker/test_api.py::test_save_checkpoint_then_delete_checkpoint_removes_saved_file_and_keeps_model_loaded tests/tinker/test_api.py::test_unload_model -v before replacing the slow test with fast route-level coverage
  • uv run --isolated --extra dev --extra jax --extra tinker python - <<'PY' ... live endpoint save/delete demo
  • bash format.sh

@j316chuck j316chuck force-pushed the codex/delete-checkpoint-endpoint branch 2 times, most recently from 32086fe to 75fb667 Compare June 26, 2026 10:51
@j316chuck j316chuck force-pushed the codex/delete-checkpoint-endpoint branch from 75fb667 to 482f0f1 Compare June 26, 2026 10:56
@pcmoritz pcmoritz marked this pull request as ready for review June 30, 2026 12:03
@pcmoritz

Copy link
Copy Markdown
Collaborator

One thing I was thinking we should do differently is making the test part of test_api.py, I can take a pass at that.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new DELETE endpoint (/api/v1/training_runs/{unique_id}/checkpoints/{checkpoint_path}) to delete saved checkpoint archives from disk and remove their corresponding database entries. It includes helper functions for parsing checkpoint paths, locating files, and executing deletions, along with comprehensive unit tests. The review feedback suggests two improvements: wrapping directory deletion in a try-except block to prevent a potential TOCTOU race condition, and committing the database transaction before deleting the file on disk to ensure consistency and avoid dangling database references.

Comment thread skyrl/tinker/api.py
Comment on lines +1261 to +1265
if checkpoint_path.is_dir():
if hasattr(checkpoint_path, "rmtree"):
checkpoint_path.rmtree()
else:
shutil.rmtree(checkpoint_path)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a potential TOCTOU (time-of-check to time-of-use) race condition here. If checkpoint_path is detected as a directory by is_dir(), but is deleted by another process or thread before shutil.rmtree is executed, shutil.rmtree will raise a FileNotFoundError and cause a 500 Internal Server Error. Wrapping the directory deletion in a try...except FileNotFoundError: block (similar to how unlink() is handled below) makes this operation more robust.

Suggested change
if checkpoint_path.is_dir():
if hasattr(checkpoint_path, "rmtree"):
checkpoint_path.rmtree()
else:
shutil.rmtree(checkpoint_path)
if checkpoint_path.is_dir():
try:
if hasattr(checkpoint_path, "rmtree"):
checkpoint_path.rmtree()
else:
shutil.rmtree(checkpoint_path)
except FileNotFoundError:
pass

Comment thread skyrl/tinker/api.py
Comment on lines +1385 to +1388
path = checkpoint_file_path(request, unique_id, checkpoint_id, resolved_checkpoint_type)
await asyncio.to_thread(delete_checkpoint_file, path)
await session.delete(checkpoint_db)
await session.commit()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

It is safer to commit the database transaction deleting the checkpoint record before deleting the actual file on disk. If the file is deleted first but the database commit fails (e.g., due to a database connection issue or transaction conflict), the database and disk will be out of sync, leaving a dangling reference in the database to a non-existent file. Committing the database deletion first ensures that if the transaction fails, the file remains intact, and if it succeeds, any failure to delete the file on disk only results in an orphan file rather than a broken reference.

Suggested change
path = checkpoint_file_path(request, unique_id, checkpoint_id, resolved_checkpoint_type)
await asyncio.to_thread(delete_checkpoint_file, path)
await session.delete(checkpoint_db)
await session.commit()
path = checkpoint_file_path(request, unique_id, checkpoint_id, resolved_checkpoint_type)
await session.delete(checkpoint_db)
await session.commit()
await asyncio.to_thread(delete_checkpoint_file, path)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add delete checkpoint endpoint to prevent Tinker checkpoint disk exhaustion

2 participants