[codex] Add Tinker checkpoint delete endpoint#1840
Conversation
32086fe to
75fb667
Compare
75fb667 to
482f0f1
Compare
|
One thing I was thinking we should do differently is making the test part of |
There was a problem hiding this comment.
Code Review
This pull request introduces a new DELETE endpoint (/api/v1/training_runs/{unique_id}/checkpoints/{checkpoint_path}) to delete saved checkpoint archives from disk and remove their corresponding database entries. It includes helper functions for parsing checkpoint paths, locating files, and executing deletions, along with comprehensive unit tests. The review feedback suggests two improvements: wrapping directory deletion in a try-except block to prevent a potential TOCTOU race condition, and committing the database transaction before deleting the file on disk to ensure consistency and avoid dangling database references.
| if checkpoint_path.is_dir(): | ||
| if hasattr(checkpoint_path, "rmtree"): | ||
| checkpoint_path.rmtree() | ||
| else: | ||
| shutil.rmtree(checkpoint_path) |
There was a problem hiding this comment.
There is a potential TOCTOU (time-of-check to time-of-use) race condition here. If checkpoint_path is detected as a directory by is_dir(), but is deleted by another process or thread before shutil.rmtree is executed, shutil.rmtree will raise a FileNotFoundError and cause a 500 Internal Server Error. Wrapping the directory deletion in a try...except FileNotFoundError: block (similar to how unlink() is handled below) makes this operation more robust.
| if checkpoint_path.is_dir(): | |
| if hasattr(checkpoint_path, "rmtree"): | |
| checkpoint_path.rmtree() | |
| else: | |
| shutil.rmtree(checkpoint_path) | |
| if checkpoint_path.is_dir(): | |
| try: | |
| if hasattr(checkpoint_path, "rmtree"): | |
| checkpoint_path.rmtree() | |
| else: | |
| shutil.rmtree(checkpoint_path) | |
| except FileNotFoundError: | |
| pass |
| path = checkpoint_file_path(request, unique_id, checkpoint_id, resolved_checkpoint_type) | ||
| await asyncio.to_thread(delete_checkpoint_file, path) | ||
| await session.delete(checkpoint_db) | ||
| await session.commit() |
There was a problem hiding this comment.
It is safer to commit the database transaction deleting the checkpoint record before deleting the actual file on disk. If the file is deleted first but the database commit fails (e.g., due to a database connection issue or transaction conflict), the database and disk will be out of sync, leaving a dangling reference in the database to a non-existent file. Committing the database deletion first ensures that if the transaction fails, the file remains intact, and if it succeeds, any failure to delete the file on disk only results in an orphan file rather than a broken reference.
| path = checkpoint_file_path(request, unique_id, checkpoint_id, resolved_checkpoint_type) | |
| await asyncio.to_thread(delete_checkpoint_file, path) | |
| await session.delete(checkpoint_db) | |
| await session.commit() | |
| path = checkpoint_file_path(request, unique_id, checkpoint_id, resolved_checkpoint_type) | |
| await session.delete(checkpoint_db) | |
| await session.commit() | |
| await asyncio.to_thread(delete_checkpoint_file, path) |
Summary
DELETE /api/v1/training_runs/{unique_id}/checkpoints/{checkpoint_id}for Tinker checkpoint cleanup.checkpoints_baseand remove the checkpoint DB row without unloading the live model.tinker://run/weights/ckpt, direct checkpoint ids, and both training and sampler checkpoint storage layouts.Fixes #1834.
Root Cause
The local Tinker server had save/list/download checkpoint flows but no REST delete implementation behind the SDK delete helper, so saved checkpoint artifacts could accumulate on disk until local storage was exhausted.
Live Endpoint Example
I brought up the SkyRL Tinker API server from this branch on
http://0.0.0.0:8000/, created a tiny LoRA training model, saved a checkpoint, then deleted it through the new endpoint-backed SDK helper.Test Coverage
The delete-specific coverage now lives in
tests/tinker/test_checkpoint_delete_api.pyso it does not start the Tinker server subprocess or load a model. The tests seed an isolated SQLite DB plus tiny checkpoint artifacts under a tempcheckpoints_base, then call the FastAPI route directly.test_delete_checkpoint_removes_saved_artifact_and_list_entryDELETE /api/v1/training_runs/{model_id}/checkpoints/weights/{checkpoint_id}.DELETE /api/v1/training_runs/{model_id}/checkpoints/{checkpoint_id}.list_checkpoints.test_delete_even_checkpoints_leaves_odd_checkpoints_listed1,2,3,4, and5.2and4.2and4are gone from disk,1,3, and5remain on disk, andlist_checkpointsreturns only1,3, and5.Latest focused test run:
Validation
/usr/bin/time -p uv run --isolated --extra dev --extra tinker pytest tests/tinker/test_checkpoint_delete_api.py -vuv run --isolated --extra dev --extra jax --extra tinker pytest tests/tinker/test_api.py::test_save_checkpoint_then_delete_checkpoint_removes_saved_file_and_keeps_model_loaded tests/tinker/test_api.py::test_unload_model -vbefore replacing the slow test with fast route-level coverageuv run --isolated --extra dev --extra jax --extra tinker python - <<'PY' ...live endpoint save/delete demobash format.sh