Actor Pause/Resume flow#227
Conversation
…int configuration in atelet
68d9c85 to
0935cfb
Compare
Benjamin Elder (BenTheElder)
left a comment
There was a problem hiding this comment.
Mostly nits inline.
- Can we add coverage for this to the e2e tests? I think this should work on gvisor+kind and it would be nice to cover in CI. We could extend the demo counter e2e.
- Worker.node_name is only populated when the syncer re-mirrors a pod, so workers created before this rollout won't satisfy the node restriction until re-synced.
- I know the storage situation is temporary, but we have no GC for snapshots and these may fill local disk quickly. Maybe a TODO somewhere or a doc warning?
I have extended existing e2e demo test to test the PAUSE API.
Added a disclaimer in the PR description.
Added TODO in the code. |
| state := &PauseState{} | ||
|
|
||
| // Acquire lock and get the timeout context for the workflow | ||
| // Lock TTL is 7 seconds, with 2 seconds padding for workflow timeout |
There was a problem hiding this comment.
| // Lock TTL is 7 seconds, with 2 seconds padding for workflow timeout | |
| // Lock TTL is 30 seconds, with 2 seconds padding for workflow timeout |
There was a problem hiding this comment.
Julian Gutierrez Oschmann (@juli4n) I copied the comment and content of 30 sec from suspend workflow. The comment says "7" sec. Is it a comment incorrect or the actual timeout set to 30 sec instead of 7?
There was a problem hiding this comment.
It looks like the comment is outdated. It used to be 7 seconds, then it was bumped to 30. See https://github.com/agent-substrate/substrate/blame/808806db7f7b19304a44a491730004239110ef04/cmd/ateapi/internal/controlapi/workflow.go#L145.
… and check the counter result
Implements
PAUSEDstate for issue #119.The snapshot files are kept locally on node VM in a separate folder. At resume time, scheduler uses a node VM hint and picks up a worker from the same node where files were stored at suspend time.
The local file management solution is temporary and will be replaced once Michelle Au (@msau42) introduces a new component that is supposed to manage files on the node VM.
Breaking change
This PR introduces a breaking change in the Actor proto. All existing actor needs to be recreated, prior testing PAUSE functionality.