add eval capes to sdk#460
Conversation
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
edwinpav
left a comment
There was a problem hiding this comment.
Overall nice work!
Two main things:
- I'd make sure that the user-facing docs/descriptions are not overly complex. Not everyone will know or even care about how the function works behind the scenes, just care what are the params, what are the returns, and the feature that the method provides.
- If you want to deploy a new sdk version with these changes, two more files need to be changed and added to this pr:
-
CHANGELOG.mdshould be updated. The tag link that the CHANGELOG references will be created after this pr is merged into master. You'd add a new release with a new tag here: https://github.com/scaleapi/nucleus-python-client/releases. Feel free to ping for any questions! The process isn't super clear lol -
The sdk
versionundertool.poetryshould be updated inpyproject.toml
(see #457 as a reference pr)
-
| ) -> EvaluationV2: | ||
| """Create an Evaluation V2 job for a model run. | ||
|
|
||
| Starts a Temporal workflow that fills ``evaluation_match_v2``. Use |
There was a problem hiding this comment.
nit: i don't think infra specific information like Temporal workflow is necessary for public user-facing docs. We can just say async at the max.
| allowed_label_matches_id: Optional existing allowed-label-matches config id. | ||
|
|
||
| Returns: | ||
| :class:`EvaluationV2` loaded via ``GET /nucleus/evaluationsV2/:id``. |
There was a problem hiding this comment.
nit: is it necessary to specify the get endpoint that the return is using? Ideally we'd want to point users to use the sdk. For this case, probably over information for how the information is retrieved
| self.__dict__.update(updated.__dict__) | ||
| return self | ||
|
|
||
| def wait_for_completion( |
There was a problem hiding this comment.
Is this needed because this is not integrated with NucleusJobs? I thought this type of functionality comes built in for the other async functions (dedup async also uses temporal)
| The API uses REST endpoints ``/nucleus/modelRun/:id/evaluationsV2``, | ||
| ``/nucleus/evaluationsV2/:id``, ``/nucleus/evaluationsV2/:id/charts``, and | ||
| ``POST /nucleus/evaluationsV2/:id/examples``. |
There was a problem hiding this comment.
| requests_command=requests.delete, | ||
| return_raw_response=True, | ||
| ) | ||
| if resp.status_code != 204: |
There was a problem hiding this comment.
should we accept any 2xx code here?
| has_ground_truth: Optional[bool] = None | ||
| tide_background: Optional[bool] = None | ||
|
|
||
| def to_api_filters(self) -> Dict[str, Any]: |
There was a problem hiding this comment.
i think only top-level keys are converted to camelCase here, which is fine if that's the intention, just wanted to flag. if there are nested fields that are not singlewords in the future, they'll be sent in snake_case i think
| if "confidence_range" in d: | ||
| out["confidenceRange"] = d["confidence_range"] | ||
| if "iou_range" in d: | ||
| out["iouRange"] = d["iou_range"] | ||
| if "pred_labels" in d: | ||
| out["predLabels"] = d["pred_labels"] | ||
| if "gt_labels" in d: | ||
| out["gtLabels"] = d["gt_labels"] | ||
| if "item_metadata" in d: | ||
| out["itemMetadata"] = d["item_metadata"] | ||
| if "prediction_metadata" in d: | ||
| out["predictionMetadata"] = d["prediction_metadata"] | ||
| if "label_equality" in d: | ||
| out["labelEquality"] = d["label_equality"] | ||
| if "has_ground_truth" in d: | ||
| out["hasGroundTruth"] = d["has_ground_truth"] | ||
| if "tide_background" in d: | ||
| out["tideBackground"] = d["tide_background"] |
There was a problem hiding this comment.
nit: this is a bit verbose and long, consider using a map
_FILTER_KEY_MAP = {
"confidence_range": "confidenceRange",
"iou_range": "iouRange",
"pred_labels": "predLabels",
"gt_labels": "gtLabels",
"item_metadata": "itemMetadata",
"prediction_metadata": "predictionMetadata",
"label_equality": "labelEquality",
"has_ground_truth": "hasGroundTruth",
"tide_background": "tideBackground",
}
def to_api_filters(self) -> Dict[str, Any]:
"""Serialize to camelCase keys expected by the GraphQL / REST layer."""
d = self.dict(exclude_none=True)
return {self._FILTER_KEY_MAP[k]: v for k, v in d.items()}
| allowed_label_matches_id: Optional[str] = None | ||
| allowed_label_matches: Optional[List[AllowedLabelMatch]] = None | ||
| allowed_label_matches_name: Optional[str] = None | ||
| _client: Any = field(repr=False, default=None) |
There was a problem hiding this comment.
nit: shoudl the type be Optional["NucleusClient"]?
There was a problem hiding this comment.
not sure if intentional, but coverage missing for list_evaluations_v2, examples with filter args
| tideAttribution: TideAttribution | ||
|
|
||
|
|
||
| class EvaluationV2MatchExample(DictCompatibleModel): |
There was a problem hiding this comment.
nit: not sure if intentional, but the example keys use snake_case, while the char models above all use camelCase
resolves https://linear.app/scale-epd/issue/DE-7460
tests wont pass until https://github.com/scaleapi/scaleapi/pull/142963 is merged
Greptile Summary
EvaluationV2SDK support:NucleusClient.create_evaluation_v2,get_evaluation_v2, andlist_evaluations_v2, plus Pydantic DTOs for charts/examples payloads and a newtests/test_evaluation_v2.pywith mocked-HTTP coverage.EvaluationV2Statusenum is exported publicly butwait_for_completioncomparesself.statusagainst raw string literals instead of using the enum values, creating a silent drift risk.list_evaluations_v2silently returns[]when the API returns an unexpected non-list success body, which could mask malformed responses.Confidence Score: 5/5
Safe to merge; only P2 style/maintenance observations found.
No P0 or P1 bugs found. Both findings are P2: one is a style inconsistency (raw strings vs enum) and one is a defensive-code gap (silent empty list). Core create/poll/charts/examples flows are correctly implemented and well-tested.
nucleus/evaluation_v2.py (enum usage) and nucleus/init.py (list_evaluations_v2 silent fallback).
Important Files Changed
Sequence Diagram
sequenceDiagram participant User participant NucleusClient participant API User->>NucleusClient: create_evaluation_v2(model_run_id, ...) NucleusClient->>API: POST /modelRun/:id/evaluationsV2 API-->>NucleusClient: "{ evaluation_id }" NucleusClient->>API: GET /evaluationsV2/:id API-->>NucleusClient: EvaluationV2 payload NucleusClient-->>User: EvaluationV2 User->>NucleusClient: ev.wait_for_completion() loop Poll until terminal NucleusClient->>API: GET /evaluationsV2/:id API-->>NucleusClient: "{ status }" end NucleusClient-->>User: EvaluationV2 (succeeded) User->>NucleusClient: "ev.charts(iou_threshold=0.5)" NucleusClient->>API: "GET /evaluationsV2/:id/charts?iouThreshold=0.5" API-->>NucleusClient: EvaluationV2Charts JSON NucleusClient-->>User: EvaluationV2Charts User->>NucleusClient: "ev.examples(match_type=FP)" NucleusClient->>API: POST /evaluationsV2/:id/examples API-->>NucleusClient: EvaluationV2ExamplesPage JSON NucleusClient-->>User: EvaluationV2ExamplesPagePrompt To Fix All With AI
Reviews (4): Last reviewed commit: "fix p1" | Re-trigger Greptile