add eval capes to sdk by luke-e-schaefer · Pull Request #460 · scaleapi/nucleus-python-client

luke-e-schaefer · 2026-05-12T18:19:19Z

resolves https://linear.app/scale-epd/issue/DE-7460

tests wont pass until https://github.com/scaleapi/scaleapi/pull/142963 is merged

Greptile Summary

Adds EvaluationV2 SDK support: NucleusClient.create_evaluation_v2, get_evaluation_v2, and list_evaluations_v2, plus Pydantic DTOs for charts/examples payloads and a new tests/test_evaluation_v2.py with mocked-HTTP coverage.
The EvaluationV2Status enum is exported publicly but wait_for_completion compares self.status against raw string literals instead of using the enum values, creating a silent drift risk.
list_evaluations_v2 silently returns [] when the API returns an unexpected non-list success body, which could mask malformed responses.

Confidence Score: 5/5

Safe to merge; only P2 style/maintenance observations found.

No P0 or P1 bugs found. Both findings are P2: one is a style inconsistency (raw strings vs enum) and one is a defensive-code gap (silent empty list). Core create/poll/charts/examples flows are correctly implemented and well-tested.

nucleus/evaluation_v2.py (enum usage) and nucleus/init.py (list_evaluations_v2 silent fallback).

Important Files Changed

Filename	Overview
nucleus/evaluation_v2.py	New EvaluationV2 dataclass + helpers; status polling uses raw string literals instead of the exported EvaluationV2Status enum.
nucleus/init.py	Adds create/get/list evaluation_v2 methods; list_evaluations_v2 silently returns [] on unexpected non-list responses.
nucleus/data_transfer_object/evaluation_v2.py	New Pydantic DTOs for evaluation V2 REST payloads; all nullable fields on EvaluationV2MatchExample are properly Optional.
tests/test_evaluation_v2.py	Unit tests with mocked HTTP layer covering create, get, charts, examples, wait_for_completion, and delete flows.
docs/index.rst	Adds Evaluations V2 documentation section with usage example and REST endpoint reference.

Sequence Diagram

sequenceDiagram
    participant User
    participant NucleusClient
    participant API

    User->>NucleusClient: create_evaluation_v2(model_run_id, ...)
    NucleusClient->>API: POST /modelRun/:id/evaluationsV2
    API-->>NucleusClient: "{ evaluation_id }"
    NucleusClient->>API: GET /evaluationsV2/:id
    API-->>NucleusClient: EvaluationV2 payload
    NucleusClient-->>User: EvaluationV2

    User->>NucleusClient: ev.wait_for_completion()
    loop Poll until terminal
        NucleusClient->>API: GET /evaluationsV2/:id
        API-->>NucleusClient: "{ status }"
    end
    NucleusClient-->>User: EvaluationV2 (succeeded)

    User->>NucleusClient: "ev.charts(iou_threshold=0.5)"
    NucleusClient->>API: "GET /evaluationsV2/:id/charts?iouThreshold=0.5"
    API-->>NucleusClient: EvaluationV2Charts JSON
    NucleusClient-->>User: EvaluationV2Charts

    User->>NucleusClient: "ev.examples(match_type=FP)"
    NucleusClient->>API: POST /evaluationsV2/:id/examples
    API-->>NucleusClient: EvaluationV2ExamplesPage JSON
    NucleusClient-->>User: EvaluationV2ExamplesPage

Prompt To Fix All With AI

Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
nucleus/evaluation_v2.py:131-141
**`EvaluationV2Status` enum defined but unused in polling logic**

`EvaluationV2Status` is exported as part of the public API, but `wait_for_completion` compares `self.status` against raw string literals. If a new terminal state (e.g., `"timed_out"`) is added to the enum later, the polling set won't be updated unless someone remembers to change both places. Using `EvaluationV2Status` values (e.g., `EvaluationV2Status.FAILED.value`) makes the coupling explicit and lets type checkers catch drift.

### Issue 2 of 2
nucleus/__init__.py:940-945
**`list_evaluations_v2` silently returns `[]` on unexpected API shape**

`self.get(...)` raises `NucleusAPIError` for HTTP errors, so the `not isinstance(rows, list)` branch is reachable only when the server returns a non-list success body (e.g., a dict). In that case the method returns `[]` without surfacing any diagnostic, making it impossible to tell whether the run has no evaluations or whether the response was malformed. Raising instead of swallowing the unexpected payload would make the error actionable.

_{Reviews (4): Last reviewed commit: "fix p1" | Re-trigger Greptile}

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

edwinpav

Overall nice work!

Two main things:

I'd make sure that the user-facing docs/descriptions are not overly complex. Not everyone will know or even care about how the function works behind the scenes, just care what are the params, what are the returns, and the feature that the method provides.
If you want to deploy a new sdk version with these changes, two more files need to be changed and added to this pr:
1. CHANGELOG.md should be updated. The tag link that the CHANGELOG references will be created after this pr is merged into master. You'd add a new release with a new tag here: https://github.com/scaleapi/nucleus-python-client/releases. Feel free to ping for any questions! The process isn't super clear lol
2. The sdk version under tool.poetry should be updated in pyproject.toml
  (see #457 as a reference pr)

edwinpav · 2026-05-27T15:41:34Z

+    ) -> EvaluationV2:
+        """Create an Evaluation V2 job for a model run.
+
+        Starts a Temporal workflow that fills ``evaluation_match_v2``. Use


nit: i don't think infra specific information like Temporal workflow is necessary for public user-facing docs. We can just say async at the max.

edwinpav · 2026-05-27T15:43:13Z

+            allowed_label_matches_id: Optional existing allowed-label-matches config id.
+
+        Returns:
+            :class:`EvaluationV2` loaded via ``GET /nucleus/evaluationsV2/:id``.


nit: is it necessary to specify the get endpoint that the return is using? Ideally we'd want to point users to use the sdk. For this case, probably over information for how the information is retrieved

edwinpav · 2026-05-27T15:46:48Z

+        self.__dict__.update(updated.__dict__)
+        return self
+
+    def wait_for_completion(


Is this needed because this is not integrated with NucleusJobs? I thought this type of functionality comes built in for the other async functions (dedup async also uses temporal)

edwinpav · 2026-05-27T15:47:39Z

+The API uses REST endpoints ``/nucleus/modelRun/:id/evaluationsV2``,
+``/nucleus/evaluationsV2/:id``, ``/nucleus/evaluationsV2/:id/charts``, and
+``POST /nucleus/evaluationsV2/:id/examples``.


Same as https://github.com/scaleapi/nucleus-python-client/pull/460/changes#r3311980952

edwinpav · 2026-05-27T15:49:33Z

+            requests_command=requests.delete,
+            return_raw_response=True,
+        )
+        if resp.status_code != 204:


should we accept any 2xx code here?

edwinpav · 2026-05-27T16:00:56Z

+    has_ground_truth: Optional[bool] = None
+    tide_background: Optional[bool] = None
+
+    def to_api_filters(self) -> Dict[str, Any]:


i think only top-level keys are converted to camelCase here, which is fine if that's the intention, just wanted to flag. if there are nested fields that are not singlewords in the future, they'll be sent in snake_case i think

edwinpav · 2026-05-27T16:02:00Z

+        if "confidence_range" in d:
+            out["confidenceRange"] = d["confidence_range"]
+        if "iou_range" in d:
+            out["iouRange"] = d["iou_range"]
+        if "pred_labels" in d:
+            out["predLabels"] = d["pred_labels"]
+        if "gt_labels" in d:
+            out["gtLabels"] = d["gt_labels"]
+        if "item_metadata" in d:
+            out["itemMetadata"] = d["item_metadata"]
+        if "prediction_metadata" in d:
+            out["predictionMetadata"] = d["prediction_metadata"]
+        if "label_equality" in d:
+            out["labelEquality"] = d["label_equality"]
+        if "has_ground_truth" in d:
+            out["hasGroundTruth"] = d["has_ground_truth"]
+        if "tide_background" in d:
+            out["tideBackground"] = d["tide_background"]


nit: this is a bit verbose and long, consider using a map

_FILTER_KEY_MAP = { "confidence_range": "confidenceRange", "iou_range": "iouRange", "pred_labels": "predLabels", "gt_labels": "gtLabels", "item_metadata": "itemMetadata", "prediction_metadata": "predictionMetadata", "label_equality": "labelEquality", "has_ground_truth": "hasGroundTruth", "tide_background": "tideBackground", } def to_api_filters(self) -> Dict[str, Any]: """Serialize to camelCase keys expected by the GraphQL / REST layer.""" d = self.dict(exclude_none=True) return {self._FILTER_KEY_MAP[k]: v for k, v in d.items()}

edwinpav · 2026-05-27T16:02:44Z

+    allowed_label_matches_id: Optional[str] = None
+    allowed_label_matches: Optional[List[AllowedLabelMatch]] = None
+    allowed_label_matches_name: Optional[str] = None
+    _client: Any = field(repr=False, default=None)


nit: shoudl the type be Optional["NucleusClient"]?

edwinpav · 2026-05-27T16:04:04Z

not sure if intentional, but coverage missing for list_evaluations_v2, examples with filter args

edwinpav · 2026-05-27T16:04:48Z

+    tideAttribution: TideAttribution
+
+
+class EvaluationV2MatchExample(DictCompatibleModel):


nit: not sure if intentional, but the example keys use snake_case, while the char models above all use camelCase

add eval capes to sdk

4c6083e

luke-e-schaefer requested review from edwinpav and vinay553 May 12, 2026 18:19

luke-e-schaefer self-assigned this May 12, 2026

greptile-apps Bot reviewed May 12, 2026

View reviewed changes

Comment thread nucleus/data_transfer_object/evaluation_v2.py Outdated

Comment thread nucleus/__init__.py Outdated

luke-e-schaefer and others added 2 commits May 12, 2026 13:49

Apply suggestion from @greptile-apps[bot]

36f6b4a

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

Apply suggestion from @greptile-apps[bot]

3caaf8d

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

greptile-apps Bot reviewed May 12, 2026

View reviewed changes

Comment thread nucleus/data_transfer_object/evaluation_v2.py Outdated

luke-e-schaefer and others added 3 commits May 12, 2026 14:03

run hooks

13a91b2

merge remote

cce066e

Update nucleus/data_transfer_object/evaluation_v2.py

aced4aa

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

greptile-apps Bot reviewed May 12, 2026

View reviewed changes

Comment thread nucleus/data_transfer_object/evaluation_v2.py

fix p1

866ac71

edwinpav reviewed May 27, 2026

View reviewed changes

		tideAttribution: TideAttribution


		class EvaluationV2MatchExample(DictCompatibleModel):

Conversation

luke-e-schaefer commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

edwinpav left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

luke-e-schaefer commented May 12, 2026 •

edited

Loading