Skip to content

add eval capes to sdk#460

Open
luke-e-schaefer wants to merge 7 commits into
masterfrom
add-eval-capabilities
Open

add eval capes to sdk#460
luke-e-schaefer wants to merge 7 commits into
masterfrom
add-eval-capabilities

Conversation

@luke-e-schaefer
Copy link
Copy Markdown

@luke-e-schaefer luke-e-schaefer commented May 12, 2026

resolves https://linear.app/scale-epd/issue/DE-7460

tests wont pass until https://github.com/scaleapi/scaleapi/pull/142963 is merged

Greptile Summary

  • Adds EvaluationV2 SDK support: NucleusClient.create_evaluation_v2, get_evaluation_v2, and list_evaluations_v2, plus Pydantic DTOs for charts/examples payloads and a new tests/test_evaluation_v2.py with mocked-HTTP coverage.
  • The EvaluationV2Status enum is exported publicly but wait_for_completion compares self.status against raw string literals instead of using the enum values, creating a silent drift risk.
  • list_evaluations_v2 silently returns [] when the API returns an unexpected non-list success body, which could mask malformed responses.

Confidence Score: 5/5

Safe to merge; only P2 style/maintenance observations found.

No P0 or P1 bugs found. Both findings are P2: one is a style inconsistency (raw strings vs enum) and one is a defensive-code gap (silent empty list). Core create/poll/charts/examples flows are correctly implemented and well-tested.

nucleus/evaluation_v2.py (enum usage) and nucleus/init.py (list_evaluations_v2 silent fallback).

Important Files Changed

Filename Overview
nucleus/evaluation_v2.py New EvaluationV2 dataclass + helpers; status polling uses raw string literals instead of the exported EvaluationV2Status enum.
nucleus/init.py Adds create/get/list evaluation_v2 methods; list_evaluations_v2 silently returns [] on unexpected non-list responses.
nucleus/data_transfer_object/evaluation_v2.py New Pydantic DTOs for evaluation V2 REST payloads; all nullable fields on EvaluationV2MatchExample are properly Optional.
tests/test_evaluation_v2.py Unit tests with mocked HTTP layer covering create, get, charts, examples, wait_for_completion, and delete flows.
docs/index.rst Adds Evaluations V2 documentation section with usage example and REST endpoint reference.

Sequence Diagram

sequenceDiagram
    participant User
    participant NucleusClient
    participant API

    User->>NucleusClient: create_evaluation_v2(model_run_id, ...)
    NucleusClient->>API: POST /modelRun/:id/evaluationsV2
    API-->>NucleusClient: "{ evaluation_id }"
    NucleusClient->>API: GET /evaluationsV2/:id
    API-->>NucleusClient: EvaluationV2 payload
    NucleusClient-->>User: EvaluationV2

    User->>NucleusClient: ev.wait_for_completion()
    loop Poll until terminal
        NucleusClient->>API: GET /evaluationsV2/:id
        API-->>NucleusClient: "{ status }"
    end
    NucleusClient-->>User: EvaluationV2 (succeeded)

    User->>NucleusClient: "ev.charts(iou_threshold=0.5)"
    NucleusClient->>API: "GET /evaluationsV2/:id/charts?iouThreshold=0.5"
    API-->>NucleusClient: EvaluationV2Charts JSON
    NucleusClient-->>User: EvaluationV2Charts

    User->>NucleusClient: "ev.examples(match_type=FP)"
    NucleusClient->>API: POST /evaluationsV2/:id/examples
    API-->>NucleusClient: EvaluationV2ExamplesPage JSON
    NucleusClient-->>User: EvaluationV2ExamplesPage
Loading

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
nucleus/evaluation_v2.py:131-141
**`EvaluationV2Status` enum defined but unused in polling logic**

`EvaluationV2Status` is exported as part of the public API, but `wait_for_completion` compares `self.status` against raw string literals. If a new terminal state (e.g., `"timed_out"`) is added to the enum later, the polling set won't be updated unless someone remembers to change both places. Using `EvaluationV2Status` values (e.g., `EvaluationV2Status.FAILED.value`) makes the coupling explicit and lets type checkers catch drift.

### Issue 2 of 2
nucleus/__init__.py:940-945
**`list_evaluations_v2` silently returns `[]` on unexpected API shape**

`self.get(...)` raises `NucleusAPIError` for HTTP errors, so the `not isinstance(rows, list)` branch is reachable only when the server returns a non-list success body (e.g., a dict). In that case the method returns `[]` without surfacing any diagnostic, making it impossible to tell whether the run has no evaluations or whether the response was malformed. Raising instead of swallowing the unexpected payload would make the error actionable.

Reviews (4): Last reviewed commit: "fix p1" | Re-trigger Greptile

@luke-e-schaefer luke-e-schaefer self-assigned this May 12, 2026
Comment thread nucleus/data_transfer_object/evaluation_v2.py Outdated
Comment thread nucleus/__init__.py Outdated
luke-e-schaefer and others added 2 commits May 12, 2026 13:49
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Comment thread nucleus/data_transfer_object/evaluation_v2.py Outdated
luke-e-schaefer and others added 3 commits May 12, 2026 14:03
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Comment thread nucleus/data_transfer_object/evaluation_v2.py
Copy link
Copy Markdown
Contributor

@edwinpav edwinpav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall nice work!

Two main things:

  1. I'd make sure that the user-facing docs/descriptions are not overly complex. Not everyone will know or even care about how the function works behind the scenes, just care what are the params, what are the returns, and the feature that the method provides.
  2. If you want to deploy a new sdk version with these changes, two more files need to be changed and added to this pr:
    1. CHANGELOG.md should be updated. The tag link that the CHANGELOG references will be created after this pr is merged into master. You'd add a new release with a new tag here: https://github.com/scaleapi/nucleus-python-client/releases. Feel free to ping for any questions! The process isn't super clear lol

    2. The sdk version under tool.poetry should be updated in pyproject.toml
      (see #457 as a reference pr)

Comment thread nucleus/__init__.py
) -> EvaluationV2:
"""Create an Evaluation V2 job for a model run.

Starts a Temporal workflow that fills ``evaluation_match_v2``. Use
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: i don't think infra specific information like Temporal workflow is necessary for public user-facing docs. We can just say async at the max.

Comment thread nucleus/__init__.py
allowed_label_matches_id: Optional existing allowed-label-matches config id.

Returns:
:class:`EvaluationV2` loaded via ``GET /nucleus/evaluationsV2/:id``.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: is it necessary to specify the get endpoint that the return is using? Ideally we'd want to point users to use the sdk. For this case, probably over information for how the information is retrieved

Comment thread nucleus/evaluation_v2.py
self.__dict__.update(updated.__dict__)
return self

def wait_for_completion(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed because this is not integrated with NucleusJobs? I thought this type of functionality comes built in for the other async functions (dedup async also uses temporal)

Comment thread docs/index.rst
Comment on lines +45 to +47
The API uses REST endpoints ``/nucleus/modelRun/:id/evaluationsV2``,
``/nucleus/evaluationsV2/:id``, ``/nucleus/evaluationsV2/:id/charts``, and
``POST /nucleus/evaluationsV2/:id/examples``.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread nucleus/evaluation_v2.py
requests_command=requests.delete,
return_raw_response=True,
)
if resp.status_code != 204:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we accept any 2xx code here?

has_ground_truth: Optional[bool] = None
tide_background: Optional[bool] = None

def to_api_filters(self) -> Dict[str, Any]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think only top-level keys are converted to camelCase here, which is fine if that's the intention, just wanted to flag. if there are nested fields that are not singlewords in the future, they'll be sent in snake_case i think

Comment on lines +37 to +54
if "confidence_range" in d:
out["confidenceRange"] = d["confidence_range"]
if "iou_range" in d:
out["iouRange"] = d["iou_range"]
if "pred_labels" in d:
out["predLabels"] = d["pred_labels"]
if "gt_labels" in d:
out["gtLabels"] = d["gt_labels"]
if "item_metadata" in d:
out["itemMetadata"] = d["item_metadata"]
if "prediction_metadata" in d:
out["predictionMetadata"] = d["prediction_metadata"]
if "label_equality" in d:
out["labelEquality"] = d["label_equality"]
if "has_ground_truth" in d:
out["hasGroundTruth"] = d["has_ground_truth"]
if "tide_background" in d:
out["tideBackground"] = d["tide_background"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is a bit verbose and long, consider using a map

      _FILTER_KEY_MAP = {
          "confidence_range": "confidenceRange",
          "iou_range": "iouRange",
          "pred_labels": "predLabels",
          "gt_labels": "gtLabels",
          "item_metadata": "itemMetadata",
          "prediction_metadata": "predictionMetadata",
          "label_equality": "labelEquality",
          "has_ground_truth": "hasGroundTruth",
          "tide_background": "tideBackground",
      }

      def to_api_filters(self) -> Dict[str, Any]:
          """Serialize to camelCase keys expected by the GraphQL / REST layer."""
          d = self.dict(exclude_none=True)
          return {self._FILTER_KEY_MAP[k]: v for k, v in d.items()}

Comment thread nucleus/evaluation_v2.py
allowed_label_matches_id: Optional[str] = None
allowed_label_matches: Optional[List[AllowedLabelMatch]] = None
allowed_label_matches_name: Optional[str] = None
_client: Any = field(repr=False, default=None)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: shoudl the type be Optional["NucleusClient"]?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if intentional, but coverage missing for list_evaluations_v2, examples with filter args

tideAttribution: TideAttribution


class EvaluationV2MatchExample(DictCompatibleModel):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: not sure if intentional, but the example keys use snake_case, while the char models above all use camelCase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants