Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions cloud_pipelines_backend/instrumentation/error_normalization.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,14 @@
r"[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}", re.IGNORECASE
)
_LONG_ALNUM_ID_PATTERN = re.compile(r"\b[a-zA-Z0-9]{16,}\b")
# Matches any embedded JSON object or Python dict literal (starts with `{"` or `{'`).
# These are stripped from grouping strings because they contain highly variable
# runtime data (e.g. full Kubernetes pod specs) that would fragment error groups.
_JSON_OBJECT_PATTERN = re.compile(r"\{['\"].*", re.DOTALL)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex pattern \{['"].* is overly greedy and will match from the first occurrence of {" or {' all the way to the end of the string, not just the JSON object itself. This means any text after a JSON object will also be stripped.

Example:

# Input: "Failed with {'error': 'x'} on server X"
# Current behavior: "Failed with {...}"  # "on server X" is lost
# Expected: "Failed with {...} on server X"

This contradicts the description which says it strips "JSON objects" (implying discrete chunks), not "everything after the first JSON object".

The pattern should be non-greedy and attempt to match balanced braces, for example:

_JSON_OBJECT_PATTERN = re.compile(r"\{['\"][^}]*\}", re.DOTALL)

Or use a non-greedy quantifier:

_JSON_OBJECT_PATTERN = re.compile(r"\{['\"].*?\}", re.DOTALL)

Note: Even these simpler fixes won't handle nested braces correctly, but would be better than the current greedy match.

Suggested change
_JSON_OBJECT_PATTERN = re.compile(r"\{['\"].*", re.DOTALL)
_JSON_OBJECT_PATTERN = re.compile(r"\{['\"].*?\}", re.DOTALL)

Spotted by Graphite

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regex pattern requires no whitespace between { and the quote character. Valid JSON and Python dict reprs often have whitespace like { "key": ...} or { 'key': ...} which will not match this pattern.

For example:

# Won't match: "Failed: { 'apiVersion': 'v1' }"  # space after {
# Will match: "Failed: {'apiVersion': 'v1'}"    # no space

Fix by allowing optional whitespace:

_JSON_OBJECT_PATTERN = re.compile(r"\{\s*['\"].*", re.DOTALL)

This is critical for Kubernetes pod specs which the description specifically mentions, as their serialization format may vary.

Suggested change
_JSON_OBJECT_PATTERN = re.compile(r"\{['\"].*", re.DOTALL)
_JSON_OBJECT_PATTERN = re.compile(r"\{\s*['\"].*", re.DOTALL)

Spotted by Graphite

Fix in Graphite


Is this helpful? React 👍 or 👎 to let us know.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This is an AI-generated code review comment.

_JSON_OBJECT_PATTERN greedily consumes everything after the first {\"/{' to end-of-string — worth documenting.

_JSON_OBJECT_PATTERN = re.compile(r"\{['\"].*", re.DOTALL)

With re.DOTALL and the greedy .*, this matches from the first {\"/{' to end-of-string. Trailing context after an embedded JSON literal is also dropped, so:

RuntimeError("config: {'a': 1} during step X")  →  RuntimeError: config: {...}
RuntimeError("config: {'a': 1} during step Y")  →  RuntimeError: config: {...}

These group together even though "step X" vs "step Y" is a meaningful distinguisher.

This is likely intentional (anything after a runtime-data dict in an error message is usually also variable), but the comment explains the why for the JSON itself, not the trailing-text trade-off. One line in the docstring would help the next reader reason about edge cases — something like "Note: trailing message text after the first {\"/{' is also stripped".

If the trade-off is not desired, scope the match more tightly with a non-greedy variant like \{['\"].*?\} (with caveats around nested braces).



def _strip_generic(*, message: str) -> str:
message = _JSON_OBJECT_PATTERN.sub("{...}", message)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should literally replace the _JSON_OBJECT_PATTERN with {...} and not just ... correct?

message = _OBJECT_REPR_PATTERN.sub("{object}", message)
message = _HEX_ADDRESS_PATTERN.sub("{addr}", message)
message = _UUID_PATTERN.sub("{uuid}", message)
Expand Down Expand Up @@ -85,13 +90,27 @@ def _normalize_orchestrator_error(*, exception: BaseException) -> str | None:
return f"OrchestratorError: {message}"


def _normalize_launcher_error(*, exception: BaseException) -> str | None:
try:
from ..launchers.interfaces import LauncherError

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if it would make sense to do the import globally (at the top of the file) and have a global bool to track if LauncherError is available?

except ImportError:
return None
if not isinstance(exception, LauncherError):
return None
# Take only the verb phrase before the first colon to drop any embedded
# serialized data (e.g. the full Kubernetes pod spec appended after ": ").
head = str(exception).split(":", 1)[0].strip()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This is an AI-generated code review comment.

_normalize_launcher_error over-truncates and may be redundant.

head = str(exception).split(":", 1)[0].strip()
return f"LauncherError: {head}"

Splitting on the first colon collapses meaningfully distinct failures into one group. All of these become LauncherError: Failed to create pod:

  • "Failed to create pod: connection timeout to API server"
  • "Failed to create pod: name 'task-x' already exists"
  • "Failed to create pod: insufficient memory in node pool"
  • "Failed to create pod: {full pod spec}"

These are different operational problems and should group separately.

With the new _JSON_OBJECT_PATTERN already in _strip_generic, the generic fallback would produce LauncherError: Failed to create pod: {...} — keeping the verb phrase, marking the JSON-bearing suffix, and still grouping the spec-variable case. The specialized normalizer may not be needed at all.

The comment also calls it "the verb phrase," but for "creating pod: spec invalid: missing field 'name'" the head is just "creating pod" and the actual diagnostic is dropped.

Fix: either (a) drop _normalize_launcher_error entirely and let the generic path handle it now that _JSON_OBJECT_PATTERN exists, or (b) strip only the trailing JSON-y data and keep the rest:

message = _JSON_OBJECT_PATTERN.sub("{...}", str(exception))
return f"LauncherError: {message.strip()}"

return f"LauncherError: {head}"


def normalize_error_message(*, exception: BaseException) -> str:
"""Return a stable normalized string for error grouping."""
for normalizer in (
_normalize_k8s_api_exception,
_normalize_max_retry_error,
_normalize_unicode_decode_error,
_normalize_orchestrator_error,
_normalize_launcher_error,
):
result = normalizer(exception=exception)
if result is not None:
Expand Down
45 changes: 45 additions & 0 deletions tests/instrumentation/test_error_normalization.py
Original file line number Diff line number Diff line change
Expand Up @@ -184,6 +184,41 @@ def test_strips_object_repr(self):
)


class TestNormalizeLauncherError:
def _make_launcher_error(
self, message: str, cause: BaseException | None = None
) -> Exception:
try:
from cloud_pipelines_backend.launchers.interfaces import LauncherError
except ImportError:
pytest.skip("LauncherError not importable")
if cause:
try:
raise LauncherError(message) from cause
except LauncherError as exc:
return exc
return LauncherError(message)

def test_strips_pod_spec_json(self):
pod_spec = (
"{'apiVersion': 'v1', 'kind': 'Pod', 'metadata': {'name': 'task-abc-xyz'}}"
)
exc = self._make_launcher_error(f"Failed to create pod: {pod_spec}")
result = error_normalization.normalize_error_message(exception=exc)
assert result == "LauncherError: Failed to create pod"

def test_with_timeout_cause(self):
cause = TimeoutError("The read operation timed out")
exc = self._make_launcher_error("Failed to create pod: {big spec}", cause=cause)
result = error_normalization.normalize_error_message(exception=exc)
assert result == "LauncherError: Failed to create pod"

def test_no_colon_in_message(self):
exc = self._make_launcher_error("launch failed")
result = error_normalization.normalize_error_message(exception=exc)
assert result == "LauncherError: launch failed"


Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This is an AI-generated code review comment.

Consider adding a test for multi-colon LauncherError messages.

The three new tests cover JSON suffix, cause chain, and no-colon. None covers the case where there is real diagnostic info after the first colon — exactly the scenario where the verb-phrase truncation silently degrades grouping quality (see the other comment on _normalize_launcher_error).

Adding:

def test_multiple_colons_in_message(self):
    exc = self._make_launcher_error("creating pod: spec invalid: missing field 'name'")
    result = error_normalization.normalize_error_message(exception=exc)
    assert result == "LauncherError: creating pod"  # or whatever the desired behavior is

…either pins the truncation as intentional or surfaces the regression so the policy gets revisited.

class TestFallback:
def test_strips_hex_address(self):
exc = ValueError("object at 0xdeadbeef failed")
Expand All @@ -204,3 +239,13 @@ def test_stable_message_unchanged(self):
exc = AttributeError("'NoneType' object has no attribute 'encode'")
result = error_normalization.normalize_error_message(exception=exc)
assert result == "AttributeError: 'NoneType' object has no attribute 'encode'"

def test_strips_json_object(self):
exc = RuntimeError("operation failed: {'key': 'value', 'nested': {'a': 1}}")
result = error_normalization.normalize_error_message(exception=exc)
assert result == "RuntimeError: operation failed: {...}"

def test_strips_json_object_double_quotes(self):
exc = RuntimeError('operation failed: {"key": "value"}')
result = error_normalization.normalize_error_message(exception=exc)
assert result == "RuntimeError: operation failed: {...}"
Loading