Skip to content

[CALCITE-5101] LISTAGG function with DISTINCT and ORDER BY fails#4983

Open
xuzifu666 wants to merge 1 commit into
apache:mainfrom
xuzifu666:calcite-5101
Open

[CALCITE-5101] LISTAGG function with DISTINCT and ORDER BY fails#4983
xuzifu666 wants to merge 1 commit into
apache:mainfrom
xuzifu666:calcite-5101

Conversation

@xuzifu666
Copy link
Copy Markdown
Member

if (collations.size() == 1) {
RelFieldCollation collation = collations.get(0);
final int index = collation.getFieldIndex();
if (index < 0 || index >= rowType.getFieldList().size()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess these should never be reachable if the code generator is correct

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is just a defensive check, but it seems unnecessary now, so I deleted it.

}
}

if (!validCollation) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused: there is a sort order specified, yet it is ignored? That cannot be right.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two options:

  1. Complete Fix: add ORDER BY columns to the re-grouped input, requires architectural changes to aggregate planning,changes query semantics, potential impact on other parts.This is a large-scale refactoring;
  2. Graceful Degradation(current way): returns correct results without crashes, loses sorting information, but doesn't crash.This is a rare scenario:only occurs with LISTAGG(DISTINCT ...) WITHIN GROUP (ORDER BY non-group-column).

According the conditions I choose the second way. So do you think my current plan is reasonable?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the result correct if you ignore the order?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a compromise solution to avoid crashes by downgrading the handling of such special statements (removing the order by clause). The main issue is that a thorough fix involves many aspects, with less consideration for ROI. If it becomes clear that a completely accurate order by result is required, I will attempt a fix with the lowest possible cost in the future.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the difference between "less acurate" and "wrong"
Either the result is correct, or it's not. I don't think we should take a fix which avoids crashes and produces wrong results.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I will try to fix this issue completely later.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mihaibudiu I had fixed the issue, PTAL.

Copy link
Copy Markdown

@GoncaloCoutoDosSantos GoncaloCoutoDosSantos Jun 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, if you're going to change AggregateExpandDistinctAggregatesRule to fix this bug, you'll also should to update rewriteUsingGroupingSets, since the same issue can be triggered by a query like:

SELECT deptno,
       LISTAGG(DISTINCT ename) WITHIN GROUP (ORDER BY sal),
       LISTAGG(DISTINCT deptno) WITHIN GROUP (ORDER BY ename)
FROM emp
GROUP BY deptno;```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least, a test checking this should be added

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion! I had addressed this issue and add related test about it. @mihaibudiu @GoncaloCoutoDosSantos

@xuzifu666 xuzifu666 requested a review from mihaibudiu June 2, 2026 15:27
Copy link
Copy Markdown
Contributor

@mihaibudiu mihaibudiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reply to the other comment you have received as well.

if (aggCall.isDistinct()) {
bottomGroups.addAll(aggCall.getArgList());
// Also include ORDER BY columns from WITHIN GROUP
for (RelFieldCollation fc : aggCall.collation.getFieldCollations()) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a better approach.
Can this add the same column twice? Is that a problem?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this cause DISTINCT to be ignored in certain scenarios?

Consider the following example:

SELECT deptno, SUM(DISTINCT sal) WITHIN GROUP (ORDER BY bonus)
FROM EMP
GROUP BY deptno

With this modification, the rule would rewrite the query as:

SELECT deptno, SUM(sal) WITHIN GROUP (ORDER BY bonus)
FROM (
  SELECT deptno, sal, bonus FROM EMP GROUP BY deptno, sal, bonus
)
GROUP BY deptno

This does not correctly enforce DISTINCT on sal: if two rows share the same sal value but have different bonus values, both survive the inner GROUP BY and sal ends up counted twice in the outer SUM, which violates the DISTINCT semantics.

PS: Please feel free to correct me or let me know if I am intervening inappropriately.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a better approach. Can this add the same column twice? Is that a problem?

That's should not a problem. bottomGroups is a NavigableSet<Integer> (TreeSet), which automatically removes duplicates. If the ORDER BY column is already in the GROUP BY or DISTINCT parameter, the Set will ignore duplicates. I've added a comment to clarify this. @mihaibudiu

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this cause DISTINCT to be ignored in certain scenarios?

Consider the following example:

SELECT deptno, SUM(DISTINCT sal) WITHIN GROUP (ORDER BY bonus)
FROM EMP
GROUP BY deptno

With this modification, the rule would rewrite the query as:

SELECT deptno, SUM(sal) WITHIN GROUP (ORDER BY bonus)
FROM (
  SELECT deptno, sal, bonus FROM EMP GROUP BY deptno, sal, bonus
)
GROUP BY deptno

This does not correctly enforce DISTINCT on sal: if two rows share the same sal value but have different bonus values, both survive the inner GROUP BY and sal ends up counted twice in the outer SUM, which violates the DISTINCT semantics.

PS: Please feel free to correct me or let me know if I am intervening inappropriately.

I think the question you raised is quite valuable, and I've already fixed it in a recent commit. You can take a look. @GoncaloCoutoDosSantos

int originalIdx = fc.getFieldIndex();
int newIdx = fullGroupSet.indexOf(originalIdx);
if (newIdx >= 0) {
remappedFCs.add(fc.withFieldIndex(newIdx));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you explain why is it ok for the result collation to have fewer elements than the input collation?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my view it's acceptable for result collation to have fewer elements because:

  1. fullGroupSet only contains columns that are part of some grouping set. If an ORDER BY column is not in fullGroupSet, it means that column is not in any grouping set.
  2. Columns not in any grouping set have inconsistent values within each group — different rows with the same group key may have different values for that column.
  3. Sorting by columns with inconsistent values is meaningless. It's safe to silently drop them from the collation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be in a comment in the code.
Can you also create a test case to cover this case and make sure it gives the same results as some other reference database?

Copy link
Copy Markdown
Member Author

@xuzifu666 xuzifu666 Jun 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I had add comment and add a test case cover this case.
I test it in postgresql and result is expected. link:https://onecompiler.com/postgresql/44rcggaqf

"deptno=20; total_distinct_salary=8000.0");
}

@Test void countDistinctWithOrderByNonGroupColumn() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused, where is the ORDER BY in this example?

Copy link
Copy Markdown
Member Author

@xuzifu666 xuzifu666 Jun 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated test case, Based on the PostgreSQL results https://onecompiler.com/postgresql/44rcggaqf, it looks like it's OK.

"deptno=20; total_distinct_salary=8000.0");
}

@Test void groupingSetsWithCollationReferencingOutsideGroupingSets() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any ORDER BY in these last two SQL programs.
Where is the collation coming from?
Maybe you can show the plan having a SORT?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added order by, and the corresponding PostgreSQL test is here: https://onecompiler.com/postgresql/44rcggaqf

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Jun 5, 2026

@xuzifu666 xuzifu666 added the LGTM-will-merge-soon Overall PR looks OK. Only minor things left. label Jun 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

LGTM-will-merge-soon Overall PR looks OK. Only minor things left.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants