Skip to content

Conversation

@kanoshiou
Copy link
Contributor

@kanoshiou kanoshiou commented Oct 13, 2025

When the optimizer creates a LocalRelation using LocalSupplier.of() (which delegates to ImmediateLocalSupplier), the same page instance can be referenced and released multiple times during inline stats execution. This happens because the LocalRelation serves as a source for both sides of the InlineJoin - once for the left-hand side and again for the right-hand side, causing a double release error and connection closure.

This PR modifies the InlineJoin.firstSubPlan() method to ensure that all LocalRelation nodes use CopyingLocalSupplier (if they don't already) after extracting the subplan. This prevents double release issues when the same page is reused across executions.

Closes #135679

- Replace `LocalSupplier.of` with new `CopyingLocalSupplier` in `PruneColumns`
- Add `INLINE_STATS_AFTER_PRUNED_AGGREGATE` capability
- Add test cases for inline stats after aggregate pruning
@elasticsearchmachine elasticsearchmachine added needs:triage Requires assignment of a team area label v9.3.0 external-contributor Pull request authored by a developer outside the Elasticsearch team labels Oct 13, 2025
@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Oct 13, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Oct 13, 2025
@bpintea bpintea requested review from astefan and bpintea October 13, 2025 11:23
@kanoshiou kanoshiou changed the title ESQL: Fix connection closed in inline stats after pruned aggregates ESQL: Fix double release in inline stats when LocalRelation is reused Oct 14, 2025
@bpintea
Copy link
Contributor

bpintea commented Oct 14, 2025

@kanoshiou thanks for this patch!

I'm however thinking if we shouldn't have another approach here: instead of patching the optimiser rules to use the CopyingLocalSupplier, shouldn't we rather patch theLocalRelation leaf that InlineJoin get to run on the RHS? (Possibly apply the fix only to InlineJoin#firstSubPlan.)
#135679 is INLINE STATS-specific and we could probably contain the fix to it. We could then also check if we'd need a CopyingLocalSupplier (but that's secondary).

@astefan, WDYT?

@astefan
Copy link
Contributor

astefan commented Oct 15, 2025

@kanoshiou thanks for this patch!

I'm however thinking if we shouldn't have another approach here: instead of patching the optimiser rules to use the CopyingLocalSupplier, shouldn't we rather patch theLocalRelation leaf that InlineJoin get to run on the RHS? (Possibly apply the fix only to InlineJoin#firstSubPlan.) #135679 is INLINE STATS-specific and we could probably contain the fix to it. We could then also check if we'd need a CopyingLocalSupplier (but that's secondary).

@astefan, WDYT?

@kanoshiou inline stats is different than any other command in that part of its workings live in EsqlSession. Whenever we add something that is specific to inline stats we try to limit its spread and make it as seamless as possible in the existing code.
This means that modifications that happen outside EsqlSession and InlineJoin itself should be limited. Your changes to optimizer rules that have nothing to do with inline stats (as opposed to PruneInlineJoinOnEmptyRightSide for example which is inline stats specific) should have a very good reason to be there.

Your changes to PruneColumns and PropagateEmptyRelation take the shotgun approach and are too invasive, they will also apply to non-inline stats use cases which is unnecessary. @bpintea's suggestion is something you should look into, instead.

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, look into @bpintea's suggestion for a less invasive and more targeted approach to this bug fix. Please, let us know if you are able to work on this bug fix in short time, otherwise we are ready to take it over. Thank you.

@kanoshiou
Copy link
Contributor Author

Thank you for the review and insightful suggestions, @bpintea @astefan! I'll update the branch today with a new logical optimizer rule.

@kanoshiou kanoshiou requested a review from astefan October 15, 2025 17:42
Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had something else in mind when I did my first review.
Using a CopyingLocalSupplier should happen in InlineJoin.firstSubPlan:

public static LogicalPlanTuple firstSubPlan(LogicalPlan optimizedPlan, Set<LocalRelation> subPlansResults) {
        final Holder<LogicalPlan> stubReplacedSubPlanHolder = new Holder<>();
        final Holder<LogicalPlan> originalSubPlanHolder = new Holder<>();
        // Collect the first inlinejoin (bottom up in the tree)
        optimizedPlan.forEachUp(InlineJoin.class, ij -> {
            // extract the right side of the plan and replace its source
            if (stubReplacedSubPlanHolder.get() == null) {
                if (ij.right().anyMatch(p -> p instanceof StubRelation)) {
                    stubReplacedSubPlanHolder.set(replaceStub(ij.left(), ij.right()));
                } else if (ij.right() instanceof LocalRelation relation
                    && (subPlansResults.isEmpty() || subPlansResults.contains(relation) == false)
                    || ij.right() instanceof LocalRelation == false && ij.right().anyMatch(p -> p instanceof LocalRelation)) {
                    // In case the plan was optimized further and the StubRelation was replaced with a LocalRelation
                    // or the right hand side became a LocalRelation alltogether, there is no need to replace the source of the
                    // right-hand side anymore.
                    stubReplacedSubPlanHolder.set(ij.right());
                    // TODO: INLINE STATS this is essentially an optimization similar to the one in PruneInlineJoinOnEmptyRightSide
                    // this further supports the idea of running the optimization step again after the substitutions (see EsqlSession
                    // executeSubPlan() method where we could run the optimizer after the results are replaced in place).
                }
                originalSubPlanHolder.set(ij.right());
            }
        });
        LogicalPlanTuple tuple = null;
        if (stubReplacedSubPlanHolder.get() != null) {
            var plan = stubReplacedSubPlanHolder.get();
            plan = plan.transformUp(LocalRelation.class, lr -> {
                if (lr.supplier() instanceof CopyingLocalSupplier == false) {
                    return new LocalRelation(lr.source(), lr.output(), new CopyingLocalSupplier(lr.supplier().get()));
                }
                return lr;
            });
            plan.setOptimized();
            tuple = new LogicalPlanTuple(plan, originalSubPlanHolder.get());
        }
        return tuple;
    }

@kanoshiou
Copy link
Contributor Author

Thank you for the thoughtful patch, @astefan — it really helped clarify the underlying issue!

It looks like your current approach addresses the problem outside the logical optimization phase, which seems like the right direction for tackling the root cause.

While reviewing your code, I was wondering if the CopyingLocalSupplier currently used in ReplaceRowAsLocalRelation could be replaced with an ImmediateLocalSupplier. Doing so might simplify the handling and reduce unnecessary duplication, assuming the reuse semantics remain intact.

@astefan
Copy link
Contributor

astefan commented Oct 16, 2025

Thank you for the thoughtful patch, @astefan — it really helped clarify the underlying issue!

It looks like your current approach addresses the problem outside the logical optimization phase, which seems like the right direction for tackling the root cause.

While reviewing your code, I was wondering if the CopyingLocalSupplier currently used in ReplaceRowAsLocalRelation could be replaced with an ImmediateLocalSupplier. Doing so might simplify the handling and reduce unnecessary duplication, assuming the reuse semantics remain intact.

I don't know. It needs to go through a round of testing to double check this.

@astefan astefan added >bug v9.2.1 auto-backport Automatically create backport pull requests when merged labels Oct 16, 2025
@astefan
Copy link
Contributor

astefan commented Oct 16, 2025

@elasticmachine test this please

@astefan astefan self-assigned this Oct 16, 2025
@kanoshiou kanoshiou requested a review from astefan October 17, 2025 04:28
@astefan
Copy link
Contributor

astefan commented Oct 17, 2025

@elasticmachine test this please

Copy link
Contributor

@astefan astefan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. CI is happy with the change in ReplaceRowAsLocalRelation, I am happy as well.

@astefan astefan force-pushed the fix-inline-stats-pruned-aggregates branch from 9eb00fc to e5444f3 Compare October 17, 2025 12:44
@astefan
Copy link
Contributor

astefan commented Oct 17, 2025

@elasticmachine test this please

@kanoshiou
Copy link
Contributor Author

Thanks for your review and sorry for the delay in replying to your comment and updating the code, @astefan! I learned a lot about ES|QL and its optimization nuances from your code and advice! Thanks again for your time and expertise!

Copy link
Contributor

@bpintea bpintea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM!

Comment on lines +4041 to +4044
| stats PRezTcny = median(network.cost), `network.cost` = count(network.cost)
| drop network.cost
| inline stats QaoRGYyf = count(*)
| drop `PRezTcny`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These queries are generated. This one in particular is relatively clean, but those below are randomised and those identifiers are eye pain (most `` quotations aren't needed either). Usually, they can also be shrunk down to a "minimal" query that reproduces the issue.
Ideally, we'd minimise the queries and use "human-friendly" identifiers, but at a minimum, maybe indicate their source by adding a comment with the originating issue.

Comment on lines 182 to 183
if (stubReplacedSubPlanHolder.get() != null) {
var plan = stubReplacedSubPlanHolder.get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor:

Suggested change
if (stubReplacedSubPlanHolder.get() != null) {
var plan = stubReplacedSubPlanHolder.get();
var plan = stubReplacedSubPlanHolder.get();
if (plan != null) {

* And uses the left-hand side's source as its source:
* Aggregate[[],[MAX(x{r}#99,true[BOOLEAN]) AS y#102]]
* \_Limit[1000[INTEGER],false]
* \_LocalRelation[[x{r}#99],[IntVectorBlock[vector=ConstantIntVector[positions=1, value=1]]]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This javadoc might require an update to be inline with the added paragraph, but this is minor, can remain as is.

@astefan astefan added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Oct 17, 2025
@astefan
Copy link
Contributor

astefan commented Oct 17, 2025

@elasticmachine test this please

@elasticsearchmachine elasticsearchmachine merged commit 968b623 into elastic:main Oct 17, 2025
35 checks passed
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
9.2 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 136467

astefan pushed a commit to astefan/elasticsearch that referenced this pull request Oct 21, 2025
…ed (elastic#136467)

When the optimizer creates a `LocalRelation` using `LocalSupplier.of()`
(which delegates to `ImmediateLocalSupplier`), the same page instance
can be referenced and released multiple times during inline stats
execution. This happens because the `LocalRelation` serves as a source
for both sides of the `InlineJoin` - once for the left-hand side and
again for the right-hand side, causing a double release error and
connection closure.

This PR modifies the `InlineJoin.firstSubPlan()` method to ensure that
all `LocalRelation` nodes use `CopyingLocalSupplier` (if they don't
already) after extracting the subplan. This prevents double release
issues when the same page is reused across executions.

Closes elastic#135679

(cherry picked from commit 968b623)
chrisparrinello pushed a commit to chrisparrinello/elasticsearch that referenced this pull request Oct 24, 2025
…ed (elastic#136467)

When the optimizer creates a `LocalRelation` using `LocalSupplier.of()`
(which delegates to `ImmediateLocalSupplier`), the same page instance
can be referenced and released multiple times during inline stats
execution. This happens because the `LocalRelation` serves as a source
for both sides of the `InlineJoin` - once for the left-hand side and
again for the right-hand side, causing a double release error and
connection closure.

This PR modifies the `InlineJoin.firstSubPlan()` method to ensure that
all `LocalRelation` nodes use `CopyingLocalSupplier` (if they don't
already) after extracting the subplan. This prevents double release
issues when the same page is reused across executions.

Closes elastic#135679
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/ES|QL AKA ESQL auto-backport Automatically create backport pull requests when merged auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >bug external-contributor Pull request authored by a developer outside the Elasticsearch team Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.2.1 v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ES|QL: INLINE STATS leads to "can't build page out of released blocks"

5 participants