[ML] Cache Inference Endpoints #133135

Closed

Assignees

Labels

:ml>enhancementTeam:ML

opened

on Aug 19, 2025

Inference currently reloads the inference endpoint on every inference request: https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/action/BaseTransportInferenceAction.java#L189

We can optimize the search time out of the inference call because:

these endpoints tend to be high read and low write
there tends to be only a handful of regularly used endpoints

This would have low impact normal operations but potentially reduce load on search nodes when there are periods of high inference traffic.

Requirements:

Potentially use CacheBuilder
Create a cache with a configurable number of endpoints.
Create a configurable time-to-live that evicts the endpoints from the cache after the time expires.
Emit cluster state updates when an inference endpoint is updated or deleted, listen for cluster state updates to evict from the cache.

ES has a Cache class it's used in the ML ModelLoadingService. This service is for loading the boosted tree models and nothing to do with the nlp models or the inference api or anything like that
There is a ModelLoadingService on every node potentially and sometimes the cache needs to be invalidated. There is actually an action to do that which updates the cluster state. ModelLoadingService watches for a specific change to the cluster state meta data then invalidates the cache. Same as we want for the ModelRegistry

elasticsearch/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportFlushTrainedModelCacheAction.java

Line 31 in d007dae

public TransportFlushTrainedModelCacheAction(

Metadata

Assignees

prwhelan

Labels

:ml>enhancementTeam:ML

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests