Skip to content

[ML] Cache Inference Endpoints #133135

@prwhelan

Description

@prwhelan

Inference currently reloads the inference endpoint on every inference request: https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/inference/src/main/java/org/elasticsearch/xpack/inference/action/BaseTransportInferenceAction.java#L189

We can optimize the search time out of the inference call because:

  1. these endpoints tend to be high read and low write
  2. there tends to be only a handful of regularly used endpoints

This would have low impact normal operations but potentially reduce load on search nodes when there are periods of high inference traffic.

Requirements:

  • Potentially use CacheBuilder
  • Create a cache with a configurable number of endpoints.
  • Create a configurable time-to-live that evicts the endpoints from the cache after the time expires.
  • Emit cluster state updates when an inference endpoint is updated or deleted, listen for cluster state updates to evict from the cache.

ES has a Cache class it's used in the ML ModelLoadingService. This service is for loading the boosted tree models and nothing to do with the nlp models or the inference api or anything like that
There is a ModelLoadingService on every node potentially and sometimes the cache needs to be invalidated. There is actually an action to do that which updates the cluster state. ModelLoadingService watches for a specific change to the cluster state meta data then invalidates the cache. Same as we want for the ModelRegistry

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions