com.gencior.triton.grpc.TritonGrpcClient

All Implemented Interfaces:: TritonClient, AutoCloseable

public class TritonGrpcClient extends Object implements TritonClient

gRPC-based implementation of the TritonClient for communicating with NVIDIA Triton Inference Server.

This class provides a high-performance client implementation using gRPC (gRPC Remote Procedure Call) for synchronous and asynchronous communication with Triton. It handles all aspects of client-server interaction including connection management, request timeout handling, and response parsing.

Features:

Synchronous Inference: Blocking inference requests via infer(String, List)
Asynchronous Inference: Non-blocking inference with CompletableFuture via inferAsync(String, List)
Server Monitoring: Health checks and availability queries
Model Management: Load/unload models, query metadata and statistics
Automatic Timeouts: Configurable per-request timeouts via TritonClientConfig
Error Handling: Graceful handling of gRPC errors with optional verbose logging

Usage Example:


 TritonClientConfig config = TritonClientConfig.builder()
     .url("localhost:8001")
     .defaultTimeoutMs(30000)
     .verbose(true)
     .build();

 TritonGrpcClient client = new TritonGrpcClient(config);
 try {
     // Check server health
     if (client.isServerReady()) {
         // Get model metadata
         TritonModelMetadata metadata = client.getModelMetadata("my_model");
         System.out.println("Model: " + metadata.getName());

         // Perform inference
         List<InferInput> inputs = Arrays.asList(...);
         InferResult result = client.infer("my_model", inputs);
         System.out.println("Output: " + result.getOutputAsString("output_0"));
     }
 } finally {
     client.close();
 }

Thread Safety:

This client is thread-safe and can be shared across multiple threads. The underlying gRPC channel handles concurrent requests efficiently.

Resource Management:

Always call close() to properly release the underlying gRPC channel and cleanup resources. Consider using try-with-resources or try-finally blocks to ensure cleanup.

Since:

1.0.0

Author:

sachachoumiloff

See Also:

Constructor Summary

Constructors

Constructor

Description

TritonGrpcClient(TritonClientConfig config)

Creates a new TritonGrpcClient with the given configuration.
Method Summary

Modifier and Type

Method

Description

void

close()

Closes the client and releases the underlying gRPC channel.

List<TritonModelStatistics>

getInferenceStatistics(String modelId, String modelVersion)

Retrieves comprehensive inference statistics for a model.

TritonModelConfig

getModelConfig(String modelId)

Retrieves runtime configuration information for a specific model (latest version).

TritonModelConfig

getModelConfig(String modelId, String modelVersion)

Retrieves runtime configuration information for a specific model.

TritonModelMetadata

getModelMetadata(String modelId, String modelVersion)

Retrieves metadata about a specific model's inputs and outputs.

TritonRepositoryIndex

getModelRepositoryIndex()

Retrieves the repository index containing all available models and their status.

TritonServerMetadata

getServerMetadata()

Retrieves comprehensive metadata about the Triton server.

InferResult

infer(String modelId, String modelVersion, List<InferInput> inputs, Map<String,GrpcService.InferParameter> customParameters)

Performs a synchronous (blocking) inference request with custom parameters.

InferResult

infer(String modelId, List<InferInput> inputs)

Performs a synchronous (blocking) inference request.

CompletableFuture<InferResult>

inferAsync(String modelId, String modelVersion, List<InferInput> inputs, Map<String,GrpcService.InferParameter> customParameters)

Performs an asynchronous (non-blocking) inference request with custom parameters.

CompletableFuture<InferResult>

inferAsync(String modelId, List<InferInput> inputs)

Performs an asynchronous (non-blocking) inference request.

boolean

isModelReady(String modelId)

Checks if a specific model is ready to accept inference requests.

boolean

isModelReady(String modelId, String modelVersion)

Checks if a specific model is ready to accept inference requests.

boolean

isServerLive()

Checks if the Triton server is alive.

boolean

isServerReady()

Checks if the Triton server is ready to accept requests.

void

loadModel(String modelId)

Requests the server to load a model.

void

unLoadModel(String modelId)

Requests the server to unload a model.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- TritonGrpcClient
  
  public TritonGrpcClient(TritonClientConfig config)
  
  Creates a new TritonGrpcClient with the given configuration.
  Initializes a connection to the Triton server specified in the configuration. The underlying gRPC channel is created with plaintext (non-TLS) communication. TLS support can be added in future versions if needed.
  
  Parameters:
  
  config - the client configuration specifying server URL, timeout, and other options
  
  Throws:
  
  io.grpc.StatusRuntimeException - if the connection fails
Method Details
- isServerLive
  
  public boolean isServerLive()
  
  Checks if the Triton server is alive.
  This is a lightweight health check that verifies the server process is running. A server can be live but not ready if it's still initializing.
  
  Specified by:
  
  isServerLive in interface TritonClient
  
  Returns:
  
  true if the server is alive, false otherwise
  
  Throws:
  
  io.grpc.StatusRuntimeException - if the gRPC call fails
- isServerReady
  
  public boolean isServerReady()
  
  Checks if the Triton server is ready to accept requests.
  A ready server has completed initialization and is prepared to handle inference requests. This should be checked before attempting to perform inference.
  
  Specified by:
  
  isServerReady in interface TritonClient
  
  Returns:
  
  true if the server is ready, false otherwise
  
  Throws:
  
  io.grpc.StatusRuntimeException - if the gRPC call fails
- isModelReady
  
  public boolean isModelReady(String modelId, String modelVersion)
  
  Checks if a specific model is ready to accept inference requests.
  
  Specified by:
  
  isModelReady in interface TritonClient
  
  Parameters:
  
  modelId - the name of the model to check
  
  modelVersion - the version of the model (can be null for latest version)
  
  Returns:
  
  true if the model is ready, false otherwise
  
  Throws:
  
  io.grpc.StatusRuntimeException - if the gRPC call fails
- isModelReady
  
  public boolean isModelReady(String modelId)
  
  Checks if a specific model is ready to accept inference requests.
  
  Specified by:
  
  isModelReady in interface TritonClient
  
  Parameters:
  
  modelId - the name of the model to check
  
  Returns:
  
  true if the model is ready, false otherwise
  
  Throws:
  
  io.grpc.StatusRuntimeException - if the gRPC call fails
- getServerMetadata
  
  public TritonServerMetadata getServerMetadata()
  
  Retrieves comprehensive metadata about the Triton server.
  Returns information including server name, version, and supported extensions.
  Specified by:
  
  getServerMetadata in interface TritonClient
  
  Returns:
  
  the server metadata
  
  Throws:
  
  io.grpc.StatusRuntimeException - if the gRPC call fails
  
  See Also:
  
  TritonServerMetadata
- getModelMetadata
  
  public TritonModelMetadata getModelMetadata(String modelId, String modelVersion)
  
  Retrieves metadata about a specific model's inputs and outputs.
  The metadata includes tensor names, data types, and shapes for the model's inputs and outputs, which is essential for correctly formatting inference requests.
  Specified by:
  
  getModelMetadata in interface TritonClient
  
  Parameters:
  
  modelId - the name of the model
  
  modelVersion - the version of the model (can be null for latest version)
  
  Returns:
  
  the model metadata including inputs and outputs schema
  
  Throws:
  
  io.grpc.StatusRuntimeException - if the gRPC call fails or model not found
  
  See Also:
  
  TritonModelMetadata
- getModelConfig
  
  public TritonModelConfig getModelConfig(String modelId, String modelVersion)
  
  Retrieves runtime configuration information for a specific model.
  The configuration includes platform type, backend, runtime environment, batching capabilities, and model file mappings.
  Specified by:
  
  getModelConfig in interface TritonClient
  
  Parameters:
  
  modelId - the name of the model
  
  modelVersion - the version of the model (can be null for latest version)
  
  Returns:
  
  the model runtime configuration
  
  Throws:
  
  io.grpc.StatusRuntimeException - if the gRPC call fails or model not found
  
  See Also:
  
  TritonModelConfig
- getModelConfig
  
  public TritonModelConfig getModelConfig(String modelId)
  
  Retrieves runtime configuration information for a specific model (latest version).
  Specified by:
  
  getModelConfig in interface TritonClient
  
  Parameters:
  
  modelId - the name of the model
  
  Returns:
  
  the model runtime configuration
  
  Throws:
  
  io.grpc.StatusRuntimeException - if the gRPC call fails or model not found
  
  See Also:
  
  TritonModelConfig
- getModelRepositoryIndex
  
  public TritonRepositoryIndex getModelRepositoryIndex()
  
  Retrieves the repository index containing all available models and their status.
  Returns a listing of all models in the repository, including their names, versions, availability status, and reasons for unavailability if applicable.
  Specified by:
  
  getModelRepositoryIndex in interface TritonClient
  
  Returns:
  
  the repository index with all models information
  
  Throws:
  
  io.grpc.StatusRuntimeException - if the gRPC call fails
  
  See Also:
  
  TritonRepositoryIndex
- loadModel
  
  public void loadModel(String modelId)
  
  Requests the server to load a model.
  Asynchronously loads the specified model into memory. The model will become available for inference once loading completes. Check model readiness after calling this method.
  
  Specified by:
  
  loadModel in interface TritonClient
  
  Parameters:
  
  modelId - the name of the model to load
  
  Throws:
  
  io.grpc.StatusRuntimeException - if the gRPC call fails
- unLoadModel
  
  public void unLoadModel(String modelId)
  
  Requests the server to unload a model.
  Unloads the specified model from memory, freeing associated resources. The model will no longer be available for inference after this call completes.
  
  Specified by:
  
  unLoadModel in interface TritonClient
  
  Parameters:
  
  modelId - the name of the model to unload
  
  Throws:
  
  io.grpc.StatusRuntimeException - if the gRPC call fails
- getInferenceStatistics
  
  public List<TritonModelStatistics> getInferenceStatistics(String modelId, String modelVersion)
  
  Retrieves comprehensive inference statistics for a model.
  Returns performance metrics including inference counts, timing statistics (queue time, compute time, etc.), memory usage, and response statistics. Can query all versions or a specific version.
  Specified by:
  
  getInferenceStatistics in interface TritonClient
  
  Parameters:
  
  modelId - the name of the model (can be null to get statistics for all models)
  
  modelVersion - the version of the model (can be null for all versions)
  
  Returns:
  
  a list of model statistics objects
  
  Throws:
  
  io.grpc.StatusRuntimeException - if the gRPC call fails
  
  See Also:
  
  TritonModelStatistics
- infer
  
  public InferResult infer(String modelId, String modelVersion, List<InferInput> inputs, Map<String,GrpcService.InferParameter> customParameters)
  
  Performs a synchronous (blocking) inference request with custom parameters.
  This method blocks until the inference result is returned from the server or a timeout occurs. Timeout is controlled via TritonClientConfig.getDefaultTimeoutMs().
  Input Validation:
  
  All inputs must have raw content available. Inputs are validated to match the model's expected schema (names, data types, shapes) on the server side.
  Specified by:
  
  infer in interface TritonClient
  
  Parameters:
  
  modelId - the name of the model to run inference on
  
  modelVersion - the version of the model (can be null for latest version)
  
  inputs - list of input tensors with data prepared for the model
  
  customParameters - optional map of custom parameters to control inference behavior
  
  Returns:
  
  the inference result containing output tensors and response metadata
  
  Throws:
  
  io.grpc.StatusRuntimeException - if the gRPC call fails or times out
  
  TritonDataNotFoundException - if an input lacks raw content
  
  See Also:
  
  InferInput
  
  InferResult
- infer
  
  public InferResult infer(String modelId, List<InferInput> inputs)
  
  Performs a synchronous (blocking) inference request.
  This method blocks until the inference result is returned from the server or a timeout occurs. Inference is performed on the latest version of the model.
  Specified by:
  
  infer in interface TritonClient
  
  Parameters:
  
  modelId - the name of the model to run inference on
  
  inputs - list of input tensors with data prepared for the model
  
  Returns:
  
  the inference result containing output tensors and response metadata
  
  Throws:
  
  io.grpc.StatusRuntimeException - if the gRPC call fails or times out
  
  TritonDataNotFoundException - if an input lacks raw content
  
  See Also:
  
  InferResult
- inferAsync
  
  public CompletableFuture<InferResult> inferAsync(String modelId, String modelVersion, List<InferInput> inputs, Map<String,GrpcService.InferParameter> customParameters)
  Performs an asynchronous (non-blocking) inference request with custom parameters.
  This method returns immediately with a CompletableFuture that will be completed when the inference result is received from the server. The request is executed concurrently in the background. Use the returned future to handle the result or errors.
  Error Handling:
  
  Errors can occur during request construction (synchronously) or during server processing (asynchronously). The returned future will be completed exceptionally in case of errors.
  Example:
  
  CompletableFuture<InferResult> future = client.inferAsync(modelId, inputs); future.whenComplete((result, error) -> { if (error != null) { System.err.println("Inference failed: " + error.getMessage()); } else { System.out.println("Result: " + result.getOutputAsString("output_0")); } });
  Specified by:
  
  inferAsync in interface TritonClient
  
  Parameters:
  
  modelId - the name of the model to run inference on
  
  modelVersion - the version of the model (can be null for latest version)
  
  inputs - list of input tensors with data prepared for the model
  
  customParameters - optional map of custom parameters to control inference behavior
  
  Returns:
  
  a CompletableFuture that will be completed with the inference result
  
  See Also:
  
  InferResult
- inferAsync
  
  public CompletableFuture<InferResult> inferAsync(String modelId, List<InferInput> inputs)
  
  Performs an asynchronous (non-blocking) inference request.
  This method returns immediately with a CompletableFuture that will be completed when the inference result is received from the server. Inference is performed on the latest version of the model.
  
  Specified by:
  
  inferAsync in interface TritonClient
  
  Parameters:
  
  modelId - the name of the model to run inference on
  
  inputs - list of input tensors with data prepared for the model
  
  Returns:
  
  a CompletableFuture that will be completed with the inference result
- close
  
  public void close() throws Exception
  
  Closes the client and releases the underlying gRPC channel.
  This method should be called when the client is no longer needed to free system resources. After calling close(), the client cannot be used for further requests.
  Attempts to gracefully shutdown the channel with a 5-second timeout. If shutdown doesn't complete within 5 seconds, the channel will be forcefully terminated.
  
  Specified by:
  
  close in interface AutoCloseable
  
  Throws:
  
  Exception - if an error occurs during shutdown

Class TritonGrpcClient

Features:

Usage Example:

Thread Safety:

Resource Management:

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

TritonGrpcClient

Method Details

isServerLive

isServerReady

isModelReady

isModelReady

getServerMetadata

getModelMetadata

getModelConfig

getModelConfig

getModelRepositoryIndex

loadModel

unLoadModel

getInferenceStatistics

infer

Input Validation:

infer

inferAsync

Error Handling:

Example:

inferAsync

close