Class TritonGrpcClient
- All Implemented Interfaces:
TritonClient,AutoCloseable
This class provides a high-performance client implementation using gRPC (gRPC Remote Procedure Call) for synchronous and asynchronous communication with Triton. It handles all aspects of client-server interaction including connection management, request timeout handling, and response parsing.
Features:
- Synchronous Inference: Blocking inference requests via
infer(String, List) - Asynchronous Inference: Non-blocking inference with
CompletableFuture via
inferAsync(String, List) - Server Monitoring: Health checks and availability queries
- Model Management: Load/unload models, query metadata and statistics
- Automatic Timeouts: Configurable per-request timeouts
via
TritonClientConfig - Error Handling: Graceful handling of gRPC errors with optional verbose logging
Usage Example:
TritonClientConfig config = TritonClientConfig.builder()
.url("localhost:8001")
.defaultTimeoutMs(30000)
.verbose(true)
.build();
TritonGrpcClient client = new TritonGrpcClient(config);
try {
// Check server health
if (client.isServerReady()) {
// Get model metadata
TritonModelMetadata metadata = client.getModelMetadata("my_model");
System.out.println("Model: " + metadata.getName());
// Perform inference
List<InferInput> inputs = Arrays.asList(...);
InferResult result = client.infer("my_model", inputs);
System.out.println("Output: " + result.getOutputAsString("output_0"));
}
} finally {
client.close();
}
Thread Safety:
This client is thread-safe and can be shared across multiple threads. The underlying gRPC channel handles concurrent requests efficiently.
Resource Management:
Always call close() to properly release the underlying gRPC channel
and cleanup resources. Consider using try-with-resources or try-finally
blocks to ensure cleanup.
- Since:
- 1.0.0
- Author:
- sachachoumiloff
- See Also:
-
Constructor Summary
ConstructorsConstructorDescriptionTritonGrpcClient(TritonClientConfig config) Creates a new TritonGrpcClient with the given configuration. -
Method Summary
Modifier and TypeMethodDescriptionvoidclose()Closes the client and releases the underlying gRPC channel.getInferenceStatistics(String modelId, String modelVersion) Retrieves comprehensive inference statistics for a model.getModelConfig(String modelId) Retrieves runtime configuration information for a specific model (latest version).getModelConfig(String modelId, String modelVersion) Retrieves runtime configuration information for a specific model.getModelMetadata(String modelId, String modelVersion) Retrieves metadata about a specific model's inputs and outputs.Retrieves the repository index containing all available models and their status.Retrieves comprehensive metadata about the Triton server.infer(String modelId, String modelVersion, List<InferInput> inputs, Map<String, GrpcService.InferParameter> customParameters) Performs a synchronous (blocking) inference request with custom parameters.infer(String modelId, List<InferInput> inputs) Performs a synchronous (blocking) inference request.inferAsync(String modelId, String modelVersion, List<InferInput> inputs, Map<String, GrpcService.InferParameter> customParameters) Performs an asynchronous (non-blocking) inference request with custom parameters.inferAsync(String modelId, List<InferInput> inputs) Performs an asynchronous (non-blocking) inference request.booleanisModelReady(String modelId) Checks if a specific model is ready to accept inference requests.booleanisModelReady(String modelId, String modelVersion) Checks if a specific model is ready to accept inference requests.booleanChecks if the Triton server is alive.booleanChecks if the Triton server is ready to accept requests.voidRequests the server to load a model.voidunLoadModel(String modelId) Requests the server to unload a model.
-
Constructor Details
-
TritonGrpcClient
Creates a new TritonGrpcClient with the given configuration.Initializes a connection to the Triton server specified in the configuration. The underlying gRPC channel is created with plaintext (non-TLS) communication. TLS support can be added in future versions if needed.
- Parameters:
config- the client configuration specifying server URL, timeout, and other options- Throws:
io.grpc.StatusRuntimeException- if the connection fails
-
-
Method Details
-
isServerLive
public boolean isServerLive()Checks if the Triton server is alive.This is a lightweight health check that verifies the server process is running. A server can be live but not ready if it's still initializing.
- Specified by:
isServerLivein interfaceTritonClient- Returns:
- true if the server is alive, false otherwise
- Throws:
io.grpc.StatusRuntimeException- if the gRPC call fails
-
isServerReady
public boolean isServerReady()Checks if the Triton server is ready to accept requests.A ready server has completed initialization and is prepared to handle inference requests. This should be checked before attempting to perform inference.
- Specified by:
isServerReadyin interfaceTritonClient- Returns:
- true if the server is ready, false otherwise
- Throws:
io.grpc.StatusRuntimeException- if the gRPC call fails
-
isModelReady
Checks if a specific model is ready to accept inference requests.- Specified by:
isModelReadyin interfaceTritonClient- Parameters:
modelId- the name of the model to checkmodelVersion- the version of the model (can be null for latest version)- Returns:
- true if the model is ready, false otherwise
- Throws:
io.grpc.StatusRuntimeException- if the gRPC call fails
-
isModelReady
Checks if a specific model is ready to accept inference requests.- Specified by:
isModelReadyin interfaceTritonClient- Parameters:
modelId- the name of the model to check- Returns:
- true if the model is ready, false otherwise
- Throws:
io.grpc.StatusRuntimeException- if the gRPC call fails
-
getServerMetadata
Retrieves comprehensive metadata about the Triton server.Returns information including server name, version, and supported extensions.
- Specified by:
getServerMetadatain interfaceTritonClient- Returns:
- the server metadata
- Throws:
io.grpc.StatusRuntimeException- if the gRPC call fails- See Also:
-
getModelMetadata
Retrieves metadata about a specific model's inputs and outputs.The metadata includes tensor names, data types, and shapes for the model's inputs and outputs, which is essential for correctly formatting inference requests.
- Specified by:
getModelMetadatain interfaceTritonClient- Parameters:
modelId- the name of the modelmodelVersion- the version of the model (can be null for latest version)- Returns:
- the model metadata including inputs and outputs schema
- Throws:
io.grpc.StatusRuntimeException- if the gRPC call fails or model not found- See Also:
-
getModelConfig
Retrieves runtime configuration information for a specific model.The configuration includes platform type, backend, runtime environment, batching capabilities, and model file mappings.
- Specified by:
getModelConfigin interfaceTritonClient- Parameters:
modelId- the name of the modelmodelVersion- the version of the model (can be null for latest version)- Returns:
- the model runtime configuration
- Throws:
io.grpc.StatusRuntimeException- if the gRPC call fails or model not found- See Also:
-
getModelConfig
Retrieves runtime configuration information for a specific model (latest version).- Specified by:
getModelConfigin interfaceTritonClient- Parameters:
modelId- the name of the model- Returns:
- the model runtime configuration
- Throws:
io.grpc.StatusRuntimeException- if the gRPC call fails or model not found- See Also:
-
getModelRepositoryIndex
Retrieves the repository index containing all available models and their status.Returns a listing of all models in the repository, including their names, versions, availability status, and reasons for unavailability if applicable.
- Specified by:
getModelRepositoryIndexin interfaceTritonClient- Returns:
- the repository index with all models information
- Throws:
io.grpc.StatusRuntimeException- if the gRPC call fails- See Also:
-
loadModel
Requests the server to load a model.Asynchronously loads the specified model into memory. The model will become available for inference once loading completes. Check model readiness after calling this method.
- Specified by:
loadModelin interfaceTritonClient- Parameters:
modelId- the name of the model to load- Throws:
io.grpc.StatusRuntimeException- if the gRPC call fails
-
unLoadModel
Requests the server to unload a model.Unloads the specified model from memory, freeing associated resources. The model will no longer be available for inference after this call completes.
- Specified by:
unLoadModelin interfaceTritonClient- Parameters:
modelId- the name of the model to unload- Throws:
io.grpc.StatusRuntimeException- if the gRPC call fails
-
getInferenceStatistics
Retrieves comprehensive inference statistics for a model.Returns performance metrics including inference counts, timing statistics (queue time, compute time, etc.), memory usage, and response statistics. Can query all versions or a specific version.
- Specified by:
getInferenceStatisticsin interfaceTritonClient- Parameters:
modelId- the name of the model (can be null to get statistics for all models)modelVersion- the version of the model (can be null for all versions)- Returns:
- a list of model statistics objects
- Throws:
io.grpc.StatusRuntimeException- if the gRPC call fails- See Also:
-
infer
public InferResult infer(String modelId, String modelVersion, List<InferInput> inputs, Map<String, GrpcService.InferParameter> customParameters) Performs a synchronous (blocking) inference request with custom parameters.This method blocks until the inference result is returned from the server or a timeout occurs. Timeout is controlled via
TritonClientConfig.getDefaultTimeoutMs().Input Validation:
All inputs must have raw content available. Inputs are validated to match the model's expected schema (names, data types, shapes) on the server side.
- Specified by:
inferin interfaceTritonClient- Parameters:
modelId- the name of the model to run inference onmodelVersion- the version of the model (can be null for latest version)inputs- list of input tensors with data prepared for the modelcustomParameters- optional map of custom parameters to control inference behavior- Returns:
- the inference result containing output tensors and response metadata
- Throws:
io.grpc.StatusRuntimeException- if the gRPC call fails or times outTritonDataNotFoundException- if an input lacks raw content- See Also:
-
infer
Performs a synchronous (blocking) inference request.This method blocks until the inference result is returned from the server or a timeout occurs. Inference is performed on the latest version of the model.
- Specified by:
inferin interfaceTritonClient- Parameters:
modelId- the name of the model to run inference oninputs- list of input tensors with data prepared for the model- Returns:
- the inference result containing output tensors and response metadata
- Throws:
io.grpc.StatusRuntimeException- if the gRPC call fails or times outTritonDataNotFoundException- if an input lacks raw content- See Also:
-
inferAsync
public CompletableFuture<InferResult> inferAsync(String modelId, String modelVersion, List<InferInput> inputs, Map<String, GrpcService.InferParameter> customParameters) Performs an asynchronous (non-blocking) inference request with custom parameters.This method returns immediately with a CompletableFuture that will be completed when the inference result is received from the server. The request is executed concurrently in the background. Use the returned future to handle the result or errors.
Error Handling:
Errors can occur during request construction (synchronously) or during server processing (asynchronously). The returned future will be completed exceptionally in case of errors.
Example:
CompletableFuture<InferResult> future = client.inferAsync(modelId, inputs); future.whenComplete((result, error) -> { if (error != null) { System.err.println("Inference failed: " + error.getMessage()); } else { System.out.println("Result: " + result.getOutputAsString("output_0")); } });- Specified by:
inferAsyncin interfaceTritonClient- Parameters:
modelId- the name of the model to run inference onmodelVersion- the version of the model (can be null for latest version)inputs- list of input tensors with data prepared for the modelcustomParameters- optional map of custom parameters to control inference behavior- Returns:
- a CompletableFuture that will be completed with the inference result
- See Also:
-
inferAsync
Performs an asynchronous (non-blocking) inference request.This method returns immediately with a CompletableFuture that will be completed when the inference result is received from the server. Inference is performed on the latest version of the model.
- Specified by:
inferAsyncin interfaceTritonClient- Parameters:
modelId- the name of the model to run inference oninputs- list of input tensors with data prepared for the model- Returns:
- a CompletableFuture that will be completed with the inference result
-
close
Closes the client and releases the underlying gRPC channel.This method should be called when the client is no longer needed to free system resources. After calling close(), the client cannot be used for further requests.
Attempts to gracefully shutdown the channel with a 5-second timeout. If shutdown doesn't complete within 5 seconds, the channel will be forcefully terminated.
- Specified by:
closein interfaceAutoCloseable- Throws:
Exception- if an error occurs during shutdown
-