Class TritonGrpcClient

java.lang.Object
com.gencior.triton.grpc.TritonGrpcClient
All Implemented Interfaces:
TritonClient, AutoCloseable

public class TritonGrpcClient extends Object implements TritonClient
gRPC-based implementation of the TritonClient for communicating with NVIDIA Triton Inference Server.

This class provides a high-performance client implementation using gRPC (gRPC Remote Procedure Call) for synchronous and asynchronous communication with Triton. It handles all aspects of client-server interaction including connection management, request timeout handling, and response parsing.

Features:

  • Synchronous Inference: Blocking inference requests via infer(String, List)
  • Asynchronous Inference: Non-blocking inference with CompletableFuture via inferAsync(String, List)
  • Server Monitoring: Health checks and availability queries
  • Model Management: Load/unload models, query metadata and statistics
  • Automatic Timeouts: Configurable per-request timeouts via TritonClientConfig
  • Error Handling: Graceful handling of gRPC errors with optional verbose logging

Usage Example:


 TritonClientConfig config = TritonClientConfig.builder()
     .url("localhost:8001")
     .defaultTimeoutMs(30000)
     .verbose(true)
     .build();

 TritonGrpcClient client = new TritonGrpcClient(config);
 try {
     // Check server health
     if (client.isServerReady()) {
         // Get model metadata
         TritonModelMetadata metadata = client.getModelMetadata("my_model");
         System.out.println("Model: " + metadata.getName());

         // Perform inference
         List<InferInput> inputs = Arrays.asList(...);
         InferResult result = client.infer("my_model", inputs);
         System.out.println("Output: " + result.getOutputAsString("output_0"));
     }
 } finally {
     client.close();
 }
 

Thread Safety:

This client is thread-safe and can be shared across multiple threads. The underlying gRPC channel handles concurrent requests efficiently.

Resource Management:

Always call close() to properly release the underlying gRPC channel and cleanup resources. Consider using try-with-resources or try-finally blocks to ensure cleanup.

Since:
1.0.0
Author:
sachachoumiloff
See Also:
  • Constructor Details

    • TritonGrpcClient

      public TritonGrpcClient(TritonClientConfig config)
      Creates a new TritonGrpcClient with the given configuration.

      Initializes a connection to the Triton server specified in the configuration. The underlying gRPC channel is created with plaintext (non-TLS) communication. TLS support can be added in future versions if needed.

      Parameters:
      config - the client configuration specifying server URL, timeout, and other options
      Throws:
      io.grpc.StatusRuntimeException - if the connection fails
  • Method Details

    • isServerLive

      public boolean isServerLive()
      Checks if the Triton server is alive.

      This is a lightweight health check that verifies the server process is running. A server can be live but not ready if it's still initializing.

      Specified by:
      isServerLive in interface TritonClient
      Returns:
      true if the server is alive, false otherwise
      Throws:
      io.grpc.StatusRuntimeException - if the gRPC call fails
    • isServerReady

      public boolean isServerReady()
      Checks if the Triton server is ready to accept requests.

      A ready server has completed initialization and is prepared to handle inference requests. This should be checked before attempting to perform inference.

      Specified by:
      isServerReady in interface TritonClient
      Returns:
      true if the server is ready, false otherwise
      Throws:
      io.grpc.StatusRuntimeException - if the gRPC call fails
    • isModelReady

      public boolean isModelReady(String modelId, String modelVersion)
      Checks if a specific model is ready to accept inference requests.
      Specified by:
      isModelReady in interface TritonClient
      Parameters:
      modelId - the name of the model to check
      modelVersion - the version of the model (can be null for latest version)
      Returns:
      true if the model is ready, false otherwise
      Throws:
      io.grpc.StatusRuntimeException - if the gRPC call fails
    • isModelReady

      public boolean isModelReady(String modelId)
      Checks if a specific model is ready to accept inference requests.
      Specified by:
      isModelReady in interface TritonClient
      Parameters:
      modelId - the name of the model to check
      Returns:
      true if the model is ready, false otherwise
      Throws:
      io.grpc.StatusRuntimeException - if the gRPC call fails
    • getServerMetadata

      public TritonServerMetadata getServerMetadata()
      Retrieves comprehensive metadata about the Triton server.

      Returns information including server name, version, and supported extensions.

      Specified by:
      getServerMetadata in interface TritonClient
      Returns:
      the server metadata
      Throws:
      io.grpc.StatusRuntimeException - if the gRPC call fails
      See Also:
    • getModelMetadata

      public TritonModelMetadata getModelMetadata(String modelId, String modelVersion)
      Retrieves metadata about a specific model's inputs and outputs.

      The metadata includes tensor names, data types, and shapes for the model's inputs and outputs, which is essential for correctly formatting inference requests.

      Specified by:
      getModelMetadata in interface TritonClient
      Parameters:
      modelId - the name of the model
      modelVersion - the version of the model (can be null for latest version)
      Returns:
      the model metadata including inputs and outputs schema
      Throws:
      io.grpc.StatusRuntimeException - if the gRPC call fails or model not found
      See Also:
    • getModelConfig

      public TritonModelConfig getModelConfig(String modelId, String modelVersion)
      Retrieves runtime configuration information for a specific model.

      The configuration includes platform type, backend, runtime environment, batching capabilities, and model file mappings.

      Specified by:
      getModelConfig in interface TritonClient
      Parameters:
      modelId - the name of the model
      modelVersion - the version of the model (can be null for latest version)
      Returns:
      the model runtime configuration
      Throws:
      io.grpc.StatusRuntimeException - if the gRPC call fails or model not found
      See Also:
    • getModelConfig

      public TritonModelConfig getModelConfig(String modelId)
      Retrieves runtime configuration information for a specific model (latest version).
      Specified by:
      getModelConfig in interface TritonClient
      Parameters:
      modelId - the name of the model
      Returns:
      the model runtime configuration
      Throws:
      io.grpc.StatusRuntimeException - if the gRPC call fails or model not found
      See Also:
    • getModelRepositoryIndex

      public TritonRepositoryIndex getModelRepositoryIndex()
      Retrieves the repository index containing all available models and their status.

      Returns a listing of all models in the repository, including their names, versions, availability status, and reasons for unavailability if applicable.

      Specified by:
      getModelRepositoryIndex in interface TritonClient
      Returns:
      the repository index with all models information
      Throws:
      io.grpc.StatusRuntimeException - if the gRPC call fails
      See Also:
    • loadModel

      public void loadModel(String modelId)
      Requests the server to load a model.

      Asynchronously loads the specified model into memory. The model will become available for inference once loading completes. Check model readiness after calling this method.

      Specified by:
      loadModel in interface TritonClient
      Parameters:
      modelId - the name of the model to load
      Throws:
      io.grpc.StatusRuntimeException - if the gRPC call fails
    • unLoadModel

      public void unLoadModel(String modelId)
      Requests the server to unload a model.

      Unloads the specified model from memory, freeing associated resources. The model will no longer be available for inference after this call completes.

      Specified by:
      unLoadModel in interface TritonClient
      Parameters:
      modelId - the name of the model to unload
      Throws:
      io.grpc.StatusRuntimeException - if the gRPC call fails
    • getInferenceStatistics

      public List<TritonModelStatistics> getInferenceStatistics(String modelId, String modelVersion)
      Retrieves comprehensive inference statistics for a model.

      Returns performance metrics including inference counts, timing statistics (queue time, compute time, etc.), memory usage, and response statistics. Can query all versions or a specific version.

      Specified by:
      getInferenceStatistics in interface TritonClient
      Parameters:
      modelId - the name of the model (can be null to get statistics for all models)
      modelVersion - the version of the model (can be null for all versions)
      Returns:
      a list of model statistics objects
      Throws:
      io.grpc.StatusRuntimeException - if the gRPC call fails
      See Also:
    • infer

      public InferResult infer(String modelId, String modelVersion, List<InferInput> inputs, Map<String,GrpcService.InferParameter> customParameters)
      Performs a synchronous (blocking) inference request with custom parameters.

      This method blocks until the inference result is returned from the server or a timeout occurs. Timeout is controlled via TritonClientConfig.getDefaultTimeoutMs().

      Input Validation:

      All inputs must have raw content available. Inputs are validated to match the model's expected schema (names, data types, shapes) on the server side.

      Specified by:
      infer in interface TritonClient
      Parameters:
      modelId - the name of the model to run inference on
      modelVersion - the version of the model (can be null for latest version)
      inputs - list of input tensors with data prepared for the model
      customParameters - optional map of custom parameters to control inference behavior
      Returns:
      the inference result containing output tensors and response metadata
      Throws:
      io.grpc.StatusRuntimeException - if the gRPC call fails or times out
      TritonDataNotFoundException - if an input lacks raw content
      See Also:
    • infer

      public InferResult infer(String modelId, List<InferInput> inputs)
      Performs a synchronous (blocking) inference request.

      This method blocks until the inference result is returned from the server or a timeout occurs. Inference is performed on the latest version of the model.

      Specified by:
      infer in interface TritonClient
      Parameters:
      modelId - the name of the model to run inference on
      inputs - list of input tensors with data prepared for the model
      Returns:
      the inference result containing output tensors and response metadata
      Throws:
      io.grpc.StatusRuntimeException - if the gRPC call fails or times out
      TritonDataNotFoundException - if an input lacks raw content
      See Also:
    • inferAsync

      public CompletableFuture<InferResult> inferAsync(String modelId, String modelVersion, List<InferInput> inputs, Map<String,GrpcService.InferParameter> customParameters)
      Performs an asynchronous (non-blocking) inference request with custom parameters.

      This method returns immediately with a CompletableFuture that will be completed when the inference result is received from the server. The request is executed concurrently in the background. Use the returned future to handle the result or errors.

      Error Handling:

      Errors can occur during request construction (synchronously) or during server processing (asynchronously). The returned future will be completed exceptionally in case of errors.

      Example:

      
       CompletableFuture<InferResult> future = client.inferAsync(modelId, inputs);
       future.whenComplete((result, error) -> {
           if (error != null) {
               System.err.println("Inference failed: " + error.getMessage());
           } else {
               System.out.println("Result: " + result.getOutputAsString("output_0"));
           }
       });
       
      Specified by:
      inferAsync in interface TritonClient
      Parameters:
      modelId - the name of the model to run inference on
      modelVersion - the version of the model (can be null for latest version)
      inputs - list of input tensors with data prepared for the model
      customParameters - optional map of custom parameters to control inference behavior
      Returns:
      a CompletableFuture that will be completed with the inference result
      See Also:
    • inferAsync

      public CompletableFuture<InferResult> inferAsync(String modelId, List<InferInput> inputs)
      Performs an asynchronous (non-blocking) inference request.

      This method returns immediately with a CompletableFuture that will be completed when the inference result is received from the server. Inference is performed on the latest version of the model.

      Specified by:
      inferAsync in interface TritonClient
      Parameters:
      modelId - the name of the model to run inference on
      inputs - list of input tensors with data prepared for the model
      Returns:
      a CompletableFuture that will be completed with the inference result
    • close

      public void close() throws Exception
      Closes the client and releases the underlying gRPC channel.

      This method should be called when the client is no longer needed to free system resources. After calling close(), the client cannot be used for further requests.

      Attempts to gracefully shutdown the channel with a 5-second timeout. If shutdown doesn't complete within 5 seconds, the channel will be forcefully terminated.

      Specified by:
      close in interface AutoCloseable
      Throws:
      Exception - if an error occurs during shutdown