Class TritonModelStatistics
This class aggregates performance metrics for a specific model including inference counts, timing statistics, memory usage, and response statistics. It provides convenient methods to calculate derived metrics such as average latency, success rate, and memory usage.
This is an immutable object that wraps the gRPC message ModelStatistics.
Convenience Methods:
getAverageComputeMs()- Calculate average inference computation timegetSuccessRate()- Calculate percentage of successful inferencesgetBatchingEfficiency()- Calculate batching efficiency ratiogetTotalGpuMemoryUsage()- Sum GPU memory usage across all instances
- Since:
- 1.0.0
- Author:
- sachachoumiloff
-
Method Summary
Modifier and TypeMethodDescriptionstatic TritonModelStatisticsdoubleReturns the average inference computation time in milliseconds.doubleReturns the average queue wait time in milliseconds.doubleReturns the batching efficiency ratio.longReturns the total number of execution batches for this model.longReturns the total number of inferences executed by this model.longReturns the timestamp of the last inference in microseconds (Unix epoch).Returns a human-readable identifier for this model.getName()Returns the name of the model.doubleReturns the success rate as a ratio between 0.0 and 1.0.longReturns the total GPU memory usage in bytes across all model instances.Returns detailed inference timing statistics.Returns the version of the model.booleanChecks if there have been any cache hits for this model.
-
Method Details
-
fromProto
-
getAverageComputeMs
public double getAverageComputeMs()Returns the average inference computation time in milliseconds.This is calculated as total compute time divided by the number of inferences.
- Returns:
- the average computation time in milliseconds, or 0.0 if no inferences
-
getAverageQueueMs
public double getAverageQueueMs()Returns the average queue wait time in milliseconds.This represents the average time requests spent waiting before being processed.
- Returns:
- the average queue wait time in milliseconds, or 0.0 if no inferences
-
getSuccessRate
public double getSuccessRate()Returns the success rate as a ratio between 0.0 and 1.0.This is calculated as successful inferences divided by total inferences (successful + failed).
- Returns:
- the success rate (0.0 = all failed, 1.0 = all successful)
-
getBatchingEfficiency
public double getBatchingEfficiency()Returns the batching efficiency ratio.This is calculated as total inferences divided by total execution batches. A higher ratio indicates better batching efficiency.
- Returns:
- the batching efficiency ratio (inferences per execution), or 0.0 if no executions
-
getTotalGpuMemoryUsage
public long getTotalGpuMemoryUsage()Returns the total GPU memory usage in bytes across all model instances.- Returns:
- total GPU memory usage in bytes
-
hasCacheHits
public boolean hasCacheHits()Checks if there have been any cache hits for this model.- Returns:
trueif at least one cache hit occurred,falseotherwise
-
getModelIdentifier
Returns a human-readable identifier for this model.Format: "model_name (vX.Y.Z)"
- Returns:
- the model identifier string
-
getName
Returns the name of the model.- Returns:
- the model name
-
getInferenceCount
public long getInferenceCount()Returns the total number of inferences executed by this model.- Returns:
- the total inference count
-
getVersion
Returns the version of the model.- Returns:
- the model version string
-
getLastInference
public long getLastInference()Returns the timestamp of the last inference in microseconds (Unix epoch).- Returns:
- the last inference timestamp
-
getExecustionCount
public long getExecustionCount()Returns the total number of execution batches for this model.- Returns:
- the execution count
-
getTritonInferStatistics
Returns detailed inference timing statistics.- Returns:
- the inference statistics object containing compute, queue, and status timings
-