Classification Optimization
Sanity Check Charts:
- No Failed Requests: Verify that all requests were successful.
- Monotonic Series: Ensure that we tried enough VUs
- Accuracy Distribution: Evaluate the consistency of accuracy across runs.
Cost Analysis Charts
- Best Image by Cost Savings: Identify the best image based on cost savings
- Cost vs Latency: Identify optimal configurations balancing cost and latency.
- Cost vs VUs & Batch: Analyze cost trends based on VUs and batch size.
Best Config
1B_cost | 253.82 |
---|---|
hw_type | nvidia-l4 |
image | default |
batch_size | 64 |
vus | 448 |
Failed Requests Check
If all requests were successful, the result should be 0.
Did we try enough VUs?
How do we know that we tried enough VUs? What if we tried a higher amount of VUs and throughput kept increasing? If thats the case then we would see a monotonicly increasing relationship between VUs and Throughput and we would need to run more tests. Lets check this out!
We can check by:
- Grouping data by
hw_type
andbatch_size
to match how we generated the experiments - Sorting the data by
vus
within each group to ensure we get the data in the correct order for our check - Check for monotonic increase in
throughput_req_per_sec
flag the groups that always increase throughput as VUs increase
But how do we know? We can use the slider to check what would have happened if we had not tried past a certain amount. Lets say that if we tried 256 instead of our actual 1024 we would have left some potential on the table, we can simulate this by filtering our runs.
Verification
- Put the slider at
256
and see that there are a number of scenarios where we should have checked for a higher VU count - Put the slider at `1024 and verify that there are no scenarios shown
Are we Accurate Enough?
We shouldn't expect to see significant changes in accuracy. We should see a pretty tight distribution, but there might be some deviation since for lower VUs we wont have as many samples of our 10_000
that we saw with higher VUs.
Here we can see some deviation with a large z-score, but overall its not that big of an absolute devation. These also occur when we have relatively low total_requests
which makes sense.
We should worry more if we see major absolute_deviation
with higher total_requests
. We can see those values here:
hw_type | batch_size | vus | total_requests | accuracy_percentage | absolute_deviation | z_score |
---|---|---|---|---|---|---|
nvidia-l4 | 128 | 1024 | 10000 | 53.85 | 4.16 | -15.59 |
Best Image by Cost Savings
Chart
- Color = Best Image for that
vu
/batch_size
/GPU
- Size = % cost savings vs. the worst (most expensive) image in that group.
- Small dots dont mean that much, large dots do
Analysis
We can see that trt-onnx
is quite a bit stronger in nvidia-l4
. There are no significant red dots.
nvidia-l4
trt-onnx
(blue) dominates most points, indicating it's typically the cheaper choice- At larger batch sizes (right side) and higher VUs (upper part of the chart), you often see big blue bubbles, suggesting
trt-onnx
can save a significant percentage versusdefault
- A few red points (i.e.,
default
cheaper) appear at lower batch sizes, but they're less frequent and often show smaller savings differences
nvidia-t4
- There's more of a mix: some points favor
default
and others favortrt-onnx
- You can see some large red bubbles, meaning
default
can occasionally produce big savings under certain (VUs, batch_size) conditions - However,
trt-onnx
is still cheaper in many scenarios, especially toward higher batch sizes
Takeaways
If you have time/budget, its better to analyze both. You can see that they are close at times. But if you only have time/budget at the current cost ratio consider the nvidia-l4
in this case.
1B Requests Cost vs. Latency
This scatter plot visualizes the relationship between average latency (ms) and cost per billion requests per day for different hardware types (hw_type) and image processing configurations (image).
How to Read the Chart:
- Point Symbols: Represent different hardware + image configurations.
- Color Gradient: Represents batch size, helping to see cost trends across different batch sizes.
- Hover Data: Displays additional details like VUs, batch size, and throughput per second.
Key Features:
- Global Minimum Cost (Red Star): Marks the configuration with the lowest cost.
- Pareto Front (Red Dashed Line + Points): Highlights the most efficient configurations, minimizing both cost and latency. These configurations offer the best trade-offs.
How to Use:
- Find the lowest-cost, low-latency configurations by looking at points near the bottom-left.
- Use the Pareto front to identify cost-effective configurations.
- Compare different hardware and image processing strategies to optimize your setup.
This visualization helps in selecting the best configuration balancing performance (low latency) and cost efficiency.
Cost vs VUs and Batch Size Contour Plots
These contour plots visualize the cost per billion tokens per day (1B_cost
) as a function of VUs (Virtual Users) and Batch Size for different hardware configurations (hw_type
) and image types (image
).
There are real points, but in-between I
How to Read the Charts:
- Color Gradient: Shows the cost levels, with darker colors representing higher costs and lighter colors representing lower costs.
- Contour Lines: Represent cost levels, helping identify cost-effective regions.
- White Dots: Represent real data points used to generate the interpolated surface.
- Red Stars: Highlight the lowest cost point in the dataset.
- Small Red Dots: Indicate the lowest cost for each batch size.
- Tight clusters: (of contour lines) indicate costs changing rapidly with small adjustments to batch size or VUs.
How to Use:
- Identify the lowest cost configurations (red stars and dots).
- Observe how cost changes with batch size and VUs to optimize your setup.
- Compare different hardware types (
hw_type
) and image processing strategies (image
) to find the best-performing configuration.
Analysis
Overall we can see that nvidia-t4
s are more expensive for this cost-ratio and task. We should consider using the nvidia-l4
.
GPU | Image | Batch Size | VUs | Min Cost |
---|---|---|---|---|
nvidia-t4 | trt-onnx |
512 | 48 | $611.07 |
nvidia-t4 | default |
32 | 32 | $622.81 |
nvidia-l4 | trt-onnx |
64 | 448 | $255.07 |
nvidia-l4 | default |
64 | 448 | $253.82 |
We can see a clear winner with nvidia-l4
over nvidia-t4
at this cost ratio. But surprisingly we see default
slightly outperform trt-onnx
.
I think we should be careful not to overfit. These numbers can vary per run, but its good to know that each image can be competitive.
nvidia-t4
- Here we can see that
trt-onnx
anddefault
both perform similarly but withtrt-onnx
having a slight edge. trt-onnx
has a lower overall cost band (611–659) thandefault
(623–676)
nvidia-l4
trt-onnx
has a broad area of relatively low cost and hits a very low floor (~255)- This is great since it shows that we get consistently good results!
default
can also dip into the mid‐200s in certain spots, but it has bigger, more expensive areas—especially at lower VUs and batch sizes.- This means we need to spend time to optimize it
Conclusion
If I have time, I might analyze the nvidia-l4
with trt-onnx
across some different runs. Despite being $1.25
more expensive per 1B requests its a safer more consistent bet IMO.