Skip to content

ShlokVFX/WaveBoost

Repository files navigation

WaveBoost - Inference Kernel

Summary

WaveBoost is my personal repository to experiment with inference-time optimizations. I implemented individual CUDA kernels for LLM inference.


📊 Benchmarks

Attention Mechanisms Comparison

Performance comparison between Multi-Head Attention (MHA) and Grouped Query Attention (GQA):

Attention Mechanisms Latency Comparison

GQA demonstrates superior memory efficiency while maintaining competitive latency through optimized grouped computation without explicit KV replication.


📚 References

Flash Attention Papers

  • Flash Attention v1: Dao et al., 2022 - Fast and Memory-Efficient Exact Attention with IO-Awareness
  • Flash Attention v2: Dao, 2023 - Faster Attention with Better Parallelism and Work Partitioning

CUDA Optimization Resources


About

Inference optimization Practice

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors