聚焦 AI 基础设施、CUDA Kernel 与高性能系统工程
🔬 Focus: AI Infrastructure · CUDA Kernels · LLM Inference · HPC Systems
🌱 Currently: Building high-throughput inference pipelines and GPU-first systems
🤝 Open to: AI infrastructure, performance engineering, research collaboration, and open-source collaboration
I build AI infrastructure and GPU-first high-performance systems with C++/CUDA, Python, and Go. 主要聚焦 AI 基础设施、GPU 算子优化与高性能系统工程实践。
- 🔥 GPU Kernel Engineering — CUDA/Triton kernels for FlashAttention, GEMM, quantization, and memory-aware operator design
- 🧠 AI Inference Systems — lightweight LLM runtimes, KV Cache, W8A16/FP8 quantization, and inference path optimization
- ⚡ High-Performance Computing — simulation, rendering, and image-processing pipelines tuned for throughput and scalability
- 🌐 Real-time Systems — RTC signaling, streaming applications, and digital human platforms with system-level integration
Currently / 当前关注: inference acceleration, kernel fusion, and end-to-end GPU system design.
Featured Projects / 核心项目 — Start here for the quickest overview of my work in CUDA kernels, inference systems, HPC simulation, and production-facing applications.
如果你想快速判断我的技术重心与代表作,建议先看下面 4 个项目。
Best entry points for collaboration, hiring conversations, and technical review.
|
Flagship CUDA kernel library covering GEMM, FlashAttention, Conv2D, SpMV, and FP8 quantization. |
Compact LLM inference engine focused on W8A16 quantization, KV Cache, and practical runtime design. |
|
Million-particle GPU simulation exploring direct N², Barnes-Hut, and CUDA-OpenGL interop. |
3D digital human platform combining real-time rendering, interaction, and behavior control. |
|
Modern C++17/CUDA kernel library for elementwise ops, GEMM, FlashAttention, Conv2D, SpMV, and FP8 quantization. |
Stepwise CUDA SGEMM optimization from naive loops to Tensor Core kernels, reaching 40% of cuBLAS. |
|
Triton fusion kernels for RMSNorm+RoPE, Gated MLP, and FP8 GEMM with auto-tuning. |
CUDA kernel playground for FlashAttention, FP16/INT8 GEMM, and Tensor Core inference primitives. |
|
Lightweight LLM runtime with W8A16 quantization, KV Cache, and practical multi-sampling support. |
Educational CUDA inference engine with seven GEMM optimization stages, reaching 72% of cuBLAS. |
|
WebGPU micro inference engine implementing Conv2d, kernel fusion, Im2Col, and MNIST classification. |
Real-time multi-model vision stack combining YOLO, DETR, OWL-ViT, BLIP, and WebSocket streaming. |
|
CUDA ray tracer featuring Phong shading, path tracing, BVH acceleration, and warp-divergence tuning. |
Million-particle CUDA simulation covering direct N², Barnes-Hut, spatial hashing, and OpenGL interop. |
|
Real-time WebGPU fluid simulation with 10K particles, compute shaders, and visual trail effects. |
CUDA image-processing library covering convolution, morphology, geometric transforms, and pipeline stages. |
|
DAG-based heterogeneous image pipeline with multi-stream scheduling and pinned-memory pools. |
|
3D digital human platform integrating real-time rendering, voice interaction, behavior control, and emotion FSM. |
Minimal WebRTC demo with Go signaling, room management, and peer-to-peer media delivery. |
|
End-to-end encrypted note sync with AES-256, mnemonic recovery, and real-time collaboration. |
Browser-based memory training app with N-back, spaced reinforcement, adaptive difficulty, and PWA support. |
|
Background in communications engineering. / 通信与信息工程相关背景 |
Engineering across medical imaging, RTC systems, and genomic-scale data workflows. / 覆盖医疗影像、实时音视频系统与基因数据工程。 |
Reach out if you're building AI infrastructure, inference acceleration, GPU systems, or performance-critical tooling.
欢迎联系我交流 AI 基础设施、推理加速、GPU 系统,以及对性能敏感的工程项目。
Open to technical collaboration, engineering roles, research discussions, and thoughtful open-source work.


