Skip to content
View LessUp's full-sized avatar
  • shenzhen
  • 12:49 (UTC +08:00)

Block or report LessUp

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
LessUp/README.md
Header
Typing SVG

聚焦 AI 基础设施、CUDA Kernel 与高性能系统工程

🔬 Focus: AI Infrastructure · CUDA Kernels · LLM Inference · HPC Systems
🌱 Currently: Building high-throughput inference pipelines and GPU-first systems
🤝 Open to: AI infrastructure, performance engineering, research collaboration, and open-source collaboration


Followers   Stars   Views



Profile  Selected Work  Background  Stack  Signals  Connect



👨‍💻 About Me / 关于我

Top Languages

I build AI infrastructure and GPU-first high-performance systems with C++/CUDA, Python, and Go. 主要聚焦 AI 基础设施、GPU 算子优化与高性能系统工程实践。

  • 🔥 GPU Kernel Engineering — CUDA/Triton kernels for FlashAttention, GEMM, quantization, and memory-aware operator design
  • 🧠 AI Inference Systems — lightweight LLM runtimes, KV Cache, W8A16/FP8 quantization, and inference path optimization
  • High-Performance Computing — simulation, rendering, and image-processing pipelines tuned for throughput and scalability
  • 🌐 Real-time Systems — RTC signaling, streaming applications, and digital human platforms with system-level integration

Currently / 当前关注: inference acceleration, kernel fusion, and end-to-end GPU system design.



🚀 Selected Work / 项目全景

Featured Projects / 核心项目 — Start here for the quickest overview of my work in CUDA kernels, inference systems, HPC simulation, and production-facing applications.
如果你想快速判断我的技术重心与代表作,建议先看下面 4 个项目。
Best entry points for collaboration, hiring conversations, and technical review.

Flagship CUDA kernel library covering GEMM, FlashAttention, Conv2D, SpMV, and FP8 quantization.

Compact LLM inference engine focused on W8A16 quantization, KV Cache, and practical runtime design.

Million-particle GPU simulation exploring direct N², Barnes-Hut, and CUDA-OpenGL interop.

3D digital human platform combining real-time rendering, interaction, and behavior control.

⚡ GPU Kernel Optimization / GPU 算子优化

Modern C++17/CUDA kernel library for elementwise ops, GEMM, FlashAttention, Conv2D, SpMV, and FP8 quantization.

C++17 CUDA Tensor Core

Stepwise CUDA SGEMM optimization from naive loops to Tensor Core kernels, reaching 40% of cuBLAS.

CUDA WMMA Roofline

Triton fusion kernels for RMSNorm+RoPE, Gated MLP, and FP8 GEMM with auto-tuning.

Triton FP8 Python

CUDA kernel playground for FlashAttention, FP16/INT8 GEMM, and Tensor Core inference primitives.

CUDA PyTorch FlashAttention

🧠 AI Inference Engines / AI 推理引擎

Lightweight LLM runtime with W8A16 quantization, KV Cache, and practical multi-sampling support.

CUDA C++17 INT8

Educational CUDA inference engine with seven GEMM optimization stages, reaching 72% of cuBLAS.

CUDA C++17 FP16

WebGPU micro inference engine implementing Conv2d, kernel fusion, Im2Col, and MNIST classification.

WebGPU TypeScript WGSL

Real-time multi-model vision stack combining YOLO, DETR, OWL-ViT, BLIP, and WebSocket streaming.

FastAPI YOLOv8 Docker

🎮 GPU Computing & Simulation / GPU 计算与仿真

CUDA ray tracer featuring Phong shading, path tracing, BVH acceleration, and warp-divergence tuning.

CUDA Path Tracing BVH

Million-particle CUDA simulation covering direct N², Barnes-Hut, spatial hashing, and OpenGL interop.

CUDA OpenGL Barnes-Hut

Real-time WebGPU fluid simulation with 10K particles, compute shaders, and visual trail effects.

WebGPU TypeScript WGSL

CUDA image-processing library covering convolution, morphology, geometric transforms, and pipeline stages.

CUDA C++17 Image Processing

DAG-based heterogeneous image pipeline with multi-stream scheduling and pinned-memory pools.

CUDA C++17 DAG

🌐 Applications / 应用项目

3D digital human platform integrating real-time rendering, voice interaction, behavior control, and emotion FSM.

React Three.js TypeScript

Minimal WebRTC demo with Go signaling, room management, and peer-to-peer media delivery.

Go WebRTC Docker

End-to-end encrypted note sync with AES-256, mnemonic recovery, and real-time collaboration.

React Express Socket.IO

Browser-based memory training app with N-back, spaced reinforcement, adaptive difficulty, and PWA support.

JavaScript Tailwind PWA


🎓 Background & Experience / 教育与经历

🎓 Education

Xidian University Xidian University

Background in communications engineering. / 通信与信息工程相关背景

💼 Experience

Mindray Mindray · ZEGO ZEGO · BGI BGI

Engineering across medical imaging, RTC systems, and genomic-scale data workflows. / 覆盖医疗影像、实时音视频系统与基因数据工程。


🛠️ Tech Stack / 技术栈

Category Technologies
Languages Languages
AI & HPC AI   CUDA · Triton · cuBLAS · Tensor Core · WebGPU · Quantization
System & DevOps System   Inference pipelines · Performance tuning
Web & Frontend Web   Real-time apps · Visualization

📊 Signals & Activity / 数据概览

LessUp's GitHub stats   GitHub Streak

🏆 Highlights & More Stats / 高亮与更多数据

📈 Activity Graph / 活动图
GitHub Activity Graph

🧬 Visual Signature / 视觉标识
Snake animation

📫 Collaboration & Contact / 联系方式

Reach out if you're building AI infrastructure, inference acceleration, GPU systems, or performance-critical tooling.
欢迎联系我交流 AI 基础设施、推理加速、GPU 系统,以及对性能敏感的工程项目。
Open to technical collaboration, engineering roles, research discussions, and thoughtful open-source work.
Email   GitHub

Footer

Pinned Loading

  1. bookmarks-cleaner bookmarks-cleaner Public

    Smart Bookmark Cleanup & Classification: Rules + ML + Optional LLM, Dedup & Multi-Format Export (Python CLI) | 智能书签清理与分类工具:规则 + ML + 可选 LLM,去重、标题清理、多格式导出(Python CLI)

    Python 2

  2. awesome-cursorrules-zh awesome-cursorrules-zh Public

    💻✨专为中文开发者优化的 Cursor AI 编程规则集合

    140 17

  3. meta-human meta-human Public

    AI Digital Human Platform: Speech Recognition + LLM Chat + TTS + 3D Avatar (Next.js + Three.js) | AI 数字人交互平台:语音识别 + LLM 对话 + 语音合成 + 3D 虚拟形象(Next.js + Three.js)

    TypeScript 12 4

  4. webrtc webrtc Public

    Minimal WebRTC Demo (Go + Pion): WebSocket Signaling, Browser P2P Audio/Video Calls & Docker Deployment | 最小可用 WebRTC 示例(Go + Pion):WebSocket 信令、浏览器端点对点音视频通话、Docker 部署

    JavaScript 1

  5. yolo-toys yolo-toys Public

    Multi-Model Real-Time Vision System: YOLO/DETR/OWL-ViT/BLIP Dynamic Switching, FastAPI + WebSocket Inference | 多模型实时视觉识别系统:YOLO/DETR/OWL-ViT/BLIP 动态切换,FastAPI + WebSocket 实时推理

    Python 2