Stellenbeschreibung
Für unseren Kunden sind wir auf der Suche nach einem AI Infrastructure & Inference Engineer (m/w/d) mit Fokus GPU & LLM.
Laufzeit: 5.1.26
Auslastung: Vollzeit
Einsatzort: Remote
• Design, implement, and optimize LLM and multimodal inference pipelines across multi-GPU, multi-node, and distributed environments.
• Build request routing and load balancing systems to ensure ultra-low latency, high-throughput services.
• Develop auto-scaling and intelligent resource allocation to meet strict SLAs across multiple data centers.
• Architect trade-offs between latency, throughput, and cost efficiency for diverse workloads.
• Implement traffic shaping and multi-tenant orchestration for fair and reliable compute allocation.
• Collaborate with AI researchers, platform engineers, and ML practitioners to bring new model architectures to production.
• Automate system provisioning, deployment pipelines, and operational tasks using modern DevOps and MLOps practices.
• Monitor, profile, and benchmark system-level performance for maximum GPU utilization and uptime.
• Apply best practices in system security, observability (logging/metrics/tracing), and disaster recovery.
• Contribute to open-source ecosystems and internal tooling to push the boundaries of inference performance.
• Maintain comprehensive technical documentation and participate in continuous process improvements.
Required skills
• Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
• 5+ years of experience in high-performance computing, GPU infrastructure, or distributed systems.
• Deep understanding of multi-GPU orchestration, workload scheduling, and distributed architectures.
• Proficiency with programming (Python or similar language) and systems automation scripting.
• Strong background in containerization (Docker), orchestration frameworks (Kubernetes), and CI/CD pipelines.
• Familiarity with observability tools such as Prometheus, Grafana, and OpenTelemetry.
• Strong understanding of OS-level performance (multi-threading, networking, memory management).
• Clear communication skills and the ability to work collaboratively across technical teams.
Preferred Skills
• Experience with NVIDIA DGX systems, NIM, TensorRT-LLM, or high-performance inference frameworks.
• Hands-on knowledge of CUDA, NCCL, Triton, MPI, NVLink, or InfiniBand networking.
• Experience deploying GPU clusters in both cloud and bare-metal environments.
• Familiarity with open-source inference ecosystems like SGLang, vLLM, or NVIDIA Dynamo.
• Knowledge of LLM optimization techniques for inference and fine-tuning acceleration.
• Understanding of enterprise security frameworks, compliance standards, and GDPR requirements.