Jobtitel: 100% remote - AI Infrastructure & Inference Engineer (m/w/d) Fokus GPU & LLM
Vertragsart: Interim / Project Consulting
Arbeitszeitmodel: Vollzeit
Zahlungsintervall: Stündlich
Lohnsatz: Verhandelbar
Ort: remote
Job veröffentlicht: 24-11-2025
Job-ID: 61614
Name: Niklas Machens
Telefonnummer: +4915119501867
E-Mail: niklas.machens@nemensis.de

Stellenbeschreibung

Für unseren Kunden sind wir auf der Suche nach einem AI Infrastructure & Inference Engineer (m/w/d) mit Fokus GPU & LLM.
 
Laufzeit: 5.1.26
Auslastung: Vollzeit
Einsatzort: Remote
 
• Design, implement, and optimize LLM and multimodal inference pipelines across multi-GPU, multi-node, and distributed environments.
• Build request routing and load balancing systems to ensure ultra-low latency, high-throughput services.
• Develop auto-scaling and intelligent resource allocation to meet strict SLAs across multiple data centers.
• Architect trade-offs between latency, throughput, and cost efficiency for diverse workloads.
• Implement traffic shaping and multi-tenant orchestration for fair and reliable compute allocation.
• Collaborate with AI researchers, platform engineers, and ML practitioners to bring new model architectures to production.
• Automate system provisioning, deployment pipelines, and operational tasks using modern DevOps and MLOps practices.
• Monitor, profile, and benchmark system-level performance for maximum GPU utilization and uptime.
• Apply best practices in system security, observability (logging/metrics/tracing), and disaster recovery.
• Contribute to open-source ecosystems and internal tooling to push the boundaries of inference performance.
• Maintain comprehensive technical documentation and participate in continuous process improvements.
 
Required skills
• Bachelor’s or Master’s degree in Computer Science, Engineering, or a related field.
• 5+ years of experience in high-performance computing, GPU infrastructure, or distributed systems.
• Deep understanding of multi-GPU orchestration, workload scheduling, and distributed architectures.
• Proficiency with programming (Python or similar language) and systems automation scripting.
• Strong background in containerization (Docker), orchestration frameworks (Kubernetes), and CI/CD pipelines.
• Familiarity with observability tools such as Prometheus, Grafana, and OpenTelemetry.
• Strong understanding of OS-level performance (multi-threading, networking, memory management).
• Clear communication skills and the ability to work collaboratively across technical teams.
 
Preferred Skills
• Experience with NVIDIA DGX systems, NIM, TensorRT-LLM, or high-performance inference frameworks.
• Hands-on knowledge of CUDA, NCCL, Triton, MPI, NVLink, or InfiniBand networking.
• Experience deploying GPU clusters in both cloud and bare-metal environments.
• Familiarity with open-source inference ecosystems like SGLang, vLLM, or NVIDIA Dynamo.
• Knowledge of LLM optimization techniques for inference and fine-tuning acceleration.
• Understanding of enterprise security frameworks, compliance standards, and GDPR requirements.
Bewerben mit indeed
Dateitypen (doc, docx, pdf, rtf) mit einer Größe von bis zu 10 MB
Dateitypen (doc, docx, pdf, rtf) mit einer Größe von bis zu 10 MB