Project overview
About the Role NexGen Cloud is a rapidly growing IaaS company focused on providing innovative cloud solutions and infrastructure services. Our GPU cloud infrastructure solutions accelerate development in industries such as Artificial Intelligence & Machine Learning, VFX & Rendering, Data Science & IoT, and Computer Aided Engineering & MDO. We are dedicated to helping our clients navigate the complexities of the digital world and achieve success through cutting-edge, scalable, secure and affordable solutions. At the companys heart stands a group of very talented, experienced, and motivated individuals who want to make a positive change and a lasting impact on the tech world. As an Infrastructure Engineer, you will help design, deploy, and operate the systems that power our global GPU cloud. You’ll bring deep expertise in Linux, networking, and automation to ensure our fleet is secure, scalable, and fast. This is a hands-on role ideal for engineers who love building and optimizing performance-critical infrastructure and who want to have a major impact at a rapidly scaling company. Key Responsibilities Core Infrastructure Provision and manage Linux systems (Ubuntu-based) supporting GPU servers and backend services. Maintain system availability, conduct root cause analysis, and implement failover strategies. Networking Design and manage high-speed, low-latency network infrastructure across data center environments. Configure firewalls, BGP, VLANs, VXLANs, and VPNs to support secure and scalable multi-tenant networking. Resolve network-related incidents impacting workloads or customer environments. Automation & Scaling Build infrastructure-as-code with tools like Ansible for repeatable, scalable deployments. Automate GPU driver installs, system bootstrapping, and fleet-wide patching. Develop CI/CD workflows for infrastructure updates and configuration validation. Cloud & Virtualization Support containerized workloads via Kubernetes or custom orchestration systems. Work with both bare-metal and virtualized GPU platforms using KVM or OpenStack-based environments. Integrate with public cloud APIs or hybrid infrastructure as needed. Monitoring & Security Deploy and manage monitoring stacks (e.g., Prometheus, Grafana, ELK) to track system health and capacity. Implement hardening practices, access controls, and audit trails for infrastructure components. Support incident response and security investigations related to infrastructure. Qualifications and Requirements 3–5 years of experience in Linux systems administration or infrastructure engineering. Strong networking knowledge: routing, switching, TCP/IP, DNS, DHCP, VLANs, BGP, VPN. Proficiency with scripting languages (Bash, Python) and automation tools (Ansible, Terraform). Hands-on experience with virtualization, containerization, and systems troubleshooting. Familiarity with monitoring and logging systems in a production environment. Strong focus on keeping good documentation Good to have: Prior experience at a GPU cloud provider, HPC environment, or similar high-performance setting. Exposure to NVIDIA GPU technologies and tooling (e.g., Nvidia GPU operator, CUDA toolkit, DCGM). Experience with software-defined networking (SDN, OVS/OVN) and overlay networks (VXLAN, Calico). Experience with networking products from Arista, Cisco, Mikrotik and Nvidia/Mellanox. Familiarity with OpenStack private cloud environments Familiarity wit ... (Description has been truncated due to length limits)