Core Infrastructure & Orchestrator Engineer

via GoLance 6 months ago Web, Mobile & Software Dev Remote

Project overview

Primary responsibility: Design and implement the backend + infra that runs Project Helix : services, APIs, queues, DBs, deployment, observability, and later migration to the Dell R7625. Hard skills: OS & Systems 5+ years working with Linux (Ubuntu) in production. Strong with: systemd, journalctl, ssh, basic hardening. Filesystems, disk layout, RAID or at least mdadm/LVM basics. Containers & Orchestration Deep hands-on with Docker and/or Podman: Writing multi-container setups (docker-compose.yml or Podman pods). Resource limits (CPU, memory, pids), namespaces. Comfortable with cgroups v2, seccomp profiles at a practical level (not necessarily kernel dev, but knows how to apply hardened profiles). Ideally has run multi-node environments: Swarm / Nomad / custom scheduler, or at least designe across multiple hosts. Backend Engineering Strong in at least one server-side language: Go, Node.js/TypeScript, or Python (preferred for Helix control plane). Has built: REST or gRPC APIs. Job/worker systems (queue-based or custom scheduler). State machines for jobs (pending → running → failed → retry → done). Can read Java well enough to map existing Appium/Spring concepts (Session, ExecutionRequest, Device) into new services. Datastores PostgreSQL: Schema design, indexing, migrations (Flyway, Liquibase, or built-in tools). Familiar with connection pooling (pgBouncer/pgpool). Redis: Using it for queues, locks, rate limiting, ephemeral state. Patterns: pub/sub or streams, distributed locks, expiring keys. Distributed Systems Concepts Understands: Idempotency, retries, backoff. At-least-once vs at-most-once semantics. Event-driven architecture basics (events for session_started, session_finished, error_event, etc). Can design: A central scheduler that assigns work to execution nodes. A session lifecycle state machine for 100k+ sessions/day. Observability / DevOps Has set up logging & metrics in real systems: Prometheus + Grafana (or equivalent), basic alerting. Structured logs to file or ELK/OpenSearch. Knows how to expose: Health endpoints. Metrics for CPU/RAM per container, errors per script, success rates. Networking / Proxies Not necessarily a network engineer, but: Understands basic routing, IPs, subnets, DNS, reverse proxies. Can integrate outgoing HTTP via proxies (SOCKS5/HTTP) at the app or container level. What they deliver (Phase 1–2): VPS with: Docker/Podman Postgres Redis Prometheus + Grafana (or similar) A Helix Orchestrator service that: Accepts “run N clients with script X” jobs. Stores job/session state in Postgres. Manages workers on that VPS via containers. A deployment story: docker-compose (or Podman equivalent) that brings everything up/down. Config structured so later migration to the R7625 is just “add more nodes, same containers.”

Similar projects

Apply on GoLance