ML Systems Engineer Job at Evolve Group, Hayward, CA

Ti8wQTRPTUI3TUEweWhLcHZiK1IzSUZTbGc9PQ==
  • Evolve Group
  • Hayward, CA

Job Description

About the Company - We are partnering with a cutting-edge AI research lab that is building foundation models from the ground up—across large language models (LLMs), image/video generation, and robotics. This is a high-intensity, hands-on environment for top-tier engineers who want to build state-of-the-art machine learning infrastructure at scale.

About the Role - We are seeking a Senior ML Infrastructure Engineer to design and build distributed training systems for large-scale AI models. This role is highly technical and requires deep expertise in ML infrastructure, distributed computing, and large-scale model training.

Responsibilities

  • Architect and optimize distributed training infrastructure for massive-scale AI models.
  • Set up and maintain multi-node, GPU-based training clusters (12+ nodes, 100+ GPUs).
  • Debug and optimize ML training performance (NCCL, CUDA, PyTorch pipeline optimization).
  • Implement and optimize data and model parallelism techniques (FSDP, DDP, DeepSpeed).
  • Develop infrastructure for efficient data sharding, sampling, and pipeline execution.
  • Build and monitor cluster performance and failure diagnostics (GKE/K8s, logging, and debugging tools).
  • Work closely with research teams to ensure infrastructure meets the needs of frontier AI model development.

Required Skills

  • Experience: 5+ years in ML infrastructure, ML systems engineering, or AI platform engineering.
  • Background: Proven experience at top AI research labs or companies working on large-scale AI models (e.g., OpenAI, DeepMind, Meta AI, NVIDIA, Anthropic, etc.).

Preferred Skills

  • Distributed Training: Multi-node training clusters, GPU compute optimization.
  • ML Frameworks: PyTorch, PyTorch Lightning.
  • Cluster Management: GKE/Kubernetes, cloud-based ML training setups.
  • Parallelism Techniques: Data/model parallelism, FSDP, DDP, DeepSpeed.
  • Debugging & Optimization: NCCL, CUDA, network optimizations for training stability.
  • Mindset & Culture Fit: Highly driven, mission-focused, and thrives in a high-intensity startup environment. Excited to build ML infrastructure for training models from scratch (not just fine-tuning existing ones).

Why Join?

  • Work on cutting-edge AI research—building foundation models from scratch.
  • Join a small, elite team solving some of the hardest ML infrastructure challenges.
  • Have a direct impact on AI at scale, working alongside top researchers and engineers.
  • Competitive compensation and meaningful equity in a fast-growing AI lab.

This is an urgent hire, and we are reviewing candidates immediately. If you are an ML Infrastructure expert looking to work on groundbreaking AI research, apply now!

Job Tags

Immediate start,

Similar Jobs

Synergy Interactive

Senior UX/UI Designer Job at Synergy Interactive

 ...Senior UX/UI Designer Join our fast-paced startup as a Senior UX/UI Designer , where you'll take ownership of creating seamless,...  ...ideas to life. Key Responsibilities Design intuitive user experiences through wireframes, prototypes, and polished visuals. Collaborate... 

Ajulia Executive Search

Quality Assurance Manager (Pharmaceutical) Job at Ajulia Executive Search

 ...Quality Assurance Manager (Pharmaceutical) Strong working knowledge of cGMP and FDA regulations Experience with GMP & GCP audits preferred ASQ Quality Auditor Certification preferred Are you looking to make a career change to a rapidly growing company? This... 

KIPP Philadelphia Public Schools

2025-26 High School Social Worker Job at KIPP Philadelphia Public Schools

 ...network of free, open-enrollment, college-preparatory public schools in educationally underserved communities. Over 100,000 students...  ...Description KIPP Philadelphia seeks a highly dedicated Social Worker who will bring a whatever it takes attitude to its team. Our... 

Momentum Technologies

Chemical Engineer I Job at Momentum Technologies

 ...Job Title: Chemical Engineer I Location: Dallas, TX Department: Research & Development Reports To: Vice President of Product Development Position Summary We are looking for a Chemical Engineer I to support the development and optimization of processes... 

Equitable Advisors

Retirement Planning Associate - Entry Level Job at Equitable Advisors

 ...municipal governments can save for retirement through a 403(b) tax-sheltered annuity (TSA) and 457(b) employee-deferred compensation (EDC) plans. Recognizing the benefits of these plans and the specific needs of this marketplace, we created the Retirement Benefits Group (RBG)....