The New Stack Makers

Keeping GPUs Ticking Like Clockwork

Informações:

Sinopse

Clockwork began with a narrow goal—keeping clocks synchronized across servers—but soon realized that its precise latency measurements could reveal deeper data center networking issues. This insight led the company to build a hardware-agnostic monitoring and remediation platform capable of automatically routing around faults. Today, Clockwork’s technology is especially valuable for large GPU clusters used in training LLMs, where communication efficiency and reliability are critical. CEO Suresh Vasudevan explains that AI workloads are among the most demanding distributed applications ever, and Clockwork provides building blocks that improve visibility, performance and fault tolerance. Its flagship feature, FleetIQ, can reroute traffic around failing switches, preventing costly interruptions that might otherwise force teams to restart training from hours-old checkpoints. Although the company originated from Stanford research focused on clock synchronization for financial institutions, the team eventually recogni