The Infoq Podcast

Oliver Gould About Architecting to Avoid and Recover from Failure

Informações:

Sinopse

In this week’s podcast, Robert Blumen talks to Oliver Gould at QCon San Francsico 2016. Oliver is the CTO of Buoyant where he leads open source development efforts. Prior to Buoyant he was a Staff Infrastructure Engineer at Twitter where he was technical lead on Observability, Traffic, Configuration and Co-ordination teams. Why listen to this podcast: - Stratification allows applications to own their logic while libraries take care of the different mechanisms, such as service discovery and load balancing - Cascading failures can’t be tested or protected against, so having a fast time to recovery is important - Having developers own their services with on-call mechanisms improves the reliability of the service; it’s best to optimise automatic restarts so problems can be addressed during normal working hours - Post mortem analysis of failures are important to improve run books or checklists and to share learning between teams - Incremental roll out of features with feature flags or weighted routing provi