100%(1)1 out of 1 people found this document helpful
This preview shows page 147 - 149 out of 550 pages.
Borgmon also supports language templates. This macro-like system enables engineersto construct libraries of rules that can be reused. This functionality again reduces rep‐etition, thus reducing the likelihood of bugs in the configuration.Of course, any high-level programming environment creates the opportunity forcomplexity, so Borgmon provides a way to build extensive unit and regression testsby synthesizing time-series data, in order to ensure that the rules behave as the authorthinks they do. The Production Monitoring team runs a continuous integration ser‐vice that executes a suite of these tests, packages the configuration, and ships the con‐figuration to all the Borgmon in production, which then validate the configurationbefore accepting it.In the vast library of common templates that have been created, two classes of moni‐toring configuration have emerged. The first class simply codifies the emergentschema of variables exported from a given library of code, such that any user of thelibrary can reuse the template of its varz. Such templates exist for the HTTP serverlibrary, memory allocation, the storage client library, and generic RPC services,among others. (While the varz interface declares no schema, the rule library associ‐ated with the code library ends up declaring a schema.)The second class of library emerged as we built templates to manage the aggregationof data from a single-server task to the global service footprint. These libraries con‐tain generic aggregation rules for exported variables that engineers can use to modelthe topology of their service.Maintaining the Con guration| 121
For example, a service may provide a single global API, but be homed in many data‐centers. Within each datacenter, the service is composed of several shards, and eachshard is composed of several jobs with various numbers of tasks. An engineer canmodel this breakdown with Borgmon rules so that when debugging, subcomponentscan be isolated from the rest of the system. These groupings typically follow theshared fate of components; e.g., individual tasks share fate due to configuration files,jobs in a shard share fate because they’re homed in the same datacenter, and physicalsites share fate due to networking.Labeling conventions make such division possible: a Borgmon adds labels indicatingthe target’s instance name and the shard and datacenter it occupies, which can beused to group and aggregate those time-series together.Thus, we have multiple uses for labels on a time-series, though all are interchangea‐ble:•Labels that define breakdowns of the data itself (e.g., our HTTP response code onthe http_responsesvariable)•Labels that define the source of the data (e.g., the instance or job name)•Labels that indicate the locality or aggregation of the data within the service as awhole (e.g., the zone label describing a physical location, a shard label describinga logical grouping of tasks)The templated nature of these libraries allows flexibility in their use. The same tem‐