From Barry on WordPress
Data Centre Heatmap
28/02/2022At Automattic, our systems team manages over 10,000 physical servers located across 30 data centers on 6 continents. As our compute density has increased from 24 CPU threads/RU in 2013 to 128 CPU threads/RU in 2022 the maximum thermal thresholds have decreased. Older, less powerful servers could operate with inlet air temperatures up to 42C (107.6F) while newer servers trigger CPU throttling at much lower temperatures of 35C-37C (95F – 98.6F). Normal data center operating temperatures tend to be between 20F-25C, but cooling failures are somewhat common (they even affect Google), so we have to monitor temperatures carefully.
We are big fans of Prometheus and Grafana, and for a few years, our temperature graphs have looked like this.
This graph shows the temperatures of some servers located in our data center in Johannesburg, South Africa over one week. The coloured lines represent individual servers, and the bold red line is the average temperature in the rack.
We get this data from our server's inlet temperature sensor using ipmitool. I thought it would be interesting to visualize this data a bit differently, and Grafana has a Heatmap graph type that makes it pretty easy.
First, we simply want to graph the temperature by location for a given datacenter. In PromQL this looks like
avg by (location) (ipmi_inlet_temp{dc="$DC"})
location includes the rack identifier and the location in the rack. For example a location of 101-10 would mean Rack 101, RU 10. We store this information is our data center asset management system (which is a colon separated file) and it gets added as labels to all Prometheus metrics. By choosing the Heatmap (New) graph type and configuring some basic graph options, Grafana allows us to create a graph which shows the same data as our original graph, but in a different, and more useful way. We can easily see that the top of the rack is warmer than the bottom which is to be expected since the cold air in this facility comes from the floor. We can also see that temperatures have increased slightly over the past week, which is not ideal, but they are not at dangerous levels.
JNB
We can contrast this with a rack in Milan, Italy, where there was a cooling outage which caused the servers to operate beyond their intended temperature threshold for some time:
Milan, Italy
Using the same data and graph options, we can also easily create heat maps of entire rows of racks to visualize airflow management and identify areas for potential improvement. Here is a row of racks in a data center in Los Angeles with poor airflow management. We can see the racks at the end of the row suffer from increased temperatures due to air leakage from the hot aisle to the cold aisle.
Los Angeles
This data can be contrasted with data from a set of racks in Amsterdam, which have much better airflow management.
This post shows how easy it is to create cool(!!) and useful heat maps using Grafana, Prometheus, and a little time. If this sort of stuff interests you, Automattic is hiring!
No comments:
Post a Comment