24 Feb 2022

Searching lost metrics after updating kube-state-metrics

After updating at my work Kubernetes cluster and updating Prometheus Operator, was updated its component – kube-state-metrics to 2.3.0. Early when versions were used 1.4.0 and 1.7.0, there is no problem because it is back compatible. But since 2.0.0 was deleted some metric. And I need to discover this, because some of them was used in alerts!

For that I wrote 3 scripts:

Get 1.4.0 metrics

#! /bin/bash

PATH_TO_CODE="./pkg/collectors"
WORD_TO_FIND="prometheus.NewDesc("

for file in `find ${PATH_TO_CODE} -type f -name "*.go" -not -path "*_test*"`; do
    list=$(cat $file | grep -n "${WORD_TO_FIND}" | awk -F':' '{print $1+1}')

    if [ ! -z "$list" ]; then
        echo -e "\n$file" | sed 's/pkg\/collectors/internal\/store/'
        new_list=()
        for line in $(echo $list); do
            metric=$(cat $file | awk "(NR == $line)" | xargs echo | awk -F ',' '{print $1}')
            if [[ "$metric" =~ "desc" ]]; then
                metric=$(cat $file | grep "$metric.*=" | awk -F "=" '{print $2}' | xargs echo)
            fi
            new_list+=($metric)
        done
        IFS=$'\n' sorted=($(sort -u <<<"${new_list[*]}")); unset IFS

        for i in ${sorted[@]}; do echo $i; done
    fi
done

Download metrics you can here.

Get 1.7.0 metrics

#! /bin/bash

PATH_TO_CODE="./internal/store"
WORD_TO_FIND="			Name: "

for file in `find ${PATH_TO_CODE} -type f -name "*.go" -not -path "*_test*"`; do
    list=$(cat $file | grep -n "${WORD_TO_FIND}" | awk -F':' '{print $1}')


    if [ ! -z "$list" ]; then
        echo -e "\n$file"
        new_list=()
        for line in $(echo $list); do
            metric=$(cat $file | awk "(NR == $line)" | xargs echo | awk -F ',' '{print $1}' | awk -F ": " '{print $2}')
            if [[ "$metric" =~ "desc" ]]; then
                metric=$(cat $file | grep "$metric.*=" | awk -F "=" '{print $2}' | xargs echo)
            fi
            new_list+=($metric)
        done
        IFS=$'\n' sorted=($(sort -u <<<"${new_list[*]}")); unset IFS
        for i in ${sorted[@]}; do echo $i; done
    fi
done

Download metrics you can here.

Get 2.3.0 metrics

#! /bin/bash

PATH_TO_CODE="./internal/store"
WORD_TO_FIND="*generator.NewFamilyGenerator("

for file in `find ${PATH_TO_CODE} -type f -name "*.go" -not -path "*_test*"`; do
    list=$(cat $file | grep -n "${WORD_TO_FIND}" | awk -F':' '{print $1+1}')

    if [ ! -z "$list" ]; then
        echo -e "\n$file"
        new_list=()
        for line in $(echo $list); do
            metric=$(cat $file | awk "(NR == $line)" | xargs echo | awk -F ',' '{print $1}')
            if [[ "$metric" =~ "desc" ]]; then
                metric=$(cat $file | grep "$metric.*=" | awk -F "=" '{print $2}' | xargs echo)
            fi
            new_list+=($metric)
        done
        IFS=$'\n' sorted=($(sort -u <<<"${new_list[*]}")); unset IFS
        for i in ${sorted[@]}; do echo $i; done
    fi
done

Download metrics you can here.

Different

As I sad above 1.7.0 is back compatible to 1.4.0.

But 2.3.0 is not back compatible to 1.4.0.

Deleted metrics in 2.3.0 from 1.4.0

kube_pod_container_resource_limits_cpu_cores
kube_pod_container_resource_limits_memory_bytes
kube_pod_container_resource_requests_cpu_cores
kube_pod_container_resource_requests_memory_bytes
kube_hpa_labels
kube_hpa_metadata_generation
kube_hpa_spec_max_replicas
kube_hpa_spec_min_replicas
kube_hpa_status_condition
kube_hpa_status_current_replicas
kube_hpa_status_desired_replicas
kube_daemonset_updated_number_scheduled
kube_node_status_allocatable_cpu_cores
kube_node_status_allocatable_memory_bytes
kube_node_status_allocatable_pods
kube_node_status_capacity_cpu_cores
kube_node_status_capacity_memory_bytes
kube_node_status_capacity_pods
kube_node_status_phase

To grep a directory for broken alerts use next construction:

grep -nr \
    -e kube_pod_container_resource_limits_cpu_cores \
    -e kube_pod_container_resource_limits_memory_bytes \
    -e kube_pod_container_resource_requests_cpu_cores \
    -e kube_pod_container_resource_requests_memory_bytes \
    -e kube_hpa_labels \
    -e kube_hpa_metadata_generation \
    -e kube_hpa_spec_max_replicas \
    -e kube_hpa_spec_min_replicas \
    -e kube_hpa_status_condition \
    -e kube_hpa_status_current_replicas \
    -e kube_hpa_status_desired_replicas \
    -e kube_daemonset_updated_number_scheduled \
    -e kube_node_status_allocatable_cpu_cores \
    -e kube_node_status_allocatable_memory_bytes \
    -e kube_node_status_allocatable_pods \
    -e kube_node_status_capacity_cpu_cores \
    -e kube_node_status_capacity_memory_bytes \
    -e kube_node_status_capacity_pods \
    -e kube_node_status_phase

Download Grafana Dashboards to check

Usually Grafana keeps dashboards in a database. If dashboard files exist Grafana loads these in the start time.

SERVER="https://grafana.example.com"
LOGIN=login
PASS=password

# Load list
# curl -H "Content-Type: application/json" -s "${SERVER}/api/search" -u ${LOGIN}:${PASS} | jq .
LIST=$(curl -H "Content-Type: application/json" -s "${SERVER}/api/search" -u ${LOGIN}:${PASS} | jq '.[].uri' | xargs -L1 echo | awk -F "/" '{print $2}')

# Save Dashboards to files
mkdir ./tmp
for i in $(echo ${LIST[@]}); do
    echo $i
    curl -H "Content-Type: application/json" -s "${SERVER}/api/dashboards/db/${i}" -u ${LOGIN}:${PASS} | jq . >> ./tmp/${i}.json
done

Based on serveradmin.ru.

After that can be used grep command that was defined above:

grep -nr ... > ./grafana-missed-metrics.txt

cat ./grafana-missed-metrics.txt | awk -F "./" '{print $2}' | awk -F ".json" '{print $1}' | sort -u

Fix metrics

For example how to fix some metrics:

kube_pod_container_resource_requests_cpu_cores
kube_pod_container_resource_requests{resource="cpu"}

kube_pod_container_resource_limits_cpu_cores
kube_pod_container_resource_limits{resource="cpu"}

kube_pod_container_resource_requests_memory_bytes
kube_pod_container_resource_requests{resource="memory"}

kube_pod_container_resource_limits_memory_bytes
kube_pod_container_resource_limits{resource="memory"}