Stop Zabbix notification for nodes under zabbix-proxy when proxy service is down

5,191

Solution 1

One way to accomplish that would be to monitor whether Zabbix proxy is reachable. This is done using zabbix[proxy,<name>,lastaccess] internal item. In the documentation on internal items, it then suggests to use fuzzytime() trigger function to check availability of your proxy:

{<zabbix>:zabbix[proxy,<name>,lastaccess].fuzzytime(2m)}=0

After that you can make your "host unreachable" triggers depend on this trigger, which will stop them from being activated when proxy is unreachable (see documentation on trigger dependencies for more information).

Solution 2

the answer by asaveljevs is good. some notes on potential issues.

proxy will be detected as missing, and dependencies will work as expected. but let's assume a proxy is unreachable for an extended period of time - 30 minutes or more. let's assume that it keeps on collecting data. once the connection is restored, it sends the values in. at this point, server sees that proxy is available (the internal lastaccess item is updated), but a proxy always sends data sequentially (older values first). if the proxy had a large amount of values collected, it will take some time to send in the recent values. for the hosts behind the proxies you would have nodata() triggers on their availability, and these triggers would see that data is missing - and they would fire.

starting with zabbix 2.2, there is a new internal item - zabbix[proxy_history] . theoretically, it could be used to monitor how many unsent values zabbix proxy has and have a trigger (t) if that number is high. then one would have a dependency on that trigger (t) from all the host availability triggers, and trigger (t) in turn would depend on lastaccess trigger. that way if the proxy suddenly disappears, we still have a dep on lastaccess. if it comes back, we would notice the large history backlog and still not alert on anything... except that internal items are subject to the same proxy queue/older first rules. until we got the values about large proxy buffer/history, we would already fire the triggers about hosts behind the proxy.

so is there a solution ? maybe.

the information about proxy buffer can be extracted from the proxy database. our task is to get it to the server as soon as possible once connection is restored. we have two options :

  • use an agent that talks directly to the server

zabbix agent would collect the buffer/history size and push that directly to the server without going through the proxy cache/buffer. if this would be a passive agent, we would completely lose buffer values during the downtime and then we would depend on the item interval to get the first value in after connection is back. if this would be an active agent, we would be able to keep some amount of data (100 or 50 by default) during the downtime. it would probably introduce a tiny, tiny delay to send these values, though. by default agent would try to send these values every 5 seconds or more often.

  • use zabbix_sender directly to the server

in this case we would be able to decide whether we care about values during the downtime. if we don't it's simple - collect the values and just push them to the server, ignoring the failures. if we would like to get the value i as soon as possible, we would probably introduce some logic to send the value every 60 seconds, but if that fails, every 5 seconds or so. if we care about values during downtime, we would have to implement logic to store these values every interval seconds. if sending fails, always retry with older values first, but keep on collecting values (older values must be sent first so that events from the trigger against this item are not all messed up). compared to discarding values, this might introduce a tiny, tiny delay to get the latest value to the server.

in all these cases there is probably a very small window for race condition which could be potentially eliminated with some clever tricks around triggers (maybe by requiring lastaccess to be recent and making sure that last 3 values of it are all different or something in that direction).

oh, and a potential query to obtain history/buffer size on the proxy db (might not work with all supported databases, adapt as needed) :

select ((select max(proxy_history.id) from proxy_history)-nextid) from ids where field_name='history_lastid';

Share:
5,191

Related videos on Youtube

A_01
Author by

A_01

Updated on September 18, 2022

Comments

  • A_01
    A_01 almost 2 years

    I have a zabbix-proxy and 12 nodes in that proxy. Right now whenever proxy service goes down. It send out of reach mail for all the 12 nodes. I want to send mail only for the zabbix proxy not for the nodes under that proxy

    Updated: Now I am trying to have a single trigger in which I want to check both the conditions like 1-check zabbix-host is not accessble from past x minutes. 2-check the host is not giving any data to the proxy(Host is down).

    Not the trigger should start shouting onle when we have condition in which proxy is running and node is down. I tried the below but its not working for me. Can some please help me out in this

    ({ip-10-4-1-17.ec2.internal:agent.ping.nodata(2m)}=1) & ({ip-10-4-1- 17.ec2.internal:zabbix[proxy,zabbixproxy.dev-test.com,lastaccess].fu‌​zzytime(120)}=1)

  • A_01
    A_01 about 10 years
    Actually I dont want to create a new trigger. Becz we've templt setup which sends notifi. mail if any of the trigger fails. So what I want to do is to club both the conditions like: ({ip-10-4-1-17.ec2.internal:agent.ping.nodata(2m)}=1) & ({ip-10-4-1-17.ec2.internal:zabbix[proxy,zabbixproxy.dev-tes‌​t.com,lastaccess].fu‌​zzytime(120)}=1)'cod‌​e' It will return 1 when both the condition will be true. 1- host is having no data from past 2 minutes 2-zabbix-proxy is not accessble from past 2 minutes. But its not working for me. @asaveljevs can you please make me understand what wrong with it.
  • asaveljevs
    asaveljevs about 10 years
    Documentation on internal items says that zabbix[proxy,...] item is not supported by proxy. So in order for it to work, it has to be on a host monitored by Zabbix server directly. Otherwise, if it is on a host monitored by Zabbix proxy (like in your trigger), the item will be checked by the proxy and will fail.
  • A_01
    A_01 about 10 years
    Thanks Richlv for your response. I am trying to understand your points and it would be great if you share some articles or material so that I can refer them rather than spending time to learn by my own and implement it. Thanks
  • asaveljevs
    asaveljevs about 10 years
    In order to answer your question, could you please describe whether "ip-10-4-1-17.ec2.internal" host is monitored by server or by proxy? If it is monitored by server, then checking proxy availability in this trigger does not make sense. If it is monitored by proxy, then (as noted above) "zabbix[proxy,...]" item should be unsupported, and the trigger does not make sense either.
  • ik_zelf
    ik_zelf over 8 years
    @asaveljevs, what about simply allow some time before clearing the alert after the proxy became available? A longer downtime would require q longer keep time .... (no idea how to do that ....)
  • asaveljevs
    asaveljevs over 8 years
    @ik_zelf, sorry, not sure I understand your question. Could you please elaborate a bit on the problem you are trying to solve and what you are aiming to achieve?
  • ik_zelf
    ik_zelf over 8 years
    like ( trigger.value = 0 & {<zabbix>:zabbix[proxy,<name>,lastaccess].fuzzytime(2m)}=0) or ( trigger.value=1 & {<zabbix>:zabbix[proxy,<name>,lastaccess].fuzzytime(2m)}<>0) ) Some expression that prevents the immediate clearing of the trigger value and allows some time to get the proxy data loaded. Maybe lastaccess should be within last minute before clearing....