?

Log in

 
 
18 October 2016 @ 06:20 pm
The importance of good examples in coding and configuration files  
My employer has chosen to use DataDog for some of it's monitoring, and I have been having a really hard time getting simple process monitoring to work reliably. Turns out that the process.yaml file syntax used by DataDog agents is very dependent on Python language psutil calls, and there is quite a difference between single quotes (used in Datadog's examples) and double quotes (needed for searching for running processes where the unique string is in the middle of a very long line).

Datadog's Process check is documented pretty well at Process check and the simple checks are easy and work right away. Checking for a running httpd process or nginx process is trivial using the example, and the PID check works, though I am not sure how useful it is as pretty much no one uses static PID assignment. What the examples need to include are an effective fuzzy search to pull the existence of a specific instance of a Node.js or Java Servlet out of many possible running processes. The simple name search for 'java' is not very helpful - as I have as many as a dozen separate Java servers running on a host. Likewise a simple name search for 'node' is useless as I have as many as thirty node.js servers running at a time... I spent far too many hours trying to get the exact name match to work until I discovered that the switch to double quotes and the use of the exact_match: False boolean operator make this fairly reliable... given that running node and java are so common, why doesn't DataDog include examples of that?

Here are mine, /etc/dd-agent/conf.d/process.yaml contents:

init_config:
instances:
name: cassandra
search_string: ["java -ea -javaagent:/usr/share/dse/cassandra/lib/jamm-0.2.5.jar"]
exact_match: False
ignore_denied_access: True

name: nodejs.mu.fuzzyblink
search_string: ["node /full/path/to/nodejs/bin/mu/fuzzyblink.js"]
exact_match: False
ignore_denied_access: True


Run service datadog-agent restart ; sleep 8 ; service datadog-agent info to reset you datadog agent and verify the syntax of your process.yaml file.

Now you can set up a process monitor alert through your DataDog cloud account and look for process:cassandra and process:nodejs.mu.fuzzyblink metrics coming in from the agent. The double quotes are the key.