Wednesday, June 24, 2015

The Night Shift

The TV Series ... Night Shift ? Nope, this is Kronometrix appliance for night workers, operators, sleepless system administrators and for the rest of us, working late. An appliance ? What do you mean ?

To be serious about monitoring you better have an application running on a cloud platform, right ? Public or private, whatever that is, Google, Amazon, Azure or on a internal network powered by OpenStack, VMware ...  preferable deployed over a large number of nodes, using large memory configurations and offering advanced dashboards to keep track of hundreds of metrics collected from machines, storage, network devices, applications all available and ready to be ... read and digested by anyone.

The duty admin is confused, what dashboards will be needed for basic performance monitoring and SLA and what metrics these dashboards should include ?

We took a simpler approach with Kronometrix, which can be deployed on a cloud computing platform if really needed, which was designed to be:
  • easy to install, manage and administer, aiming for zero-administration
  • ready for computer performance analysis, including essential performance metrics
  • self maintained and automated, majority of tasks are pre-configured and already set
  • simple to read and understand using clear UI dashboards
  • available for operation people, ready for large size screens, example 51"

Night Mode

Lights off please. A simple Ops Dashboard, designed to display vital performance indicators, CPU, Memory, Disk IO and NIC IO utilization, for a host, on two time ranges: 5 and 30 minutes along with the system run queue length. On same dashboard, central we display same time series data, but represented as a chart on different time resolutions. A very simple control let us put the dashboard for night monitoring mode.

Operational Dashboard,  5 and 30 minutes

Night Mode, Zoom-In

Operational Dashboard 5 and 30 minutes, zoom in

Kronometrix Advantages

  • top essential performance metrics, per host, included
  • clear and simple UI, no confusing labels, charts, extra information
  • two time ranges, allowing operators, sysadmins to check current and past activities 
  • possibility to drill and zoom in, using the time series data charts
  • direct link to console events, alerts and thresholds
  • designed for large size screens, eq. 51", on night and day mode

Friday, June 12, 2015

One Tool to rule them all ...

Data Recording Module - Five default, main, data recorders, written in Perl 5, responsible to collect and report: overall system utilization and queuing, per-CPU statistics, per disk IO throughput and errors, per NIC IO throughput and the computer system inventory data: sysrec, cpurec, diskrec, nicrec and hdwrec.

But why not having a single process, simple to start and stop, and operate ? Does it matter, anyway ?

One tool cannot perform all jobs. We believe that.

We have assigned different tasks to different recorders, making easy and simple to update a data recorder without breaking the others. As well, we wanted to separate functionality based on the main 4 system's resources: CPU, Memory, Disk and Network.
Additional we think it is very important to have a description of the hardware and software running on that computer system, sort of the inventory, that everybody understands easily what are we looking at. Saying these, we ended up using the following data recorders:
  • sysrec: overall system CPU, MEM, DISK, NIC utilization and saturation
  • cpurec: per-CPU statistics
  • diskrec: per-DISK statistics, throughout in KBytes and IOPS
  • nicrec: per-NIC statistics, throughout, as KBytes and packets along with the link saturation, errors
  • hdwrec: hardware, software inventory, like number of CPUs, total physical RAM installed, number of disks


How about system utilization using Kronometrix ? This is general footprint in terms of cpu, memory and disk used:
  • CPU: on an idle host, all data recorders use less than 0.5%. On a busy system 95-100%, the data recorders use up to 3%
  • Memory: All default data recorders use up to 64MB RAM including the transport utility. Windows data recorders use up to 128MB RAM
  • Disk: the default installation, without raw data uses up to 75MB disk space and the data recorders are not disk IO intensive applications

Keep it simple

We can add or change a data recorder within minutes. Having data recorders based on Perl5 is allowing us to change or add new functions, very easily. Additional we can put recorders run at different time resolutions, if needed. And in case we don't need certain functions, for example network traffic per NIC, we simple shutdown nicrec without affecting the other recorders. So its easy and simple.

Raw Data

A single recorder means, lots of metrics to report. Say we would have agent_one, the main data recorder which should look overall system, the CPUs, disks etc. The payload would increase when running a single recorder. And we want to store the data collected, so that would mean, we need to split and analyse separately data for CPUs, disks, etc.

The Package

Once upon a time Kronometrix was not using more than 1MB disk space. And we were happy like that. But soon we discovered that people from financial and banking sector were not happy changing, installing new libraries on their systems to allow us to run Kronometrix. Worse such sites, usually have very strong requirements what operating system packages are allowed to be installed and what not.
So, we needed to rethink and adopt another mechanism to deploy Kronometrix on such networks. We ended-up having our own Perl distro shipped with Kronometrix + OpenSSL. This way we were able to survive without any extra dependencies from customers and keep running.

One Tool to rule them all

Our approach is simple, easy and offers flexibility on different networks. The Kronometrix data recording package is automated for majority of operating systems out there: FreeBSD, RedHat, CentOS, ClusterLinux, OpenSUSE, Debian, Ubuntu, Solaris, Windows.

Wednesday, June 10, 2015

Who are you ? The story about DBus and cloning virtual machines

Kronometrix Data Recording Module, ships with a transport utility called sender, which ensures all raw data is shipped to one or many data analytic appliances over HTTP or HTTPS protocols. 

sender, a Perl5 citizen, checks all raw data updates and relies on a host uuid identifier from the operating system to deliver that raw data. If such host uuid is found will use that for the entire duration of its execution. If no such host uuid is found it will generate a new one and store that on its configuration file, kronometrix.json. The analytics appliance, relies that each data source is unique and valid, properly checked or generated by the transport utility, sender. 

But what's happening if this really does not work ?

Data Source Id

Kronometrix uses the data source id concept, (DSID), to identify a data source point, from where one or many data messages are received. For example for IT, Computer Performance data, a DSID identifies to a computer system host, physical or virtual, connected or not to a TCP/IP network.

The DSID is obtained from operating system core functions. Example:
  • Linux platforms we speak to DBus and we try to get that via machine-id file
  • FreeBSD we ask the sysctl interface for kern.hostuuid
  • other way: we compute one, using UUID::Tiny Perl5 module

Who are you ?

Working closely with one of our customers, we've seen that they were not receiving data from a number of virtual machines where previously we have installed Kronometrix. Looking into this we discovered that sender was producing same data source id across a number of virtual machines, using same DSID:

"dsid" : "96d5b4a4-d0fa-54a8-ba74-14cc978041f1"
So to Kronometrix Analytics Appliance all these hosts were more or less similar, having same DSID. Not good.

Whats wrong ?

As simple as that, we found out that one VM was used to clone other VMs and by mistake the machine-id file was cloned as well. DBus on CentOS 6.x did produce a sane and valid machine-id, but then the VM configuration was taken and cloned, including the machine-id. On this respect from operating system level all hosts were similar identified as having same host UUID. No software was reporting this nor complain about this malfunction.

Our system was able to immediately discover this trouble and we proposed a fix to the Operation Center group. Later the cloning procedure was fixed to ensure machine-id on Linux will not be cloned anymore and data was finally flying to our appliance. Nice and easy. 

No more clones :)