Hive gathering information

From Wiki

Jump to: navigation, search

Teared collection

First we start with both suexec and cron logging every execution and statistics for it and the statistics are passed to the cpustatsd socket.

Each hour we have the collected information pushed from the memory of cpustatsd to the local PostgreSQL DB.

Then each day (by default at 00:30) a cron job collects the top 25 users from each server and submits the information to the central portal if there is one running for your servers. Also it pulls summary cpu info for the last 24 hours.

So to summarize:

socket -> cpustatsd -> local DB -> cron jobs -> Central DB

Why we did it in this way?

First we tried to parse the information from the logs every day. The parsing is a very CPU and I/O intensive task which took a lot of time and resources. This was a big problem for our servers.

Then we decided that we will parse the logs each hour, this way we hoped to reduce the CPU and I/O overhead. This proved to be an even bigger issue then the first one.

Finally we decided to offload the parsing to a single remote server. The machine was simple overwhelmed by the amount of data it had to parse. And in addition to the enormous overloads this was additional economical set back.

So in the end, we designed the cpustatsd which is a system daemon which parses the information on the fly. Initially we pushed the information from its memory to the DB once a day. This was not fast enough for the management, so we decided to do this every hour.

On a heavily used servers more frequent then that would mean that more resources on the machine have to be used to push the information into the db. As this is not optimal and getting the new information each hour seems fast enough we have decided to go with the functionality we are currently using.

So to summarize, we wanted to have a distributed/dispersed load without overloading the machine and in the same time using as little as possible CPU resources plus minimizing I/O.

Personal tools