HOWTO


      1 Introduction

      2 ------------

      3 This HOWTO describes how to set up a loadbalanced, redundant and

      4 distributed network monitoring system using op5 Monitor. Note that

      5 Merlin is a part of op5 Monitor. Non-customers will have to adjust

      6 paths etc used throughout this guide in order to be able to use it.

      7 Replacing "op5 Monitor" with "nagios + merlin" will be a good start

      8 for those of you venturing into the unknown without the aid of our

      9 rather excellent support services. For those wishing to configure

      10 only distributed network monitoring, or only loadbalanced or

      11 redundant, this is still a good guide.

      12

      13 The guide will assume that we're installing two redundant and

      14 loadbalanced master-servers ("yoda" and "obi1"), with three

      15 poller servers, two of which are peered with each other. The single

      16 poller will be designated "solo". The peered pollers will be "luke"

      17 and "leya". We'll assume that each poller has their names in the

      18 DNS and can be looked up that way. 'solo' will be responsible for

      19 monitoring the hostgroup 'hyperdrive'. 'luke' and 'leya' will share

      20 responsibility in monitoring the hostgroups 'theforce' and 'tattoine'.

      21

      22 With this setup, communications will go like this:

      23 luke will talk with:

      24 leya, obi1, yoda

      25

      26 leya will talk with:

      27 luke, obi1, yoda

      28

      29 solo will talk with:

      30 obi1, yoda

      31

      32 obi1 will talk with:

      33 yoda, luke, leya, solo

      34

      35 yoda will talk with:

      36 obi1, luke, leya, solo

      37

      38

      39 Preparations

      40 ------------

      41 The following needs to be in place for this HOWTO to be usable, but

      42 how to obtain or set them up is outside the scope of this article.

      43

      44 * Make sure you have the passwords for the root accounts on all the

      45   servers intended to be part of the monitoring network. This will

      46   be needed in order to configure merlin.

      47 * Open the firewalls for port 15551 (merlin's default port) and 22.

      48   Both ends will attempt to connect on port 15551, so it's ok if only

      49   one side of the intended connection can connect to the other. For

      50   port 22, it's a little bit more complicated and in order to get the

      51   full shebang of features both ends will need to be able to initiate

      52   connections with the other. It's possible to get away with not

      53   allowing pollers to initiate connections to the master server, but

      54   certain recovery operations will then not be possible.

      55 * op5 Monitor needs to be installed on all systems intended to be part

      56   of the network.

      57

      58

      59 Hello mon'!

      60 -----------

      61 Included in op5 Monitor is the 'mon' helper. mon is a nifty little

      62 tool designed to help with configuring, managing and verifying the

      63 status of a distributed Merlin installation.

      64

      65 It's usage is quite simple: mon

      66 Just type mon and you'll get a list of all available categories and

      67 commands. Some commands lack a category and are runnable all by

      68 themselves, such as 'stop', 'start' and 'restart', which take care

      69 of stopping, starting and restarting monitor and merlin using the

      70 proper shutdown and startup sequences.

      71

      72 We'll soon see exactly how useful that little helper is.

      73

      74

      75 Step 1 - Configure Merlin on one of the master systems

      76 ------------------------------------------------------

      77 The aforementioned 'mon' helper has a 'node' category. This is useful

      78 for manipulating configured nodes in merlin's configuration file,

      79 among other things.

      80 We'll start with configuring Merlin properly on 'yoda'. The commands

      81 to do so will look like this (yes, there's a typo there):

      82

      83   mon node add obi1 type=peer

      84   mon node add luke type=poller hostgroup=theforce,tattoine

      85   # type=poller is default, so we don't have to spell it out

      86   mon node add leya hostgroup=theforce,tattoine

      87   mon node add solo hostgroup=hyperride

      88

      89 The 'node' category also has a 'remove' command, so when we've noticed

      90 the typo we made when adding the poller 'solo', we can fix that by

      91 removing the faulty node and re-adding it again:

      92

      93   mon node remove solo

      94   mon node add solo type=poller hostgroup=hyperdrive

      95

      96 You may verify that you've done things right in a couple of different

      97 ways. 'mon node list' lists nodes. It accepts --type= argument, so

      98 if we want to list all pollers and peers, we can run:

      99

      100   mon node list --type=poller,peer

      101

      102 This in conjunction with 'mon node show ' is excellent to use

      103 when scripting.

      104

      105 The contents of the configuration file, which by default resides in

      106 /opt/monitor/op5/merlin/merlin.conf, should now include something

      107 like this:

      108         --8<--8<--8<--

      109         peer obi1 {

      110                 address = obi1

      111                 port = 15551

      112         }

      113         poller luke {

      114                 address = luke

      115                 port = 15551

      116                 hostgroup = theforce,tattoine

      117         }

      118         poller leya {

      119                 address = leya

      120                 port = 15551

      121                 hostgroup = theforce,tattoine

      122         }

      123         poller solo {

      124                 address = solo

      125                 port = 15551

      126                 hostgroup = hyperdrive

      127         }

      128         --8<--8<--8<--8<--

      129

      130

      131 Step 2 - Distribute ssh-keys

      132 ----------------------------

      133 The 'mon' helper has an sshkey category. The commands in it let you

      134 push and fetch sshkeys from remote destinations.

      135

      136   mon sshkey push --all

      137

      138 will append your ~/.ssh/id_rsa.pub file to the authorized_keys file

      139 for the root and monitor user on all configured remote nodes.

      140 If you don't have a public keyfile, one will be generated for you.

      141 Please note that if you generate a keyfile yourself, it must not have

      142 a password, or configuration synchronization will fail to work.

      143

      144 So far we've set up one-way communication. 'yoda' can now talk to

      145 all the other systems without having to use passwords. In order to

      146 fetch all the keys from the remote systems, we'll use the following

      147 command:

      148

      149   mon sshkey fetch --all

      150

      151 This will fetch all the relevant keys into ~/.ssh/authorized_keys.

      152 So every node can talk to 'yoda', and 'yoda' can talk to every

      153 other node. That's great. But 'luke' and 'leya' need to be able

      154 to talk to each other as well, and all the pollers need to be

      155 able to talk to 'obi1'. Since we have all keys except our own in

      156 ~/.ssh/authorized_keys, we can simply amend it with the key we

      157 generated earlier and distribute the resulting file to every node.

      158 Or we can do what we just did for 'yoda' on all the other nodes,

      159 and simply wait until we're done configuring merlin on all nodes

      160 and then log in to them and run the 'mon sshkey push --all' and

      161 'mon sshkey fetch --all' commands there too.

      162

      163 You can verify that this works by running:

      164   mon node ctrl -- 'echo hostname is $(hostname)'

      165

      166

      167 Step 3 - Configure Merlin on the remote systems

      168 -----------------------------------------------

      169 I sneakily introduced the 'node ctrl' command in the last section.

      170 This time we'll use it rather heavily, along with the 'node add'

      171 command which will run on the remote systems.

      172

      173 First we add ourself and 'obi1' as master to all pollers:

      174   mon node ctrl --type=poller -- mon node add yoda type=master

      175   mon node ctrl --type=poller -- mon node add obi1 type=master

      176

      177 Now the poller 'solo' is actually done and configured already.

      178

      179 Then we add ourself as a peer to all our peers (just 'obi1' really,

      180 but in case you build larger networks, this will work better):

      181   mon node ctrl --type=peer -- mon node add yoda type=peer

      182

      183 Then we add all pollers to 'obi1':

      184   mon node ctrl obi1 -- mon node add luke hostgroup=theforce,tattoine

      185   mon node ctrl obi1 -- mon node add leya hostgroup=theforce,tattoine

      186   mon node ctrl obi1 -- mon node add solo hostgroup=hyperdrive

      187

      188 And finally we add luke and leya as peers to each other:

      189   mon node ctrl leya -- mon node add luke type=peer

      190   mon node ctrl luke -- mon node add leya type=peer

      191

      192 solo will have the following config:

      193         --8<--8<--8<--

      194         master yoda {

      195                 address = yoda

      196                 port = 15551

      197         }

      198         master obi1 {

      199                 address = obi1

      200                 port = 15551

      201         }

      202         --8<--8<--8<--

      203

      204 luke will have this in its config file:

      205         --8<--8<--8<--

      206         master yoda {

      207                 address = yoda

      208                 port = 15551

      209         }

      210         master obi1 {

      211                 address = obi1

      212                 port = 15551

      213         }

      214         peer leya {

      215                 address = leya

      216                 port = 15551

      217         }

      218         --8<--8<--8<--

      219

      220 leya will have this:

      221         --8<--8<--8<--

      222         master yoda {

      223                 address = yoda

      224                 port = 15551

      225         }

      226         master obi1 {

      227                 address = obi1

      228                 port = 15551

      229         }

      230         peer luke {

      231                 address = luke

      232                 port = 15551

      233         }

      234         --8<--8<--8<--

      235

      236 obi1 will have this:

      237         --8<--8<--8<--

      238         peer yoda {

      239                 address = yoda

      240                 port = 15551

      241         }

      242         poller luke {

      243                 address = luke

      244                 port = 15551

      245                 hostgroup = theforce,tattoine

      246         }

      247         poller leya {

      248                 address = leya

      249                 port = 15551

      250                 hostgroup = theforce,tattoine

      251         }

      252         poller solo {

      253                 address = solo

      254                 port = 15551

      255                 hostgroup = hyperdrive

      256         }

      257         --8<--8<--8<--

      258

      259

      260 Step 4 - Verifying configuration and ssh-key setup

      261 --------------------------------------------------

      262 Now that we have merlin configured properly on all our five nodes,

      263 we can use a recursive version of the 'node ctrl' command to make

      264 sure ssh work properly from all systems to all systems they need

      265 to talk to. Try pasting this into the console:

      266

      267   mon node ctrl -- \

      268     'echo On $(hostname); hostname | mon node ctrl -- \

      269           \'from=$(cat); echo "@$(hostname) from $from\'' \

      270   | grep ^@

      271

      272 Hard to follow? I agree, but it should produce something like this:


      286 If it does, that means ssh keys are properly installed, at least for

      287 the root user(s). If the command seems to hang somewhere in the middle,

      288 try rerunning it without the ending 'grep' statement. If that ends up

      289 with a password prompt appearing, you'll need to revisit the sshkey

      290 configuration and again make sure every node that should be able to

      291 talk to other nodes can talk to the nodes it should be able to talk

      292 to. Simple, eh?

      293

      294 (XXX; Test this and make sure it actually works like this)

      295

      296 Step 5 - Configuring Nagios

      297 ---------------------------

      298 Handling object configuration is very much out of scope for this

      299 article, but there are a few rules (most of them are actually more

      300 like guidelines, but things will be confusing if you don't follow

      301 them, so please do) one needs to adhere to in order for Merlin to

      302 work properly.

      303

      304 * Each host that is a member of a hostgroup used to distribute work

      305   from a master to a poller should never be member of a hostgroup

      306   that is used to distribute work to a different poller. In our

      307   case, that means that any host that is a member of either 'theforce'

      308   or 'tattoine' shouldn't also be a member of 'hyperdrive'.

      309

      310 * Two peers absolutely must have identical object configuration.

      311   This is due to the way loadbalancing works in Merin. In our case

      312   that means that since 'luke' is responsible for 'theforce' and

      313   'tattoine', its peer 'leia' must also be responsible for exactly

      314   those two hostgroups, and no other.

      315

      316 That's basically it. It's possible to circumvent these rules, but

      317 if you do, you're on your own. No tools currently exist to enforce

      318 them, and Merlin won't complain if you suddenly add another poller

      319 that's responsible for 'tattoine' and 'hyperdrive', even though

      320 such a configuration is obviously retarded in light of the above

      321 rules.

      322

      323

      324 Step 6 - Synchronization configuration

      325 --------------------------------------

      326 Configuration synchronization will be a bit easier for you if you

      327 move all of monitor's object configuration files to a cfg_dir

      328 instead of using the default layout of mixing object configuration

      329 with Nagios' main configuration file and other various stuff. This

      330 is especially true for pollers and doesn't matter nearly as much

      331 for masters.

      332

      333 The quick and easy way to set it up so that it works like that is

      334 by running the following commands from 'yoda':

      335

      336   dir=/opt/monitor/etc/oconf

      337   conf=/opt/monitor/etc/nagios.cfg

      338   mon node ctrl -- sed -i /^cfg_file=/d $conf

      339   mon node ctrl -- sed -i /^log_file=/acfg_dir=$dir $conf

      340   mon node ctrl -- mkdir -m 775 $dir

      341   mon node ctrl -- chown monitor:apache $dir

      342

      343 Now, if you run:

      344   mon node ctrl -- mon oconf hash

      345

      346 you should get a list of 'da39a3ee5e6b4b0d3255bfef95601890afd80709'

      347 as output from all nodes. That means the pollers now have an empty

      348 object configuration, which is just the way we like it since we'll

      349 be pushing configuration from one of our two peered masters to all

      350 the pollers.

      351

      352 ("da 39 hash" is what you get from sha1 when you don't feed it any

      353 input at all)

      354

      355 In merlin, you can configure a script to run that takes care of

      356 syncing configuration. This script should also restart monitor on

      357 the receiving ends when it's done sending configuration. In the

      358 Merlin world, this is handled by a single command that gets run

      359 once when we detect that we have a newer configuration than any

      360 of our peers or pollers.

      361

      362   mon oconf push

      363

      364 takes no arguments at all. It does parse merlin.conf though and

      365 creates complete configuration files for all the pollers, which

      366 by default gets sent to /opt/monitor/etc/oconf/from-master.cfg

      367 on each respective poller, which is then restarted. Again by

      368 default, it will also send the entire /opt/monitor/etc directory

      369 to all its peers, using rsync --delete to make sure all systems

      370 are fully synced. Currently though, only changes to the object

      371 config triggers a full sync, so perhaps there's room for

      372 improvement.

      373

      374 Config sync is configured either globally via an object_config

      375 compound in the daemon section of the config file, or via those

      376 same object_config compounds in each node if one wants to

      377 override how one system syncs to another. It could look something

      378 like this, for instance:

      379

      380   daemon {

      381     object_config {

      382       # the command to run when one or more peers or

      383       # pollershave older configuration than we do

      384       push = mon oconf push

      385

      386       # the command to run when one or more masters or

      387       # peers have a newer configuration than we do

      388       #pull = some random command

      389     }

      390   }

      391

      392   peer obi1 {

      393     object_config {

      394       # the command to run when obi1 has older config than we do

      395           # overrides the global command

      396       push = rsync -aovtr --delete /opt/monitor/etc obi1:/opt/monitor

      397       # command to run when obi1 has newer config than we do

      398       #pull = some random command

      399     }

      400   }

      401

      402 Caveats:

      403 * The 'pull' thing is highly untested and I'm unsure how it would

      404   work if one node tries to pull from another while that other node

      405   is pushing at the same time. Care should be take to avoid such

      406   setups.

      407 * The only supported scenario is to have the master with the most

      408   recently changed configuration push that config to its peers and

      409   pollers. This *will* create avalanche pushing if one uses peered

      410   pollers that in turn have pollers themselves, since all peered

      411   pollers that in turn have pollers will try to push at the same

      412   time. Due to this, more than 2 tiers is currently not supported

      413   officially, although it works just fine for everything else in our

      414   lab environment.

      415 * Config pushing from master to poller requires the objects.cache

      416   file in order to split config for each poller. Since config pushes

      417   should always be initiated by a running Merlin anyways, this isn't

      418   much of a problem once you've done the first push and everything

      419   is up and running, but when first setting up the system it will

      420   be tricksy to get things to run smoothly.

      421

      422 The object_config compounds can contain whatever variables you

      423 like without Merlin complaining about them, and

      424

      425   mon node show

      426

      427 will show them as

      428   OCONF_PUSH=mon oconf push

      429   OCONF_WHATEVER_YOU_NAMED_YOUR_VARIABLE=somevalue

      430

      431 so you can quite easily add some other scripted solution to support

      432 your needs. 'mon oconf push' happens to use two such private vars,

      433 namely 'source' and 'dest'. 'source' is really only used when pushing

      434 configuration to peers, and 'dest' is what we end up using as target

      435 when pushing the configuration. So if you want your peer sync script

      436 to only send /opt/monitor/etc/oconf that we created before, you can

      437 quite easily set that up by configuring your peer thus:

      438

      439 peer obi1 {

      440   address = obi1

      441   port = 15551

      442   type = peer

      443

      444   object_config {

      445     push = mon oconf push

      446     source = /opt/monitor/etc/oconf

      447     dest = /opt/monitor/etc

      448   }

      449

      450

      451 The 'oconf push' command uses another command internally:

      452   mon oconf nodesplit

      453

      454 This you can run without interfering with anything. In our case, it

      455 would print something like this:

      456

      457   Created /var/cache/merlin/config/luke with 1154 objects for hostgroup

      458   list 'theforce,tattoine'

      459   Created /var/cache/merlin/config/leya with 1154 objects for hostgroup

      460   list 'theforce,tattoine'

      461   Created /var/cache/merlin/config/solo with 652 objects for hostgroup

      462   list 'hyperdrive'.

      463

      464 You can inspect the files thus created and see if they seem to fit your

      465 criteria. Note that they will be rather large, since templates aren't

      466 sent to the poller nodes.

      467

      468 Step 7 - Starting the distributed system

      469 ----------------------------------------

      470 Once you've inspected the configuration and you like what you see,

      471 it's time to activate it and get some monitoring going on. Run the

      472 following sequence of commands when you're ready:

      473

      474   mon restart; sleep 3; mon oconf push

      475

      476 This should send configuration to all the pollers and peers and then

      477 attempt to restart monitor and merlin on those nodes. Pushing config

      478 to masters is not yet supported, although scripting it wouldn't be

      479 too hard for those who are interested. Do see the notes about 'pull'

      480 above though.

      481

      482

      483 Step 8 - Verifying that it works

      484 --------------------------------

      485 The first thing to do is to run:

      486

      487   mon node status

      488

      489 It will quite quickly become apparent that this little helper is

      490 awesome for finding problems in your merlin setup. It connects to

      491 the database, grabs the currently running nodes and prints a lot

      492 of information about them, such as if they're active, when they

      493 last sent a program_status_data event, how many checks they're

      494 doing and what their check latency is. If, for some reason,

      495 one node has crashed or is otherwise unable to communicate with

      496 the node you're looking from, you'll find this out quite quickly

      497 using this little helper.

      498

      499 Filing a bugreport without including output from a run of this

      500 program on all nodes is a hanging offense. You have been warned.

      501

      502

      503 Step 9 - Finding out why it doesn't

      504 -----------------------------------

      505 This is sort of general guidelines to troubleshooting certain issues

      506 in Merlin. It involves digging through logfiles, running small helper

      507 programs and generally just tinkering around trying to figure out

      508 what happened, what's happening and what will happen when I do this?

      509 Most of it is stuff that has come up during beta-testing or that have

      510 been problematic in the past. If new common problems arise, I'll add

      511 more recipes to this little guide.

      512

      513

      514

      515 Problem: Loadbalancing seems to have stopped working even though

      516          all my peers are ACTIVE and were last seen alive "3s ago"

      517 Answer:

      518 It can sometimes look like that if you check the output of

      519

      520   mon node status

      521

      522 after having restarted Merlin or Monitor. Most of the time, it's

      523 because the peers switched peer-id's and are now slowly taking

      524 over each others checks. If the problem isn't resolving itself

      525 so that the number of checks performed by each peer is converging

      526 on an equal split, something else has gone wrong and a more thorough

      527 investigation is necessary.

      528 Checking if all peers have the same configuration is the first step:

      529

      530   mon node ctrl --type=peer -- mon oconf hash

      531   mon node ctrl --type=peer -- sha1sum /opt/monitor/var/objects.cache

      532

      533 If they do, you might have run into the intermittent error that

      534 some users have seen with peers in a loadbalanced setup. Restarting

      535 the affected systems usually restores them to good working order:

      536

      537   mon node ctrl --type=peer -- mon restart

      538

      539

      540 Problem: Logfiles are flooded with messages about 'nulling OOB pointer'

      541 Answer:

      542 Merlin uses a highly efficient binary protocol to transfer events

      543 between module and daemon and across the network to other nodes.

      544 The way the codec works makes it not-really-but-almost impossible

      545 to support network nodes with different wordsize and byte order.

      546 That is, 32bit and 64bit systems can't talk to each other, and

      547 servers running i386-type cpu's can't communicate with PowerPC or

      548 other big-endian machines. PDP11 won't work with anything but other

      549 PDP11's. They'll do just fine with each other though.

      550

      551 Since merlin-0.9.0-beta5, merlin detects when a node with different

      552 wordsize, byte order and object structure version connects and warns

      553 about such incompatibilities. grep the logfiles for 'FATAL.*compat'

      554 and you should see if that's the problem.

      555

      556 If it isn't and it's the module logfile which holds all the messages

      557 you've almost certainly hit a compatibility problem with Nagios, or

      558 a concurrency issue related to threading. There shouldn't really be

      559 any compatibility problems, since Merlin will unload itself if the

      560 version of Nagios that loads it has a different object structure

      561 version than we're expecting of it, but I suppose weirder things

      562 have happened than a random malfunction in a piece of software.

      563

      564

      565 Problem: Database isn't being updated

      566 Answer:

      567 Inserting events into the database is the job of the daemon.

      568 Information about its problems can be found in the daemon

      569 logfile, /opt/monitor/op5/merlin/daemon.log. If no "query failed"

      570 messages can be found there, check the neb.log file to see if

      571 it's sending anything, and look for disconnected peers and pollers

      572 with:

      573

      574   mon node status

      575

      576

      577 Problem: 'mon node status' shows one or more nodes as 'INACTIVE'

      578 Answer:

      579 Check that merlin and monitor is running on the remote systems.

      580

      581   mon node ctrl $node -- pidof monitor

      582   mon node ctrl $node -- pidof merlind

      583   mon node ctrl $node -- mon node status

      584

      585 If they're not, you've found the symptom, so check the logfiles

      586 or look for corefiles on those systems.

      587 If they are, try

      588

      589   grep -i connect /opt/monitor/op5/merlin/logs/daemon.log

      590

      591 If you see a lot of connection attempts to the INACTIVE node and

      592 no "connection refused", it's almost certainly a firewall issue.

      593 If you see the "connection refused" thing, it's almost certainly

      594 due to either merlind not running or misconfiguration.

      595

      596

      597 Problem: I've found a corefile

      598 Answer:

      599 Goodie. Now do something useful with it, pretty, pretty please.

      600

      601   file core

      602

      603 will tell you which command was run to create it, so then you can

      604 run:

      605

      606   # gdb -q /path/to/$offending_program core

      607   gdb> bt

      608

      609 If the corefile came from monitor, you'll need to run:

      610

      611   mondebug core

      612

      613 instead, since it will otherwise include a lot of "unresolved symbols"

      614 in the backtrace, which basically means that it's completely useless

      615 and I would still have to re-do it. At least if the core was caused

      616 by a module, which is something we need to know.

      617

      618 Send me the output of both the "file" command and the backtrace

      619 that gdb prints, along with the corefile. This is basically everything

      620 I do when I get a corefile anyways, but for me to be able to do it if

      621 you send me the corefile means I'll have to install the exact same

      622 version of merlin, monitor and possibly a lot of system libraries too.

      623 That's extremely cumbersome, so go the extra halfinch and grab a

      624 backtrace while you're at it. Bugs will get fixed a billion times

      625 faster if you do.

      626

      627 If the trace looks like this:

      628 #0  0x0022c402 in ?? ()

      629 #1  0x0062e116 in ?? ()

      630 #2  0x0806f4fb in ?? ()

      631 #3  0x080566cc in ?? ()

      632

      633 That means that either the stack has been overwritten (a bug can cause

      634 this, and it's bad), or that there are only unresolved symbols in there.

      635 Either way, it's fairly useless in that state, but I'll still want it

      636 since any clue is better than no clue.

      637

      638

      639

      640 Problem: 'mon node status' claims one node hasn't been alive for a

      641          very long time

      642 Answer:

      643 There should be a timestamp stating when it was last active. grep

      644 for that timestamp in Merlin's logfiles and nagios.log. Start with

      645 daemon.log on the system where you ran 'mon node status' and look

      646 for disconnect messages. You'll have to check the logs on both the

      647 systems to find the most likely cause.

      648

      649 To look through nagios.log you can use:

      650

      651   mon log show --start=$when-20 --end=$when+20

      652

      653 although you'll have to calculate the start and end things manually,

      654 since that command right there isn't valid shell or anything.

      655

      656   mon log show --help

      657

      658 will provide more filtering options, since grep'ing can be tricky

      659 without knowing what to look for.

      660

      661

      662 Problem: Reports show wrong uptime/downtime/whatever

      663 Answer:

      664 In 99% of all cases this is due to missing data in the report_data

      665 table. It now resides in the merlin database as opposed to the

      666 monitor_reports database, where it used to be.

      667

      668 If you can find a particular period in the logs that happens to be

      669 broken it's not that hard to repair it, although doing so will take

      670 time and you have to shut down Monitor in order to pull it off.

      671 If the database is anything but huge, it's definitely easiest to

      672 just truncate the report_data table and recreate it from scratch.

      673 The following sequence of commands *should* take care of doing

      674 that, but it's been a while since I wrote them and I haven't got

      675 anything to test with.

      676

      677   mon stop

      678   mon node ctrl -- mon stop

      679   mon log import --fetch

      680

      681 The 'log' category of commands is useful for importing data from

      682 remote sites though. Snoop around a bit and see what you can find.

      683 'sortmerge' will thrash the disks quite a lot and use a ton of

      684 memory, but 'import' is the only one which can be potentially

      685 dangerous. Use the --no-sql option for a dry-run first if you're

      686 nervous.

      687

      688 If the report_data table *is* huge and shutting down monitor

      689 while repairs are under way is not an option, there may be other

      690 solutions to try but they are all situational and more or less

      691 voodooish.

      692

      693

      694 Problem: Merlin's config sync destroyed my configuration!

      695 Answer:

      696 If it happened on a poller system, that's by design. Nothing to

      697 do and nothing to try. Just re-do your work and make sure you

      698 don't do it in a file that Merlin will overwrite occasionally.

      699 If it was on a peer system, it might be possible to save it.

      700 Sneak a peak in /var/cache/merlin/backup and see if you can

      701 find your config files there.

      702

      703

      704 Problem: X happened and it's not a listed problem here

      705 Answer:

      706 Perhaps it's by design, and then again it might not be. If you

      707 think it's wrong, check the logfiles (all three of them) on

      708 all systems involved and look for anomalies. When that's done

      709 and you still haven't found anything, remain calm and write a

      710 concise report stating what you did, what you expected should

      711 happen and what happened. Feel free to include logfiles and

      712 stuff as well, since I'll almost certainly want to look at

      713 them myself.

      714

      715

      716 Problem: My peered pollers are acting up!

      717 Answer:

      718 It's possible that the pollers are trying to push their config

      719 to the master-server, but with the default sync command, it

      720 will attempt to push configuration to all nodes at the same

      721 time. This can cause one peered poller to try to push to the

      722 other poller, which resets that poller and causes it to try

      723 to push its configuration to the master, but since it pushes

      724 globally and by default only to pollers and peers, it will

      725 cause config to be pushed to the other node, which is then

      726 restarted, etc, etc, etc.

      727 To fix it, it's usually enough to add an empty object_config

      728 section to all your master nodes, like so:

      729

      730         master yoda {

      731                 address = yoda

      732                 port = 15551

      733                         object_config {

      734                         }

      735         }

      736         master obi1 {

      737                 address = obi1

      738                 port = 15551

      739                         object_config {

      740                         }

      741         }

      742

      743 This removes the config sync command from the master nodes,

      744 so Merlin won't try to push configuration around.

      745

      746

      747 Problem: My extra files/plugins/whatever aren't being synced!

      748 Answer:

      749 In order to sync files outside of the object configuration one can add

      750 an extra compound to the node-configuration of the node one wants to

      751 sync paths to. The config should look something like this:

      752

      753 poller solo {

      754         hostgroups = hyperdrive

      755         address = solo

      756         port = 15551

      757         sync {

      758                 /path/to/file/to/sync = yes

      759                 /path/to/file2/to/sync = /path/on/remote/system

      760         }

      761 }

      762

      763 This will cause /path/to/file/to/sync to be sent to the same path on

      764 the remote system, and the file /path/to/file2/to/sync to be sent to

      765 /path/on/remote/system on the remote system. It should be possible to

      766 list directories and not only files, but that is untested.

      767

      768 Caveat 1: This is not well tested, but what sparse tests I've done

      769 worked out well.

      770 Caveat 2: Peers normally send all of /opt/monitor/etc to each other,

      771 so for those no extra configuration should be necessary in a normal

      772 setup.

      773 Caveat 3: Only the sha1 checksum of the object config, coupled with

      774 the timestamp of the same, is used to determine which file to sync

      775 where. There's no checking done to make sure a newer version of the

      776 file on the other end is being overwritten.

      777 Caveat 4: This will run as the same user as the merlin daemon. I

      778 have no idea if file ownership and permissions will be preserved.

      779

      780 If caveat 3 bites your ass, check in /var/cache/merlin/backups (or

      781 /var/cache/merlin/backup) for the original files.

      782

      783 This is sort-of supported as of merlin-1.1.7, but see caveat 1.

      784

      785

      786 Problem: After a restart, Ninja hangs/is empty for a loong time!

      787 Answer:

      788 If you have a large system, you should be using ocimp instead of

      789 import.php to run the initial import of your configuration. It's

      790 well more than 10 times as fast and will cause the waiting period

      791 to be that much shorter. This will also most likely help if you're

      792 seeing an empty UI from time to time. It makes the largest difference

      793 in large environments, ofcourse, but even smaller ones should benefit

      794 from it. ocimp was considered "stable beta" as of merlin-1.1.8-beta2.

      795

      796

      797 Problem: One of my nodes can't connect to the others!

      798 Answer:

      799 On the node that can't connect to other nodes you need to add a

      800 node-option to prevent it from attempting to connect to the nodes

      801 it can't connect to. This is useful if you know there is a firewall

      802 that you won't be allowing new connections through in one direction,

      803 and especially if you're using a tripwire-rigged firewall. An

      804 example config would look like this:

      805

      806   master behind-firewall-we-cant-connect-through {

      807     address = master.behind.firewall.example.com

      808     connect = no

      809   }

      810

      811 This will cause the merlin daemon on this node to never attempt to

      812 connect to master.behind.firewall.example.com. Normally, both nodes

      813 attempt to connect at startup and every so often so long as the

      814 connection is down.

      815 This feature was introduced in Merlin 1.1.0

      816

      817

      818 Problem: My poller is monitoring a network behind a firewall which

      819          the master can't see through!

      820 Answer:

      821 There's a node-option you can use on the master-node which only

      822 works when configuring poller nodes. Here's what it would look

      823 like

      824

      825   poller watching-network-behind-firewall-where-master-cant-see {

      826     address = admin-net-poller.example.com

      827     port = 15551

      828         takeover = no

      829   }

      830

      831 This feature was introduced in Merlin 0.9.1.

      832

      833

      834 Problem: I want feature X!

      835 Answer:

      836 I want icecream.

      837

      838

      839 Problem: Make feature Y work like this!

      840 Answer:

      841 Talk to Peter.