Skip to main content

Dell DRAC BBU auto-learn tests kill disk performance

We’ve had a bunch of new servers in place for around 3 months now. They seem to be working well and are performing just fine.

Then, out of the blue, our monitoring started throwing alerts on seemingly random servers. Our queues were building up – basically, database performance had dropped dramatically and our processing scripts couldn’t stuff data into the DBs fast enough.

What could be causing it?

I took a quick glance at our monitoring and various statistics confirmed the problems. You can clearly see in the following graphs that CPU Wait and Load Average increased at around 14:30, and at the same time MySQL throughput and command counters dropped dramatically:




So, I’ve found the symptoms; what could be the cause?

Digging around in the system logs, I found the following lines (emphasis is mine):

Apr 26 12:55:31 b008 Server Administrator: Storage Service EventID: 2176  The controller battery Learn cycle has started.:  Battery 0 Controller 0
Apr 26 12:56:36 b008 Server Administrator: Storage Service EventID: 2266  Controller log file entry: Battery is discharging:  Battery 0 Controller 0
Apr 26 12:56:36 b008 Server Administrator: Storage Service EventID: 2248  The controller battery is executing a Learn cycle.:  Battery 0 Controller 0
Apr 26 14:21:52 b008 Server Administrator: Storage Service EventID: 2278  The controller battery charge level is below a normal threshold.:  Battery 0 Controller 0
Apr 26 14:21:53 b008 Server Administrator: Storage Service EventID: 2188  The controller write policy has been changed to Write Through.:  Battery 0 Controller 0
Apr 26 14:21:53 b008 Server Administrator: Storage Service EventID: 2199  The virtual disk cache policy has changed.:  Virtual Disk 0 (Virtual Disk 0) Controller 0 (PERC 6/i Adapter)
Apr 26 14:55:06 b008 Server Administrator: Storage Service EventID: 2177  The controller battery Learn cycle has completed.:  Battery 0 Controller 0
Apr 26 14:55:21 b008 Server Administrator: Storage Service EventID: 2247  The controller battery is charging.:  Battery 0 Controller 0
Apr 26 14:55:21 b008 Server Administrator: Storage Service EventID: 2278  The controller battery charge level is below a normal threshold.:  Battery 0 Controller 0
Apr 26 15:25:41 b008 Server Administrator: Storage Service EventID: 2279  The controller battery charge level is operating within normal limits:  Battery 0 Controller 0
Apr 26 15:25:41 b008 Server Administrator: Storage Service EventID: 2189  The controller write policy has been changed to Write Back.:  Battery 0 Controller 0
Apr 26 15:25:42 b008 Server Administrator: Storage Service EventID: 2199  The virtual disk cache policy has changed.:  Virtual Disk 0 (Virtual Disk 0) Controller 0 (PERC 6/i Adapter)
Apr 26 18:50:26 b008 Server Administrator: Storage Service EventID: 2358  The battery charge cycle is complete.:  Battery 0 Controller 0

Bingo!

The RAID controller is running a battery learn cycle. When battery charge drops below a certain level, cache Write Back is disabled, and that kills disk performance. It seems this test is enabled by default and is configured to run every 90 days. Of course, Sod’s law dictates that it has to trigger in our busy period!

The Dell tools are not able to turn off this feature, but the LSI MegaCli tool can (Dell PERC 6/i controllers are re-badged LSI cards). I’ve run the following script on all servers (thanks to burr86 on ##infra-talk @ Freenode):

#!/bin/sh
TMPFILE=$(mktemp -p /tmp bbu.relearn.off.XXXXXXXXXX) || exit 1
echo "autoLearnMode=1" > $TMPFILE # or =0 to enable the bbu relearn
MegaCli -AdpBbuCmd -SetBbuProperties -f$TMPFILE -a0
rm -f $TMPFILE

I wrote a puppet class to run this script on all nodes:

class megacli::check {
    exec{'perc_bbu_autolearn.sh':
        require => Class['megacli::install'],
        cwd    => '/tmp',
        path   => '/usr/local/bin:/bin',
        unless => 'MegaCli -AdpBbuCmd -GetBbuProperties -a0 | grep -q "Auto-Learn Mode: Disabled"'
    }
}
All that remains to be done is to put a cron job in place that runs the battery learning cycle every three months at an off-peak time. The command to kick that off is:
omconfig storage battery action=startlearn controller=0 battery=0

Comments

Popular posts from this blog

Python logging with rich - writing to stderr - plain output when writing to file

Rich is a Python library for writing rich text (with color and style) to the terminal, and for displaying advanced content such as tables, markdown, and syntax highlighted code. Rich provides RichHandler , a logging handler for python's logging module which will format and colorize text written by the module. However, RichHandler writes to stdout by default. More specifically, it writes to a rich Console object which, by default, writes to stdout. To make RichHandler write to stderr by default, you must pass in a Console object which has been configured to write to stderr: import logging from rich.console import Console from rich.logging import RichHandler DATEFMT = "%Y-%m- %d T%H:%M:%SZ" FORMAT = " %(message)s " logging . basicConfig( level = "NOTSET" , format = FORMAT, datefmt = DATEFMT, handlers = [RichHandler(console = Console(stderr = True ))], ) logger = logging . getLogger(__name__) logger . i...

Fix python import order on save in vim with ruff and ale

My IDE of choice is vim. I use various tools to perform linting and code formatting, and configure them all with ALE  (the Asynchronous Lint Engine). After using several discrete tools ( black , isort , flake8 , etc) I have settled on using Ruff to do my python code formatting and linting. Here's the relevant fragment of my ALE config in my .vimrc: " ALE config let g :ale_fixers = { \ 'python' : [ 'ruff' , 'ruff_format' ], \} let g :ale_linters = { \ 'python' : [ 'ruff' ], \} let g :ale_python_ruff_use_global = 1 One of the last remaining wrinkles I had was getting Ruff to automatically sort import statements. Sorting imports is performed by the Ruff linter, not the formatter, which is documented here . The fix on the command line is to add an option, like this: ruff check --select I --fix The difficulty I had was getting this to happen in the editor when the file was saved. It turns out, all I needed to do was ...

Escaping special characters in wget username or password

I recently offered to help out with the hosting of a WordPress  site. It’s currently hosted somewhere with no shell access – just ftp – and there are a lot of images to transfer. I quickly figured out I could use wget to mirror the site, using something like: wget -m ftp://username:password@example.com However, this broke in this case because the username for the site contained an @ character (the username was user@example.com ). Turns out the solution was to encode the special chars using HTML notation. This is the command that did the trick: wget -m ftp://user%40example.com:password@example.com