Skip to content

Latest commit

 

History

History
667 lines (462 loc) · 24.8 KB

b1000.md

File metadata and controls

667 lines (462 loc) · 24.8 KB

Introduction

B1000 is a backup system created to fulfill needs for an easy and reliable backup mechanism. Main design goals were:

  • no central scheduler
  • use standard Unix tools as much as possible (for both backup and restore process)
  • possibility to do backup to several storages
  • data may be pushed to a central storage from clients as well as pulled from clients to a central storage
  • extremely automated and transparent centralized monitoring
  • parallel pre-scripts execution and copying data, as well as parallel copies
  • consistent directory structure across all backups

General info

This chapter gives general description of how B1000 works. Think of it as a gentle introduction to the Devil Himself. If you're robot, you may prefer to start from Technical info chapter. Or, if you think you're bright enough and prefer to learn by example, jump to Examples section.

B1000 is in fact a sophisticated wrapper for rsync written in python, which uses ini-style configuration files. Each time B1000 is run, it takes the configuration file (/etc/b1000/b1000.cfg by default) and processes all jobs defined there. Optionally, you may specify which jobs to run (as a program arguments).

There is no dedicated backup scheduler. We use system's cron for that.

The idea of push and pull

There are two ways of doing backup with B1000:

  1. Push - this is the way we do backup on trusted clients. Data is pushed from a client to central storage.
  2. Pull - this is how we do backup on untrusted clients. Data is pulled from clients by the storage server. Idea is that untrusted clients don't connect to a storage server from "the outside".

To setup proper push or pull backup, one needs to create a job with adequate destinations and reports. This is covered by detailed description of B1000 configuration section later on.

For now, let's see most common scenarios for pull- and push- type backups.

Push

Push is the simple way of doing backups with B1000. It goes like this:

  • Job is run on a client by cron
  • Data is sent over rsync to a destination (storage server)
  • Report (information about state of backup for the monitoring) is written to a MySQL database

Pull

This is The Complicated Way Of Doing Backups. It requires setting up two jobs: passive and pull, one on client and one on storage, respectively. This is how it works:

  • Passive job starts on a client and prepares data, if necessary (with pre- script).
  • Monitoring data is written on a client to a local report file.
  • Job enters "copying" state and waits until the data is pulled by another job, running on storage server.
  • That another job, pull job, starts in the meantime on the storage and waits until data is ready for copying (by periodically checking the remote report file). Report contents are also written to central MySQL database, for monitoring to see what's going on.
  • When data is ready, storage pulls it from a client and notifies it when it's done pulling.
  • When client sees that storage finished pulling data, it continues with the job, by finishing the whole process with optional post- script and some internal cleanup tasks.

All the communication between client and storage is done using rsync. For that to work, client needs to share the following with storage server (using rsync):

  • (read-only) directory where backup files are stored (this is either "live" part of the filesystem or some directory "aside", where backup files are written by a pre- script)
  • (read-write) directory with report files, used also for synchronization between passive and pull jobs.

The idea behind monitoring

The idea behind monitoring of B1000 work is that information about all jobs run ends up in one central database. This is easy for all push-type jobs, which run on trusted hosts that may have direct access to such database. But when data is pulled from an untrusted host, we need to pull monitoring data the same way and then store it in the database. Job running on storage server acts as a "monitoring proxy" in such configuration.

The idea behind slave servers

Common practice is to offload production servers by doing backups on slave servers. B1000 knows that and allows user to configure backup job with few additional parameters which allow to:

  • establish logical relation between slave and master - this is required by monitoring to know that by doing backup of eg. MySQL instance running on 3307 on host slave3 we in fact did backup "payments" production database
  • check how much slave was lagging behind master when backup started - this allows us to monitor if data in backup is "fresh enough"

The idea behind instances

There is one more freaky concept in B1000: instances. We often have a configuration, where one physical server holds more than one instance of a service, for example multiple instances of MySQL. As we are usually lazy bastards, we don't want to configure backup every time we configure new instance. We want it to happen automagically. Enter instances! (NOTE: Truth is that we're not lazy, we just want to be sure, that we do have a backup. And the best option is to make it happen automatically, without relaying on forgetful humans)

With B1000 you can configure backup for all MySQL instances in one job. You just need to use instances option to execute a script that returns list of all available instances on a host. B1000 then spawns one separate job per each instance.

Instances play well with slave servers too, cause this is how we roll.

Technical info

Installation

B1000 requires MySQLdb python module which can be installed via packaging system:

  • Debian: apt-get install python-mysqldb
  • FreeBSD: pkg_add -r py27-MySQLdb51

Command line options

Command line for B1000 looks as follows:

b1000 [options] [job[/instance]] ...

Available options are (none of them is required):

Short option Long option Description
-h --help Display help on command line syntax
-v --version Display program version
-f --cfg Use alternative config file (default is to use /etc/b1000/b1000.cfg)
-c --copy Retry jobs that failed during copying data

If no job/instance is given in program arguments, B1000 runs all configured jobs. In other case, it runs only jobs specified by the user.

Rules for specifying jobs to run:

  • Each argument contains a job name or job/instance name pair (see instances to know more about ).
  • If only job name is given, and there are instances configured for this job, job is run for each instance.
  • If job/instance pair is given, job is run for specified instance only.

Configuration

B1000 configuration is ini-style. Configuration file consists of arbitrary sections with also arbitrary set of required and allowed options. There are few features additional to "classic" ini-style configuration, described in detail in option_types

Section naming for B1000 configuration is arbitrary. Following sections are allowed:

  • [global] - global configuration options configure various aspects of B1000 and also provide common configuration for jobs.
  • [job:JOB_NAME] - job is the core resource in B1000 configuration. It defines a backup process for a certain set of data.
  • [dest:DESTINATION_NAME] - destination is a definition of a place where backup is stored.
  • [report:REPORT_NAME] - reports are places where B1000 stores information on backup progress. Reports are used for monitoring purposes only.

For each section (and sometimes even for a certain section type) there is a set of required and allowed keys. B1000 checks each section keys for correctness and reports to a logfile if something is not right.

Global section

Global section configures various aspects of B1000 and also provide common configuration for jobs. Options marked red are required.

Option Default Description
log_format %(asctime)-15s %(levelname)-7s [%(threadName)-10s] %(message)s Format of log file line (see here for details)
log_level INFO level of logs printed. One of: INFO, DEBUG, INFO, WARNING, ERROR, FATAL
log_file /dev/stdout where to print logs
backup_dir path to store intermediate backup files
scripts_dir path where pre-, post- and other scripts used by b1000 are located. This is a shortcut, just for your convenience
status_dir path where status filed for failed jobs are stored
copy_retries 3 retries for failed copy operations (if destination type supports it)
copy_retry_min_sleep 60 sleep between copy retries (in seconds) (if destination type supports it)

Job section

Job is the core resource in B1000 configuration. It defines a backup process for a certain set of data. There are three so called directions for jobs in B1000:

  • Push - pushes data from client to storage server, also does the pre- and post- scripts
  • Passive - waits for Pull job to pull data, also does pre- and post- scripts
  • Pull - pulls data from a client, doesn't do any pre- nor post- scripts

Option types

Options in configuration file are set with "key = value" syntax. But for job definitions there are three additional features regarding setting options.

Run-time options

Besides options set explicitly by user in a configuration file, B1000 sets two additional run-time options in context of each job/instance. Those may be used (see: referencing_options) anywhere in the same block, and are primarily used as arguments while calling shell scripts that set master_host, master_instance and data_age variables.

Options set by B1000 for each job in run-time:

Option Description
instance currently processed instance, as specified in instances
host hostname where the job is run
start_time timestamp of a job start in form: %Y-%m-%d-%H-%M-%S (useful for temporary directories, for example)
Dynamic Options

Dynamic options may be set to output of an external script. Every time such option is referenced, script is run and output is taken as value for a key. Only few options may be dynamically set (see table with job options below). Dynamic options exist only for push jobs. Syntax for using dynamic options is:

key = !/path/to/executable/script
Referencing options

When setting value of a key, user may use the following syntax to reference value of another option in the same block, or, if not found, in global section:

key = $other_key

This sets value of a key to value of other_key. This is useful when setting a dynamic variable using external script, for example:

master_host = !/usr/lib/b1000/scripts/master.sh $instance

Or for using the same prefix path in multiple places, for example:

[global]
scripts_dir = /usr/local/lib/b1000/scripts
...
[job:system]
pre = $scripts_dir/system_pre.sh

Push jobs

Push is the most feature-rich type of job.

Option dynamic Description
direction push
type full, sync
instances + List of instances to back up.
report List of reports to write
dest List of destinations to copy backup to
data_age + age of the data being backed up (e.g. slave server delay)
master_host + master (real production) host which holds the data we back up
master_instance + master (real production) instance which holds the data we back up
pre script to run before copying data
post script to run after copying data
include files and directories to include in backup
exclude files and directories to exclude from backup

Passive jobs

Passive job requires twin Pull job set up on storage server to work correctly.

For passive jobs there is no include nor exclude options, as this is something that pull job decides on.

Also, there is no instances option (together with other instance-related ones). Reason for this is that currently passive/pull configuration is only intended for simple cases. This may change in the future.

Option Description
direction passive
report List of reports to write
dest List of destinations to copy backup to
pre script to run before copying data
post script to run after copying data

Pull jobs

Pull job requires twin Passive job set up on a client to work correctly.

Option Description
direction pull
type full, sync
report List of reports to write
dest List of destinations to copy backup to
include files and directories to include in backup
exclude files and directories to exclude from backup
report_source rsync location on client side where to look for reports
report_poll_wait how long to wait between remote report checks (in seconds)
report_poll_retries how many times to check remote report

More on options

Notes on destination paths and job types

We wanted to achieve one specific goal with B1000: that all backups would end in a simple, consistent directory structure. No more asking questions: "where is backup of _____ stored?". That means we left only little space for user to play with destination directories.

There are two types of backup in B1000: full and sync. There is no incremental backup, as you might have expected. Although Sync does that. Kinda.

  • full type jobs do just what you expect them to. Every time job is run, it makes a full backup of all files configured with include option. Backup is stored on destination in a directory constructed the following way:
    $path/job_name/Y-M-D-DoW/host-instance-master_host-master_instance-Y-M-D-DOW- HH:MM:SS

  • sync type jobs synchronize set of files and directories configured with include option with files on destination. It's just a regular rsync.
    Backup is stored on destination in a same directory every time, constructed the following way:

$path/job_name/host-instance-master_host-master_instance

If any of: instance, master_host, master_instance is absent for a job, it is omitted in the destination path together with the leading dash.

$path is an option specified in destination definition. This is either a local directory (eg. /srv/backup/) or remote rsync module (eg. rsync://backup-3/b1000), depending on where you want to store resulting backup. But in real world local directories make sense only for pull-type jobs. All push jobs should normally write backups to a remote rsync server.

Report

Each job needs at least one report. Report lists consists of ELEMENTS separated by SEPARATORS, where:

  • ELEMENT is: [a-zA-Z][-_a-zA-Z0-9]+
  • SEPARATOR is: [ ,]+

Example:

report = main hd backup
Destination

Each job needs at least one destination. Syntax for destination lists is as for reports, but with two additional special modifiers:

  • / (slash) - joins two or more destinations and tells B1000 to randomize order in which they are considered when copying data. Following example configures a job to copy data sequentially to three destinations, but in random order

    dest = atm/tele/hd

  • & (and) - tells B100 to start copying to destination in background and continue with other destinations. Following example configures a job to copying data to hd in background, then immediately proceed with tele, and when tele is done, data will be copied to atm

    dest = hd& tele atm

  • Combinations of those two modifiers are permitted. For example, to start copying to hd in background, then proceed with tele and atm, sequentialy, in random order, use:

    dest = hd& tele/atm

Instances

Syntax for specifying instance lists in configuration file is similar to report lists: ELEMENTS separated by SEPARATORS. One addition is that when filled by an external script, newline can be a separator too:

  • ELEMENT is: [a-zA-Z][-_a-zA-Z0-9]+
  • SEPARATOR is: [ \n,]+

How to name an instance? You choose. It may be a port number, instance name, or whatever that allows you to identify an instance.

When B1000 finds a job in configuration file that has instances parameter set, instead of preparing a single job, it spawns one job per instance. Each of this jobs get additional configuration parameter set in run-time: instance.

master_host, master_instance, data_age

If you make a backup of a production service, but instead of doing it on production server, you use a slave host, you may want to inform B1000 what is the real master host/instance. This is needed by monitoring to know, that backup of certain production service took place, but was done in some other place than the real host.

Also, in case of doing backup on a slave host, it may be crucial to know if it was lagging behind its master at the time.

Those options may be set statically, but the real world scenario is: You are doing backups on a slave host holding multiple instances of production services, eg. MySQL slaves. In such scenario user may configure master_host, master_instance and data_age to execute scripts which take run-time option $instance as an argument, check what is the master host, master instance and data age for that given instance, and return the actual value.

With a job configured in such way, you don't need to worry about backups of newly created instances or removed ones.

pre, post

Each job consists of three stages:

  • pre- script
  • copying data
  • post- script

Pre- and post- scripts are any executables that are run locally on client that runs a job (you may have only one pre- and one post-job). This is another place where $instance run-time option comes in handy. You may, for example, have one pre- script for doing MySQL dumps, which takes instance name as an argument.

Exit code of a pre- or post- script is what matters for B1000. If a script exits with anything other than '0', B1000 treats it as an error. Output of pre- and post- scripts is written to B1000 log file.

include, exclude

Those two options tell B1000 what to backup and what to skip. include is in fac the first argument of rsync call, exclude holds values for a series of --exclude rsync options. For example:

include = /etc /var/spool/cron /usr/local/bin
exclude = *.tmp *~ *.log

will translate to something like: rsync ... --exclude=**.tmp --exclude=**~ --exclude=*.log /etc /var/spool/cron /usr/local/bin rsync://destination/...

Destination section

Destination is a definition of a place where backup is stored. There are two types of destinations.

Active

This type of destination is invalid for passive jobs. It is meant to be used in jobs that actually store data (push and pull)

Option Description
type active - regular or in-place push backups
path where to store files (both local directories and rsync:// are allowed)
exclude files excluded from copy
verbosity how verbose should the copying process be (1-3, default 1)

Passive

This type of destination is valid only for passive jobs. It doesn't copy data, it waits for pull job to download it.

Option Description
type passive - job waits until another job pulls the data (valid only for "passive" jobs)
host hostname of a client which pulls data
timeout timeout (in second) for data being pulled

Report section

Reports are places where B1000 stores information on backup progress. Reports are used for monitoring purposes only. There are two types of report B1000 can write: mysql (MySQL database) and file (locally written file). Each one has different options. Options marked red are required.

Database

Option Description
type mysql - for reports stored in MySQL database
server database server
db database name
user database user
password database user password

File

This kind of report is required for passive jobs.

Option Description
type file - for reports stored in files
path directory where reports are written

Examples

Push jobs

Common sections

Let's first define some common sections used by example push jobs:

[global]
log_level = INFO
log_file = /var/log/b1000.log
backup_dir = /srv/backup
scripts_dir = /usr/local/lib/b1000/scripts
status_dir = /srv/b1000/statuses
copy_retries = 3
copy_retry_min_sleep = 60

This is our central database for b1000 monitoring:

[report:main]
type = mysql
server = db-b1000
user = b1000
password = AndrzejJestCudownyIPrzystojny
db = b1000

Those are storage servers that we use to store all backups:

[dest:tele]
type = active
path = rsync://backup-1/b1000
verbosity = 1


[dest:atm]
type = active
path = rsync://backup-2/b1000
verbosity = 1


[dest:hd]
type = active
path = rsync://backup-3/b1000
verbosity = 1

Simple backup of /etc

Backup whole /etc by pushing it to tele and atm sequentially, in random order. Store monitoring in central database.

[job:etc]
direction = push
type = full
dest = tele/atm
report = main
include = /etc

Less simple backup of /etc

Same as above, but with tar/gzip.

[job:etc]
direction = push
type = full
dest = tele/atm
report = main
pre = tar czf $backup_dir/system_backup.tar.gz /etc
include = $backup_dir/system_backup.tar.gz
post = rm -f $backup_dir/system_backup.tar.gz

Backup of MySQL binlogs.

  1. Just before starting backup of MySQL binlogs, we want to switch to a new binlog, so the binlog set is as fresh as possible. We do it with a script run as pre script.
  2. We want to make backup of MySQL binlogs to one directory per server. All files in one directory. Thus, we do sync backup.
  3. Also, we wand to copy all binlogs to hd as soon as possible, and in the meantime to tele and atm.

Configuration for the above looks like this:

[job:binlogs]
direction = push
type = sync
dest = hd& tele/atm
report = main
pre = $scripts_dir/mysql_rotate_logs.sh
include = /var/lib/mysql/mysql-bin.*

Backup all instances of MySQL on a slave host

This is a full-flavor push job with instances and all the bells and whistles. Seriously, intense stuff.

[job:mysqldb]
direction = push
type = full
dest = hd& tele/atm
report = main
instances = $scripts_dir/mysql_get_instances.sh
data_age = $scripts_dir/mysql_get_slave_delay.sh $instance
master_host = $scripts_dir/mysql_get_master_host.sh $instance
master_instance = $scripts_dir/mysql_get_master_instance.sh $instance
pre = $scripts_dir/mysql_dump_instance.sh $instance $backup_dir/$instance-$start_time
post = rm -rf $backup_dir/$instance-$start_time
include = $backup_dir/$instance-$start_time/*

Passive/Pull jobs

To pull backup from a client we need setup one (passive) job there, and one (pull) job on the storage server.

Client

[report:localfile]
type = file
path = /srv/b1000/reports

[dest:atm]
type = passive
host = backup-2
timeout = 10

[job:ext]
direction = passive
dest = atm
report = localfile
pre = $scripts_dir/ext_dump.sh

Notes:

  • Passive job needs to write report to a local file
  • Directory with reports needs to be readable over rsync by storage server
  • Backup job runs pre-script which prepares data for backup (to some arbitrary directory in this case, hardcoded in the pre-script)
  • This directory needs to be readable over rsync by storage server
  • Job doesn't specify any include-s, it's job is only to prepare data available for storage using rsync.

Storage

[report:main]
type = mysql
server = db-b1000
user = b1000
password = erjcgnhy3p4i8cg2ny34ilugfcnh
db = b1000

[dest:local]
type = active
path = /srv/stor1/b1000
verbosity = 1

[job:system]
direction = pull
type = full
dest = local
report_source = rsync://ext/reports
report_poll_wait = 3
report_poll_retries = 3
report = main
include = rsync://ext/backup
exclude = *.tmp *.log

Notes:

  • Pull job has the destination configured as a local directory. This probably should be a directory where other backups are stored too.
  • report_source and include are two rsync resources made available on the untrusted host