B1000 is a backup system created to fulfill needs for an easy and reliable backup mechanism. Main design goals were:
- no central scheduler
- use standard Unix tools as much as possible (for both backup and restore process)
- possibility to do backup to several storages
- data may be pushed to a central storage from clients as well as pulled from clients to a central storage
- extremely automated and transparent centralized monitoring
- parallel pre-scripts execution and copying data, as well as parallel copies
- consistent directory structure across all backups
This chapter gives general description of how B1000 works. Think of it as a gentle introduction to the Devil Himself. If you're robot, you may prefer to start from Technical info chapter. Or, if you think you're bright enough and prefer to learn by example, jump to Examples section.
B1000 is in fact a sophisticated wrapper for rsync written in python, which uses ini-style configuration files. Each time B1000 is run, it takes the configuration file (/etc/b1000/b1000.cfg by default) and processes all jobs defined there. Optionally, you may specify which jobs to run (as a program arguments).
There is no dedicated backup scheduler. We use system's cron for that.
There are two ways of doing backup with B1000:
- Push - this is the way we do backup on trusted clients. Data is pushed from a client to central storage.
- Pull - this is how we do backup on untrusted clients. Data is pulled from clients by the storage server. Idea is that untrusted clients don't connect to a storage server from "the outside".
To setup proper push or pull backup, one needs to create a job with adequate destinations and reports. This is covered by detailed description of B1000 configuration section later on.
For now, let's see most common scenarios for pull- and push- type backups.
Push is the simple way of doing backups with B1000. It goes like this:
- Job is run on a client by cron
- Data is sent over rsync to a destination (storage server)
- Report (information about state of backup for the monitoring) is written to a MySQL database
This is The Complicated Way Of Doing Backups. It requires setting up two jobs: passive and pull, one on client and one on storage, respectively. This is how it works:
- Passive job starts on a client and prepares data, if necessary (with pre- script).
- Monitoring data is written on a client to a local report file.
- Job enters "copying" state and waits until the data is pulled by another job, running on storage server.
- That another job, pull job, starts in the meantime on the storage and waits until data is ready for copying (by periodically checking the remote report file). Report contents are also written to central MySQL database, for monitoring to see what's going on.
- When data is ready, storage pulls it from a client and notifies it when it's done pulling.
- When client sees that storage finished pulling data, it continues with the job, by finishing the whole process with optional post- script and some internal cleanup tasks.
All the communication between client and storage is done using rsync. For that to work, client needs to share the following with storage server (using rsync):
- (read-only) directory where backup files are stored (this is either "live" part of the filesystem or some directory "aside", where backup files are written by a pre- script)
- (read-write) directory with report files, used also for synchronization between passive and pull jobs.
The idea behind monitoring of B1000 work is that information about all jobs run ends up in one central database. This is easy for all push-type jobs, which run on trusted hosts that may have direct access to such database. But when data is pulled from an untrusted host, we need to pull monitoring data the same way and then store it in the database. Job running on storage server acts as a "monitoring proxy" in such configuration.
Common practice is to offload production servers by doing backups on slave servers. B1000 knows that and allows user to configure backup job with few additional parameters which allow to:
- establish logical relation between slave and master - this is required by monitoring to know that by doing backup of eg. MySQL instance running on 3307 on host slave3 we in fact did backup "payments" production database
- check how much slave was lagging behind master when backup started - this allows us to monitor if data in backup is "fresh enough"
There is one more freaky concept in B1000: instances. We often have a configuration, where one physical server holds more than one instance of a service, for example multiple instances of MySQL. As we are usually lazy bastards, we don't want to configure backup every time we configure new instance. We want it to happen automagically. Enter instances! (NOTE: Truth is that we're not lazy, we just want to be sure, that we do have a backup. And the best option is to make it happen automatically, without relaying on forgetful humans)
With B1000 you can configure backup for all MySQL instances in one job. You
just need to use instances
option to execute a script that returns list of
all available instances on a host. B1000 then spawns one separate job per each
instance.
Instances play well with slave servers too, cause this is how we roll.
B1000 requires MySQLdb python module which can be installed via packaging system:
- Debian:
apt-get install python-mysqldb
- FreeBSD:
pkg_add -r py27-MySQLdb51
Command line for B1000 looks as follows:
b1000 [options] [job[/instance]] ...
Available options are (none of them is required):
Short option | Long option | Description |
---|---|---|
-h | --help | Display help on command line syntax |
-v | --version | Display program version |
-f | --cfg | Use alternative config file (default is to use /etc/b1000/b1000.cfg) |
-c | --copy | Retry jobs that failed during copying data |
If no job/instance is given in program arguments, B1000 runs all configured jobs. In other case, it runs only jobs specified by the user.
Rules for specifying jobs to run:
- Each argument contains a job name or job/instance name pair (see instances to know more about ).
- If only job name is given, and there are instances configured for this job, job is run for each instance.
- If job/instance pair is given, job is run for specified instance only.
B1000 configuration is ini-style. Configuration file consists of arbitrary sections with also arbitrary set of required and allowed options. There are few features additional to "classic" ini-style configuration, described in detail in option_types
Section naming for B1000 configuration is arbitrary. Following sections are allowed:
- [global] - global configuration options configure various aspects of B1000 and also provide common configuration for jobs.
- [job:JOB_NAME] - job is the core resource in B1000 configuration. It defines a backup process for a certain set of data.
- [dest:DESTINATION_NAME] - destination is a definition of a place where backup is stored.
- [report:REPORT_NAME] - reports are places where B1000 stores information on backup progress. Reports are used for monitoring purposes only.
For each section (and sometimes even for a certain section type) there is a set of required and allowed keys. B1000 checks each section keys for correctness and reports to a logfile if something is not right.
Global section configures various aspects of B1000 and also provide common configuration for jobs. Options marked red are required.
Option | Default | Description |
---|---|---|
log_format | %(asctime)-15s %(levelname)-7s [%(threadName)-10s] %(message)s |
Format of log file line (see here for details) |
log_level | INFO | level of logs printed. One of: INFO, DEBUG, INFO, WARNING, ERROR, FATAL |
log_file | /dev/stdout | where to print logs |
backup_dir | – | path to store intermediate backup files |
scripts_dir | – | path where pre-, post- and other scripts used by b1000 are located. This is a shortcut, just for your convenience |
status_dir | – | path where status filed for failed jobs are stored |
copy_retries | 3 | retries for failed copy operations (if destination type supports it) |
copy_retry_min_sleep | 60 | sleep between copy retries (in seconds) (if destination type supports it) |
Job is the core resource in B1000 configuration. It defines a backup process for a certain set of data. There are three so called directions for jobs in B1000:
- Push - pushes data from client to storage server, also does the pre- and post- scripts
- Passive - waits for Pull job to pull data, also does pre- and post- scripts
- Pull - pulls data from a client, doesn't do any pre- nor post- scripts
Options in configuration file are set with "key = value" syntax. But for job definitions there are three additional features regarding setting options.
Besides options set explicitly by user in a configuration file, B1000 sets two additional run-time options in context of each job/instance. Those may be used (see: referencing_options) anywhere in the same block, and are primarily used as arguments while calling shell scripts that set master_host, master_instance and data_age variables.
Options set by B1000 for each job in run-time:
Option | Description |
---|---|
instance | currently processed instance, as specified in instances |
host | hostname where the job is run |
start_time | timestamp of a job start in form: %Y-%m-%d-%H-%M-%S (useful for temporary directories, for example) |
Dynamic options may be set to output of an external script. Every time such option is referenced, script is run and output is taken as value for a key. Only few options may be dynamically set (see table with job options below). Dynamic options exist only for push jobs. Syntax for using dynamic options is:
key = !/path/to/executable/script
When setting value of a key, user may use the following syntax to reference value of another option in the same block, or, if not found, in global section:
key = $other_key
This sets value of a key to value of other_key. This is useful when setting a dynamic variable using external script, for example:
master_host = !/usr/lib/b1000/scripts/master.sh $instance
Or for using the same prefix path in multiple places, for example:
[global]
scripts_dir = /usr/local/lib/b1000/scripts
...
[job:system]
pre = $scripts_dir/system_pre.sh
Push is the most feature-rich type of job.
Option | dynamic | Description |
---|---|---|
direction | – | push |
type | – | full, sync |
instances | + | List of instances to back up. |
report | – | List of reports to write |
dest | – | List of destinations to copy backup to |
data_age | + | age of the data being backed up (e.g. slave server delay) |
master_host | + | master (real production) host which holds the data we back up |
master_instance | + | master (real production) instance which holds the data we back up |
pre | – | script to run before copying data |
post | – | script to run after copying data |
include | – | files and directories to include in backup |
exclude | – | files and directories to exclude from backup |
Passive job requires twin Pull job set up on storage server to work correctly.
For passive jobs there is no include nor exclude options, as this is something that pull job decides on.
Also, there is no instances option (together with other instance-related ones). Reason for this is that currently passive/pull configuration is only intended for simple cases. This may change in the future.
Option | Description |
---|---|
direction | passive |
report | List of reports to write |
dest | List of destinations to copy backup to |
pre | script to run before copying data |
post | script to run after copying data |
Pull job requires twin Passive job set up on a client to work correctly.
Option | Description |
---|---|
direction | pull |
type | full, sync |
report | List of reports to write |
dest | List of destinations to copy backup to |
include | files and directories to include in backup |
exclude | files and directories to exclude from backup |
report_source | rsync location on client side where to look for reports |
report_poll_wait | how long to wait between remote report checks (in seconds) |
report_poll_retries | how many times to check remote report |
We wanted to achieve one specific goal with B1000: that all backups would end in a simple, consistent directory structure. No more asking questions: "where is backup of _____ stored?". That means we left only little space for user to play with destination directories.
There are two types of backup in B1000: full and sync. There is no incremental backup, as you might have expected. Although Sync does that. Kinda.
-
full type jobs do just what you expect them to. Every time job is run, it makes a full backup of all files configured with
include
option. Backup is stored on destination in a directory constructed the following way:
$path/job_name/Y-M-D-DoW/host-instance-master_host-master_instance-Y-M-D-DOW- HH:MM:SS
-
sync type jobs synchronize set of files and directories configured with
include
option with files on destination. It's just a regular rsync.
Backup is stored on destination in a same directory every time, constructed the following way:
$path/job_name/host-instance-master_host-master_instance
If any of: instance, master_host, master_instance
is absent for a job, it is
omitted in the destination path together with the leading dash.
$path
is an option specified in destination definition. This is either a
local directory (eg. /srv/backup/
) or remote rsync module (eg.
rsync://backup-3/b1000
), depending on where you want to store resulting
backup. But in real world local directories make sense only for pull-type
jobs. All push jobs should normally write backups to a remote rsync server.
Each job needs at least one report. Report lists consists of ELEMENTS separated by SEPARATORS, where:
- ELEMENT is:
[a-zA-Z][-_a-zA-Z0-9]+
- SEPARATOR is:
[ ,]+
Example:
report = main hd backup
Each job needs at least one destination. Syntax for destination lists is as for reports, but with two additional special modifiers:
-
/ (slash) - joins two or more destinations and tells B1000 to randomize order in which they are considered when copying data. Following example configures a job to copy data sequentially to three destinations, but in random order
dest = atm/tele/hd
-
& (and) - tells B100 to start copying to destination in background and continue with other destinations. Following example configures a job to copying data to hd in background, then immediately proceed with tele, and when tele is done, data will be copied to atm
dest = hd& tele atm
-
Combinations of those two modifiers are permitted. For example, to start copying to hd in background, then proceed with tele and atm, sequentialy, in random order, use:
dest = hd& tele/atm
Syntax for specifying instance lists in configuration file is similar to report lists: ELEMENTS separated by SEPARATORS. One addition is that when filled by an external script, newline can be a separator too:
- ELEMENT is:
[a-zA-Z][-_a-zA-Z0-9]+
- SEPARATOR is:
[ \n,]+
How to name an instance? You choose. It may be a port number, instance name, or whatever that allows you to identify an instance.
When B1000 finds a job in configuration file that has instances
parameter
set, instead of preparing a single job, it spawns one job per instance. Each
of this jobs get additional configuration parameter set in run-time:
instance
.
If you make a backup of a production service, but instead of doing it on production server, you use a slave host, you may want to inform B1000 what is the real master host/instance. This is needed by monitoring to know, that backup of certain production service took place, but was done in some other place than the real host.
Also, in case of doing backup on a slave host, it may be crucial to know if it was lagging behind its master at the time.
Those options may be set statically, but the real world scenario is: You are
doing backups on a slave host holding multiple instances of production
services, eg. MySQL slaves. In such scenario user may configure master_host
,
master_instance
and data_age
to execute scripts which take run-time option
$instance
as an argument, check what is the master host, master instance and
data age for that given instance, and return the actual value.
With a job configured in such way, you don't need to worry about backups of newly created instances or removed ones.
Each job consists of three stages:
- pre- script
- copying data
- post- script
Pre- and post- scripts are any executables that are run locally on client that
runs a job (you may have only one pre- and one post-job). This is another
place where $instance
run-time option comes in handy. You may, for example,
have one pre- script for doing MySQL dumps, which takes instance name as
an argument.
Exit code of a pre- or post- script is what matters for B1000. If a script exits with anything other than '0', B1000 treats it as an error. Output of pre- and post- scripts is written to B1000 log file.
Those two options tell B1000 what to backup and what to skip. include
is in
fac the first argument of rsync call, exclude
holds values for a series of
--exclude
rsync options. For example:
include = /etc /var/spool/cron /usr/local/bin
exclude = *.tmp *~ *.log
will translate to something like: rsync ... --exclude=**.tmp --exclude=**~ --exclude=*.log /etc /var/spool/cron /usr/local/bin rsync://destination/...
Destination is a definition of a place where backup is stored. There are two types of destinations.
This type of destination is invalid for passive jobs. It is meant to be used in jobs that actually store data (push and pull)
Option | Description |
---|---|
type | active - regular or in-place push backups |
path | where to store files (both local directories and rsync:// are allowed) |
exclude | files excluded from copy |
verbosity | how verbose should the copying process be (1-3, default 1) |
This type of destination is valid only for passive jobs. It doesn't copy data, it waits for pull job to download it.
Option | Description |
---|---|
type | passive - job waits until another job pulls the data (valid only for "passive" jobs) |
host | hostname of a client which pulls data |
timeout | timeout (in second) for data being pulled |
Reports are places where B1000 stores information on backup progress. Reports are used for monitoring purposes only. There are two types of report B1000 can write: mysql (MySQL database) and file (locally written file). Each one has different options. Options marked red are required.
Option | Description |
---|---|
type | mysql - for reports stored in MySQL database |
server | database server |
db | database name |
user | database user |
password | database user password |
This kind of report is required for passive jobs.
Option | Description |
---|---|
type | file - for reports stored in files |
path | directory where reports are written |
Common sections
Let's first define some common sections used by example push jobs:
[global]
log_level = INFO
log_file = /var/log/b1000.log
backup_dir = /srv/backup
scripts_dir = /usr/local/lib/b1000/scripts
status_dir = /srv/b1000/statuses
copy_retries = 3
copy_retry_min_sleep = 60
This is our central database for b1000 monitoring:
[report:main]
type = mysql
server = db-b1000
user = b1000
password = AndrzejJestCudownyIPrzystojny
db = b1000
Those are storage servers that we use to store all backups:
[dest:tele]
type = active
path = rsync://backup-1/b1000
verbosity = 1
[dest:atm]
type = active
path = rsync://backup-2/b1000
verbosity = 1
[dest:hd]
type = active
path = rsync://backup-3/b1000
verbosity = 1
Backup whole /etc
by pushing it to tele
and atm
sequentially, in random
order. Store monitoring in central database.
[job:etc]
direction = push
type = full
dest = tele/atm
report = main
include = /etc
Same as above, but with tar/gzip.
[job:etc]
direction = push
type = full
dest = tele/atm
report = main
pre = tar czf $backup_dir/system_backup.tar.gz /etc
include = $backup_dir/system_backup.tar.gz
post = rm -f $backup_dir/system_backup.tar.gz
- Just before starting backup of MySQL binlogs, we want to switch to a new binlog, so the binlog set is as fresh as possible. We do it with a script run as
pre
script. - We want to make backup of MySQL binlogs to one directory per server. All files in one directory. Thus, we do
sync
backup. - Also, we wand to copy all binlogs to
hd
as soon as possible, and in the meantime totele
andatm
.
Configuration for the above looks like this:
[job:binlogs]
direction = push
type = sync
dest = hd& tele/atm
report = main
pre = $scripts_dir/mysql_rotate_logs.sh
include = /var/lib/mysql/mysql-bin.*
This is a full-flavor push job with instances
and all the bells and
whistles. Seriously, intense stuff.
[job:mysqldb]
direction = push
type = full
dest = hd& tele/atm
report = main
instances = $scripts_dir/mysql_get_instances.sh
data_age = $scripts_dir/mysql_get_slave_delay.sh $instance
master_host = $scripts_dir/mysql_get_master_host.sh $instance
master_instance = $scripts_dir/mysql_get_master_instance.sh $instance
pre = $scripts_dir/mysql_dump_instance.sh $instance $backup_dir/$instance-$start_time
post = rm -rf $backup_dir/$instance-$start_time
include = $backup_dir/$instance-$start_time/*
To pull backup from a client we need setup one (passive) job there, and one (pull) job on the storage server.
Client
[report:localfile]
type = file
path = /srv/b1000/reports
[dest:atm]
type = passive
host = backup-2
timeout = 10
[job:ext]
direction = passive
dest = atm
report = localfile
pre = $scripts_dir/ext_dump.sh
Notes:
- Passive job needs to write report to a local file
- Directory with reports needs to be readable over rsync by storage server
- Backup job runs pre-script which prepares data for backup (to some arbitrary directory in this case, hardcoded in the pre-script)
- This directory needs to be readable over rsync by storage server
- Job doesn't specify any
include
-s, it's job is only to prepare data available for storage using rsync.
Storage
[report:main]
type = mysql
server = db-b1000
user = b1000
password = erjcgnhy3p4i8cg2ny34ilugfcnh
db = b1000
[dest:local]
type = active
path = /srv/stor1/b1000
verbosity = 1
[job:system]
direction = pull
type = full
dest = local
report_source = rsync://ext/reports
report_poll_wait = 3
report_poll_retries = 3
report = main
include = rsync://ext/backup
exclude = *.tmp *.log
Notes:
- Pull job has the destination configured as a local directory. This probably should be a directory where other backups are stored too.
report_source
andinclude
are two rsync resources made available on the untrusted host