Skip to content
Alain Jobart edited this page Nov 13, 2013 · 1 revision

This document describes the information we keep in zookeeper, how it is generated, and how the python client uses it.

Keyspace / Shard / Tablet Data

Keyspace

Each keyspace is now a global zookeeper path, with sub-directories for its shards and action / actionlog. There is no data present (there used to be, but the vt tools never touched it).

$ zk ls /zk/global/vt/keyspaces/ruser
action
actionlog
shards

The path and sub-paths are created by 'vtctl CreateKeyspace'.

We use the action and actionlog paths for locking only, no process is actively watching these paths.

Shard

A shard is a global zookeeper path, with sub-directories for its action / actionlog, and replication graph.

$ zk ls /zk/global/vt/keyspaces/user/shards/10-20
action
actionlog
cell1-0000200278

We use the action and actionlog paths for locking only, no process is actively watching these paths.

A shard path is also a node with a Shard object:

type Shard struct {
// There can be only at most one master, but there may be none. (0)
   MasterAlias TabletAlias

// This must match the shard name based on our other conventions, but  

    // helpful to have it decomposed here.
    KeyRange key.KeyRange

    // ServedTypes is a list of all the tablet types this shard will
    // serve. This is usually used with overlapping shards during
    // data shuffles like shard splitting.
    ServedTypes []TabletType

    // SourceShards is the list of shards we're replicating from,
    // using filtered replication.
    SourceShards []SourceShard

    // Cells is the list of cells that have tablets for this shard.
    // It is populated at InitTablet time when a tabelt is added
    // in a cell that is not in the list yet.
    Cells []string
}

type SourceShard struct {
    // Uid is the unique ID for this SourceShard object.
    // It is for instance used as a unique index in blp_checkpoint
    // when storing the position. It should be unique whithin a
    // destination Shard, but not globally unique.
    Uid uint32

    // the source keyspace
    Keyspace string

    // the source shard
    Shard string

    // the source shard keyrange
    KeyRange key.KeyRange

    // we could add other filtering information, like table list, ...
}

$ zk cat /zk/global/vt/keyspaces/ruser/shards/10-20
{
  "MasterAlias": {
    "Cell": "nyc",
    "Uid": 200278
  },
  "KeyRange": {
    "Start": "10",
    "End": "20"
  },
 "Cells": [
    "cell1",
    "cell2"
 ]
}

The shard path and sub-directories are created when the first tablet in that shard is created.

The Shard object is changed when we add tablets in unknown cells, or when we change the master.

Tablet

A tablet has a path in zookeeper, with its action / actionlog and pid file:

$ zk ls /zk/nyc/vt/tablets/0000200308
action
actionlog
pid

We use the action and actionlog paths for remote execution of actions. vttablet will watch that directory and launch a vtaction for every requested action.

A tablet also has a node of type Tablet:

struct {
        Cell        string      // the zk cell this tablet is assigned to (doesn't change)
        Uid         uint32      // the server id for this instance
        Parent      TabletAlias // the globally unique alias for our replication parent - zero if this is the global master
        Addr        string      // host:port for queryserver
        SecureAddr  string      // host:port for queryserver using encrypted connection
        MysqlAddr   string      // host:port for the mysql instance
        MysqlIpAddr string      // ip:port for the mysql instance - needed to match slaves with tablets and preferable to relying on reverse dns

        Keyspace string
        Shard    string
        Type     TabletType

        State TabletState

        // Normally the database name is implied by "vt_" + keyspace. I
        // really want to remove this but there are some databases that are
        // hard to rename.
        DbNameOverride string
        KeyRange       key.KeyRange
}

$ zk cat /zk/nyc/vt/tablets/0000200308
{
  "Cell": "nyc",
  "Uid": 200308,
  "Parent": {
    "Cell": "",
    "Uid": 0
  },
  "Addr": "nyc-db308.nyc.youtube.com:8101",
  "SecureAddr": "nyc-db308.nyc.youtube.com:8102",
  "MysqlAddr": "nyc-db308.nyc.youtube.com:3306",
  "MysqlIpAddr": "10.33.158.138:3306",
  "Keyspace": "",
  "Shard": "",
  "Type": "idle",
  "State": "ReadOnly",
  "DbNameOverride": "",
  "KeyRange": {
    "Start": "",
    "End": ""
  }
}

The Tablet object is created by 'vtctl InitTablet'. Up-to-date information (port numbers, ...) is maintained by the vttablet process. 'vtctl ChangeSlaveType' will also change the Tablet record.

Replication Graph

The data maintained by vt tools is as follows: 0 the master is stored in the Shard object 0 the replication links are stored in each cell for each master/slave relationship (in the cell of the slave), in a ShardReplication object.

Whenever the replication graph changes, we rebuild the serving graph for that cell.

Serving Graph

The serving graph for a shard is maintained in every cell that contains tablets for that shard. To get all the available keyspaces in a cell, just list the top-level cell serving graph directory:

$ zk ls /zk/nyc/vt/ns
lookup
user

The python client lists that directory at startup to find all the keyspaces.

SrvKeyspace

The keyspace data is stored under /zk//vt/ns/

type SrvShard struct {
// Copied from Shard
KeyRange key.KeyRange
ServedTypes []TabletType

    // TabletTypes represents the list of types we have serving tablets  
    // for, in this cell only.  
    TabletTypes []TabletType  
    // For atomic updates  
    version int64  

}

type SrvKeyspace struct {
// Shards to use per type, only contains complete partitions.
Partitions map[TabletType]*KeyspacePartition

    // This list will be deprecated as soon as Partitions is used.  
    // List of non-overlapping shards sorted by range.  
    Shards []SrvShard  
    // List of available tablet types for this keyspace in this cell.  
    // May not have a server for every shard, but we have some.  
    TabletTypes []TabletType  
    // For atomic updates  
    version int64  

}

type KeyspacePartition struct {
        // List of non-overlapping continuous shards sorted by range.
        Shards []SrvShard
}

$ zk cat /zk/nyc/vt/ns/rlookup
{
"Shards": [
{
"KeyRange": {
"Start": "",
"End": ""
},
}
],
"TabletTypes": [
"master",
"rdonly",
"replica"
]
}

The only way to build this data is to run the following vtctl command:

$ vtctl RebuildKeyspaceGraph

When building a new data center, this command should be run for every keyspace.

Rebuilding a keyspace graph will: 0 find all the shard names in the keyspace from looking at the children of /zk/global/vt/keyspaces//shards 0 rebuild the graph for every shard in the keyspace (see below) 0 find the list of cells to update by collecting the cells for each tablet of the first shard compute, sanity check and update the keyspace graph object in every cell 0 The python client reads the nodes to find the shard map (KeyRanges, TabletTypes, ...)

SrvShard

The shard data is stored under /zk//vt/ns//

$ zk cat /zk/nyc/vt/ns/rlookup/0
{
"KeyRange": {
"Start": "",
"End": ""
}
}

We also have per serving type data under /zk//vt/ns///

$ zk cat /zk/nyc/vt/ns/rlookup/0/master { "entries": [ { "uid": 200274,
"host": "nyc-db274.nyc.youtube.com",
"port": 0,
"named_port_map": {
"_mysql": 3306,
"_vtocc": 8101,
"_vts": 8102
}
}
]
}

The shard serving graph can be re-built using the 'vtctl RebuildShardGraph /' command. However, it is also triggered by any 'vtctl ChangeSlaveType' command, so it is usually not needed. For instance, when vt_bns_monitor takes servers in and out of serving state, it will rebuild the shard graph.

Note this will rebuild the serving graph for all cells, not just one cell.

Rebuilding a shard serving graph will: 0 compute the data to write by looking at all the tablets from the replicaton graph 0 write all the /zk//vt/ns/// nodes everywhere 0 delete any pre-existing /zk//vt/ns/// that is not in use any more 0 compute and write all the /zk//vt/ns// nodes everywhere

The python client reads the per-type data nodes to find servers to talk to. When resolving user.10-20.master, it will try to read /zk/local/vt/ns/user/10-20/master.