Large Static Graphs

Giraph Out of Core

Presented by Diptanshu
V props, E props and topo kept in separate files.
Partitions can be loaded from disk to memory on the fly, and written back to disk.
If property changes and not topo, only deltas of property loaded from disk
Sticky partitions for random set of partitions to be retained in memory. LRU to keep recently used partitions in memory.
Messages batched and written to disk in vertex ID order. Multiple batches can be in different files.
Compute method called on vertices in same order. Cursor for each file moves sequentially forward, pointing to current/newer vertex being processed. Ensures single pass thru file.
Desiderata
A "unit" of a graph (partition/subgraph) can fit in memory. Others need not, can be persisted to disk.
State of this unit (topo and property) need to be persisted, loaded on demand. Subset of properties could be used in each supserstep/task phase, allowing only those to be loaded.
Messages for a unit persisted to disk between supersteps, loaded on-demand.
Useful for elasticity to offload partitions when not used in non-stationary graph algos.

Presented by Diptanshu
Each subgraph fits in mem, but not entire graph in distributed mem
If subgraph in memory, updated async in memory. Else write to disk. Messages read when loading that subgraph from disk.
Allows subgraphs to make async progress for k supersteps.
Inactive vertices need not be loaded into memory for a given superstep after being persisted to disk.
Desiderata
- Notion of active and inactive vertices to allow memory to be flushed out. Useful for elasticity?
TODO
Additional Reading: CCGRID 2016 paper by Safi for memory offloading [RD]
Additional Reading: KLA programming model, Giraph unchained (move forward by supersteps/nested ssteps) [RD]
Additional Reading: Look at asycing/sync+async programming models: GraphLab, GraphX? [DK]

Presented by Diptanshu
Edge owned only by single vertex. Undirected means two edges.
Direct mutation on edges owned by a vertex: Add, remove, convert from directed to undirected. Visible in same superstep.
Graph mutation requests for vertices/edges not owned by a vertex (add/remove vertex/edge, optionally with properties). Mutations are applied between two ssteps.
sending message to no-existed vertex can create it.
Mutation conflicts resolved using vertex change and vertex resolver.
Desiderata
Support for v, e, property mutations
Rebalancing partitions/subgraphs across (static/dynamic) set of machines between supersteps? Handling skews due to mutation?
TODO:
What is the specific order in which mutations happen? [DK]
Additional reading: Rebalancing of partitions (Mirzan) [RD]. Algos that have a lot of mutations (e.g. MSF) [DK]

Outedges can have custom implementations. Iterable also accessible thru different models, like ArrayList, HashMapEdgeList, ByteArrayList, etc.
Desiderata
- Support different DS for V, E, SG
TODO:
Additional reading: Other graph DS in shared and distributed memory -- performance, storage footprint [DK].