Fdb setup guide

FoundationDB - Installation & Configuration

FoundationDB - Installation

Use the correct version when installing FoundationDB.

Full documentation on usage of FoundationDB is available here: https://apple.github.io/foundationdb/index.html

Current stable version to be used is 7.1.61

Installation

mkdir actordb
cd actordb/
# Download FDB clients
wget "https://github.com/apple/foundationdb/releases/download/7.1.61/foundationdb-clients_7.1.61-1_amd64.deb"
# Download DFB Server
wget "https://github.com/apple/foundationdb/releases/download/7.1.61/foundationdb-server_7.1.61-1_amd64.deb"
# Install both packages
sudo dpkg -i foundationdb-clients_7.1.61-1_amd64.deb
sudo dpkg -i foundationdb-server_7.1.61-1_amd64.deb

FoundationDB - Configuration

Configuration

Configuration of FoundationDB is done in multiple stages, node by node. Initially start by correctly configuring single node.

Examples here will assume that the first node has the IP of 192.168.100.10, other nodes - second 192.168.100.11, third 192.168.100.12.

Configure first node

Prepare the configuration

Start by editing /etc/foundationdb/fdb.cluster and modifying the IP address to the IP of the server where the FoundationDB will accept connections.

Content of fdb.cluster should look similar to this after modification:

nameofmynewcluster:node12e821uhfkfakjfhas@192.168.100.10:4500

In fdb.cluster you should set to your own values the nameofmynewcluster and the IP of the machine.

Then prepare the foundationdb.conf file in /etc/foundationdb.

Make sure to set the public-address, listen-address accordingly. Example for foundationdb.conf for 192.168.10.100 machine:

## foundationdb.conf
##
## Configuration file for FoundationDB server processes
## Full documentation is available at
## https://apple.github.io/foundationdb/configuration.html#the-configuration-file

[fdbmonitor]
user = foundationdb
group = foundationdb

[general]
restart-delay = 60
## by default, restart-backoff = restart-delay-reset-interval = restart-delay
# initial-restart-delay = 0
# restart-backoff = 60
# restart-delay-reset-interval = 60
cluster-file = /etc/foundationdb/fdb.cluster
# delete-envvars =
# kill-on-configuration-change = true

## Default parameters for individual fdbserver processes
[fdbserver]
command = /usr/sbin/fdbserver
# public-address = auto:$ID   ############## SEE LINE BELOW ################
public-address = 192.168.100.10:$ID
# listen-address = public     ############## SEE LINE BELOW ################       
listen-address = public
datadir = /var/lib/foundationdb/data/$ID
logdir = /var/log/foundationdb
# logsize = 10MiB
# maxlogssize = 100MiB
# machine-id =
# datacenter-id =
# class =
# memory = 8GiB
# storage-memory = 1GiB
# cache-memory = 2GiB
# metrics-cluster =
# metrics-prefix =

## An individual fdbserver process with id 4500
## Parameters set here override defaults from the [fdbserver] section
[fdbserver.4500]

[backup_agent]
command = /usr/lib/foundationdb/backup_agent/backup_agent
logdir = /var/log/foundationdb

# BACKUP AGENT CONFIGURATION       ############## SEE LINE BELOW ################
[backup_agent.1]

When configurting a 3 node cluster and backups will be done on physical drive on one of the nodes, only one node (the one storing the backup) should have the [backup_agent.1] in the config. Other nodes must have it commented out.

Finalize first node and run it

Enable the service and restart it with new configuration:

systemctl enable foundationdb
systemctl restart foundationdb

Configure the node through fdbcli command line interface:

fdb> configure new single ssd

This will create a working single FoundationDB node which is starting point of the cluster.

Configure other nodes (2,3,...)

Copy the fdb.cluster from first node to the second, third node.
Configure the foundationdb.conf as on the first node, insert node IP addresses and configure if it's a backup node

Enable the service and restart it with new configuration:

systemctl enable foundationdb
systemctl restart foundationdb

After you run fdbcli and run status you should see a working connection to previous node.

Observing FDB cluster status

At any time we can run status details in fdbcli to observe cluster status:

fdb> status details

Using cluster file `/etc/foundationdb/fdb.cluster'.

Configuration:
  Redundancy mode        - double
  Storage engine         - ssd-2
  Coordinators           - 3
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 3
  Zones                  - 3
  Machines               - 3
  Memory availability    - 3.3 GB per process on machine with least available
                           >>>>> (WARNING: 4.0 GB recommended) <<<<<
  Fault Tolerance        - 1 machines
  Server time            - 10/25/23 06:48:05

Data:
  Replication health     - Healthy
  Moving data            - 0.000 GB
  Sum of key-value sizes - 1.327 GB
  Disk space used        - 3.648 GB

Operating space:
  Storage server         - 34.8 GB free on most full server
  Log server             - 34.8 GB free on most full server

Workload:
  Read rate              - 17 Hz
  Write rate             - 0 Hz
  Transactions started   - 5 Hz
  Transactions committed - 0 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Process performance details:
  192.168.100.10:4500    (  2% cpu;  4% machine; 0.000 Gbps;  1% disk IO; 4.0 GB / 6.6 GB RAM  )
  192.168.100.11:4500    (  1% cpu;  1% machine; 0.000 Gbps;  1% disk IO; 1.7 GB / 3.3 GB RAM  )
  192.168.100.12:4500    (  2% cpu;  1% machine; 0.000 Gbps;  0% disk IO; 1.7 GB / 3.3 GB RAM  )

Coordination servers:
  192.168.100.10:4500  (reachable)
  192.168.100.11:4500  (reachable)
  192.168.100.12:4500  (reachable)

Client time: 10/25/23 06:48:05

fdb>

Configure cluster behaviour (replication factor) and coordinator nodes

When atleast 3 nodes are running we can switch to double replication factor mode which will give us 1 node fault tolerance.

fdb> configure double

We want multiple coordinators in case of cluster failue, in our case we will put all 3 (consult with architects on how many coordinators to put). Following example sets 192.168.100.10:4500 192.168.100.11:4500 192.168.100.12:4500 as coordinator nodes:

fdb> coordinators 192.168.100.10:4500 192.168.100.11:4500 192.168.100.12:4500

Post-Configuration Steps

Backup & Restore

References to backup & restore tools

Backup: https://apple.github.io/foundationdb/backups.html#fdbbackup-command-line-tool

Restore: https://apple.github.io/foundationdb/backups.html#fdbrestore-command-line-tool

Create a backup

Before doing a backup to a single node make sure all other cluster nodes, except the one doing the backup have the [backup_agent.1] disabled like this in foundationdb.conf file:

# /etc/foundationdb/foundationdb.conf

# [backup_agent.1]

Example to create a backup in folder /tmp/backup-2023-10/:

fdbbackup start -w -d file:///tmp/backup-2023-10/

Restore from backup

Execute the following command to restore a cluster from a backup snapshot:

fdbrestore start --dest-cluster-file /etc/foundationdb/fdb.cluster -r file:///tmp/backup-2023-10/backup-2023-10-17-14-27-10.788233/

Ensure that the path includes the subfolder of the backup to be restored, like in the example above.

Sample backup script

#!/bin/bash
TODAY=`date '+%Y-%m-%d'`
KEEPBACKUPDAYS=3
# execute backup
echo "performing actordb backup ..."
actordb backup --path /external/biocoded/backup/actordb/$TODAY/ --master 127.0.0.1:33306
# store time of backup
# fdb backup
echo "performing foundationdb backup..."
mkdir -p /external/biocoded/backup/fdb/$TODAY/
chmod a+rw /external/biocoded/backup/fdb/$TODAY/
fdbbackup start -w -d file:///external/biocoded/backup/fdb/$TODAY/
echo "$TODAY done." >> /external/biocoded/backup/backups.txt
# delete all backups older than 3 days
find /external/biocoded/backup/* -type d -ctime +$KEEPBACKUPDAYS -exec rm -rf {} \;

Node IP changes

Initial state

configured per docs
3 nodes, each is a coordinator
e.g. 192.168.122.190, 192.168.122.32, 192.168.122.18
Assumed that no changes occur when changing ips

Single IP change

This is basically adding a new node when one machine is down, with the caveat that the data on the machine is preserved

Change the ip of a single machine (e.g. 192.168.122.190 -> 192.168.122.242)
Change the public-address in the fdb configuration file
Restart the fdb service
The node should be connected to the cluster and the database should be reinitializing
The cluster status (i.e. fdbcli -> status details) will warn about an unreachable coordinator
Change the coordinators to include the new ip, removing the old

Change all IPs at once

either all nodes are shut down or the fdb service is stopped
new nodes are now e.g. 192.168.122.249 192.168.122.185 192.168.122.22
change the public-address in the configuration file on each node
starting the nodes or the fdb service at this point will result in no node being able to join the cluster and fdbcli -> status warning about no coordinator being reachable
Manually edit the /etc/foundationdb/fdb.cluster file to contain the correct ip addresses

-- Initial cluster file state

# DO NOT EDIT!
# This file is auto-generated, it is not to be edited by hand
testcluster:AoWN336IupjxCSlITaImlBUdSAukNTru@192.168.122.18:4500,192.168.122.32:4500,192.168.122.190:4500

-- Cluster file state after changes

testcluster:AoWN336IupjxCSlITaImlBUdSAukNTru@192.168.122.22:4500,192.168.122.185:4500,192.168.122.249:4500

after the changes to the cluster file the fdb service can be re/started and the nodes should join the cluster with the database still present and a healthy state

Exclude/Include nodes

Exclude

configuration limitations should be taken into account when doing this
e.g. using a double configuration with only 2 available coordinators and then excluding another won't work, it makes the database unavailable (as per the docs for double mode)
and the only way to restore the state is to include the machine or add a new node
it can lead to situations where "unsafe" recovery methods need to be used
exclude addresses>
excluded addresses should be removed from coordinators beforehand
wait for command completion
status should report the count of machines excluded
at this point the machine fdb service can be stopped and the machine shutdown

Include

include address
should automatically be picked up if the fdb proccess is running

Attempted

machine without the specified /var/lib/foundationdb/data/4500
machine with the specified /var/lib/foundationdb/data/4500
removing the third machine
set only 2 coordinators
exclude leftover machine & wait for completion
include machine

Misc

excluding 1 out of 2 coordinators when configured in double mode caused issues and the database seemed to be in a weird state (difference between storage used, issues with adding nodes)
while adding/removing nodes, despite configuration stating double mode, no machines reported for fault tolerance (most likely caused by the previous point)
solved by changing to lower tiered mode and back to the initial one