Fdb setup guide
FoundationDB - Installation & Configuration
FoundationDB - Installation
Use the correct version when installing FoundationDB.
Full documentation on usage of FoundationDB is available here: https://apple.github.io/foundationdb/index.html
Current stable version to be used is 7.1.61
Installation
mkdir actordb
cd actordb/
# Download FDB clients
wget "https://github.com/apple/foundationdb/releases/download/7.1.61/foundationdb-clients_7.1.61-1_amd64.deb"
# Download DFB Server
wget "https://github.com/apple/foundationdb/releases/download/7.1.61/foundationdb-server_7.1.61-1_amd64.deb"
# Install both packages
sudo dpkg -i foundationdb-clients_7.1.61-1_amd64.deb
sudo dpkg -i foundationdb-server_7.1.61-1_amd64.deb
FoundationDB - Configuration
Configuration
Configuration of FoundationDB is done in multiple stages, node by node. Initially start by correctly configuring single node.
Examples here will assume that the first node has the IP of 192.168.100.10, other nodes - second 192.168.100.11, third 192.168.100.12.
Configure first node
Prepare the configuration
Start by editing /etc/foundationdb/fdb.cluster and modifying the IP address to the IP of the server where the FoundationDB will accept connections.
Content of fdb.cluster should look similar to this after modification:
In fdb.cluster you should set to your own values the nameofmynewcluster and the IP of the machine.
Then prepare the foundationdb.conf file in /etc/foundationdb.
Make sure to set the public-address, listen-address accordingly. Example for foundationdb.conf for 192.168.10.100 machine:
## foundationdb.conf
##
## Configuration file for FoundationDB server processes
## Full documentation is available at
## https://apple.github.io/foundationdb/configuration.html#the-configuration-file
[fdbmonitor]
user = foundationdb
group = foundationdb
[general]
restart-delay = 60
## by default, restart-backoff = restart-delay-reset-interval = restart-delay
# initial-restart-delay = 0
# restart-backoff = 60
# restart-delay-reset-interval = 60
cluster-file = /etc/foundationdb/fdb.cluster
# delete-envvars =
# kill-on-configuration-change = true
## Default parameters for individual fdbserver processes
[fdbserver]
command = /usr/sbin/fdbserver
# public-address = auto:$ID ############## SEE LINE BELOW ################
public-address = 192.168.100.10:$ID
# listen-address = public ############## SEE LINE BELOW ################
listen-address = public
datadir = /var/lib/foundationdb/data/$ID
logdir = /var/log/foundationdb
# logsize = 10MiB
# maxlogssize = 100MiB
# machine-id =
# datacenter-id =
# class =
# memory = 8GiB
# storage-memory = 1GiB
# cache-memory = 2GiB
# metrics-cluster =
# metrics-prefix =
## An individual fdbserver process with id 4500
## Parameters set here override defaults from the [fdbserver] section
[fdbserver.4500]
[backup_agent]
command = /usr/lib/foundationdb/backup_agent/backup_agent
logdir = /var/log/foundationdb
# BACKUP AGENT CONFIGURATION ############## SEE LINE BELOW ################
[backup_agent.1]
When configurting a 3 node cluster and backups will be done on physical drive on one of the nodes, only one node (the one storing the backup) should have the [backup_agent.1] in the config. Other nodes must have it commented out.
Finalize first node and run it
Enable the service and restart it with new configuration:
Configure the node through fdbcli command line interface:
This will create a working single FoundationDB node which is starting point of the cluster.
Configure other nodes (2,3,...)
- Copy the
fdb.clusterfrom first node to the second, third node. - Configure the
foundationdb.confas on the first node, insert node IP addresses and configure if it's a backup node
Enable the service and restart it with new configuration:
After you run fdbcli and run status you should see a working connection to previous node.
Observing FDB cluster status
At any time we can run status details in fdbcli to observe cluster status:
fdb> status details
Using cluster file `/etc/foundationdb/fdb.cluster'.
Configuration:
Redundancy mode - double
Storage engine - ssd-2
Coordinators - 3
Usable Regions - 1
Cluster:
FoundationDB processes - 3
Zones - 3
Machines - 3
Memory availability - 3.3 GB per process on machine with least available
>>>>> (WARNING: 4.0 GB recommended) <<<<<
Fault Tolerance - 1 machines
Server time - 10/25/23 06:48:05
Data:
Replication health - Healthy
Moving data - 0.000 GB
Sum of key-value sizes - 1.327 GB
Disk space used - 3.648 GB
Operating space:
Storage server - 34.8 GB free on most full server
Log server - 34.8 GB free on most full server
Workload:
Read rate - 17 Hz
Write rate - 0 Hz
Transactions started - 5 Hz
Transactions committed - 0 Hz
Conflict rate - 0 Hz
Backup and DR:
Running backups - 0
Running DRs - 0
Process performance details:
192.168.100.10:4500 ( 2% cpu; 4% machine; 0.000 Gbps; 1% disk IO; 4.0 GB / 6.6 GB RAM )
192.168.100.11:4500 ( 1% cpu; 1% machine; 0.000 Gbps; 1% disk IO; 1.7 GB / 3.3 GB RAM )
192.168.100.12:4500 ( 2% cpu; 1% machine; 0.000 Gbps; 0% disk IO; 1.7 GB / 3.3 GB RAM )
Coordination servers:
192.168.100.10:4500 (reachable)
192.168.100.11:4500 (reachable)
192.168.100.12:4500 (reachable)
Client time: 10/25/23 06:48:05
fdb>
Configure cluster behaviour (replication factor) and coordinator nodes
When atleast 3 nodes are running we can switch to double replication factor mode which will give us 1 node fault tolerance.
We want multiple coordinators in case of cluster failue, in our case we will put all 3 (consult with architects on how many coordinators to put).
Following example sets 192.168.100.10:4500 192.168.100.11:4500 192.168.100.12:4500 as coordinator nodes:
Post-Configuration Steps
Backup & Restore
References to backup & restore tools
Backup: https://apple.github.io/foundationdb/backups.html#fdbbackup-command-line-tool
Restore: https://apple.github.io/foundationdb/backups.html#fdbrestore-command-line-tool
Create a backup
Before doing a backup to a single node make sure all other cluster nodes, except the one doing the backup have the [backup_agent.1] disabled like this in foundationdb.conf file:
Example to create a backup in folder /tmp/backup-2023-10/:
fdbbackup start -w -d file:///tmp/backup-2023-10/
Restore from backup
Execute the following command to restore a cluster from a backup snapshot:
fdbrestore start --dest-cluster-file /etc/foundationdb/fdb.cluster -r file:///tmp/backup-2023-10/backup-2023-10-17-14-27-10.788233/
Ensure that the path includes the subfolder of the backup to be restored, like in the example above.
Sample backup script
#!/bin/bash
TODAY=`date '+%Y-%m-%d'`
KEEPBACKUPDAYS=3
# execute backup
echo "performing actordb backup ..."
actordb backup --path /external/biocoded/backup/actordb/$TODAY/ --master 127.0.0.1:33306
# store time of backup
# fdb backup
echo "performing foundationdb backup..."
mkdir -p /external/biocoded/backup/fdb/$TODAY/
chmod a+rw /external/biocoded/backup/fdb/$TODAY/
fdbbackup start -w -d file:///external/biocoded/backup/fdb/$TODAY/
echo "$TODAY done." >> /external/biocoded/backup/backups.txt
# delete all backups older than 3 days
find /external/biocoded/backup/* -type d -ctime +$KEEPBACKUPDAYS -exec rm -rf {} \;
Node IP changes
Initial state
- configured per docs
- 3 nodes, each is a coordinator
- e.g. 192.168.122.190, 192.168.122.32, 192.168.122.18
- Assumed that no changes occur when changing ips
Single IP change
This is basically adding a new node when one machine is down, with the caveat that the data on the machine is preserved
- Change the ip of a single machine (e.g. 192.168.122.190 -> 192.168.122.242)
- Change the
public-addressin the fdb configuration file - Restart the fdb service
- The node should be connected to the cluster and the database should be reinitializing
- The cluster status (i.e. fdbcli -> status details) will warn about an unreachable coordinator
- Change the coordinators to include the new ip, removing the old
Change all IPs at once
- either all nodes are shut down or the fdb service is stopped
- new nodes are now e.g. 192.168.122.249 192.168.122.185 192.168.122.22
- change the
public-addressin the configuration file on each node - starting the nodes or the fdb service at this point will result in no node being able to join the cluster and fdbcli -> status warning about no coordinator being reachable
- Manually edit the /etc/foundationdb/fdb.cluster file to contain the correct ip addresses
-- Initial cluster file state
# DO NOT EDIT!
# This file is auto-generated, it is not to be edited by hand
testcluster:AoWN336IupjxCSlITaImlBUdSAukNTru@192.168.122.18:4500,192.168.122.32:4500,192.168.122.190:4500
-- Cluster file state after changes
testcluster:AoWN336IupjxCSlITaImlBUdSAukNTru@192.168.122.22:4500,192.168.122.185:4500,192.168.122.249:4500
- after the changes to the cluster file the fdb service can be re/started and the nodes should join the cluster with the database still present and a healthy state
Exclude/Include nodes
Exclude
- configuration limitations should be taken into account when doing this
- e.g. using a double configuration with only 2 available coordinators and then excluding
another won't work, it makes the database unavailable (as per the docs for double mode)
and the only way to restore the state is to include the machine or add a new node - it can lead to situations where "unsafe" recovery methods need to be used
- exclude
addresses> - excluded addresses should be removed from coordinators beforehand
- wait for command completion
- status should report the count of machines excluded
- at this point the machine fdb service can be stopped and the machine shutdown
Include
- include
address - should automatically be picked up if the fdb proccess is running
Attempted
- machine without the specified /var/lib/foundationdb/data/4500
- machine with the specified /var/lib/foundationdb/data/4500
- removing the third machine
- set only 2 coordinators
- exclude leftover machine & wait for completion
- include machine
Misc
- excluding 1 out of 2 coordinators when configured in double mode caused issues and the database seemed to be in a weird state (difference between storage used, issues with adding nodes)
- while adding/removing nodes, despite configuration stating double mode, no machines reported for fault tolerance (most likely caused by the previous point)
- solved by changing to lower tiered mode and back to the initial one