Running PostgreSQL using ZFS and AWS EBS

    The main reason to use PostgreSQL with ZFS (instead of ext4/xfs) is data compression. Using LZ4, you can achieve 2-3x compression ratio which means that you need to write and read 2-3x less data. ZSTD offers even better compression at the expense of slightly higher CPU usage.

    The second reason is Adaptive Replacement Cache (ARC). ARC is a page replacement algorithm with slightly better characteristics than Linux page cache. Since it caches compressed blocks, you can also fit more data in the same RAM.

    Basic ZFS setup

    First, you need to create a separate pool for PostgreSQL:

    And 2 datasets for PostgreSQL data and a write-ahead log (WAL):

    1. mv /var/lib/postgresql/14/main/pg_wal /tmp/pg_wal
    2. mv /var/lib/postgresql /tmp/postgresql
    3. # Create datasets.
    4. zfs create pg/data -o mountpoint=/var/lib/postgresql
    5. zfs create pg/wal-14 -o mountpoint=/var/lib/postgresql/14/main/pg_wal
    6. # Move PostgreSQL files back.
    7. cp -r /tmp/postgresql/* /var/lib/postgresql
    8. cp -r /tmp/pg_wal/* /var/lib/postgresql/14/main/pg_wal
    9. chmod 0750 /var/lib/postgresql
    10. chmod 0750 /var/lib/postgresql/14/main/pg_wal

    ZFS config

    Consider starting with the following ZFS configuration and tune it as you learn more:

    ZFS ARC size

    By default, ZFS uses 50% of RAM for Adaptive Replacement Cache (ARC). You can consider increasing ARC to 70-80% of RAM, but make sure to leave enough memory for PostgreSQL shared_buffers:

    1. # set ARC cache to 1GB
    2. echo 1073741824 >> /sys/module/zfs/parameters/zfs_arc_max

    To persist the ARC size change through Linux restarts, create /etc/modprobe.d/zfs.conf:

    recordsize is the size of the largest block of data that ZFS will write and read. ZFS compresses each block individually and compression is better for larger blocks. Use the default recordsize=128k and decrease it to 32-64k if you need more TPS (transactions per second).

    • Larger recordsize means better compression which improves performance if your queries read/write lots of data (tens of megabytes).
    • Smaller recordsize means more TPS.

    ARC and shared_buffers

    Since ARC caches compressed blocks, prefer using it over PostgreSQL shared_buffers for caching hot data. But making shared_buffers too small will negatively affect write speed. So consider lowering shared_buffers as long as your write speed does not suffer too much and leave the rest of the RAM for ARC.

    Disabling TOAST compression

    To not compress data twice, you can disable PostgreSQL compression by setting column storage to EXTERNAL. But it does not make much difference:

    • LZ4 is extremely fast.
    • Both LZ4 and ZSTD have special logic to skip incompressible (or already compressed) parts of data.

    Alignment Shift

    Use the default ashift value with Amazon Elastic Block Store and other cloud storages because EBS volume is not a single physical device but a logical volume that spans numerous distributed devices.

    But if you know the sector size of the drive, it is worth it to configure ashift properly:

    1. zpool create -o ashift=12 -o autoexpand=on pg /dev/nvme1n1

    Because ZFS always writes full blocks, you can disable full page writes in PostgreSQL via full_page_writes = off setting.

    PostgreSQL block size and WAL size

    The default PostgreSQL block size is 8k and it does not match ZFS record size (by default 128k). The result is that while PostgreSQL writes data in 8k blocks, ZFS has to work with 128k records (known as write amplification). You can improve this situation by increasing PostgreSQL block size to 32k and WAL block size to 64k. This requires re-compiling PostgreSQL and re-initializing a database.

    • Larger blocksize considerably improves performance of the queries that read a lot of data (tens of megabytes). This effect is not specific to ZFS and you can use larger block sizes with other filesystems as well.
    • Smaller blocksize means higher transaction rate per second.

    logbias

    Use .

    Quote from :

    Another one from @taratarabobara open in new window:

    logbias=throughput will fragment every. Single. Block. Written to your pool.

    Normally ZFS writes data and metadata near sequentially, so they can be read with a single read IOP later. Indirect syncs (logbias=throughput) cause metadata and data to be spaced apart, and data spaced apart from data. Fragmentation results, along with very pool IO merge.

    If you want to see this in action, do “zfs send dataset >/dev/null” while watching “zpool iostat -r 1” in another window. You will see many, many 4K reads that cannot be aggregated with anything else. This is the cost of indirect sync, and you pay it at every single read.

    It should only be used in very specific circumstances.

    ZFS snapshots

    If you are going to use ZFS snapshots, create a separate dataset for PostgreSQL WAL files. This way snapshots of your main dataset are smaller. Don’t forget to backup WAL files separately so you can use Point-in-Time RecoveryOptimizing PostgreSQL for ZFS and AWS EBS - 图5open in new window.

    But usually it is easier and cheaper to store backups on S3 using . Another popular option is EBS snapshots.