Everything in Git: Running a Trading Signal Platform on NixOS.
By Jan-Lukas Pflaum & Steffen Krutzinna
Everything in Git: Running a Trading Signal Platform on NixOS
A monorepo, declarative infrastructure, and one deployment command
Our entire production infrastructure deploys with one command:
clan machines updateDatabase configuration, workflow schedules, secrets, log shipping, observability—across every machine. No SSH sessions, no manual steps, no configuration (or documentation!) drift. Each server converges to whatever state is declared in our git repository.
As we're writing this, px dynamics is a few weeks old. We wanna take you through how we set up camp: The infrastructure took less time to build than it would take most startups (us included!) to properly configure an AWS account. We're not migrating from something else or "modernizing legacy systems." We started here.
A note on how we got here: we consciously decided to invest in solid infrastructure rather than "move fast and break things." We had help, we found a NixOS freelancer via HN who set up the initial foundation. We didn't figure all of this out ourselves. But since that initial setup, it's just been the two of us: a data engineer and a quant trader. Neither of us are infrastructure specialists. Once the foundation was in place, we've been able to expand and modify it quickly on our own. The investment in learning Nix pays forward.
The Monorepo Foundation
First things first: Everything lives in one repository:
flake.nix # Entry point: dev environment, fleet inventory, Nix config
/machines/ # Server definitions and what runs on each
/sql/ # Database migrations (yoyo-managed)
/libs/r/pxr/ # R packages
/services/api/ # Customer-facing REST API (FastAPI)
/services/flows/ # Python workflows (Prefect)
/analytics/ # Jupyter notebooksThis matters. A change to a database schema, the Python code that writes to it, and the infrastructure that hosts it can be reviewed in one PR. Context is never lost because there's only one place to look.
The monorepo also enables something we didn't fully anticipate: when you work with Claude Code, it can read the entire infrastructure definition, understand database schemas, network topology, and service dependencies. It can suggest changes that account for the full context. This isn't possible when configuration lives in Terraform state, AWS consoles, and scattered repositories.
The Setup
Three Hetzner servers, all in Germany:
- Database server: Bare metal, 64GB RAM, dual 1TB NVMe in ZFS mirror, 16 cores
- Application server: Bare metal, 8GB RAM, observability stack
- API server: Cloud VM, 4GB RAM, customer-facing REST API with TLS
Total cost is under €100/month. That's PostgreSQL with TimescaleDB, a customer-facing API with rate limiting and authentication, a full Prometheus/Loki/Grafana observability stack, scheduled workflow execution, and offsite backups. The database server alone would cost several hundred dollars per month on a managed cloud service and we'd still need to manage the application layer ourselves.
All machines communicate over WireGuard using IPv6 ULA addresses. All internal services—PostgreSQL, Prometheus, Loki, Grafana—bind only to the VPN interface. The public internet sees SSH and the WireGuard endpoint, nothing else. WireGuard deserves a mention here: it's fast, simple, and we genuinely forget it's running.
Why not a hyperscaler? No lock-in, no surprise bills and no services deprecated with six months notice. And critically for a data-driven company: no egress fees. We move a lot of data: market data ingestion, reading data for developing and retraining models, dashboards that refresh constantly. On AWS or GCP, egress alone would dwarf our current server costs. The servers are commodity hardware. If Hetzner disappeared tomorrow, we'd spin up on any provider with dedicated servers and deploy the same configuration. That's not hypothetical, it's the whole point of declarative infrastructure.
Why NixOS
NixOS solves specific problems that matter for our workload:
Reproducibility: The same configuration produces the same system. Not "should produce"—does produce. When we deploy, we know exactly what will run because it's declared in code.
Atomic updates with rollback: Deployments either succeed completely or don't happen. If something breaks, the previous generation is still there. We've rolled back in production, no biggie.
Single source of truth: No wondering whether the production database settings match what's in the wiki. The configuration is the documentation.
Shared development environments: Everyone works with the same centrally-managed versions of R, Python, and all dependencies. "Works on my machine" isn't a thing. Getting a new colleague fully set up takes five minutes: clone the repo, and direnv automatically loads (and later updates) the environment. The same environment runs locally and in production: same Jupyter kernels, same library versions, same everything.
No Docker overhead: We don't need containers because Nix already provides isolation and reproducibility. Services run directly on the host, managed by systemd. Thus, we don't have to orchestrate containers, think about what images to use, and worry about layer caching issues. Rock-solid tools like systemd handle process supervision, logging, and resource limits and come with virtually zero overhead.
The result is that everything is fast. Queries return instantly. Deployments complete in seconds. We're not yet at dozens of terabytes of data, but bare metal servers, a well-tuned PostgreSQL, good networking, and the absence of abstraction layers add up. We don't see a need for Kubernetes. With strong bare metal machines and our workload, horizontal scaling isn't the constraint. Vertical scaling is simpler, and we're nowhere near the limits.
And really, these are the same tools running behind AWS and GCP anyway—PostgreSQL, systemd, Linux. We're just running them directly, without the abstraction layer and the markup.
The learning curve is real. Nix is its own programming language with its own idioms. The first weeks involved a lot of "why isn't this working" and "what does this line do?". But once you internalize the model, infrastructure becomes genuinely tractable. Problems have answers you can find by reading code (and turns out Claude has a pretty good understanding of Nix too).
We use the Clan framework for multi-machine orchestration. It handles inventory management, secrets distribution, and WireGuard mesh networking. Clan sits on top of NixOS and provides opinions about how to structure a fleet. Those opinions happen to match what we wanted: machines defined in one place, secrets encrypted in git, VPN configuration derived from user definitions.
A Tour Through the Config
Fleet Inventory
The entire fleet is defined in flake.nix:
clan = {
meta.name = "pxd";
# Shared configuration applied to all machines
inventory.instances.shared = {
module.name = "importer";
roles.default.tags.all = { };
roles.default.extraModules = [ ./machines/shared.nix ];
};
# WireGuard VPN mesh
inventory.instances.pxd = {
module.name = "wireguard";
roles.controller.machines.gateway = { };
roles.controller.settings.endpoint = "xxx.pxdynamics.com";
roles.peer.machines.database = { };
};
};Every machine gets the shared configuration. The WireGuard mesh is declared once. Adding a new server means adding it to the inventory and running clan machines update.
Database with Inline Documentation
PostgreSQL configuration becomes self-documenting:
services.postgresql = {
enable = true;
package = pkgs.postgresql_18;
# TimescaleDB for time-series workloads
extensions = ps: [ ps.timescaledb ];
settings.shared_preload_libraries = [ "timescaledb" ];
# Performance settings via PGTune (mixed workload, 48GB RAM, 12 CPUs, SSD)
settings.max_connections = 100;
settings.shared_buffers = "12GB"; # 25% of RAM
settings.effective_cache_size = "36GB"; # 75% of RAM
settings.work_mem = "56173kB"; # (RAM - shared_buffers) / (connections * 3)
settings.random_page_cost = 1.1; # SSD: nearly as fast as sequential
settings.effective_io_concurrency = 200;
settings.min_wal_size = "4GB";
settings.max_wal_size = "16GB";
};The comments explain why, not what. Future us will know the reasoning behind each setting.
Secrets That Never Touch Disk
Secrets are managed through Clan's vars system and injected via systemd credentials:
clan.core.vars.generators.entsoe = {
prompts.api-key.description = "ENTSO-E Transparency Platform API Key";
prompts.api-key.type = "hidden";
prompts.api-key.persist = true;
};
systemd.services.prefect-worker = {
serviceConfig.LoadCredential = [
"entsoe-api-key:${config.clan.core.vars.generators.entsoe.files.api-key.path}"
];
environment.ENTSOE_API_KEY_FILE = "%d/entsoe-api-key";
};The secret file path is passed to the service. The actual secret value lives encrypted in the repository (SOPS with age encryption) and only exists in memory at runtime. No plaintext secrets in environment variables, no secrets scattered across config files.
The Prefect Deployment Pattern
This is probably the most novel part of our setup. Prefect workflows are typically deployed by pointing at a git repository and branch. The worker pulls code at runtime.
We do something different:
writers.writeYAML "prefect.yaml" {
deployments = [
{
name = "entsoe-da-forecast";
entrypoint = "${./entsoe/entsoe_forecasts.py}:entsoe_da_forecast";
schedule.cron = "x x * * *";
schedule.timezone = "Europe/Berlin";
work_pool.name = "pxd";
}
];
}That ${./entsoe/entsoe_forecasts.py} becomes a Nix store path. The file is copied into /nix/store/... during deployment, and the Prefect deployment points directly there.
No git pulls at runtime. No wondering which branch is deployed. No "works on my machine" because the worker runs exactly what the developer tested locally.
The flow code is deployed atomically with the rest of the system configuration. If you roll back the machine, you roll back the flows. If a new flow version has a bug, nixos-rebuild switch --rollback restores both the system and the workflow definitions.
This pattern also solves dependency management. The Python environment is built by Nix with pinned versions. The worker doesn't pip install anything—it runs a complete environment that was built and tested before deployment.
ZFS with Selective Snapshots
Not everything needs to be snapshotted:
disko.devices.zpool.tank = {
datasets = {
postgresql = {
type = "zfs_fs";
mountpoint = "/var/lib/postgresql";
options."com.sun:auto-snapshot" = "true"; # Data we care about
};
store = {
type = "zfs_fs";
mountpoint = "/nix";
# No auto-snapshot: Nix store is content-addressed and reproducible
};
log = {
type = "zfs_fs";
mountpoint = "/var/log";
# No auto-snapshot: Logs ship to Loki
};
};
};The Nix store is content-addressed—any path can be rebuilt from the source. Logs ship to Loki for retention. Only the PostgreSQL data directory needs point-in-time snapshots.
Functional WireGuard Peers
Users are defined in a simple data structure:
# users.nix
{
skr5k = {
ssh = "ssh-ed25519 AAAA...";
wireguard = {
ip = "xxxx:xxxx:xx:xxxx::x";
publicKey = "xxxxx...";
};
};
}The WireGuard configuration derives from this:
networking.wireguard.interfaces.pxd.peers = lib.pipe ../../users.nix [
import
lib.attrValues
(lib.filter (lib.hasAttr "wireguard"))
(map ({ wireguard, ... }: {
inherit (wireguard) publicKey;
allowedIPs = [ "${wireguard.ip}/128" ];
}))
];Add a user to users.nix, deploy, they're on the VPN. Remove them, deploy, they're off. The same file controls SSH access to machines—users with an ssh key get root access. One file, two access systems, no synchronization bugs.
Observability Without Vendor Lock-in
The observability stack runs entirely on our infrastructure:
- Prometheus: Scrapes metrics from node exporters on both machines, plus PostgreSQL exporter on the database server
- Loki: Aggregates logs from systemd journals via Grafana Alloy agents
- Grafana: Dashboards, exploration, alerting—all pointed at local data sources
No Datadog. No CloudWatch. No per-seat pricing that scales with team size. The data stays on our servers, retention is controlled by us, and the query interface is the same whether we're debugging at 2 PM or 2 AM.
Everything is declared in Nix and communicates only over WireGuard:
services.alloy = {
enable = true;
configPath = pkgs.writeText "config.alloy" ''
loki.source.journal "systemd" {
forward_to = [loki.write.loki.receiver]
max_age = "12h"
labels = {
job = "systemd-journal",
host = "${config.networking.hostName}",
}
}
loki.write "loki" {
endpoint {
url = "http://app-server.internal:3100/loki/api/v1/push"
}
}
'';
};Every machine runs Alloy and ships its journal to Loki. When something goes wrong, we query logs across all machines from one place. The host label lets us filter by machine. The unit label lets us filter by systemd service. Standard LogQL queries, no proprietary syntax to learn.
Prometheus scrape configs define what gets collected:
scrapeConfigs = [
{
job_name = "node";
static_configs = [{
targets = [ "db.internal:9100" "app.internal:9100" ];
labels.env = "production";
}];
}
{
job_name = "postgres";
static_configs = [{ targets = [ "db.internal:9187" ]; }];
}
];CPU, memory, disk, network, PostgreSQL connections, query statistics. All collected every 15 seconds, retained for 30 days, queryable from Grafana. The configuration is the documentation of what we monitor.
Backups That Actually Work
BorgBackup runs nightly to a Hetzner Storage Box:
inventory.instances.borgbackup = {
roles.client.machines.database = {
settings = {
startAt = "*-*-* 02:00:00"; # 2 AM UTC daily
destinations.storagebox.repo = "ssh://backup@storage.example.com:23/./repo";
};
};
};One interesting part is handling SQLite correctly. Prefect uses SQLite for its state database. Backing up an active SQLite database can produce a corrupted backup if writes happen during the copy.
The solution uses SQLite's backup API:
clan.core.state.prefect = {
folders = [ "/var/lib/private/prefect-server" ];
preBackupScript = ''
${pkgs.sqlite}/bin/sqlite3 /var/lib/prefect-server/prefect.db \
".backup /var/lib/prefect-server/prefect-backup.db"
'';
postRestoreScript = ''
if [ -f /var/lib/prefect-server/prefect-backup.db ]; then
mv /var/lib/prefect-server/prefect-backup.db /var/lib/prefect-server/prefect.db
rm -f /var/lib/prefect-server/prefect.db-wal
rm -f /var/lib/prefect-server/prefect.db-shm
fi
'';
};The backup captures a consistent snapshot even while Prefect is running. The restore procedure handles WAL files correctly. Details like this matter when you actually need to recover.
Working With Claude Code
The monorepo plus declarative infrastructure creates an unexpected capability: AI assistants can reason about the entire system.
When we work with Claude Code, it can:
- Read the complete infrastructure definition in
machines/ - Understand how database schemas in
sql/relate to the Python flows inservices/flows/ - See network topology, service dependencies, and firewall rules together
- Suggest changes that account for interactions across the stack
A concrete example: we ask Claude Code to "add a Python function for reading time series data from the database." It can read the database schema in /sql/ to understand the table structure, check existing patterns in /services/flows/ for how we query PostgreSQL, see the connection configuration in .pg_service.conf, and produce code that fits the codebase conventions. The result isn't generic—it's informed by the actual schema and existing patterns.
This works because everything is text in one place. There's no implicit knowledge locked in a web console or scattered across SaaS dashboards. The repository is the system. What's committed is what runs.
We also maintain skill guides in the repo for different domains—networking, secrets management, Grafana dashboards, Nix patterns. Claude Code reads these too, so it gives contextually appropriate advice whether we're debugging WireGuard or adding a new service.
We're not making grand claims about AI. But when your infrastructure is code, tools that understand code become tools that understand your infrastructure. That's a practical observation, not a prediction about the future.
Trade-offs and What's Next
This setup isn't without friction.
Package availability: NixOS doesn't always have the newest packages. Airflow 3 isn't readily available, and we didn't feel like investing heavily in packaging it ourselves. Here's the thing though: it seems like what's well-maintained in Nix turns out to be a pretty good filter for quality tools. We tried Prefect instead and we're super happy with it.
Learning curve: Nix takes time to learn. The first weeks were slow and sometimes frustrating. We're still learning. When something goes wrong in a Nix derivation, error messages can be cryptic—like when the formatter is unhappy, it looks like the end of the world! Sometimes figuring out the Nix way takes longer than just SSHing in and fixing it manually. But once you've solved a problem, it's solved forever and the solution is in the codebase, not in your head (from which it tends to fade..). So, the learnings compound.
Not everything is self-hosted yet: We use GitHub, Vercel for the website, and standard SaaS like Slack, Notion, and Google Workspace. It's not the right time to insource these—our focus must be on the platform, not on running email servers. But having the infrastructure foundation in place means we're only one services.mattermost.enable = true; away from changing that.
Future refactoring: Of course, we'll need to revisit decisions we are making now, but that's normal. The difference is that refactoring declarative infrastructure is safe: you can see exactly what changes, test it on a staging machine, and roll back if needed.
Closing Thoughts
Small teams can run serious infrastructure. The tools exist. NixOS is opinionated but delivers genuine reproducibility. The monorepo keeps context together. Clan handles the multi-machine orchestration that used to require dedicated platforms.
This setup creates leverage. One person can deploy a database change, update the workflows that use it, and ship new observability dashboards—all in one commit, one review, one deployment. That's not a workflow we designed—it's the natural result of keeping everything together and declarative.
The three pieces reinforce each other:
- The monorepo ensures context is never lost across concerns
- NixOS/Clan makes that context executable and reproducible
- AI-assisted development becomes practical when everything is readable text
None of these is revolutionary alone. Together though, they honestly feel a bit like a superpower, because we can build so rapidly and have a platform we can build anything on, fast, reproducible, and in an operationally sustainable way.
clan machines updateOne command. Every machine converges. Now, back to the fun part: data and models.