How to Optimize y-cruncher for Record-Breaking Pi Computations

Breaking large Pi records with y-cruncher requires careful tuning across hardware, OS, and y-cruncher settings. This guide gives a step-by-step, prescriptive setup tuned for large-scale runs (multi-terabyte working sets). Assume a Linux server-class machine with many cores, large RAM, and plenty of fast storage.

1) Hardware choices (priorities)

Component	Recommendation
CPU	Many physical cores with high single-thread IPC and large caches (e.g., AMD EPYC or modern Intel Xeon). Prefer high core count over extreme frequency for very large runs.
RAM	At least 2–4× the peak working set y-cruncher reports. ECC memory. Aim to avoid swapping at all costs.
Storage	NVMe SSDs in RAID-0 or high-performance single NVMe for scratch. For multi-TB runs consider RAID with controllers supporting high sustained throughput; use many parallel drives to increase I/O concurrency.
Network	Irrelevant for compute, but needed for downloads/monitoring.
Cooling / PSU	Stable power and thermal headroom to avoid throttling during multi-day runs.
Motherboard / NUMA	Prefer single-socket or NUMA-aware configuration; on multi-socket, plan NUMA allocation (see below).

2) OS and kernel tuning (Linux)

Use a recent, stable kernel (avoid enterprise kernels known to throttle background throughput).
Disable CPU frequency governors or set to “performance”: echo performance > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
Disable SMT/Hyper-Threading for consistent throughput when CPU-bound (test both).
Turn off power-saving C-states that induce jitter in long runs.
Set hugepages to reduce TLB pressure if memory mapping large buffers: enable and reserve largepages via sysctl and /etc/sysctl.conf.
Increase I/O scheduler and VM settings:
- vm.swappiness=0
- vm.dirty_ratio and vm.dirty_background_ratio tuned to allow large write-backs without bursts.
Use a filesystem that handles large files well (XFS, ext4 with large inode settings) and mount with appropriate options (noatime, nodiratime).
For NUMA systems: enable node interleaving if you cannot manually NUMA-pin; otherwise use numactl to bind processes/memory to nodes to minimize cross-node traffic.

3) y-cruncher binary and build choices

Use the latest stable y-cruncher matching your platform (download from the author’s site). Use prebuilt binaries only if they match your CPU instruction set (AVX2/AVX512).
Prefer an AVX/AVX2/AVX512-enabled build that matches your CPU for faster FFT/multiply kernels.
If building from source, compile with the highest optimization flags appropriate for your compiler and CPU microarchitecture.

4) Memory and task decomposition

Set y-cruncher’s memory limit slightly below available RAM so OS and filesystem caches have breathing room (e.g., if 256 GB RAM, give 230–240 GB).
Use y-cruncher’s Task Decomposition (threads) roughly equal to physical cores. For very large problems, over-decompose up to 2× logical cores can help — test both.
For NUMA: use numactl –cpunodebind and –membind to pin y-cruncher to one node or to interleave memory across nodes depending on observed behavior. For best results on multi-socket: run a single y-cruncher instance per NUMA node and coordinate the problem decomposition if feasible.

5) FFT, I/O and scratch settings

Let y-cruncher pick optimal FFT sizes for very large problems, but check logs for chosen algorithms (4-step FFT variants scale better).
Use local fast NVMe for scratch space; give y-cruncher exclusive scratch if possible. Configure the scratch directory on the fastest device and ensure sufficient free space (working set + buffer).
For extremely large runs, split scratch across multiple drives and use a RAID or LVM volume to present a single high-throughput device. Avoid relying on a single slow disk.
Monitor and tune I/O parallelism: ensure underlying storage supports many concurrent queued requests (depth) and tune NVMe queue settings if needed.

6) Parallel framework and thread affinity

Test the different parallel frameworks y-cruncher provides (e.g., Cilk, Push Pool, TBB, native) — performance varies by OS and CPU. On Linux, Push Pool or TBB variants often perform well; on Windows results can differ.
Pin threads to cores to reduce scheduler jitter. Use taskset or y-cruncher’s affinity options. Avoid letting the OS freely migrate threads on large FFTs.
If using hyperthreading, benchmark with HT on vs. off — some workloads benefit, others do not.

7) Verification, checkpointing, and reproducibility

Enable verification passes (y-cruncher supports internal verification of results) — always do this for record attempts. It adds time but prevents silent errors.
Use y-cruncher’s checkpointing options so an interrupted run can be resumed. Place checkpoints on the same high-performance scratch.
Keep detailed logs, system telemetry (temperatures, frequencies), and checksums of the output to support verification.

8) Thermal and reliability considerations

Long runs are sensitive to thermal throttling and memory errors. Monitor CPU temperatures and DIMM ECC events.
Use ECC RAM and monitor SMART on drives. Replace any component showing errors before a record attempt.
Consider redundant power and UPS to handle short outages.

9) Benchmarking and iterative tuning (recommended workflow)

Small-scale validation run on the same hardware to confirm configuration and correctness.
Medium-scale run (10–20% of target) to measure scaling, I/O, and memory behavior. Collect logs.
Tune: change task decomposition, affinity, SMT, and scratch layout based on the medium run.
Full run with verification and checkpointing enabled.

10) Concrete example flags and command (example)

Memory: set via y-cruncher prompt (Memory = X GB).
Threads: set to physical cores (e.g., Threads = 64).
Scratch dir: specify high-speed mount when prompted or via config.
Use numactl if needed: numactl –cpunodebind=0 –membind=0 ./y-cruncher

(Exact flags vary by y-cruncher version — consult the included guide and logs for syntax.)

11) Monitoring and logging

Continuously log y-cruncher output to file.
Monitor CPU, memory, disk I/O, and temperatures (sar, iostat, nvme-cli, ipmitool).
Check periodic verification outputs to detect drift or errors early.

12) Final checklist before attempting a record run

ECC memory and tested drives (SMART OK).
Power and cooling verified for continuous operation.
Kernel/OS tuned (performance governor, hugepages, swappiness).
Scratch on NVMe/RAID with enough free space.
y-cruncher binary optimized for CPU ISA.
Thread/task decomposition chosen and affinity set.
Verification and checkpointing enabled.
Logs, telemetry, and recovery plan in place.

Appendix — quick tuning recipe (summary)

Set performance CPU governor and disable C-states.
Reserve ~5–10% RAM for OS; give remainder to y-cruncher.
Use fast NVMe scratch; present as single high-throughput device.
Set Threads ≈ physical cores; test 1.5–2× for over-decomposition on very large runs.
Pin threads and use numactl on NUMA systems.
Enable verification and checkpoints.
Monitor temperatures and ECC/S.M.A.R.T. during the run.

Follow these steps, iterate based on empirical telemetry, and you’ll maximize the chance of achieving a stable, high-throughput y-cruncher Pi computation suitable for record attempts.

How to Optimize y-cruncher for Record-Breaking Pi Computations