How to Optimize y-cruncher for Record-Breaking Pi Computations
Breaking large Pi records with y-cruncher requires careful tuning across hardware, OS, and y-cruncher settings. This guide gives a step-by-step, prescriptive setup tuned for large-scale runs (multi-terabyte working sets). Assume a Linux server-class machine with many cores, large RAM, and plenty of fast storage.
1) Hardware choices (priorities)
| Component | Recommendation |
|---|---|
| CPU | Many physical cores with high single-thread IPC and large caches (e.g., AMD EPYC or modern Intel Xeon). Prefer high core count over extreme frequency for very large runs. |
| RAM | At least 2–4× the peak working set y-cruncher reports. ECC memory. Aim to avoid swapping at all costs. |
| Storage | NVMe SSDs in RAID-0 or high-performance single NVMe for scratch. For multi-TB runs consider RAID with controllers supporting high sustained throughput; use many parallel drives to increase I/O concurrency. |
| Network | Irrelevant for compute, but needed for downloads/monitoring. |
| Cooling / PSU | Stable power and thermal headroom to avoid throttling during multi-day runs. |
| Motherboard / NUMA | Prefer single-socket or NUMA-aware configuration; on multi-socket, plan NUMA allocation (see below). |
2) OS and kernel tuning (Linux)
- Use a recent, stable kernel (avoid enterprise kernels known to throttle background throughput).
- Disable CPU frequency governors or set to “performance”: echo performance > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
- Disable SMT/Hyper-Threading for consistent throughput when CPU-bound (test both).
- Turn off power-saving C-states that induce jitter in long runs.
- Set hugepages to reduce TLB pressure if memory mapping large buffers: enable and reserve largepages via sysctl and /etc/sysctl.conf.
- Increase I/O scheduler and VM settings:
- vm.swappiness=0
- vm.dirty_ratio and vm.dirty_background_ratio tuned to allow large write-backs without bursts.
- Use a filesystem that handles large files well (XFS, ext4 with large inode settings) and mount with appropriate options (noatime, nodiratime).
- For NUMA systems: enable node interleaving if you cannot manually NUMA-pin; otherwise use numactl to bind processes/memory to nodes to minimize cross-node traffic.
3) y-cruncher binary and build choices
- Use the latest stable y-cruncher matching your platform (download from the author’s site). Use prebuilt binaries only if they match your CPU instruction set (AVX2/AVX512).
- Prefer an AVX/AVX2/AVX512-enabled build that matches your CPU for faster FFT/multiply kernels.
- If building from source, compile with the highest optimization flags appropriate for your compiler and CPU microarchitecture.
4) Memory and task decomposition
- Set y-cruncher’s memory limit slightly below available RAM so OS and filesystem caches have breathing room (e.g., if 256 GB RAM, give 230–240 GB).
- Use y-cruncher’s Task Decomposition (threads) roughly equal to physical cores. For very large problems, over-decompose up to 2× logical cores can help — test both.
- For NUMA: use numactl –cpunodebind and –membind to pin y-cruncher to one node or to interleave memory across nodes depending on observed behavior. For best results on multi-socket: run a single y-cruncher instance per NUMA node and coordinate the problem decomposition if feasible.
5) FFT, I/O and scratch settings
- Let y-cruncher pick optimal FFT sizes for very large problems, but check logs for chosen algorithms (4-step FFT variants scale better).
- Use local fast NVMe for scratch space; give y-cruncher exclusive scratch if possible. Configure the scratch directory on the fastest device and ensure sufficient free space (working set + buffer).
- For extremely large runs, split scratch across multiple drives and use a RAID or LVM volume to present a single high-throughput device. Avoid relying on a single slow disk.
- Monitor and tune I/O parallelism: ensure underlying storage supports many concurrent queued requests (depth) and tune NVMe queue settings if needed.
6) Parallel framework and thread affinity
- Test the different parallel frameworks y-cruncher provides (e.g., Cilk, Push Pool, TBB, native) — performance varies by OS and CPU. On Linux, Push Pool or TBB variants often perform well; on Windows results can differ.
- Pin threads to cores to reduce scheduler jitter. Use taskset or y-cruncher’s affinity options. Avoid letting the OS freely migrate threads on large FFTs.
- If using hyperthreading, benchmark with HT on vs. off — some workloads benefit, others do not.
7) Verification, checkpointing, and reproducibility
- Enable verification passes (y-cruncher supports internal verification of results) — always do this for record attempts. It adds time but prevents silent errors.
- Use y-cruncher’s checkpointing options so an interrupted run can be resumed. Place checkpoints on the same high-performance scratch.
- Keep detailed logs, system telemetry (temperatures, frequencies), and checksums of the output to support verification.
8) Thermal and reliability considerations
- Long runs are sensitive to thermal throttling and memory errors. Monitor CPU temperatures and DIMM ECC events.
- Use ECC RAM and monitor SMART on drives. Replace any component showing errors before a record attempt.
- Consider redundant power and UPS to handle short outages.
9) Benchmarking and iterative tuning (recommended workflow)
- Small-scale validation run on the same hardware to confirm configuration and correctness.
- Medium-scale run (10–20% of target) to measure scaling, I/O, and memory behavior. Collect logs.
- Tune: change task decomposition, affinity, SMT, and scratch layout based on the medium run.
- Full run with verification and checkpointing enabled.
10) Concrete example flags and command (example)
- Memory: set via y-cruncher prompt (Memory = X GB).
- Threads: set to physical cores (e.g., Threads = 64).
- Scratch dir: specify high-speed mount when prompted or via config.
- Use numactl if needed: numactl –cpunodebind=0 –membind=0 ./y-cruncher
(Exact flags vary by y-cruncher version — consult the included guide and logs for syntax.)
11) Monitoring and logging
- Continuously log y-cruncher output to file.
- Monitor CPU, memory, disk I/O, and temperatures (sar, iostat, nvme-cli, ipmitool).
- Check periodic verification outputs to detect drift or errors early.
12) Final checklist before attempting a record run
- ECC memory and tested drives (SMART OK).
- Power and cooling verified for continuous operation.
- Kernel/OS tuned (performance governor, hugepages, swappiness).
- Scratch on NVMe/RAID with enough free space.
- y-cruncher binary optimized for CPU ISA.
- Thread/task decomposition chosen and affinity set.
- Verification and checkpointing enabled.
- Logs, telemetry, and recovery plan in place.
Appendix — quick tuning recipe (summary)
- Set performance CPU governor and disable C-states.
- Reserve ~5–10% RAM for OS; give remainder to y-cruncher.
- Use fast NVMe scratch; present as single high-throughput device.
- Set Threads ≈ physical cores; test 1.5–2× for over-decomposition on very large runs.
- Pin threads and use numactl on NUMA systems.
- Enable verification and checkpoints.
- Monitor temperatures and ECC/S.M.A.R.T. during the run.
Follow these steps, iterate based on empirical telemetry, and you’ll maximize the chance of achieving a stable, high-throughput y-cruncher Pi computation suitable for record attempts.
Leave a Reply