How to Optimize y-cruncher for Record-Breaking Pi Computations

How to Optimize y-cruncher for Record-Breaking Pi Computations

Breaking large Pi records with y-cruncher requires careful tuning across hardware, OS, and y-cruncher settings. This guide gives a step-by-step, prescriptive setup tuned for large-scale runs (multi-terabyte working sets). Assume a Linux server-class machine with many cores, large RAM, and plenty of fast storage.

1) Hardware choices (priorities)

Component Recommendation
CPU Many physical cores with high single-thread IPC and large caches (e.g., AMD EPYC or modern Intel Xeon). Prefer high core count over extreme frequency for very large runs.
RAM At least 2–4× the peak working set y-cruncher reports. ECC memory. Aim to avoid swapping at all costs.
Storage NVMe SSDs in RAID-0 or high-performance single NVMe for scratch. For multi-TB runs consider RAID with controllers supporting high sustained throughput; use many parallel drives to increase I/O concurrency.
Network Irrelevant for compute, but needed for downloads/monitoring.
Cooling / PSU Stable power and thermal headroom to avoid throttling during multi-day runs.
Motherboard / NUMA Prefer single-socket or NUMA-aware configuration; on multi-socket, plan NUMA allocation (see below).

2) OS and kernel tuning (Linux)

  • Use a recent, stable kernel (avoid enterprise kernels known to throttle background throughput).
  • Disable CPU frequency governors or set to “performance”: echo performance > /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
  • Disable SMT/Hyper-Threading for consistent throughput when CPU-bound (test both).
  • Turn off power-saving C-states that induce jitter in long runs.
  • Set hugepages to reduce TLB pressure if memory mapping large buffers: enable and reserve largepages via sysctl and /etc/sysctl.conf.
  • Increase I/O scheduler and VM settings:
    • vm.swappiness=0
    • vm.dirty_ratio and vm.dirty_background_ratio tuned to allow large write-backs without bursts.
  • Use a filesystem that handles large files well (XFS, ext4 with large inode settings) and mount with appropriate options (noatime, nodiratime).
  • For NUMA systems: enable node interleaving if you cannot manually NUMA-pin; otherwise use numactl to bind processes/memory to nodes to minimize cross-node traffic.

3) y-cruncher binary and build choices

  • Use the latest stable y-cruncher matching your platform (download from the author’s site). Use prebuilt binaries only if they match your CPU instruction set (AVX2/AVX512).
  • Prefer an AVX/AVX2/AVX512-enabled build that matches your CPU for faster FFT/multiply kernels.
  • If building from source, compile with the highest optimization flags appropriate for your compiler and CPU microarchitecture.

4) Memory and task decomposition

  • Set y-cruncher’s memory limit slightly below available RAM so OS and filesystem caches have breathing room (e.g., if 256 GB RAM, give 230–240 GB).
  • Use y-cruncher’s Task Decomposition (threads) roughly equal to physical cores. For very large problems, over-decompose up to 2× logical cores can help — test both.
  • For NUMA: use numactl –cpunodebind and –membind to pin y-cruncher to one node or to interleave memory across nodes depending on observed behavior. For best results on multi-socket: run a single y-cruncher instance per NUMA node and coordinate the problem decomposition if feasible.

5) FFT, I/O and scratch settings

  • Let y-cruncher pick optimal FFT sizes for very large problems, but check logs for chosen algorithms (4-step FFT variants scale better).
  • Use local fast NVMe for scratch space; give y-cruncher exclusive scratch if possible. Configure the scratch directory on the fastest device and ensure sufficient free space (working set + buffer).
  • For extremely large runs, split scratch across multiple drives and use a RAID or LVM volume to present a single high-throughput device. Avoid relying on a single slow disk.
  • Monitor and tune I/O parallelism: ensure underlying storage supports many concurrent queued requests (depth) and tune NVMe queue settings if needed.

6) Parallel framework and thread affinity

  • Test the different parallel frameworks y-cruncher provides (e.g., Cilk, Push Pool, TBB, native) — performance varies by OS and CPU. On Linux, Push Pool or TBB variants often perform well; on Windows results can differ.
  • Pin threads to cores to reduce scheduler jitter. Use taskset or y-cruncher’s affinity options. Avoid letting the OS freely migrate threads on large FFTs.
  • If using hyperthreading, benchmark with HT on vs. off — some workloads benefit, others do not.

7) Verification, checkpointing, and reproducibility

  • Enable verification passes (y-cruncher supports internal verification of results) — always do this for record attempts. It adds time but prevents silent errors.
  • Use y-cruncher’s checkpointing options so an interrupted run can be resumed. Place checkpoints on the same high-performance scratch.
  • Keep detailed logs, system telemetry (temperatures, frequencies), and checksums of the output to support verification.

8) Thermal and reliability considerations

  • Long runs are sensitive to thermal throttling and memory errors. Monitor CPU temperatures and DIMM ECC events.
  • Use ECC RAM and monitor SMART on drives. Replace any component showing errors before a record attempt.
  • Consider redundant power and UPS to handle short outages.

9) Benchmarking and iterative tuning (recommended workflow)

  1. Small-scale validation run on the same hardware to confirm configuration and correctness.
  2. Medium-scale run (10–20% of target) to measure scaling, I/O, and memory behavior. Collect logs.
  3. Tune: change task decomposition, affinity, SMT, and scratch layout based on the medium run.
  4. Full run with verification and checkpointing enabled.

10) Concrete example flags and command (example)

  • Memory: set via y-cruncher prompt (Memory = X GB).
  • Threads: set to physical cores (e.g., Threads = 64).
  • Scratch dir: specify high-speed mount when prompted or via config.
  • Use numactl if needed: numactl –cpunodebind=0 –membind=0 ./y-cruncher

(Exact flags vary by y-cruncher version — consult the included guide and logs for syntax.)

11) Monitoring and logging

  • Continuously log y-cruncher output to file.
  • Monitor CPU, memory, disk I/O, and temperatures (sar, iostat, nvme-cli, ipmitool).
  • Check periodic verification outputs to detect drift or errors early.

12) Final checklist before attempting a record run

  • ECC memory and tested drives (SMART OK).
  • Power and cooling verified for continuous operation.
  • Kernel/OS tuned (performance governor, hugepages, swappiness).
  • Scratch on NVMe/RAID with enough free space.
  • y-cruncher binary optimized for CPU ISA.
  • Thread/task decomposition chosen and affinity set.
  • Verification and checkpointing enabled.
  • Logs, telemetry, and recovery plan in place.

Appendix — quick tuning recipe (summary)

  • Set performance CPU governor and disable C-states.
  • Reserve ~5–10% RAM for OS; give remainder to y-cruncher.
  • Use fast NVMe scratch; present as single high-throughput device.
  • Set Threads ≈ physical cores; test 1.5–2× for over-decomposition on very large runs.
  • Pin threads and use numactl on NUMA systems.
  • Enable verification and checkpoints.
  • Monitor temperatures and ECC/S.M.A.R.T. during the run.

Follow these steps, iterate based on empirical telemetry, and you’ll maximize the chance of achieving a stable, high-throughput y-cruncher Pi computation suitable for record attempts.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *