Insertion Sort Performance Analyzer

A systems-level performance analysis project examining how algorithm design, compiler optimizations, and memory behavior affect real-world execution time.

Overview

Insertion Sort Performance Analyzer is a CMPT 295 mini-project focused on understanding performance beyond Big-O notation. The project benchmarks multiple insertion sort implementations and analyzes how data movement strategies, compiler optimizations, branching behavior, and cache locality impact runtime on modern CPUs.

Rather than treating insertion sort as a purely theoretical algorithm, this project evaluates how low-level implementation details translate into measurable performance differences.

What Was Compared

  • Standard insertion sort using C-style arrays
  • Insertion sort using std::vector to evaluate abstraction overhead
  • Optimized insertion sort using binary search and memmove
  • std::sort as a baseline reference

Each implementation was tested on sorted, reverse-sorted, random, and duplicate-heavy inputs across multiple input sizes.

Key Takeaways

  • Compiler optimizations (-O2, -O3) dramatically reduce instruction count and execution time.
  • Replacing branch-heavy loops with linear memory operations (memmove) significantly improves performance on modern CPUs.
  • Memory access patterns and branch predictability have a major impact on runtime, often more than algorithmic complexity alone.

Tech Stack

Backend

C++, Assembly

DevOps

Makefile, Valgrind

Challenges & Solutions

Challenge:

Understanding why algorithmically similar insertion sort implementations showed large performance differences in practice.

Solution:

Benchmarked each variant under identical conditions and analyzed compiler output, performance counters, and memory behavior to identify the impact of branching, data movement, and abstraction overhead.

Challenge:

Evaluating the real effect of compiler optimization flags beyond raw timing results.

Solution:

Compiled the same code with -O0, -O2, and -O3, then compared execution time, generated assembly, and profiling data to connect optimizations to measurable performance changes.

Challenge:

Identifying performance bottlenecks caused by memory access and branch prediction.

Solution:

Used profiling tools and cache analysis to study memory behavior and branch misses, revealing that replacing branch-heavy loops with linear memory operations (memmove) significantly improved performance.

Learning Outcomes

  • Gained practical experience benchmarking and analyzing performance at the systems level.
  • Learned how compiler optimizations, branching behavior, and cache locality influence runtime.
  • Developed confidence reading assembly output and profiling data to explain performance differences.