Skip to content

More fair comparison between algorithms#36

Open
facelessuser wants to merge 1 commit into
color-js:mainfrom
facelessuser:consistent-returns
Open

More fair comparison between algorithms#36
facelessuser wants to merge 1 commit into
color-js:mainfrom
facelessuser:consistent-returns

Conversation

@facelessuser

@facelessuser facelessuser commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Ensure that every gamut mapping method returns the color in the origin space consistently for comparison.

Also, ensure that oklch-cubic checks lightness when gamut mapping.

@facelessuser facelessuser requested a review from LeaVerou June 19, 2026 21:50
@netlify

netlify Bot commented Jun 19, 2026

Copy link
Copy Markdown

Deploy Preview for color-apps ready!

Name Link
🔨 Latest commit 9b1550e
🔍 Latest deploy log https://app.netlify.com/projects/color-apps/deploys/6a36a1b81eadf50008d1c5da
😎 Deploy Preview https://deploy-preview-36--color-apps.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
🤖 Make changes Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

@LeaVerou LeaVerou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be part of the harness running these algorithms (see map.js), just like clipping the chroma to 0.4 which is now applied universally to all of them beforehand (I thought it was wasteful to start from whatever chroma since we know it won't be higher than that and some algos performed more poorly when significantly farther) and the clipping to P3 that happens to all of them after. Each algorithm should only include the bits that are different for that algorithm. We can (and do) measure these operations in the time taken, so where that code lives shouldn't affect the results.

Though no handling is going to be perfectly fair, if a GMA operates on oklch colors it will be faster if fed oklch colors (like the benchmark does) and if another operates in a different space, it will be faster when starting from that space...

@facelessuser

Copy link
Copy Markdown
Contributor Author

There is a lot of overhead that some algorithms are subject to that others are not. All of the algorithms currently return whatever is convenient for them (and some what is not convenient for them).

The only thing I've done here is ensure the input origin color space is also the output. This ensures they are all operating under the same requirements. Take the given input space and gamut map the color such that the output is the same space, but within the specified gamut. Now they are all subject to the same overhead and are operating on the same rules. This makes a fairer comparison and provides an apples-to-apples comparison.

I personally feel that having these methods operate under the same requirements is reasonable, but if this is not desired, a correction for lightness would still be needed.

@facelessuser

Copy link
Copy Markdown
Contributor Author

Implementing the algorithm in ColorAide, to see how it performs with similar overhead as other approaches, the Cubic approach preserves chroma and hue better than RayTrace. It's not visually noticeable, but in raw numbers, it performs better.

Running this on an image with over a million pixels with constantly changing, super saturated Rec. 2020 colors, and gamut mapping them to sRGB, the Cubic approach had a ~62.8% speedup over raytrace.

Clip was still much faster, at least in our implementation. There are a variety of reasons why this could be. I assume this is likely because they are all sharing similar overhead.

The hue cache seems fine when you are repeating the same color, but at least with a highly saturated rainbow in an image, the hue cache didn't provide a huge difference. Since I'd need to hold a hue cache for every linear RGB gamut, I ended up removing it, but I do cache the creation of the LMS to RGB matrix as I found that to be noticeable; this way, I always have that for a given linear RGB.

If you need to reduce chroma in a linear RGB gamut using Oklab, it'd be hard to find a better approach it seems.

coloraide git:(feature/fit-oklch-cubic) ✗ python3 tools/apply_gamut.py -i test.jpg -o test-map4.jpg --gamut rec2020 --gmap 'raytrace'
Pixels: 1048576
> 100%
Completed in: 7.003060333 seccoloraide git:(feature/fit-oklch-cubic) ✗ python3 tools/apply_gamut.py -i test.jpg -o test-map4.jpg --gamut rec2020 --gmap 'oklch-cubic'
Pixels: 1048576
> 100%
Completed in: 3.657383834 seccoloraide git:(feature/fit-oklch-cubic) ✗ python3 tools/apply_gamut.py -i test.jpg -o test-map4.jpg --gamut rec2020 --gmap 'clip'
Pixels: 1048576
> 100%
Completed in: 1.127206125 sec

@LeaVerou

LeaVerou commented Jun 20, 2026

Copy link
Copy Markdown
Member

There is a lot of overhead that some algorithms are subject to that others are not. All of the algorithms currently return whatever is convenient for them (and some what is not convenient for them).

The only thing I've done here is ensure the input origin color space is also the output. This ensures they are all operating under the same requirements. Take the given input space and gamut map the color such that the output is the same space, but within the specified gamut. Now they are all subject to the same overhead and are operating on the same rules. This makes a fairer comparison and provides an apples-to-apples comparison.

I personally feel that having these methods operate under the same requirements is reasonable, but if this is not desired, a correction for lightness would still be needed.

To clarify, I'm fine having the conversion! I was just saying the code to do that should be centralized, not something each algorithm author needs to remember to do.

Also, fix OkLCh-Cubic's lightness issue
@facelessuser

Copy link
Copy Markdown
Contributor Author

Updated this stuff in a centralized place.

Lightness should be handled in the Cubic approach now, and output normalization is done in one place. With or without this normalization, the Cubic approach kills it.

@lloydk

lloydk commented Jun 20, 2026

Copy link
Copy Markdown

The fact that clip isn't the fastest algorithm in the benchmarking app probably means that the comparison isn't fair as it could be.

I added oklch-cubic to my (unreleased) library and benchmarked the 35640 colors the gamut mapping benchmark app uses by default. I added the ability to disable the oklch-cubic cache and a raytrace algorithm that was optimized for P3 as my generic raytrace algorithm is unoptimized and needs some work.

I ran the benchmarks under Node and Bun. Clip was the fastest and oklch-cubic with cache enabled was second fastest. With cache disabled oklch-cubic was second slowest.

I'm a bit surprised that the cache didn't provide any benefit in coloraide. It would be interesting to see the how caching vs no caching performs when benchmarking the same set of colors as this benchmarking app.

I'll play around with my implementation a bit more to see if I can improve the no cache performance.

Node
clk: ~4.98 GHz
cpu: AMD Ryzen 7 9800X3D 8-Core Processor
runtime: node 26.1.0 (x64-linux)

benchmark                                          avg (min … max) p75 / p99    (min … top 1%)
------------------------------------------------------------------ -------------------------------
oklch -> p3 gamut map (clip)                          2.31 ms/iter   2.30 ms  █
                                               (2.25 ms … 2.77 ms)   2.73 ms ▄█▃
                                           (530.77 kb …   1.96 mb) 564.63 kb ███▄▂▂▁▂▂▂▁▂▁▁▁▁▁▁▁▂▁
                                          4.09 ipc ( 95.14% cache)   4.16k branch misses
                                11.82M cycles  48.35M instructions 139.28k c-refs   6.77k c-misses

oklch -> p3 gamut map (oklch-cubic)                   5.81 ms/iter   5.82 ms  █▅
                                               (5.66 ms … 6.38 ms)   6.35 ms ▂██▂▅
                                           (  3.09 mb …   3.11 mb)   3.09 mb █████▇▅▄▃▂▁▁▃▄▄▁▁▂▄▁▂
                                          3.92 ipc ( 98.30% cache)  59.37k branch misses
                                29.63M cycles 116.02M instructions 866.39k c-refs  14.76k c-misses

oklch -> p3 gamut map (oklch-cubic, cache disabled)  16.02 ms/iter  16.05 ms  ▅ █▂
                                             (15.41 ms … 18.10 ms)  17.88 ms  █ ██▇
                                           (  5.39 mb …  37.39 mb)  12.77 mb ▇█▇███▇▄▇▁▁▄▄▁▄▁▁▁▁▁▄
                                          3.54 ipc ( 97.37% cache)  92.20k branch misses
                                80.53M cycles 284.86M instructions   2.65M c-refs  69.67k c-misses

oklch -> p3 gamut map (raytrace)                     14.81 ms/iter  15.01 ms    █
                                             (14.45 ms … 15.34 ms)  15.30 ms    █▃▃  ▃
                                           (550.35 kb … 610.97 kb) 558.61 kb ▆▆▁███▄▆█▁▄▁▆▆▁▄██▆▁▄
                                          4.28 ipc ( 95.25% cache)  88.69k branch misses
                                75.30M cycles 322.31M instructions 221.22k c-refs  10.50k c-misses

oklch -> p3 gamut map (raytraceP3Fast)               10.02 ms/iter  10.18 ms  █
                                              (9.75 ms … 10.84 ms)  10.73 ms  █▅▄
                                           (557.30 kb … 581.56 kb) 557.74 kb ▆████▄▁▂▂▅▅▂▂▄▁▂▅▄▂▁▂
                                          3.47 ipc ( 94.38% cache)  60.89k branch misses
                                51.07M cycles 177.41M instructions 169.47k c-refs   9.52k c-misses

oklch -> p3 gamut map (css)                          62.69 ms/iter  63.27 ms         █
                                             (61.12 ms … 64.60 ms)  64.47 ms ▅▅▅ ▅   █ ▅▅ ▅▅     ▅
                                           (  1.09 mb …   1.10 mb)   1.09 mb ███▁█▁▁▁█▁██▁██▁▁▁▁▁█
                                          3.88 ipc ( 97.75% cache) 331.59k branch misses
                               317.60M cycles   1.23G instructions 857.75k c-refs  19.30k c-misses

oklch -> p3 gamut map (bottosson lightness)           6.49 ms/iter   6.61 ms  █
                                               (6.25 ms … 7.40 ms)   7.29 ms ██▇▂
                                           (  4.34 mb …   4.37 mb)   4.34 mb █████▃▄▄▂▇▆▂▂▂▃▁▄▂▂▁▂
                                          3.92 ipc ( 95.97% cache)  15.67k branch misses
                                32.78M cycles 128.66M instructions 328.83k c-refs  13.27k c-misses

summary
  oklch -> p3 gamut map (clip)
   2.51x faster than oklch -> p3 gamut map (oklch-cubic)
   2.81x faster than oklch -> p3 gamut map (bottosson lightness)
   4.34x faster than oklch -> p3 gamut map (raytraceP3Fast)
   6.41x faster than oklch -> p3 gamut map (raytrace)
   6.93x faster than oklch -> p3 gamut map (oklch-cubic, cache disabled)
   27.12x faster than oklch -> p3 gamut map (css)
Bun
clk: ~4.89 GHz
cpu: AMD Ryzen 7 9800X3D 8-Core Processor
runtime: bun 1.3.14 (x64-linux)

benchmark                                          avg (min … max) p75 / p99    (min … top 1%)
------------------------------------------------------------------ -------------------------------
oklch -> p3 gamut map (clip)                          2.01 ms/iter   1.99 ms  █
                                               (1.91 ms … 2.68 ms)   2.53 ms  █▄
                                           (  0.00  b … 384.00 kb)   2.61 kb ███▇▃▂▂▂▂▂▁▁▁▂▂▂▃▂▁▁▁
                                          4.02 ipc ( 97.88% cache)   5.25k branch misses
                                10.10M cycles  40.63M instructions  82.69k c-refs   1.75k c-misses

oklch -> p3 gamut map (oklch-cubic)                   4.54 ms/iter   4.69 ms  █
                                               (4.30 ms … 5.58 ms)   5.52 ms ▃█▄
                                           (  0.00  b …   1.25 mb)  25.43 kb ████▆▄▄▆▆▅▂▁▂▃▁▁▁▂▁▂▂
                                          3.83 ipc ( 99.13% cache)  37.97k branch misses
                                22.96M cycles  88.02M instructions 673.90k c-refs   5.85k c-misses

oklch -> p3 gamut map (oklch-cubic, cache disabled)  15.43 ms/iter  16.02 ms  █▃▃█       ▃
                                             (14.41 ms … 17.20 ms)  17.09 ms ▂████  ▂  ▂ █ ▇▇
                                           (  0.00  b …  20.63 mb) 545.52 kb █████▁▆█▆▆█▆█▆██▁▁▁▆▆
                                          3.41 ipc ( 97.25% cache) 115.85k branch misses
                                73.84M cycles 251.50M instructions   2.26M c-refs  62.24k c-misses

oklch -> p3 gamut map (raytrace)                     15.11 ms/iter  15.27 ms  █
                                             (14.79 ms … 16.22 ms)  16.08 ms ▂█   ▂▂
                                           (  0.00  b …   3.13 mb) 194.91 kb ██▅▁▃██▃▆▃▆▃▁▁▃▁▁▁▁▁▃
                                          3.99 ipc ( 97.95% cache) 110.66k branch misses
                                77.03M cycles 307.19M instructions 270.48k c-refs   5.54k c-misses

oklch -> p3 gamut map (raytraceP3Fast)                9.36 ms/iter   9.51 ms   ▃█
                                              (9.09 ms … 10.50 ms)  10.09 ms ▂███
                                           (  0.00  b … 896.00 kb)  12.27 kb █████▁▁▃██▃▆▁▄▃▆▁▁▁▃▃
                                          3.64 ipc ( 97.57% cache)  72.86k branch misses
                                47.55M cycles 172.98M instructions 171.81k c-refs   4.18k c-misses

oklch -> p3 gamut map (css)                          61.04 ms/iter  61.20 ms    ██
                                             (60.59 ms … 62.01 ms)  61.86 ms ▅▅▅██▅    ▅▅        ▅
                                           (  0.00  b …   0.00  b)   0.00  b ██████▁▁▁▁██▁▁▁▁▁▁▁▁█
                                          3.65 ipc ( 96.91% cache) 362.30k branch misses
                               312.64M cycles   1.14G instructions 340.55k c-refs  10.53k c-misses

oklch -> p3 gamut map (bottosson lightness)           5.82 ms/iter   5.87 ms   █
                                               (5.65 ms … 6.46 ms)   6.24 ms  ▇██▆▃
                                           (  0.00  b …   1.00 mb)   9.93 kb ▆█████▄▆▄▄▅▁▂▁▅▂▄▆▁▁▃
                                          3.45 ipc ( 98.19% cache)  20.23k branch misses
                                29.74M cycles 102.71M instructions 184.68k c-refs   3.34k c-misses

summary
  oklch -> p3 gamut map (clip)
   2.26x faster than oklch -> p3 gamut map (oklch-cubic)
   2.9x faster than oklch -> p3 gamut map (bottosson lightness)
   4.66x faster than oklch -> p3 gamut map (raytraceP3Fast)
   7.52x faster than oklch -> p3 gamut map (raytrace)
   7.68x faster than oklch -> p3 gamut map (oklch-cubic, cache disabled)
   30.38x faster than oklch -> p3 gamut map (css)

@facelessuser

Copy link
Copy Markdown
Contributor Author

I want to be clear, the cache did provide a benefit when your hue and gamut exactly matches. My statement of what it tested is probably misleading.

The test I did didn't hit it exact guess enough. It was a map of highly saturated Rec. 2020 colors. While it varied in oklch lightness and hue, it wasn't generated from the oklch perspective, but from rec. 2020. It produced a lot of hues that weren't exact.

I'm positive there are scenarios it would help a lot, but I'm not really wanting to keep a cache for every gamut, P3, sRGB,etc. All those caches add up. I'm handling more than the 3 CSS gamuts.

@facelessuser

Copy link
Copy Markdown
Contributor Author

When I tested things, what I ended up doing is basically treating Rec. 2020 has an HSL space, adjusting lightness and hue. I then converted them back to Rec. 2020. For all the colors I used max saturation. This ensured that the hues didn't all align with in perfect OkLCh hues. That's why I didn't get the cache improvements.

If I were to add hue_cache back, I'd probably have it share the cache across all the gamuts using a LRU cache with some max size. If you were processing lots of colors in a specific space, you'd fill your cache with that space and get a speed improvement. I probably wouldn't want it to expand infinitely, caching every sub hue: 20.0000000001, 20.0000000002, etc. I think having some sort of compromise is reasonable.

The fact that not having the hue_cache doesn't bother me is a testament to the fact that it is super fast even without the cache. I find caching the linear LMS/RGB matrix more useful as I get the benefit on every hit.

@facelessuser

Copy link
Copy Markdown
Contributor Author

I will further note that I've been evaluating it in other gamuts like sRGB and Rec. 2020 (not just P3). I realize this relation may hold better for P3 than the others, but so far it seems to work ok for the others as well, though a clip may be required. I haven't yet found cases where P3 needs clipping, but I have found some for sRGB.

@lloydk

lloydk commented Jun 20, 2026

Copy link
Copy Markdown

I'm less convinced that it's faster than other methods without the cache. I think it depends on how well optimized the alternative methods are.

@facelessuser

Copy link
Copy Markdown
Contributor Author

Yeah, I'm not convinced it's faster than clipping. I wanted to drop it into my library where there is known overhead and consistency. And I'm skeptical about being faster than the LUT, but I have real data showing it out performs Raytrace, though Raytrace is flexible to be used with non Oklab perceptual spaces, so it still has it's niche. Right now I'm trying to properly evaluate it's usage outside P3 as that's a very narrow case.

@LeaVerou LeaVerou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First, this discussion should not be happening in a PR, these are very valuable results y'all!

And yes, it doesn't make sense that anything would be faster than clip, lol. That's clearly an artifact of the benchmark methodology.

But thinking about this some more, I've been wondering on what would be a fair comparison. Is it really fair if all start and end with xRGB1? If a certain algorithm is faster for xRGB colors whereas another one is faster for OKLCH colors, that should be reflected in the data, not hidden by artificial conversions. I think it's ok to convert the output color, since that has to become xRGB at some point, but the input color shouldn't be converted any more than is necessary by the conversion algorithm. Instead we should have an option to customize how the colors to benchmark are generated, so that we can have separate comparisons for different color spaces. I can think of several different color generating schemes that would be useful to try, but we can start from one additional one that generates P3 colors guaranteed to be OOG. The easiest way would be probably to take the current generated colors and convert them to P3 before feeding them to the conversion algos non-timed.

Approved as I can see the argument of having this in meanwhile (and it includes the L fix).

As a bigger point, if we're measuring speed, none of these should be using the OOP API!

Footnotes

  1. Using xRGB as shorthand for "an RGB color space"

Comment thread gamut-mapping/methods/oklch-cubic.js
@facelessuser

facelessuser commented Jun 21, 2026

Copy link
Copy Markdown
Contributor Author

Normalizing compute Requirements

Is it really fair if all start and end with xRGB1?

That wasn't really what I was suggesting. I was more suggesting that they have similiar requirements; for instance, if color is originX and we are gamut mapping to xRGB, the input and output color space should be the same for compute regardless of what it does internally. It's more that compute has a similar, expected requirement for all methods. Internally, they can do whatever. Currently, their output is inconsistent. I think the normalization ensures they all get a similar input already.

Benchmarking

Now that I better understand what the new algorithm is doing, I think the benchmark highly favors the new algorithm, even if that wasn't intentional. It generates colors in OkLCh, holding hue constant for a column and lightness constant for a row. The algorithm capitalizes on the constant hue case because it has the hue_cache. I think it is fine to show that it can excel in this case, but a fairer example would be to generate the colors in a way where this is not this case. When I tested it, I generated my colors very differently, which is why I didn't see the hue cache speed up, not to a degree that was noticeable.

The algorithm is fast, no denying that, regardless of the hue cache benefit; why not show that? I think it is indeed fairer to show it compared to the others in a test that shows what the algorithm does, not the cache, unless the whole approach is the cache, as it is with the LUT.

At the very least, a way to benchmark the colors in a different way that doesn't have a high bias to a specific approach.

Algorithm Changes

I think that OkLCh Cubic should clip to the gamut at the end. This requiremewnt should be baked into the algorithm.

I've been running tests, and I can clearly hit cases where colors are not fully in gamut without that.

This shows the Input OkLCh color, what it looks like in Display-P3 before GMA, and then after GMA.

AssertionError: color(--oklch 0.60262 0.22481 203.86deg / 1) -> + color(display-p3 -0.37824 0.61506 0.72699 / 1) -> color(display-p3 -0.00008 0.58127 0.64186 / 1)

Even if these cases aren't prolific, they still occur and should be handled.

@facelessuser

Copy link
Copy Markdown
Contributor Author

First, this discussion should not be happening in a PR, these are very valuable results y'all!

I was grappling with how this could be faster than clip, and even the LUT (on average), which spawned the whole digging into what is a "fair" test. Then I found the lightness bug, and I mainly wanted to find any flaws in the algorithm and get them patched before commenting further. I kind of also wanted to understand how well this applied to sRGB and Rec. 2020 (other CSS gamut map targets). It wasn't clear to me whether the observed relationship held true generally for all linear RGB spaces or just P3. I think once I feel more confident about what the algorithm really can and can't do, I'll comment more in the CSSWG thread. I have a habit of spouting my first gut impression, and then later walking back some of that once I understand things better 😅.

I've seen enough to know roughly where it falls compared to some approaches. It's less complicated than Björn's approach and avoids iterations where RayTrace doesn't, so it doesn't surprise me that it beats those. I know LUT isn't just looking up chroma in a table; there is some work that occurs, and maybe in some cases it can be slower, but I'd still think that on average, it would be faster. I don't have enough data right now.

@facelessuser

Copy link
Copy Markdown
Contributor Author

While I think the cache does give a good performance boost in some situations, I do think an unbounded cache may not be a practical requirement. I imagine a bounded cache that drops the least hit entries in the cache when full may be more reasonable. Even if I end up implementing the cache (shared across gamuts or separate), I'd likely not implement them such that they'd be unbounded.

@LeaVerou

Copy link
Copy Markdown
Member

I don't think the input and output spaces need to be the same. If one GMA is faster when run on oklch colors and another is faster when run on xRGB colors, that's useful data that should be known. It's not inconceivable that browsers may end up using different GMAs depending on the format, after all.

OTOH, since the color needs to be displayed on a screen in the end, I think it's ok for the output space to be xRGB, but I'd also be fine with them returning it in any format. That also applies to any final clipping: it will be done by the screen hardware anyway, so no strong opinion on whether it should be considered part of the algorithm or not. That said, for non-screen media (e.g. printing) you could end up side-stepping xRGB altogether and convert to an entirely different color space, so perhaps xRGB shouldn't be so privileged.

In general, I think instead of hiding each GMAs strengths and weaknesses, we should have multiple different ways of generating the colors via a URL parameter & corresponding form control.
I had locally prototyped a ?space parameter that sets the color space for the input by generating colors in oklch and then converting them before they are passed to the (normalized) GMAs but after seeing your comment I'm not sure that's the best approach, since they'd still have hue duplication (though presumably the conversion error would eliminate the cache benefit).
Perhaps we should have different generator functions that are used to generate the colors?


As a bigger picture comment, I worry we may have lost sight of the end goal here, and have started treating this as a purely algorithmic problem.

The goal is to find a GMA that would work well for CSS use cases.
We're not developing a GMA for images or graphics. For entire graphics you need a perceptual rendering intent because relationships between colors matter more than absolute color qualities. This means converting even in-gamut colors, so none of these GMAs would be a good fit for this.
For CSS, we need a relative colorimetric rendering intent, i.e. in-gamut colors stay in gamut, and only OOG are affected.
In addition to the rendering intent, we need to ensure that contrasting color pairs remain contrasting, so large L and H shifts are not acceptable.

Similarly, the choice of oklch for the generator was not to favor any particular algorithm (oklch-cubic is not the only GMA that uses oklch), but based on the human factors of the actual use case. In practice, that's when authors get OOG: when using polar formats to tweak visual components (e.g. to create color palettes). When using RGB color formats, you know when you're OOG, so it mainly comes up when authors use e.g. P3 and the color is displayed in an sRGB device. But for such a small distance, clipping is not such a huge problem anyway.

And once you're using oklch, generating representative patches by going over the range of L and H with a predefined step seemed like the obvious choice to cover the range of possibilities.

Don't get me wrong, I 100% agree we should have multiple ways to generate the patches. The more data, the better! But not all ways of generating colors are equally relevant to the end goal.
On that note, it would be productive if you could suggest a deterministic patch generation algorithm that you think is fair. It sounds like you have one in mind?

The algorithm is fast, no denying that, regardless of the hue cache benefit; why not show that? I think it is indeed fairer to show it compared to the others in a test that shows what the algorithm does, not the cache, unless the whole approach is the cache, as it is with the LUT.

Because it felt wasteful to recompute a bunch of stuff that only depends on H and not L (and my hypothesis is that in real-world usage, hues would not vary that much, but that's just a hypothesis).

I didn't want to cap its number of entries as that seemed less fair — depending on how you generate colors you'd get different results), but it could be capped to a certain number of digits of precision beyond which the C is the same anyway, I had computed that number before somewhere.

It may also be worth having a version with the cache and a version without it. I'll try to add one.

I think that OkLCh Cubic should clip to the gamut at the end. [...] I've been running tests, and I can clearly hit cases where colors are not fully in gamut without that.

From an architectural pov, we should decide if compute() should include all necessary normalizations or whether it should exclude them so they're guaranteed to only be done once. I've currently gone with the latter, but I can see a case for the former. It seems more error-prone (easy to author a new GMA without remembering to normalize) but also it makes the compute()s more self-contained. No strong opinion here, except that it should be decided once and applied universally.

@facelessuser

Copy link
Copy Markdown
Contributor Author

From an architectural pov, we should decide if compute() should include all necessary normalizations or whether it should exclude them so they're guaranteed to only be done once. I've currently gone with the latter, but I can see a case for the former. It seems more error-prone (easy to author a new GMA without remembering to normalize) but also it makes the compute()s more self-contained. No strong opinion here, except that it should be decided once and applied universally.

I somewhat disagree. It makes no sense to clip after clipping in a normalization. It doubles the overhead of clipping for no reason. If an algorithm doesn't require clipping after, it shouldn't clip, but if it does, it should be part of the algorithm. I think it makes sense to contain the real GMA logic. In the end though, I'll defer to what you want.

Don't get me wrong, I 100% agree we should have multiple ways to generate the patches.

Yep, I think it's good to highlight an approaches strengths, but also it's weaknesses.

I didn't want to cap its number of entries as that seemed less fair

I don't think it unfair, but I'm also only offering it as a suggestion. I personally wouldn't implement it myself unbounded, but it's fine if people choose to. As someone who's worked in embedded programming, an unbounded cache simply doesn't sit well with me 🙂. I also think the algorithm is good with our without the cache.

@LeaVerou

Copy link
Copy Markdown
Member

From an architectural pov, we should decide if compute() should include all necessary normalizations or whether it should exclude them so they're guaranteed to only be done once. I've currently gone with the latter, but I can see a case for the former. It seems more error-prone (easy to author a new GMA without remembering to normalize) but also it makes the compute()s more self-contained. No strong opinion here, except that it should be decided once and applied universally.

I somewhat disagree. It makes no sense to clip after clipping in a normalization. It doubles the overhead of clipping for no reason. If an algorithm doesn't require clipping after, it shouldn't clip, but if it does, it should be part of the algorithm. I think it makes sense to contain the real GMA logic. In the end though, I'll defer to what you want.

I'm confused — you say you disagree then you go on to agree 😅

Don't get me wrong, I 100% agree we should have multiple ways to generate the patches.

Yep, I think it's good to highlight an approaches strengths, but also it's weaknesses.

So,e suggestions about alternative ways to generate them? 🙂

I didn't want to cap its number of entries as that seemed less fair

I don't think it unfair, but I'm also only offering it as a suggestion. I personally wouldn't implement it myself unbounded, but it's fine if people choose to. As someone who's worked in embedded programming, an unbounded cache simply doesn't sit well with me 🙂. I also think the algorithm is good with our without the cache.

I meant an LRU cache seemed unfair as e.g. you'd get different results if you generate the pages hue-first vs l-first. Whereas limiting by precision doesn't depend on ordering.
But I agree I should add a version of the algo without the cache.

@lloydk

lloydk commented Jun 21, 2026

Copy link
Copy Markdown

Yeah, I'm not convinced it's faster than clipping. I wanted to drop it into my library where there is known overhead and consistency. And I'm skeptical about being faster than the LUT, but I have real data showing it out performs Raytrace, though Raytrace is flexible to be used with non Oklab perceptual spaces, so it still has it's niche. Right now I'm trying to properly evaluate it's usage outside P3 as that's a very narrow case.

It's quite possible that the performance difference between each gamut mapping method is library dependent. So for coloraide non cached oklch-cubic could very well be faster than raytrace while other libraries might show different results.

As a bigger picture comment, I worry we may have lost sight of the end goal here, and have started treating this as a purely algorithmic problem.

The goal is to find a GMA that would work well for CSS use cases. We're not developing a GMA for images or graphics. For entire graphics you need a perceptual rendering intent because relationships between colors matter more than absolute color qualities. This means converting even in-gamut colors, so none of these GMAs would be a good fit for this. For CSS, we need a relative colorimetric rendering intent, i.e. in-gamut colors stay in gamut, and only OOG are affected. In addition to the rendering intent, we need to ensure that contrasting color pairs remain contrasting, so large L and H shifts are not acceptable.

Isn't the primary CSS use case going to be mapping to sRGB or P3 or possibly Rec2020. I know at one point there was talk of mapping to Rec2020 and then clipping. If mapping to *RGB is the primary use case then I think the default output space for this benchmark should be *RGB.

I somewhat disagree. It makes no sense to clip after clipping in a normalization. It doubles the overhead of clipping for no reason. If an algorithm doesn't require clipping after, it shouldn't clip, but if it does, it should be part of the algorithm. I think it makes sense to contain the real GMA logic. In the end though, I'll defer to what you want.

I agree that the algorithms should contain the real GMA logic. Also, the current clip implementation isn't very efficient so algorithms that call toGamut({ method: "clip" }) will get punished unfairly compared to algorithms that don't clip. I think ideally we'd have a clip(Coords: coords) method on ColorSpace that clipped coords in place and avoided mapping over the ColorSpace.coords object. This could also potentially help the clip benchmark numbers.

@facelessuser

facelessuser commented Jun 21, 2026

Copy link
Copy Markdown
Contributor Author

It's quite possible that the performance difference between each gamut mapping method is library dependent. So for coloraide non cached oklch-cubic could very well be faster than raytrace while other libraries might show different results.

No, cached is faster when hue is identical, I've tested this locally. The specific test I originally did generated Max saturated values in Rec. 2020 with various lightness which doesn't generate enough exact hue matches to show the additional gain. The cache only helps in this specific case to avoid hue data calculations.

@facelessuser

Copy link
Copy Markdown
Contributor Author

This demonstrates cache works, even in ColorAide. First is cached, second example is not.

coloraide git:(feature/fit-oklch-cubic) ✗ python3 -m timeit -s "from coloraide.everything import ColorAll as Color; c = Color('oklch(80% 0.3 320)')" "c.clone().fit('display-p3', method='oklch-cubic', cache=True)"
20000 loops, best of 5: 12.4 usec per loopcoloraide git:(feature/fit-oklch-cubic) ✗ python3 -m timeit -s "from coloraide.everything import ColorAll as Color; c = Color('oklch(80% 0.3 320)')" "c.clone().fit('display-p3', method='oklch-cubic')"
20000 loops, best of 5: 17.3 usec per loop

Admittedly, performance will be library-dependent and language-dependent, but skiping the hue data calculation is universally faster, but it only works when the hue exactly matches in the cache.

@lloydk

lloydk commented Jun 21, 2026

Copy link
Copy Markdown

This demonstrates cache works, even in ColorAide. First is cached, second example is not.

➜  coloraide git:(feature/fit-oklch-cubic) ✗ python3 -m timeit -s "from coloraide.everything import ColorAll as Color; c = Color('oklch(80% 0.3 320)')" "c.clone().fit('display-p3', method='oklch-cubic', cache=True)"
20000 loops, best of 5: 12.4 usec per loop
➜  coloraide git:(feature/fit-oklch-cubic) ✗ python3 -m timeit -s "from coloraide.everything import ColorAll as Color; c = Color('oklch(80% 0.3 320)')" "c.clone().fit('display-p3', method='oklch-cubic')"
20000 loops, best of 5: 17.3 usec per loop

Admittedly, performance will be library-dependent and language-dependent, but skiping the hue data calculation is universally faster, but it only works when the hue exactly matches in the cache.

Are both times faster than raytrace?

@facelessuser

Copy link
Copy Markdown
Contributor Author

Are both times faster than raytrace?

Yes, Ray Trace has to iterate 4 times; there is no way it can beat OkLCh Cubic in speed. Cubic isn't extremely complicated, and it doesn't iterate with color conversions. Ray Trace is more flexible for swapping out other perceptual spaces. But it can't beat the Cubic approach in raw speed.

Here's a test I coded up which plays to OkLCh Cubic's strengths when caching is enabled. Similar to what the Color.js benchmark does.

Tests are Ray Trace, Cubic, Cached Cubic, Clip. Ray Trace can't keep up.

coloraide git:(feature/fit-oklch-cubic) ✗ python3 tools/benchmark_gma.py -g display-p3 -m "raytrace" -t oklch
Completed in: 33.679208042 seccoloraide git:(feature/fit-oklch-cubic) ✗ python3 tools/benchmark_gma.py -g display-p3 -m "oklch-cubic" -t oklch
Completed in: 18.136304 seccoloraide git:(feature/fit-oklch-cubic) ✗ python3 tools/benchmark_gma.py -g display-p3 -m 'oklch-cubic:{"cache": true}' -t oklch
Completed in: 12.771840833 seccoloraide git:(feature/fit-oklch-cubic) ✗ python3 tools/benchmark_gma.py -g display-p3 -m "clip" -t oklch
Completed in: 6.156545917 sec

Now, here's one that generates maximum saturated colors in Rec. 2020. Here the hues won't play to Cached Cubic's strengths. Still faster than Ray Trace.

coloraide git:(feature/fit-oklch-cubic) ✗ python3 tools/benchmark_gma.py -g display-p3 -m "raytrace" -t rec2020
Completed in: 36.545193125 seccoloraide git:(feature/fit-oklch-cubic) ✗ python3 tools/benchmark_gma.py -g display-p3 -m "oklch-cubic" -t rec2020
Completed in: 20.638073625 seccoloraide git:(feature/fit-oklch-cubic) ✗ python3 tools/benchmark_gma.py -g display-p3 -m 'oklch-cubic:{"cache": true}' -t rec2020
Completed in: 20.639316583 seccoloraide git:(feature/fit-oklch-cubic) ✗ python3 tools/benchmark_gma.py -g display-p3 -m "clip" -t rec2020
Completed in: 6.920228125 sec

@facelessuser

Copy link
Copy Markdown
Contributor Author

I think at this point, a lightness fix is already on main, and there appears to not quite be an agreement on how to make the tests more consistent. I think I will abandon this review for now, though I think further conversation would be useful.

@lloydk

lloydk commented Jun 23, 2026

Copy link
Copy Markdown

@facelessuser I just quickly clauded a no-cache version and indeed, the difference is tiny. Considering dropping the cache entirely, the space-time tradeoff doesn't seem worth it, even though it felt worth it to me before seeing the numbers. Do you see the same?

I asked Codex why I was seeing much bigger performance differences between cache vs no cache for my library and the benchmark app. Here's what Codex said (Spectre is my library):

Apologies for the horizontal scrolling, I'm to tired to reformat the output.

  I ran the current Spectre benchmark and got the same shape:

  clip                         2.31 ms/iter
  oklch-cubic                  5.56 ms/iter  = 2.41x clip
  oklch-cubic cache disabled  13.89 ms/iter  = 6.01x clip

  In Spectre, bench-p3-gma.mjs is a tight raw-Vec3 loop over prebuilt samples and reused output buffers. The grid is also 100% out of gamut at C=0.4, so every sample
  reaches the cubic solve path. The uncached path calls computeHueData() every time, which includes trig, coefficient setup, and six root-related solves per hue setup. The
  cached path is warmed before measurement, so it is basically Map.get() for hue data.

  In the app, the timed path includes much more wrapper cost: color.to("oklch").set(...), the mapper, then result.inGamut("p3") and possible clip in methods.js:45. With
  prebuilt Color objects I measured:

  app raw cached compute       1.473 us/call
  app raw no-cache compute     1.927 us/call
  app normalized cached        4.831 us/call
  app normalized no-cache      5.186 us/call

  So the cache delta is real in the app too, but after normalization it is only ~0.36 us on top of ~4.8 us. In Spectre, the cached generic path is only ~0.15 us/call in a
  direct probe, so the same kind of hue setup cost dominates and turns into a ~2.5x cached-vs-uncached ratio.

  Short version: Spectre removed most of the shared overhead, so the per-hue cache is exposed. The app measures through Color.js object conversion and post-normalization,
  so the cache benefit is diluted.

and another comment from Codex:

The absolute cache effect is similar. The ratio is different because app “raw compute” has a large fixed Color.js overhead.

  From the app probe:

  Color.to("oklch") only   1.459 us/call
  cached.compute           1.508 us/call
  nocache.compute          1.929 us/call

  So app cached compute is basically dominated by color.to("oklch"). The cached solver work after that is tiny, around 0.05 us/call in this probe. No-cache adds about 0.42
  us/call.

  From the Spectre benchmark you quoted/current run:
  oklch-cubic no-cache    13.89 ms / 35640 = 0.390 us/call
  delta                                           0.234 us/call

  So Spectre’s no-cache penalty is not bigger in absolute terms. It is actually smaller than the app probe’s absolute no-cache penalty. It just sits on top of a much
  smaller cached baseline.

  That is the crux:

  app:     1.508 -> 1.929 us = 1.28x, because ~1.46 us is shared Color.js conversion overhead
  Spectre: 0.156 -> 0.390 us = 2.50x, because the shared overhead has been stripped down

  The app’s compute() starts with color = color.to("oklch"), which always goes through Color.js conversion/object machinery. Spectre’s benchmark passes plain [L, C, H]
  arrays into a reusable output buffer. Once that overhead is removed, the hue-data cache becomes a much larger fraction of the total runtime.

So as mentioned earlier in this thread, don't use the OOP api for benchmarking. Switching to the function api should increase the delta between cache and no cache. Also, the more optimized a library is for performance the bigger the delta between cache and no cache.

Note that I didn't verify any of the AI results but the analysis is more or less what I expected to see.

@LeaVerou

Copy link
Copy Markdown
Member

I just pushed a change that goes back to a single cubic GMA with no cache.

As for the points above, I said it myself earlier that we should have never used the OOP API for this.
I'll ask Claude to convert all of them to the procedural API and see what happens.

@facelessuser

Copy link
Copy Markdown
Contributor Author
clip                         2.31 ms/iter
oklch-cubic                  5.56 ms/iter  = 2.41x clip
oklch-cubic cache disabled  13.89 ms/iter  = 6.01x clip

That's more like what I expected! Just knowing what the algorithms are doing, clip should always be winning. Cache should also have a clear advantage when the test favors it. Doing less code should equal less time. Solving the cubic roots, even after skipping the hue data retrieval, should have still been more expensive than clip. oklch-cubic is fast, but not magic 🙃. I expect that, even with a more fair comparison, we'll still see it performing near the top.

@facelessuser

Copy link
Copy Markdown
Contributor Author

Small note: the quadratic logic in the firstRoot function unnecessarily returns two roots for the double root case. If the discriminant is zero, only one needs to be returned. I guess you could throw out double if really close to zero, like what is done in the Cubic logic.

Additionally, just like with the Cardano method for cubic roots, you can simplify the solution by dividing the coefficients by the first. I usually do this so I don't have to remember the more complicated equation that they drill into your heads when you're young.

	c /= b; d /= b;
	let m = -c * 0.5;
	let disc = m * m - d;
	if (disc > 0) {
		// Two real roots
		r0 = m + Math.sqrt(disc);
		r1 = m - Math.sqrt(disc);
	}
	else if (disc === 0) {
		// Double root
		r0 = m;
	}

While I like to use this approach, you don't have to use it. I thought I'd share it for fun, though. Here's a video on the approach: https://www.youtube.com/watch?v=MHXO86wKeDY.

My kid got in trouble using this method because his teacher claimed the discriminant was wrong, and while this approach produces a different discriminant, it is consistent with the approach and has the same relation to zero. She just didn't understand what he was doing 🙂 .

@facelessuser

facelessuser commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

To be fair, firstRoot evaluates all roots at the end, regardless, even if they are still infinity, so it probably doesn't really gain you anything if you account for double roots or not. I just looked at it as a "correctness" thing 🤷🏻.

EDIT: Don't know why I said this, you avoid doing two sqrt calls.

@lloydk

lloydk commented Jun 23, 2026

Copy link
Copy Markdown

I asked Claude to change the benchmark program to measure each algorithm in one shot (locking up the ui). Before the changes no cache seemed to outperform cached. After the changes the no cache version was about 1.13x slower. My guess is that there is some kind of measurement error that just happens to benefit the no cache version. Also the Spectre mitigation for Performance.now() may have something to do with the measurement error. Here's what Claude had to say about Spectre mitigation. https://claude.ai/share/205394ae-0f57-40d8-b1e6-54cdc84fde23

The benchmark app does not appear to have cross-origin isolation enabled. This means Performance.now() will use a less precise timer making the benchmark results less accurate.

@facelessuser

Copy link
Copy Markdown
Contributor Author

The benchmark app does not appear to have cross-origin isolation enabled. This means Performance.now() will use a less precise timer making the benchmark results less accurate.

That's interesting. I didn't know that.

@LeaVerou

Copy link
Copy Markdown
Member

Ok check out #41

@lloydk

lloydk commented Jun 24, 2026

Copy link
Copy Markdown

Ok check out #41

Cross-origin isolation is enabled

@facelessuser

Copy link
Copy Markdown
Contributor Author

Here is a more fair comparison in ColorAide. I've been comparing Cubic against a very generalized version of Ray Trace that handles any perceptual space (other than Oklab) and handles other generic cases as well. But when using an idealized case that hadles just Oklab and just linear RGB spaces, these are the comparisons I get.

So, Cubic is still faster, as I suspected, but a lot more closer than my earlier metrics showed.

coloraide git:(main) ✗ python3 tools/benchmark_gma.py -t oklch -m 'raytrace-oklch'
Colors: 250000
> 100%
Completed in: 4.91770125 seccoloraide git:(main) ✗ python3 tools/benchmark_gma.py -t oklch -m 'oklch-cubic'
Colors: 250000
> 100%
Completed in: 4.384709042 sec

For comparison to the original raytrace, with a lot of abstraction and generalization:

coloraide git:(main) ✗ python3 tools/benchmark_gma.py -t oklch -m 'raytrace'
Colors: 250000
> 100%
Completed in: 8.318023583 sec

@lloydk

lloydk commented Jun 26, 2026

Copy link
Copy Markdown

I was playing around a bit benchmarking the various gamut mapping methods from the command line and hit my 5hr Claude limit so I figured I'd share a couple interesting findings:

  • An implementation of the Bjorn method using the okhsl* functions from the texel library is 4.1x faster.
  • on my machine calling to(color, OKLCH) when the color is already in oklch takes 94ns. Calling an optimized oklchToClippedP3() function takes 78ns. So there's potentially a lot of library overhead that could skew the results. Also, the cost of normalizing the result may dwarf the cost of the actual algorithm (e.g. clip).

Here's a snippet of what Claude said:

An optimized OKLCh→P3 call (given coords) is ~78ns as the methods actually use it — here's the breakdown:

┌───────────────────────────────────────────────────────────────┬─────────┐
│                                                               │ ns/call │
├───────────────────────────────────────────────────────────────┼─────────┤
│ oklchToClippedP3() as called (returns {space, coords, alpha}) │ 78ns    │
├───────────────────────────────────────────────────────────────┼─────────┤
│ same math, written to a reused buffer (no allocation)         │ 69ns    │
├───────────────────────────────────────────────────────────────┼─────────┤
│ same math, no gamma (clamped linear only)                     │ 52ns    │
└───────────────────────────────────────────────────────────────┴─────────┘

So the ~78ns splits into:
- ~52ns — the linear conversion: cos/sin, OKLab→LMS matrix, 3 cubes, LMS→linear-P3 matrix, clamp (the trig + two 3×3 multiplies are most of this)
- ~17ns — the gamma encode (3× Math.pow)
- ~9ns — allocating the result object + coords array

The striking part is the comparison with the input side:

- Generic to(color, OKLCH) (no-op, no math): ~90ns
- Optimized OKLCh→P3 (full real conversion: trig + 2 matrices + gamma + allocation): ~78ns

The library's do-nothing same-space dispatch costs more than doing an entire real color-space conversion by hand

@facelessuser

Copy link
Copy Markdown
Contributor Author

Yes, I actually implemented Bjorn last night locally as well. I also permanently put a fast path for Ray Trace in that is optimized for OkLCh as the perceptual space (the default); it still has a little extra overhead, but close enough to give a reasonable comparison.

I benchmarked them all (I should note I'm comparing non-cached Cubic), and I also found Bjorn faster.

If I'm being honest, I still favor Cubic over Bjorn due to Cubic's simplicity, and it's easier to integrate Cubic in a color library with weird gamuts like ProPhoto that break Bjorn's approach. But for raw speed and the ability to work with CSS gamuts, they are both fine, and Bjorn is faster. Bjorn is optimized for 32-bit, but that seems to be what browsers are using anyway.

coloraide git:(main) ✗ python3 tools/benchmark_gma.py -t oklch -m 'clip'
Colors: 250000
> 100%
Completed in: 1.328450042 seccoloraide git:(main) ✗ python3 tools/benchmark_gma.py -t oklch -m 'bjorn'
Colors: 250000
> 100%
Completed in: 3.14876375 seccoloraide git:(main) ✗ python3 tools/benchmark_gma.py -t oklch -m 'oklch-cubic'
Colors: 250000
> 100%
Completed in: 4.26570175 seccoloraide git:(main) ✗ python3 tools/benchmark_gma.py -t oklch -m 'raytrace'
Colors: 250000
> 100%
Completed in: 5.079669584 sec

Cubic (Cached) is, of course, faster than Bjorn, which I expected.

coloraide git:(main) ✗ python3 tools/benchmark_gma.py -t oklch -m 'oklch-cubic:{"cache": true}'
Colors: 250000
> 100%
Completed in: 3.040707875 sec

One thing I haven't really tested for is overcorrection. IIRC, Bjorn's approach could overcorrect chroma sometimes (clipping at the end handles any under correction). I haven't tested Cubic for any overcorrection, and I'm not sure if any of that matters, but I'll probably test it for the sake of my curiosity.

@facelessuser

Copy link
Copy Markdown
Contributor Author

I was playing around a bit benchmarking the various gamut mapping methods from the command line and hit my 5hr Claude limit so I figured I'd share a couple interesting findings:

An implementation of the Bjorn method using the okhsl* functions from the texel library is 4.1x faster.
on my machine calling to(color, OKLCH) when the color is already in oklch takes 94ns. Calling an optimized oklchToClippedP3() function takes 78ns. So there's potentially a lot of library overhead that could skew the results. Also, the cost of normalizing the result may dwarf the cost of the actual algorithm (e.g. clip).

Yeah, it really does seem like API overhead in Color.js makes it hard to get a clear reading on speed.

@facelessuser

Copy link
Copy Markdown
Contributor Author

Yeah, Bjorn does overcorrect. Cubic only slightly overcorrects, nothing I'd worry too much about. Ray Trace no notable overcorrection.

Bjorn Max OC =     0.0009773300725007816
Cubic Max OC =     1.4139043158500897e-08
Ray Trace Max OC = 3.155697925194545e-14

@facelessuser

facelessuser commented Jun 26, 2026

Copy link
Copy Markdown
Contributor Author

More intense testing shows worst case overcorrection as:

Bjorn OC    =  0.012629226051878122
Cubic OC    =  3.324680198190488e-05
RayTrace OC =  5.522471369090454e-14
Clip OC     =  5.145939230288832e-14

I think Cubic is generally more accurate than Bjorn's, even if it's a little slower.

EDIT: I never mentioned what OC is measuring. A gamut-mapped color should have one channel that is either 0 or 1 if not overcorrected. That means it is on the gamut surface and not below. The OC value takes the min and max channels and compares them to 0 and 1, respectively. It then takes the smallest value (which is closest to its boundary) and uses that as the OC value. Then we just track the biggest value that occurred from all the colors tested. This is only applied if the original color was out of gamut.

@lloydk

lloydk commented Jun 27, 2026

Copy link
Copy Markdown

Reading the comment for the cachedHueData() function I got the impression that there would be a maximum 3600 entries in the cache (0-360 in steps of 0.1) but that doesn't seem to be the case. From Codex:

You’re right. The original comment is misleading.

For hues in the normal 0..360 range, toPrecision(H, 4) does not mean fixed ~0.1deg buckets everywhere. It is 4 significant digits:

  • 0..1: 0.0001deg buckets
  • 1..10: 0.001deg buckets
  • 10..100: 0.01deg buckets
  • 100..360: 0.1deg buckets

So within 0..360, the original cache can hold roughly 30k distinct keys, not 3600. It is still bounded if inputs are already normalized into 0..360, but the comment’s
“collapses hues within ~0.1°” is only true for three-digit hues.

It is also not fully bounded for arbitrary hue values because the original does not normalize modulo 360. Hues like 0, 360, 720, 1080, etc. become separate cache keys
even though they are equivalent directions. So the safer statement is: it reduces arbitrary float precision, but it does not canonicalize equivalent hues, and its
effective bucket size varies by magnitude.

Is the code in this pull request correct or is the cache supposed to be capped at 3600 with each entry a multiple of 0.1?

@facelessuser

Copy link
Copy Markdown
Contributor Author

@lloydk, I think you meant to post here: #41?

But yes, I think the current toPrecision logic does not store data per h.x since it is doing 4 digits, not 4 significant digits as it says. Nor is it doing one decimal place like what it seems to imply with the 0.1 statement.

The way it is now, 4 digits, it will store the most precise hue data between 0 - 1, with the least precise data between 100 - 360.

It should be noted that clamping hue to 0.1 does decrease lightness and hue preservation and overcorrection as you lose resolution. This may be acceptable, but it is something to note.

@lloydk

lloydk commented Jun 27, 2026

Copy link
Copy Markdown

@lloydk, I think you meant to post here: #41?

Yeah, to many tabs open 😄

I'm creating a standalone repository that benchmarks oklch-cubic cached and edge seeker using optimized conversion and helper functions. The gamut mapping algorithms that have caches are very fast so I think that's the most realistic way to compare them.

I'm going to use a cache of 3600 hues at a step of 0.1 for oklch-cubic for now as the call to toPrecision() used to generate the cache key in the current implementation is fairly slow.

@facelessuser

Copy link
Copy Markdown
Contributor Author

I'm creating a standalone repository that benchmarks oklch-cubic cached and edge seeker using optimized conversion and helper functions. The gamut mapping algorithms that have caches are very fast so I think that's the most realistic way to compare them.

Yep, makes sense.

I'm going to use a cache of 3600 hues at a step of 0.1 for oklch-cubic for now as the call to toPrecision() used to generate the cache key in the current implementation is fairly slow.

Yeah, I think what you want is just to call num.toFixed(1). toPrecision(num, 4) has a specific use case, and this just isn't the right place for it.

@facelessuser

Copy link
Copy Markdown
Contributor Author

Yeah, if the option comes down to cached vs cached, it would be cool to see a fair comparison of the EdgeSeeker vs Cached Cubic. The good thing about Cubic is that if you don't want a cache, it's still pretty fast without the cache.

@lloydk

lloydk commented Jun 27, 2026

Copy link
Copy Markdown

Yeah, if the option comes down to cached vs cached, it would be cool to see a fair comparison of the EdgeSeeker vs Cached Cubic. The good thing about Cubic is that if you don't want a cache, it's still pretty fast without the cache.

Assuming implementations highly optimized for specific gamuts I think the order is bjorn, raytrace, uncached cubic.

Also, cubic calls a lot of transcendental math functions so for languages like rust and c++ there's less code that can be optimized and you're relying on library implementations of those math functions.

I have a rust implementation of both algorithms and the differences are surprising.

@lloydk

lloydk commented Jun 27, 2026

Copy link
Copy Markdown

Here's a repository that benchmarks oklch-cubic (cached) and edge seeker using highly optimized color conversion and helper functions that should be almost entirely allocation free. All of the functions and benchmarks use coord arrays instead of color objects.

The data for the benchmark is the same default data that the benchmark app in this repository uses.

The benchmarks optionally check if the coords are in gamut, perform the gamut mapping algorithm and then convert the coords to p3. Clip just converts to p3 and clamps the values.

There are rust versions of the benchmarks in the rust directory of the repository.

Benchmarking is hard and the results depend on the Javascript runtime (Node vs Bun), CPU architecture (e.g. ARM vs Intel), implementation language (e.g. Javascript vs Rust) and probably a bunch of other stuff I'm forgetting.

Also, keep in mind that these are microbenchmarks and performance could be different in a real world application.

Here are the results on my Intel 9800X3D

Node

summary
  clip
   1.26x faster than oklch-cubic (cached)
   1.92x faster than oklch-cubic (cached, in-gamut check)
   2.16x faster than edge-seeker
   2.86x faster than edge-seeker (in-gamut check)

Bun

summary
  clip
   1.55x faster than oklch-cubic (cached)
   2.02x faster than oklch-cubic (cached, in-gamut check)
   3.15x faster than edge-seeker
   3.56x faster than edge-seeker (in-gamut check)

Rust

scalar Rust (median ns/call over 25 grid passes, fastest to slowest):
  clip                                  17.95 ns/call  (1.00× fastest)
  edge-seeker                           32.71 ns/call  (1.82× fastest)
  oklch-cubic (cached)                  46.56 ns/call  (2.59× fastest)
  edge-seeker (in-gamut check)          48.88 ns/call  (2.72× fastest)
  oklch-cubic (cached, in-gamut check)  58.37 ns/call  (3.25× fastest)

I was a bit surprised to see edge-seeker ahead of oklch-cubic in the Rust benchmark and the results on my Mac laptop were even more surprising...

@lloydk

lloydk commented Jun 27, 2026

Copy link
Copy Markdown

Results on an Macbook Air M2

Node

summary
  clip
   1.1x faster than oklch-cubic (cached)
   1.74x faster than oklch-cubic (cached, in-gamut check)
   2.15x faster than edge-seeker
   2.85x faster than edge-seeker (in-gamut check)

Bun

summary
  clip
   1.59x faster than oklch-cubic (cached)
   2.03x faster than oklch-cubic (cached, in-gamut check)
   3.68x faster than edge-seeker
   4.18x faster than edge-seeker (in-gamut check)

Rust

scalar Rust (median ns/call over 25 grid passes, fastest to slowest):
  clip                                  12.62 ns/call  (1.00× fastest)
  oklch-cubic (cached)                  30.62 ns/call  (2.43× fastest)
  oklch-cubic (cached, in-gamut check)  44.35 ns/call  (3.52× fastest)
  edge-seeker                           58.87 ns/call  (4.67× fastest)
  edge-seeker (in-gamut check)          76.23 ns/call  (6.04× fastest)

@facelessuser

Copy link
Copy Markdown
Contributor Author

This is probably the best data we have yet. It's nice to see EdgeSeeker against Cubic. I think it is a reminder that EdgeSeeker is doing more than you'd initially think. It has to look up the items in the table, lerp them, and make some curvature calculations. I had thought it was just looking up the value and lerping; I forgot about the curvature stuff.

@lloydk

lloydk commented Jun 27, 2026

Copy link
Copy Markdown

This is probably the best data we have yet. It's nice to see EdgeSeeker against Cubic. I think it is a reminder that EdgeSeeker is doing more than you'd initially think. It has to look up the items in the table, lerp them, and make some curvature calculations. I had thought it was just looking up the value and lerping; I forgot about the curvature stuff.

The one caveat with the edge seeker numbers is that the lookup isn't optimized or allocation free. I had completely ignored edge seeker until yesterday because I assumed it was going to be the fast so I didn't look to see what kind of optimizations could be made.

I'll spend some time this weekend optimizing the existing edge seeker lookup to see if I can speed it up. If the table and lookup could be replaced by the same 3600 0.1 step hue array that oklch-cubic uses plus some kind of lerp (or maybe no lerp) then it might end up being as fast or faster than oklch-cubic. I'm not sure how that would affect the accuracy of the results and I don't know what kind of analysis would be needed to determine if the results would be acceptable.

@lloydk

lloydk commented Jun 27, 2026

Copy link
Copy Markdown

I added an additional benchmark to my benchmark repository that randomizes the order of the data (using a seed so it's repeatable). There's a modest but noticeable difference in benchmark times in the Javascript benchmarks. In the rust benchmarks the differences are quite dramatic especially with edge seeker (32.33ns vs 75.22ns).

── grid (H = 0..359 step 1, repeated per L) ── (median ns/call over 25 passes, fastest to slowest):
  clip                                  19.93 ns/call  (1.00× fastest)
  edge-seeker                           32.33 ns/call  (1.62× fastest)
  oklch-cubic (cached)                  46.11 ns/call  (2.31× fastest)
  edge-seeker (in-gamut check)          49.62 ns/call  (2.49× fastest)
  oklch-cubic (cached, in-gamut check)  57.53 ns/call  (2.89× fastest)

── random (stratified/jittered fractional H + L) ── (median ns/call over 25 passes, fastest to slowest):
  clip                                  29.21 ns/call  (1.00× fastest)
  oklch-cubic (cached)                  61.51 ns/call  (2.11× fastest)
  edge-seeker                           75.22 ns/call  (2.58× fastest)
  oklch-cubic (cached, in-gamut check)  77.34 ns/call  (2.65× fastest)
  edge-seeker (in-gamut check)          89.81 ns/call  (3.08× fastest)

Claude's analysis:

Why edge-seeker shows it the most

Its hot path is a ~10-iteration binary search over the 710-entry LUT, each iteration a data-dependent branch (hue < lut[mid] → left/right):

  • Grid: only 360 distinct hues, repeated 99× in order → the branch predictor memorizes every path → near-zero mispredicts.
  • Shuffled random: every search is a fresh unpredictable path → a mispredict on most of the ~10 iterations. ~10 × ~15–20 cycles ÷ ~5 GHz ≈ 30–40 ns — which matches the ~43 ns gap almost exactly.

So I think a fair benchmark should not iterate over it's data in a predictable order unless that's what you want to measure.

@lloydk

lloydk commented Jun 28, 2026

Copy link
Copy Markdown

I added an additional lookup table to bypass the binary search of edge seeker and this new version is now faster than oklch-cubic in most cases in my gma benchmark.

Here are the results on my Intel machine:

Node

summary
  clip
   1.28x faster than oklch-cubic (cached)
   1.38x faster than edge-seeker (indexed)
   1.99x faster than edge-seeker

summary
  clip (random hues)
   1.27x faster than edge-seeker (indexed) (random hues)
   1.46x faster than oklch-cubic (cached) (random hues)
   2.06x faster than edge-seeker (random hues)

Bun

  clip
   1.45x faster than oklch-cubic (cached)
   1.47x faster than edge-seeker (indexed)
   2.43x faster than edge-seeker

summary
  clip (random hues)
   1.37x faster than edge-seeker (indexed) (random hues)
   1.73x faster than oklch-cubic (cached) (random hues)
   2.63x faster than edge-seeker (random hues)

Rust

── grid (H = 0..359 step 1, repeated per L) ── (median ns/call over 25 passes, fastest to slowest):
  clip                   19.39 ns/call  (1.00× fastest)
  edge-seeker (indexed)  28.64 ns/call  (1.48× fastest)
  edge-seeker            31.83 ns/call  (1.64× fastest)
  oklch-cubic (cached)   53.74 ns/call  (2.77× fastest)

── random (stratified/jittered fractional H + L) ── (median ns/call over 25 passes, fastest to slowest):
  clip                   28.37 ns/call  (1.00× fastest)
  edge-seeker (indexed)  41.60 ns/call  (1.47× fastest)
  oklch-cubic (cached)   66.05 ns/call  (2.33× fastest)
  edge-seeker            75.20 ns/call  (2.65× fastest)

@lloydk

lloydk commented Jun 28, 2026

Copy link
Copy Markdown

I'm going to pause my work on the benchmarks for a bit as I'd like to save some of my weekly AI limit for other work. Eventually I'll add raytrace, bjorn and oklch-cubic (no cache) to my benchmark as I'm curious to see how the rust implementations perform.

One thing I'd like to see is if there's a way eliminate in gamut checking before running the oklch-cubic and edge seeker algorithms. If there's a way to just run the algorithms (without changing in gamut colors), maybe by sacrificing a little bit of accuracy I think that could be a big win.

@facelessuser

Copy link
Copy Markdown
Contributor Author

If you just check the lightness and see if the chroma is already less than the calculated max chroma, you don't have to check the gamut. Checking the gamut is likely better if you are already in the gamut, but if not, and it is faster on average to skip the check, then it makes sense that it might be useful to bypass it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants