More fair comparison between algorithms#36
Conversation
✅ Deploy Preview for color-apps ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
LeaVerou
left a comment
There was a problem hiding this comment.
I think this should be part of the harness running these algorithms (see map.js), just like clipping the chroma to 0.4 which is now applied universally to all of them beforehand (I thought it was wasteful to start from whatever chroma since we know it won't be higher than that and some algos performed more poorly when significantly farther) and the clipping to P3 that happens to all of them after. Each algorithm should only include the bits that are different for that algorithm. We can (and do) measure these operations in the time taken, so where that code lives shouldn't affect the results.
Though no handling is going to be perfectly fair, if a GMA operates on oklch colors it will be faster if fed oklch colors (like the benchmark does) and if another operates in a different space, it will be faster when starting from that space...
|
There is a lot of overhead that some algorithms are subject to that others are not. All of the algorithms currently return whatever is convenient for them (and some what is not convenient for them). The only thing I've done here is ensure the input origin color space is also the output. This ensures they are all operating under the same requirements. Take the given input space and gamut map the color such that the output is the same space, but within the specified gamut. Now they are all subject to the same overhead and are operating on the same rules. This makes a fairer comparison and provides an apples-to-apples comparison. I personally feel that having these methods operate under the same requirements is reasonable, but if this is not desired, a correction for lightness would still be needed. |
|
Implementing the algorithm in ColorAide, to see how it performs with similar overhead as other approaches, the Cubic approach preserves chroma and hue better than RayTrace. It's not visually noticeable, but in raw numbers, it performs better. Running this on an image with over a million pixels with constantly changing, super saturated Rec. 2020 colors, and gamut mapping them to sRGB, the Cubic approach had a ~62.8% speedup over raytrace. Clip was still much faster, at least in our implementation. There are a variety of reasons why this could be. I assume this is likely because they are all sharing similar overhead. The hue cache seems fine when you are repeating the same color, but at least with a highly saturated rainbow in an image, the hue cache didn't provide a huge difference. Since I'd need to hold a hue cache for every linear RGB gamut, I ended up removing it, but I do cache the creation of the LMS to RGB matrix as I found that to be noticeable; this way, I always have that for a given linear RGB. If you need to reduce chroma in a linear RGB gamut using Oklab, it'd be hard to find a better approach it seems. ➜ coloraide git:(feature/fit-oklch-cubic) ✗ python3 tools/apply_gamut.py -i test.jpg -o test-map4.jpg --gamut rec2020 --gmap 'raytrace'
Pixels: 1048576
> 100%
Completed in: 7.003060333 sec
➜ coloraide git:(feature/fit-oklch-cubic) ✗ python3 tools/apply_gamut.py -i test.jpg -o test-map4.jpg --gamut rec2020 --gmap 'oklch-cubic'
Pixels: 1048576
> 100%
Completed in: 3.657383834 sec
➜ coloraide git:(feature/fit-oklch-cubic) ✗ python3 tools/apply_gamut.py -i test.jpg -o test-map4.jpg --gamut rec2020 --gmap 'clip'
Pixels: 1048576
> 100%
Completed in: 1.127206125 sec |
To clarify, I'm fine having the conversion! I was just saying the code to do that should be centralized, not something each algorithm author needs to remember to do. |
Also, fix OkLCh-Cubic's lightness issue
98ce882 to
9b1550e
Compare
|
Updated this stuff in a centralized place. Lightness should be handled in the Cubic approach now, and output normalization is done in one place. With or without this normalization, the Cubic approach kills it. |
|
The fact that clip isn't the fastest algorithm in the benchmarking app probably means that the comparison isn't fair as it could be. I added oklch-cubic to my (unreleased) library and benchmarked the 35640 colors the gamut mapping benchmark app uses by default. I added the ability to disable the oklch-cubic cache and a raytrace algorithm that was optimized for P3 as my generic raytrace algorithm is unoptimized and needs some work. I ran the benchmarks under Node and Bun. Clip was the fastest and oklch-cubic with cache enabled was second fastest. With cache disabled oklch-cubic was second slowest. I'm a bit surprised that the cache didn't provide any benefit in coloraide. It would be interesting to see the how caching vs no caching performs when benchmarking the same set of colors as this benchmarking app. I'll play around with my implementation a bit more to see if I can improve the no cache performance. NodeBun |
|
I want to be clear, the cache did provide a benefit when your hue and gamut exactly matches. My statement of what it tested is probably misleading. The test I did didn't hit it exact guess enough. It was a map of highly saturated Rec. 2020 colors. While it varied in oklch lightness and hue, it wasn't generated from the oklch perspective, but from rec. 2020. It produced a lot of hues that weren't exact. I'm positive there are scenarios it would help a lot, but I'm not really wanting to keep a cache for every gamut, P3, sRGB,etc. All those caches add up. I'm handling more than the 3 CSS gamuts. |
|
When I tested things, what I ended up doing is basically treating Rec. 2020 has an HSL space, adjusting lightness and hue. I then converted them back to Rec. 2020. For all the colors I used max saturation. This ensured that the hues didn't all align with in perfect OkLCh hues. That's why I didn't get the cache improvements. If I were to add hue_cache back, I'd probably have it share the cache across all the gamuts using a LRU cache with some max size. If you were processing lots of colors in a specific space, you'd fill your cache with that space and get a speed improvement. I probably wouldn't want it to expand infinitely, caching every sub hue: 20.0000000001, 20.0000000002, etc. I think having some sort of compromise is reasonable. The fact that not having the hue_cache doesn't bother me is a testament to the fact that it is super fast even without the cache. I find caching the linear LMS/RGB matrix more useful as I get the benefit on every hit. |
|
I will further note that I've been evaluating it in other gamuts like sRGB and Rec. 2020 (not just P3). I realize this relation may hold better for P3 than the others, but so far it seems to work ok for the others as well, though a clip may be required. I haven't yet found cases where P3 needs clipping, but I have found some for sRGB. |
|
I'm less convinced that it's faster than other methods without the cache. I think it depends on how well optimized the alternative methods are. |
|
Yeah, I'm not convinced it's faster than clipping. I wanted to drop it into my library where there is known overhead and consistency. And I'm skeptical about being faster than the LUT, but I have real data showing it out performs Raytrace, though Raytrace is flexible to be used with non Oklab perceptual spaces, so it still has it's niche. Right now I'm trying to properly evaluate it's usage outside P3 as that's a very narrow case. |
LeaVerou
left a comment
There was a problem hiding this comment.
First, this discussion should not be happening in a PR, these are very valuable results y'all!
And yes, it doesn't make sense that anything would be faster than clip, lol. That's clearly an artifact of the benchmark methodology.
But thinking about this some more, I've been wondering on what would be a fair comparison. Is it really fair if all start and end with xRGB1? If a certain algorithm is faster for xRGB colors whereas another one is faster for OKLCH colors, that should be reflected in the data, not hidden by artificial conversions. I think it's ok to convert the output color, since that has to become xRGB at some point, but the input color shouldn't be converted any more than is necessary by the conversion algorithm. Instead we should have an option to customize how the colors to benchmark are generated, so that we can have separate comparisons for different color spaces. I can think of several different color generating schemes that would be useful to try, but we can start from one additional one that generates P3 colors guaranteed to be OOG. The easiest way would be probably to take the current generated colors and convert them to P3 before feeding them to the conversion algos non-timed.
Approved as I can see the argument of having this in meanwhile (and it includes the L fix).
As a bigger point, if we're measuring speed, none of these should be using the OOP API!
Footnotes
-
Using xRGB as shorthand for "an RGB color space" ↩
Normalizing
|
I was grappling with how this could be faster than clip, and even the LUT (on average), which spawned the whole digging into what is a "fair" test. Then I found the lightness bug, and I mainly wanted to find any flaws in the algorithm and get them patched before commenting further. I kind of also wanted to understand how well this applied to sRGB and Rec. 2020 (other CSS gamut map targets). It wasn't clear to me whether the observed relationship held true generally for all linear RGB spaces or just P3. I think once I feel more confident about what the algorithm really can and can't do, I'll comment more in the CSSWG thread. I have a habit of spouting my first gut impression, and then later walking back some of that once I understand things better 😅. I've seen enough to know roughly where it falls compared to some approaches. It's less complicated than Björn's approach and avoids iterations where RayTrace doesn't, so it doesn't surprise me that it beats those. I know LUT isn't just looking up chroma in a table; there is some work that occurs, and maybe in some cases it can be slower, but I'd still think that on average, it would be faster. I don't have enough data right now. |
|
While I think the cache does give a good performance boost in some situations, I do think an unbounded cache may not be a practical requirement. I imagine a bounded cache that drops the least hit entries in the cache when full may be more reasonable. Even if I end up implementing the cache (shared across gamuts or separate), I'd likely not implement them such that they'd be unbounded. |
|
I don't think the input and output spaces need to be the same. If one GMA is faster when run on oklch colors and another is faster when run on xRGB colors, that's useful data that should be known. It's not inconceivable that browsers may end up using different GMAs depending on the format, after all. OTOH, since the color needs to be displayed on a screen in the end, I think it's ok for the output space to be xRGB, but I'd also be fine with them returning it in any format. That also applies to any final clipping: it will be done by the screen hardware anyway, so no strong opinion on whether it should be considered part of the algorithm or not. That said, for non-screen media (e.g. printing) you could end up side-stepping xRGB altogether and convert to an entirely different color space, so perhaps xRGB shouldn't be so privileged. In general, I think instead of hiding each GMAs strengths and weaknesses, we should have multiple different ways of generating the colors via a URL parameter & corresponding form control. As a bigger picture comment, I worry we may have lost sight of the end goal here, and have started treating this as a purely algorithmic problem. The goal is to find a GMA that would work well for CSS use cases. Similarly, the choice of oklch for the generator was not to favor any particular algorithm (oklch-cubic is not the only GMA that uses oklch), but based on the human factors of the actual use case. In practice, that's when authors get OOG: when using polar formats to tweak visual components (e.g. to create color palettes). When using RGB color formats, you know when you're OOG, so it mainly comes up when authors use e.g. P3 and the color is displayed in an sRGB device. But for such a small distance, clipping is not such a huge problem anyway. And once you're using oklch, generating representative patches by going over the range of L and H with a predefined step seemed like the obvious choice to cover the range of possibilities. Don't get me wrong, I 100% agree we should have multiple ways to generate the patches. The more data, the better! But not all ways of generating colors are equally relevant to the end goal.
Because it felt wasteful to recompute a bunch of stuff that only depends on H and not L (and my hypothesis is that in real-world usage, hues would not vary that much, but that's just a hypothesis). I didn't want to cap its number of entries as that seemed less fair — depending on how you generate colors you'd get different results), but it could be capped to a certain number of digits of precision beyond which the C is the same anyway, I had computed that number before somewhere. It may also be worth having a version with the cache and a version without it. I'll try to add one.
From an architectural pov, we should decide if |
I somewhat disagree. It makes no sense to clip after clipping in a normalization. It doubles the overhead of clipping for no reason. If an algorithm doesn't require clipping after, it shouldn't clip, but if it does, it should be part of the algorithm. I think it makes sense to contain the real GMA logic. In the end though, I'll defer to what you want.
Yep, I think it's good to highlight an approaches strengths, but also it's weaknesses.
I don't think it unfair, but I'm also only offering it as a suggestion. I personally wouldn't implement it myself unbounded, but it's fine if people choose to. As someone who's worked in embedded programming, an unbounded cache simply doesn't sit well with me 🙂. I also think the algorithm is good with our without the cache. |
I'm confused — you say you disagree then you go on to agree 😅
So,e suggestions about alternative ways to generate them? 🙂
I meant an LRU cache seemed unfair as e.g. you'd get different results if you generate the pages hue-first vs l-first. Whereas limiting by precision doesn't depend on ordering. |
It's quite possible that the performance difference between each gamut mapping method is library dependent. So for coloraide non cached oklch-cubic could very well be faster than raytrace while other libraries might show different results.
Isn't the primary CSS use case going to be mapping to sRGB or P3 or possibly Rec2020. I know at one point there was talk of mapping to Rec2020 and then clipping. If mapping to *RGB is the primary use case then I think the default output space for this benchmark should be *RGB.
I agree that the algorithms should contain the real GMA logic. Also, the current clip implementation isn't very efficient so algorithms that call |
No, cached is faster when hue is identical, I've tested this locally. The specific test I originally did generated Max saturated values in Rec. 2020 with various lightness which doesn't generate enough exact hue matches to show the additional gain. The cache only helps in this specific case to avoid hue data calculations. |
|
This demonstrates cache works, even in ColorAide. First is cached, second example is not. ➜ coloraide git:(feature/fit-oklch-cubic) ✗ python3 -m timeit -s "from coloraide.everything import ColorAll as Color; c = Color('oklch(80% 0.3 320)')" "c.clone().fit('display-p3', method='oklch-cubic', cache=True)"
20000 loops, best of 5: 12.4 usec per loop
➜ coloraide git:(feature/fit-oklch-cubic) ✗ python3 -m timeit -s "from coloraide.everything import ColorAll as Color; c = Color('oklch(80% 0.3 320)')" "c.clone().fit('display-p3', method='oklch-cubic')"
20000 loops, best of 5: 17.3 usec per loopAdmittedly, performance will be library-dependent and language-dependent, but skiping the hue data calculation is universally faster, but it only works when the hue exactly matches in the cache. |
Are both times faster than raytrace? |
Yes, Ray Trace has to iterate 4 times; there is no way it can beat OkLCh Cubic in speed. Cubic isn't extremely complicated, and it doesn't iterate with color conversions. Ray Trace is more flexible for swapping out other perceptual spaces. But it can't beat the Cubic approach in raw speed. Here's a test I coded up which plays to OkLCh Cubic's strengths when caching is enabled. Similar to what the Color.js benchmark does. Tests are Ray Trace, Cubic, Cached Cubic, Clip. Ray Trace can't keep up. ➜ coloraide git:(feature/fit-oklch-cubic) ✗ python3 tools/benchmark_gma.py -g display-p3 -m "raytrace" -t oklch
Completed in: 33.679208042 sec
➜ coloraide git:(feature/fit-oklch-cubic) ✗ python3 tools/benchmark_gma.py -g display-p3 -m "oklch-cubic" -t oklch
Completed in: 18.136304 sec
➜ coloraide git:(feature/fit-oklch-cubic) ✗ python3 tools/benchmark_gma.py -g display-p3 -m 'oklch-cubic:{"cache": true}' -t oklch
Completed in: 12.771840833 sec
➜ coloraide git:(feature/fit-oklch-cubic) ✗ python3 tools/benchmark_gma.py -g display-p3 -m "clip" -t oklch
Completed in: 6.156545917 secNow, here's one that generates maximum saturated colors in Rec. 2020. Here the hues won't play to Cached Cubic's strengths. Still faster than Ray Trace. ➜ coloraide git:(feature/fit-oklch-cubic) ✗ python3 tools/benchmark_gma.py -g display-p3 -m "raytrace" -t rec2020
Completed in: 36.545193125 sec
➜ coloraide git:(feature/fit-oklch-cubic) ✗ python3 tools/benchmark_gma.py -g display-p3 -m "oklch-cubic" -t rec2020
Completed in: 20.638073625 sec
➜ coloraide git:(feature/fit-oklch-cubic) ✗ python3 tools/benchmark_gma.py -g display-p3 -m 'oklch-cubic:{"cache": true}' -t rec2020
Completed in: 20.639316583 sec
➜ coloraide git:(feature/fit-oklch-cubic) ✗ python3 tools/benchmark_gma.py -g display-p3 -m "clip" -t rec2020
Completed in: 6.920228125 sec |
|
I think at this point, a lightness fix is already on main, and there appears to not quite be an agreement on how to make the tests more consistent. I think I will abandon this review for now, though I think further conversation would be useful. |
I asked Codex why I was seeing much bigger performance differences between cache vs no cache for my library and the benchmark app. Here's what Codex said (Spectre is my library): Apologies for the horizontal scrolling, I'm to tired to reformat the output. I ran the current Spectre benchmark and got the same shape:
clip 2.31 ms/iter
oklch-cubic 5.56 ms/iter = 2.41x clip
oklch-cubic cache disabled 13.89 ms/iter = 6.01x clip
In Spectre, bench-p3-gma.mjs is a tight raw-Vec3 loop over prebuilt samples and reused output buffers. The grid is also 100% out of gamut at C=0.4, so every sample
reaches the cubic solve path. The uncached path calls computeHueData() every time, which includes trig, coefficient setup, and six root-related solves per hue setup. The
cached path is warmed before measurement, so it is basically Map.get() for hue data.
In the app, the timed path includes much more wrapper cost: color.to("oklch").set(...), the mapper, then result.inGamut("p3") and possible clip in methods.js:45. With
prebuilt Color objects I measured:
app raw cached compute 1.473 us/call
app raw no-cache compute 1.927 us/call
app normalized cached 4.831 us/call
app normalized no-cache 5.186 us/call
So the cache delta is real in the app too, but after normalization it is only ~0.36 us on top of ~4.8 us. In Spectre, the cached generic path is only ~0.15 us/call in a
direct probe, so the same kind of hue setup cost dominates and turns into a ~2.5x cached-vs-uncached ratio.
Short version: Spectre removed most of the shared overhead, so the per-hue cache is exposed. The app measures through Color.js object conversion and post-normalization,
so the cache benefit is diluted.
and another comment from Codex: The absolute cache effect is similar. The ratio is different because app “raw compute” has a large fixed Color.js overhead.
From the app probe:
Color.to("oklch") only 1.459 us/call
cached.compute 1.508 us/call
nocache.compute 1.929 us/call
So app cached compute is basically dominated by color.to("oklch"). The cached solver work after that is tiny, around 0.05 us/call in this probe. No-cache adds about 0.42
us/call.
From the Spectre benchmark you quoted/current run:
oklch-cubic no-cache 13.89 ms / 35640 = 0.390 us/call
delta 0.234 us/call
So Spectre’s no-cache penalty is not bigger in absolute terms. It is actually smaller than the app probe’s absolute no-cache penalty. It just sits on top of a much
smaller cached baseline.
That is the crux:
app: 1.508 -> 1.929 us = 1.28x, because ~1.46 us is shared Color.js conversion overhead
Spectre: 0.156 -> 0.390 us = 2.50x, because the shared overhead has been stripped down
The app’s compute() starts with color = color.to("oklch"), which always goes through Color.js conversion/object machinery. Spectre’s benchmark passes plain [L, C, H]
arrays into a reusable output buffer. Once that overhead is removed, the hue-data cache becomes a much larger fraction of the total runtime.
So as mentioned earlier in this thread, don't use the OOP api for benchmarking. Switching to the function api should increase the delta between cache and no cache. Also, the more optimized a library is for performance the bigger the delta between cache and no cache. Note that I didn't verify any of the AI results but the analysis is more or less what I expected to see. |
|
I just pushed a change that goes back to a single cubic GMA with no cache. As for the points above, I said it myself earlier that we should have never used the OOP API for this. |
That's more like what I expected! Just knowing what the algorithms are doing, clip should always be winning. Cache should also have a clear advantage when the test favors it. Doing less code should equal less time. Solving the cubic roots, even after skipping the hue data retrieval, should have still been more expensive than clip. |
|
Small note: the quadratic logic in the firstRoot function unnecessarily returns two roots for the double root case. If the discriminant is zero, only one needs to be returned. I guess you could throw out double if really close to zero, like what is done in the Cubic logic. Additionally, just like with the Cardano method for cubic roots, you can simplify the solution by dividing the coefficients by the first. I usually do this so I don't have to remember the more complicated equation that they drill into your heads when you're young. c /= b; d /= b;
let m = -c * 0.5;
let disc = m * m - d;
if (disc > 0) {
// Two real roots
r0 = m + Math.sqrt(disc);
r1 = m - Math.sqrt(disc);
}
else if (disc === 0) {
// Double root
r0 = m;
}While I like to use this approach, you don't have to use it. I thought I'd share it for fun, though. Here's a video on the approach: https://www.youtube.com/watch?v=MHXO86wKeDY. My kid got in trouble using this method because his teacher claimed the discriminant was wrong, and while this approach produces a different discriminant, it is consistent with the approach and has the same relation to zero. She just didn't understand what he was doing 🙂 . |
|
To be fair, firstRoot evaluates all roots at the end, regardless, even if they are still infinity, so it probably doesn't really gain you anything if you account for double roots or not. I just looked at it as a "correctness" thing 🤷🏻. EDIT: Don't know why I said this, you avoid doing two sqrt calls. |
|
I asked Claude to change the benchmark program to measure each algorithm in one shot (locking up the ui). Before the changes no cache seemed to outperform cached. After the changes the no cache version was about 1.13x slower. My guess is that there is some kind of measurement error that just happens to benefit the no cache version. Also the Spectre mitigation for The benchmark app does not appear to have cross-origin isolation enabled. This means |
That's interesting. I didn't know that. |
|
Ok check out #41 |
Cross-origin isolation is enabled |
|
Here is a more fair comparison in ColorAide. I've been comparing Cubic against a very generalized version of Ray Trace that handles any perceptual space (other than Oklab) and handles other generic cases as well. But when using an idealized case that hadles just Oklab and just linear RGB spaces, these are the comparisons I get. So, Cubic is still faster, as I suspected, but a lot more closer than my earlier metrics showed. ➜ coloraide git:(main) ✗ python3 tools/benchmark_gma.py -t oklch -m 'raytrace-oklch'
Colors: 250000
> 100%
Completed in: 4.91770125 sec
➜ coloraide git:(main) ✗ python3 tools/benchmark_gma.py -t oklch -m 'oklch-cubic'
Colors: 250000
> 100%
Completed in: 4.384709042 secFor comparison to the original raytrace, with a lot of abstraction and generalization: ➜ coloraide git:(main) ✗ python3 tools/benchmark_gma.py -t oklch -m 'raytrace'
Colors: 250000
> 100%
Completed in: 8.318023583 sec |
|
I was playing around a bit benchmarking the various gamut mapping methods from the command line and hit my 5hr Claude limit so I figured I'd share a couple interesting findings:
Here's a snippet of what Claude said: An optimized OKLCh→P3 call (given coords) is ~78ns as the methods actually use it — here's the breakdown:
┌───────────────────────────────────────────────────────────────┬─────────┐
│ │ ns/call │
├───────────────────────────────────────────────────────────────┼─────────┤
│ oklchToClippedP3() as called (returns {space, coords, alpha}) │ 78ns │
├───────────────────────────────────────────────────────────────┼─────────┤
│ same math, written to a reused buffer (no allocation) │ 69ns │
├───────────────────────────────────────────────────────────────┼─────────┤
│ same math, no gamma (clamped linear only) │ 52ns │
└───────────────────────────────────────────────────────────────┴─────────┘
So the ~78ns splits into:
- ~52ns — the linear conversion: cos/sin, OKLab→LMS matrix, 3 cubes, LMS→linear-P3 matrix, clamp (the trig + two 3×3 multiplies are most of this)
- ~17ns — the gamma encode (3× Math.pow)
- ~9ns — allocating the result object + coords array
The striking part is the comparison with the input side:
- Generic to(color, OKLCH) (no-op, no math): ~90ns
- Optimized OKLCh→P3 (full real conversion: trig + 2 matrices + gamma + allocation): ~78ns
The library's do-nothing same-space dispatch costs more than doing an entire real color-space conversion by hand
|
|
Yes, I actually implemented Bjorn last night locally as well. I also permanently put a fast path for Ray Trace in that is optimized for OkLCh as the perceptual space (the default); it still has a little extra overhead, but close enough to give a reasonable comparison. I benchmarked them all (I should note I'm comparing non-cached Cubic), and I also found Bjorn faster. If I'm being honest, I still favor Cubic over Bjorn due to Cubic's simplicity, and it's easier to integrate Cubic in a color library with weird gamuts like ProPhoto that break Bjorn's approach. But for raw speed and the ability to work with CSS gamuts, they are both fine, and Bjorn is faster. Bjorn is optimized for 32-bit, but that seems to be what browsers are using anyway. ➜ coloraide git:(main) ✗ python3 tools/benchmark_gma.py -t oklch -m 'clip'
Colors: 250000
> 100%
Completed in: 1.328450042 sec
➜ coloraide git:(main) ✗ python3 tools/benchmark_gma.py -t oklch -m 'bjorn'
Colors: 250000
> 100%
Completed in: 3.14876375 sec
➜ coloraide git:(main) ✗ python3 tools/benchmark_gma.py -t oklch -m 'oklch-cubic'
Colors: 250000
> 100%
Completed in: 4.26570175 sec
➜ coloraide git:(main) ✗ python3 tools/benchmark_gma.py -t oklch -m 'raytrace'
Colors: 250000
> 100%
Completed in: 5.079669584 secCubic (Cached) is, of course, faster than Bjorn, which I expected. ➜ coloraide git:(main) ✗ python3 tools/benchmark_gma.py -t oklch -m 'oklch-cubic:{"cache": true}'
Colors: 250000
> 100%
Completed in: 3.040707875 secOne thing I haven't really tested for is overcorrection. IIRC, Bjorn's approach could overcorrect chroma sometimes (clipping at the end handles any under correction). I haven't tested Cubic for any overcorrection, and I'm not sure if any of that matters, but I'll probably test it for the sake of my curiosity. |
Yeah, it really does seem like API overhead in Color.js makes it hard to get a clear reading on speed. |
|
Yeah, Bjorn does overcorrect. Cubic only slightly overcorrects, nothing I'd worry too much about. Ray Trace no notable overcorrection. |
|
More intense testing shows worst case overcorrection as: I think Cubic is generally more accurate than Bjorn's, even if it's a little slower. EDIT: I never mentioned what OC is measuring. A gamut-mapped color should have one channel that is either 0 or 1 if not overcorrected. That means it is on the gamut surface and not below. The OC value takes the min and max channels and compares them to 0 and 1, respectively. It then takes the smallest value (which is closest to its boundary) and uses that as the OC value. Then we just track the biggest value that occurred from all the colors tested. This is only applied if the original color was out of gamut. |
|
Reading the comment for the
Is the code in this pull request correct or is the cache supposed to be capped at 3600 with each entry a multiple of 0.1? |
|
@lloydk, I think you meant to post here: #41? But yes, I think the current The way it is now, 4 digits, it will store the most precise hue data between 0 - 1, with the least precise data between 100 - 360. It should be noted that clamping hue to 0.1 does decrease lightness and hue preservation and overcorrection as you lose resolution. This may be acceptable, but it is something to note. |
|
Yeah, to many tabs open 😄 I'm creating a standalone repository that benchmarks oklch-cubic cached and edge seeker using optimized conversion and helper functions. The gamut mapping algorithms that have caches are very fast so I think that's the most realistic way to compare them. I'm going to use a cache of 3600 hues at a step of 0.1 for oklch-cubic for now as the call to |
Yep, makes sense.
Yeah, I think what you want is just to call |
|
Yeah, if the option comes down to cached vs cached, it would be cool to see a fair comparison of the EdgeSeeker vs Cached Cubic. The good thing about Cubic is that if you don't want a cache, it's still pretty fast without the cache. |
Assuming implementations highly optimized for specific gamuts I think the order is bjorn, raytrace, uncached cubic. Also, cubic calls a lot of transcendental math functions so for languages like rust and c++ there's less code that can be optimized and you're relying on library implementations of those math functions. I have a rust implementation of both algorithms and the differences are surprising. |
|
Here's a repository that benchmarks oklch-cubic (cached) and edge seeker using highly optimized color conversion and helper functions that should be almost entirely allocation free. All of the functions and benchmarks use coord arrays instead of color objects. The data for the benchmark is the same default data that the benchmark app in this repository uses. The benchmarks optionally check if the coords are in gamut, perform the gamut mapping algorithm and then convert the coords to p3. Clip just converts to p3 and clamps the values. There are rust versions of the benchmarks in the Benchmarking is hard and the results depend on the Javascript runtime (Node vs Bun), CPU architecture (e.g. ARM vs Intel), implementation language (e.g. Javascript vs Rust) and probably a bunch of other stuff I'm forgetting. Also, keep in mind that these are microbenchmarks and performance could be different in a real world application. Here are the results on my Intel 9800X3D Node summary clip 1.26x faster than oklch-cubic (cached) 1.92x faster than oklch-cubic (cached, in-gamut check) 2.16x faster than edge-seeker 2.86x faster than edge-seeker (in-gamut check) Bun summary clip 1.55x faster than oklch-cubic (cached) 2.02x faster than oklch-cubic (cached, in-gamut check) 3.15x faster than edge-seeker 3.56x faster than edge-seeker (in-gamut check) Rust scalar Rust (median ns/call over 25 grid passes, fastest to slowest): clip 17.95 ns/call (1.00× fastest) edge-seeker 32.71 ns/call (1.82× fastest) oklch-cubic (cached) 46.56 ns/call (2.59× fastest) edge-seeker (in-gamut check) 48.88 ns/call (2.72× fastest) oklch-cubic (cached, in-gamut check) 58.37 ns/call (3.25× fastest) I was a bit surprised to see edge-seeker ahead of oklch-cubic in the Rust benchmark and the results on my Mac laptop were even more surprising... |
|
Results on an Macbook Air M2 Node summary clip 1.1x faster than oklch-cubic (cached) 1.74x faster than oklch-cubic (cached, in-gamut check) 2.15x faster than edge-seeker 2.85x faster than edge-seeker (in-gamut check) Bun summary clip 1.59x faster than oklch-cubic (cached) 2.03x faster than oklch-cubic (cached, in-gamut check) 3.68x faster than edge-seeker 4.18x faster than edge-seeker (in-gamut check) Rust scalar Rust (median ns/call over 25 grid passes, fastest to slowest): clip 12.62 ns/call (1.00× fastest) oklch-cubic (cached) 30.62 ns/call (2.43× fastest) oklch-cubic (cached, in-gamut check) 44.35 ns/call (3.52× fastest) edge-seeker 58.87 ns/call (4.67× fastest) edge-seeker (in-gamut check) 76.23 ns/call (6.04× fastest) |
|
This is probably the best data we have yet. It's nice to see EdgeSeeker against Cubic. I think it is a reminder that EdgeSeeker is doing more than you'd initially think. It has to look up the items in the table, lerp them, and make some curvature calculations. I had thought it was just looking up the value and lerping; I forgot about the curvature stuff. |
The one caveat with the edge seeker numbers is that the lookup isn't optimized or allocation free. I had completely ignored edge seeker until yesterday because I assumed it was going to be the fast so I didn't look to see what kind of optimizations could be made. I'll spend some time this weekend optimizing the existing edge seeker lookup to see if I can speed it up. If the table and lookup could be replaced by the same 3600 0.1 step hue array that oklch-cubic uses plus some kind of lerp (or maybe no lerp) then it might end up being as fast or faster than oklch-cubic. I'm not sure how that would affect the accuracy of the results and I don't know what kind of analysis would be needed to determine if the results would be acceptable. |
|
I added an additional benchmark to my benchmark repository that randomizes the order of the data (using a seed so it's repeatable). There's a modest but noticeable difference in benchmark times in the Javascript benchmarks. In the rust benchmarks the differences are quite dramatic especially with edge seeker (32.33ns vs 75.22ns). ── grid (H = 0..359 step 1, repeated per L) ── (median ns/call over 25 passes, fastest to slowest): clip 19.93 ns/call (1.00× fastest) edge-seeker 32.33 ns/call (1.62× fastest) oklch-cubic (cached) 46.11 ns/call (2.31× fastest) edge-seeker (in-gamut check) 49.62 ns/call (2.49× fastest) oklch-cubic (cached, in-gamut check) 57.53 ns/call (2.89× fastest) ── random (stratified/jittered fractional H + L) ── (median ns/call over 25 passes, fastest to slowest): clip 29.21 ns/call (1.00× fastest) oklch-cubic (cached) 61.51 ns/call (2.11× fastest) edge-seeker 75.22 ns/call (2.58× fastest) oklch-cubic (cached, in-gamut check) 77.34 ns/call (2.65× fastest) edge-seeker (in-gamut check) 89.81 ns/call (3.08× fastest) Claude's analysis:
So I think a fair benchmark should not iterate over it's data in a predictable order unless that's what you want to measure. |
|
I added an additional lookup table to bypass the binary search of edge seeker and this new version is now faster than oklch-cubic in most cases in my gma benchmark. Here are the results on my Intel machine: Node summary clip 1.28x faster than oklch-cubic (cached) 1.38x faster than edge-seeker (indexed) 1.99x faster than edge-seeker summary clip (random hues) 1.27x faster than edge-seeker (indexed) (random hues) 1.46x faster than oklch-cubic (cached) (random hues) 2.06x faster than edge-seeker (random hues) Bun clip 1.45x faster than oklch-cubic (cached) 1.47x faster than edge-seeker (indexed) 2.43x faster than edge-seeker summary clip (random hues) 1.37x faster than edge-seeker (indexed) (random hues) 1.73x faster than oklch-cubic (cached) (random hues) 2.63x faster than edge-seeker (random hues) Rust ── grid (H = 0..359 step 1, repeated per L) ── (median ns/call over 25 passes, fastest to slowest): clip 19.39 ns/call (1.00× fastest) edge-seeker (indexed) 28.64 ns/call (1.48× fastest) edge-seeker 31.83 ns/call (1.64× fastest) oklch-cubic (cached) 53.74 ns/call (2.77× fastest) ── random (stratified/jittered fractional H + L) ── (median ns/call over 25 passes, fastest to slowest): clip 28.37 ns/call (1.00× fastest) edge-seeker (indexed) 41.60 ns/call (1.47× fastest) oklch-cubic (cached) 66.05 ns/call (2.33× fastest) edge-seeker 75.20 ns/call (2.65× fastest) |
|
I'm going to pause my work on the benchmarks for a bit as I'd like to save some of my weekly AI limit for other work. Eventually I'll add raytrace, bjorn and oklch-cubic (no cache) to my benchmark as I'm curious to see how the rust implementations perform. One thing I'd like to see is if there's a way eliminate in gamut checking before running the oklch-cubic and edge seeker algorithms. If there's a way to just run the algorithms (without changing in gamut colors), maybe by sacrificing a little bit of accuracy I think that could be a big win. |
|
If you just check the lightness and see if the chroma is already less than the calculated max chroma, you don't have to check the gamut. Checking the gamut is likely better if you are already in the gamut, but if not, and it is faster on average to skip the check, then it makes sense that it might be useful to bypass it. |
Ensure that every gamut mapping method returns the color in the origin space consistently for comparison.
Also, ensure that oklch-cubic checks lightness when gamut mapping.