More

AndyNemmity · 2026-05-10T20:02:31 1778443351

I really do run a/b tests. I really do test, and validate.

I do not believe me giving you that information is honest. If I do, I am pretending that you will get the same experience.

Maybe you're using a different model. Maybe you have stuff in your CLAUDE.md that will break it.

It is not honest to me to give you confidence in it, when no one can be confident in it.

AndyNemmity · 2026-05-10T16:01:27 1778428887

Define obviously validation? What is the signal that tells you one is reasonable vs another?

I find the only way to do that is to look at it, if it passes some visual tests, try it, and then a/b test if it's any better than without it.

theptip · 2026-05-10T16:30:42 1778430642

Some sort of eval. Eg TermBench, implemented in Harbor.

It’s an insane amount of effort to build shareable, reusable, comprehensive evals, hence why so almost all skills are stuck in the “vibes” phase.

That said I think it’s quite easy to skim/intuit these sort of skills and do horizontal gene transfer into your own vibes-based system. If you use the skills regularly you can construct a cheap personal eval that is a lot easier to maintain and use it to compare a new skill/plugin. Just things like “please write a paper on <my personal unpublished thesis>” is a good starting point here. You get a good feel for whether a skill is better than vanilla by running it a couple times and watching the failure modes.

AndyNemmity · 2026-05-10T16:38:40 1778431120

Yeah, I think we're in a phase honestly where you shouldn't use anyone elses skills, and you should instead point your stuff at a repo with skills, have it really read it, and then ask what of value there is to potentially rewrite in your style based on your preferences.

I have a complex setup with a lot of things based around what I do. I don't know how anyone could reasonably get their head around any of it. It's a research project in itself.

So I tell people, please don't use it. Just point your claude code at it, and see if there's anything useful for you.

bisonbear · 2026-05-11T02:38:05 1778467085

Agree, it's impossible to tell if someone else's workflow works with your codebase without actually trying it, which takes time/tokens. I've been thinking about how to make running quick, directional evals easier / more efficient to give more confidence in using / developing skills. Basically, how do we go from vibes to data?

apwheele · 2026-05-10T16:23:54 1778430234

So yes a/b broadly speaking is what I was saying (test cases and can show it is actually better).

Even this repo just the "b" showcase, showing the outputs as is (with no clear documentation how those were generated, is it headless in a CI pipeline somewhere?), is not good, https://github.com/Imbad0202/academic-research-skills/tree/m....

AndyNemmity · 2026-05-10T16:31:09 1778430669

I run a lot of a/b testing. But I'm not sure showing it actually communicates all that much. Since these are non deterministic systems, even showing you an a/b test from when i made the decision a month ago, doesn't really mean a whole lot.

I agree we need more clear indications of value, I don't quite understand how to legitimately do that in a fair, and honest way.

AndyNemmity · 2026-05-10T03:04:46 1778382286

I watched all the Alphago games live, I've watched analysis of so many Alphago games.

I think one of the particulars about Go is how hard the player base took it. Far harder than chess did. Far harder than Starcraft did (although arguably, Alphastar wasn't even that good strategy wise, it was just better mechanically even with preventions. No one has adopted almost any of Alphastar's strategy)

Lee Sedol in particular was crushed by the experience.

Others found optimism and opportunity in it.

I don't think extrapolating the Go experience is all that useful across the board, although it does have some value, and perspective, and it was a fantastic article I enjoyed reading.

Games have cheating, because cheating is easier than getting better.

Before AI, there was rampant cheating. In Magic the gathering, it's shuffle cheating, or holding out cards, or whatever.

The ease at which you can cheat makes more cheaters. If you can get away with it, or if it's like Go, or Chess AI, it's trivial to do, and easy to not get caught.

Same with map hacking in Starcraft.

I don't know. I don't have any fully formed thoughts here, except that I think extrapolating the experience in this way is vastly overstating it's generalized impacts.

But I also could be very wrong. We are talking about predictions. No one can predict anything.

Predictions say more about you, and your perspective, than they do about reality.

But great read, enjoyed thinking about it all.

AndyNemmity · 2026-05-08T20:02:30 1778270550

For sure, this is the pattern I use.

And I wish I could make even more deterministic. Maybe I can, but it can also be a bit challenging to sort.

AndyNemmity · 2026-05-08T20:01:18 1778270478

Yeah, that's how I do skills. If I can make a script, I do. Everything that can be deterministic should be. https://github.com/notque/vexjoy-agent

AndyNemmity · 2026-05-08T19:59:48 1778270388

That's what my system does. It uses a workflow if one already exists, if not, it just creates one on the fly from the primitives.

https://github.com/notque/vexjoy-agent

I would prefer that be deterministic though. This thread has me considering what if anything I can do to make it forced. Like, I could do it with hooks, but that's not elegant at all.

AndyNemmity · 2026-05-05T01:41:48 1777945308

Yeah, I Blind A/B test everything, and a lot.

But I don't expect anyone to every use my stuff. It's complicated as hell. But it's for me, and it works without me having to remotely think about the complexity.

I love that.

AndyNemmity · 2026-05-05T01:22:19 1777944139

This is why I created the /do router, to route to all skills. I also have anti rationalization, progressive context discovery etc.

I only make it for me, so it's a bit complex and targeted towards me, and what I do, but it's pretty easy to adjust things.

https://github.com/notque/vexjoy-agent

Working on reading through Agent Skills, it seems we've converged on a lot of the same points, and I've never seen it, so trying to get an understanding of it.

Edit 1: I don't like all the commands. I just rely on a single router to automatically decide what I want, and that feels like the most reasonable way to me to communicate with it.

I don't want to remember things. And that's the way for me to scale the number of skills and activities. I don't have to think about them.

Edit 2: We have very different routers.

https://github.com/addyosmani/agent-skills/blob/f504276d8e07...

vs

https://github.com/notque/vexjoy-agent/blob/main/skills/do/S...

I personally wouldn't call theirs an intelligent router. They are dancing between a few different skills. We have extremely different setups there.

But of course, I'm using way more context to get it done. I'm even sending it out to Haiku to build the route choices.

I choose to use tokens to make things better for myself, not everyone would make the same choice, so I certainly see why they are using a few skills, and composing them.

Edit 3: This is much easier for a user to wrap their head around because there's much less.

I am only focused on the best improvements I can make that show value for my use cases. This is straight foward to reason about.

This seems like a nice way to get the best concepts for people trying to understand them. I commend them for a clean, simple approach.

Edit 4: Yeah, I think there are some things I can learn from them which is always good.

I especially like simple decisions like collapsing the install details for each harness in the readme.

I'm going to read over the entire thing and look for opportunities to improve my stuff.

We are all working together, learning, testing, building, trying to find the best way to implement things.

AndyNemmity · 2026-05-02T22:45:52 1777761952

An LLM cannot be conscious.

A submarine cannot swim.

bastawhiz · 2026-05-02T23:07:57 1777763277

Then the author should be more careful with their choice of words.

AndyNemmity · 2026-05-01T15:30:22 1777649422

Good for you, but I'm not sure what you're showing! :)