Some sort of eval. Eg TermBench, implemented in Harbor.
It’s an insane amount of effort to build shareable, reusable, comprehensive evals, hence why so almost all skills are stuck in the “vibes” phase.
That said I think it’s quite easy to skim/intuit these sort of skills and do horizontal gene transfer into your own vibes-based system. If you use the skills regularly you can construct a cheap personal eval that is a lot easier to maintain and use it to compare a new skill/plugin. Just things like “please write a paper on <my personal unpublished thesis>” is a good starting point here. You get a good feel for whether a skill is better than vanilla by running it a couple times and watching the failure modes.
Yeah, I think we're in a phase honestly where you shouldn't use anyone elses skills, and you should instead point your stuff at a repo with skills, have it really read it, and then ask what of value there is to potentially rewrite in your style based on your preferences.
I have a complex setup with a lot of things based around what I do. I don't know how anyone could reasonably get their head around any of it. It's a research project in itself.
So I tell people, please don't use it. Just point your claude code at it, and see if there's anything useful for you.
Agree, it's impossible to tell if someone else's workflow works with your codebase without actually trying it, which takes time/tokens. I've been thinking about how to make running quick, directional evals easier / more efficient to give more confidence in using / developing skills. Basically, how do we go from vibes to data?
I run a lot of a/b testing. But I'm not sure showing it actually communicates all that much. Since these are non deterministic systems, even showing you an a/b test from when i made the decision a month ago, doesn't really mean a whole lot.
I agree we need more clear indications of value, I don't quite understand how to legitimately do that in a fair, and honest way.
I watched all the Alphago games live, I've watched analysis of so many Alphago games.
I think one of the particulars about Go is how hard the player base took it. Far harder than chess did. Far harder than Starcraft did (although arguably, Alphastar wasn't even that good strategy wise, it was just better mechanically even with preventions. No one has adopted almost any of Alphastar's strategy)
Lee Sedol in particular was crushed by the experience.
Others found optimism and opportunity in it.
I don't think extrapolating the Go experience is all that useful across the board, although it does have some value, and perspective, and it was a fantastic article I enjoyed reading.
Games have cheating, because cheating is easier than getting better.
Before AI, there was rampant cheating. In Magic the gathering, it's shuffle cheating, or holding out cards, or whatever.
The ease at which you can cheat makes more cheaters. If you can get away with it, or if it's like Go, or Chess AI, it's trivial to do, and easy to not get caught.
Same with map hacking in Starcraft.
I don't know. I don't have any fully formed thoughts here, except that I think extrapolating the experience in this way is vastly overstating it's generalized impacts.
But I also could be very wrong. We are talking about predictions. No one can predict anything.
Predictions say more about you, and your perspective, than they do about reality.
I would prefer that be deterministic though. This thread has me considering what if anything I can do to make it forced. Like, I could do it with hooks, but that's not elegant at all.
But I don't expect anyone to every use my stuff. It's complicated as hell. But it's for me, and it works without me having to remotely think about the complexity.
Working on reading through Agent Skills, it seems we've converged on a lot of the same points, and I've never seen it, so trying to get an understanding of it.
Edit 1: I don't like all the commands. I just rely on a single router to automatically decide what I want, and that feels like the most reasonable way to me to communicate with it.
I don't want to remember things. And that's the way for me to scale the number of skills and activities. I don't have to think about them.
I personally wouldn't call theirs an intelligent router. They are dancing between a few different skills. We have extremely different setups there.
But of course, I'm using way more context to get it done. I'm even sending it out to Haiku to build the route choices.
I choose to use tokens to make things better for myself, not everyone would make the same choice, so I certainly see why they are using a few skills, and composing them.
Edit 3: This is much easier for a user to wrap their head around because there's much less.
I am only focused on the best improvements I can make that show value for my use cases. This is straight foward to reason about.
This seems like a nice way to get the best concepts for people trying to understand them. I commend them for a clean, simple approach.
Edit 4: Yeah, I think there are some things I can learn from them which is always good.
I especially like simple decisions like collapsing the install details for each harness in the readme.
I'm going to read over the entire thing and look for opportunities to improve my stuff.
We are all working together, learning, testing, building, trying to find the best way to implement things.
I do not believe me giving you that information is honest. If I do, I am pretending that you will get the same experience.
Maybe you're using a different model. Maybe you have stuff in your CLAUDE.md that will break it.
It is not honest to me to give you confidence in it, when no one can be confident in it.