Agent-EvalKit automates the evaluation of AI agents through structured test-case generation, observability instrumentation, and combined code and LLM-based metrics directly in the development environment.
Arbor enables AI-driven research through systematic hypothesis management and achieved an average of 2.5x higher improvements than existing code models on six test tasks.