{"id":6381,"date":"2026-07-01T13:17:32","date_gmt":"2026-07-01T06:17:32","guid":{"rendered":"https:\/\/daiilynews.cu.ma\/?p=6381"},"modified":"2026-07-01T13:17:32","modified_gmt":"2026-07-01T06:17:32","slug":"benchmarking-ai-agents-for-enterprise-java-framework-migration","status":"publish","type":"post","link":"https:\/\/daiilynews.cu.ma\/?p=6381","title":{"rendered":"Benchmarking AI Agents for Enterprise Java Framework Migration"},"content":{"rendered":"<p> <br \/>\n       <\/p>\n<p>      \u2b50<br \/>\n      Star ScarfBench on GitHub<\/p>\n<p>Modernizing enterprise applications is one of the largest and most expensive software engineering activities organizations undertake. Teams migrate applications across frameworks to improve maintainability, cloud readiness, developer productivity, and access to modern capabilities.<br \/>\nRecent advances in coding agents have sparked excitement around AI-assisted modernization. But an important question remains:<br \/>\nCan AI agents reliably modernize real-world enterprise applications?<br \/>\nExisting software engineering benchmarks have demonstrated impressive progress in bug fixing and code generation, but framework migration presents a fundamentally different challenge. Success requires not only translating code, but also preserving behavior, adapting build systems, and navigating runtime dependencies.<br \/>\nTo address this gap, we introduce ScarfBench (Self-Contained Application Refactoring Benchmark), an open benchmark for evaluating AI agents on cross-framework migration tasks in Enterprise Java.<br \/>\nScarfBench focuses on migrations across three major Java ecosystems:<\/p>\n<p>Spring<br \/>\nJakarta EE<br \/>\nQuarkus<\/p>\n<p>Unlike traditional benchmarks that compare generated code against reference implementations, ScarfBench evaluates whether migrated applications actually build, deploy, and preserve behavior.<\/p>\n<p>\t\tWhy Migration Is Hard<\/p>\n<p>Framework migration is much more than replacing annotations.<br \/>\nA simple repository migration can require changes across dependency injection, persistence configuration, queries, and framework descriptors. Small mistakes in any of these pieces can prevent successful deployment.<\/p>\n<p>Figure: Spring \u2192 Jakarta Migration Example<\/p>\n<p>Framework migration requires translating framework semantics, not just source code.<\/p>\n<p>\t\tIntroducing ScarfBench<\/p>\n<p>ScarfBench provides a systematic way to evaluate AI agents on enterprise Java framework migration tasks.<br \/>\nApplications are required to:<\/p>\n<p>Build successfully.<br \/>\nDeploy correctly.<br \/>\nPass behavioral validation.<\/p>\n<p>This provides a much more realistic measure of modernization quality.<br \/>\nBenchmark at a Glance<\/p>\n<p>      Metric<br \/>\n      Value<\/p>\n<p>    Applications34<br \/>\n    Framework implementations102<br \/>\n    Migration tasks204<br \/>\n    Lines of code~151K<br \/>\n    Source and test files~2,000<br \/>\n    Expert-written tests1,331<\/p>\n<p>ScarfBench includes both focused migration tasks and whole-application migrations.<\/p>\n<p>Figure: ScarfBench Construction Pipeline<\/p>\n<p>Starting from a JSR-based enterprise Java taxonomy, expert migrations create verified implementations across Spring, Jakarta EE, and Quarkus.<\/p>\n<p>\t\tHow Do Frontier Agents Perform?<\/p>\n<p>We evaluated several state-of-the-art coding agents on ScarfBench.<br \/>\nDespite strong performance on traditional software engineering benchmarks, framework migration remains difficult. Success rates vary considerably across framework pairs and whole-application migrations remain particularly challenging.<\/p>\n<p>Figure: Current Leaderboard<\/p>\n<p>Even the strongest current agents achieve less than 10% behavioral success, illustrating the gap between generating compilable code and preserving application behavior.<\/p>\n<p>Figure: Compile \u2192 Deploy \u2192 Test Progression<\/p>\n<p>Compile success consistently exceeds deploy success, which in turn exceeds behavioral success. Build success alone significantly overestimates migration quality.<\/p>\n<p>Figure: Migration Outcomes by Target Framework<\/p>\n<p>Migration difficulty depends strongly on the target framework, with Jakarta EE proving particularly challenging.<\/p>\n<p>\t\tWhat We Learned About AI Agents for Java Modernization<\/p>\n<p>Beyond measuring success rates, ScarfBench helps us understand how agents behave during modernization.<\/p>\n<p>\t\tCan Agents Reliably Tell When a Migration Is Complete?<\/p>\n<p>A migrated application is only useful if it actually builds and runs.<br \/>\nWe therefore compared agent-reported outcomes against independent build verification.<\/p>\n<p>\t\tFinding: Agents Are Overconfident<\/p>\n<p>Claude Code reported successful builds for 29 out of 30 whole applications.<br \/>\nOnly 22 of those applications actually built successfully.<br \/>\nMeanwhile, the single application classified as failed by the agent ultimately built correctly.<br \/>\nThis suggests that agent self-assessment should not be treated as a reliable signal of migration completion.<br \/>\nIndependent build and test validation remains essential.<\/p>\n<p>\t\tHow Do Agents Navigate Application Dependencies?<\/p>\n<p>Framework migrations rarely affect a single file or layer.<br \/>\nChanges in configuration, services, databases, and web components often cascade across the application.<\/p>\n<p>\t\tFinding: Migration Is Iterative Rather Than Linear<\/p>\n<p>The most frequently visited layers were:<\/p>\n<p>Configuration<br \/>\nWeb<br \/>\nDatabase<br \/>\nService<\/p>\n<p>Common transitions included:<\/p>\n<p>Configuration \u2194 Web<br \/>\nService \u2194 Database<\/p>\n<p>This suggests that migration is an iterative dependency-resolution process rather than a simple source-to-source transformation.<\/p>\n<p>\t\tWhere Do Agents Spend Most of Their Effort?<\/p>\n<p>We used layer revisit frequency as a proxy for migration effort. Layers that required repeated visits typically involved debugging, dependency resolution, or framework adaptation.<\/p>\n<p>\t\tFinding: Configuration Dominates Migration Effort<\/p>\n<p>Rather than proceeding linearly, agents repeatedly returned to configuration-related artifacts while resolving framework differences and dependency issues.<\/p>\n<p>\t\tWhat Challenges Are Not About Code Transformation?<\/p>\n<p>Not every migration issue originates from source code.<\/p>\n<p>\t\tFinding: Environment and Tooling Matter<\/p>\n<p>Agents frequently struggled with environmental issues, including:<\/p>\n<p>Docker cache inconsistencies<br \/>\nPort connectivity problems<br \/>\nMaven wrapper and build tooling issues<\/p>\n<p>These operational concerns often delayed validation even when the source-code migration itself was largely complete.<\/p>\n<p>Figure: Failure Mode Distribution<\/p>\n<p>Modernization failures span build systems, deployment environments, dependency injection, databases, endpoints, assertions, and infrastructure.<\/p>\n<p>\t\tKey Takeaway<\/p>\n<p>The biggest challenge in framework modernization is not translating Java code.<br \/>\nIt is managing the web of dependencies across configuration, infrastructure, and runtime environments.<br \/>\nWhile frontier agents can automate substantial portions of the migration process, reliable validation and architectural reasoning remain critical for achieving successful outcomes.<br \/>\nScarfBench helps expose these challenges and provides a standardized way to measure progress toward truly autonomous application modernization.<\/p>\n<p>\t\tExplore ScarfBench<\/p>\n<p>ScarfBench is designed as an open resource for researchers and practitioners.<br \/>\nResources include:<\/p>\n<p>Benchmark dataset<br \/>\nEvaluation infrastructure<br \/>\nPublic leaderboard<br \/>\nDocumentation<br \/>\nOpen-source code<\/p>\n<p>Researchers can compare agent architectures and techniques. Practitioners can use ScarfBench to evaluate modernization solutions before deploying them in production environments.<\/p>\n<p>\t\tWebsite<\/p>\n<p>https:\/\/scarfbench.info<\/p>\n<p>\t\tDataset<\/p>\n<p>https:\/\/huggingface.co\/datasets\/ibm-research\/ScarfBench<\/p>\n<p>\t\tSpace<\/p>\n<p>https:\/\/huggingface.co\/spaces\/ibm-research\/ScarfBench<\/p>\n<p>\t\tGitHub Repository<\/p>\n<p>https:\/\/github.com\/scarfbench\/scarfbench<\/p>\n<p>\t\tLeaderboard<\/p>\n<p>https:\/\/scarfbench.info\/leaderboard<\/p>\n<p>\t\tPaper<\/p>\n<p>https:\/\/arxiv.org\/abs\/2605.06754<br \/>\nFramework migration remains one of the largest unsolved problems in AI-assisted software engineering. We hope ScarfBench helps the community measure progress and accelerate the next generation of AI-assisted application modernization.<br \/>\nWe invite researchers, practitioners, and framework communities to evaluate their agents, contribute new migration scenarios and help advance the state of the art.<br \/>\n<br \/><br \/>\n<br \/><a href=\"https:\/\/huggingface.co\/blog\/ibm-research\/scarfbench\">Source link <\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<p>\u2b50 Star ScarfBench on GitHub Modernizing enterprise applications is one of the largest and most expensive software engineering activities organizations undertake. Teams migrate applications across frameworks to improve maintainability, cloud readiness, developer productivity, and access to modern capabilities. Recent advances in coding agents have sparked excitement around AI-assisted modernization. But an important question remains: Can [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":6382,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[676],"tags":[],"class_list":["post-6381","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-tech-ai"],"_links":{"self":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts\/6381","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=6381"}],"version-history":[{"count":0,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/posts\/6381\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=\/wp\/v2\/media\/6382"}],"wp:attachment":[{"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=6381"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=6381"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/daiilynews.cu.ma\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=6381"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}