FlipWalker Game Benchmark

This page presents a local-LLM benchmark result for a gravity-flip puzzle game. The game itself is playable below, and this page also summarizes the prompt source, generation process, and personal takeaways.

Playable game

Controls: press Space to start and flip gravity.

Prompt used for generation

English prompt

Create a small vanilla JavaScript puzzle game where a character automatically walks back and forth, the player can only flip gravity, and the goal is to get a key and reach a door. Design a well-crafted level with multiple obstacles that require precise gravity flips—both to reach the key and to reach the door. The level should have a clear intended solution path that feels satisfying to solve, with the layout carefully tuned so the character's walk timing and gravity flips align correctly. Test the game by simulating the solution step-by-step, verify the level is completable as intended, and fix any issues that prevent completion.

Japanese version used in this run

キャラクターが自動的に左右に往復し、プレイヤーは重力を反転させるだけの小さなバニラ JavaScript パズルゲームを作成せよ。目標は鍵を取得して扉に到達することである。精密な重力反転を要求する障害物を複数配置した、よく設計されたレベルを作れ。鍵への到達にも扉への到達にも重力反転が必要である。キャラクターの歩行タイミングと重力反転が噛み合うよう丁寧に調整された、明確な解法経路を持ち、解いたときに達成感を感じられるレベルにせよ。ソリューションをステップごとにシミュレートしてゲームをテストし、レベルが意図通りにクリアできることを確認し、クリアを妨げる問題があれば修正せよ。

Personal claim

Today, cloud models such as Claude, Codex, and Gemini are advancing rapidly, and many benchmark scores evaluate coding and agent capabilities from different angles. Scores still matter, but this year I personally see stronger importance in SDLC quality.

In my usual process, I manually execute skills one by one with SDD + TDD. For this benchmark, I used a workflow that turns those skills into an automated pipeline and built the FlipWalker game using only a local LLM.

My view is that when an SDLC process is reasonably capable and well-structured as a workflow, even a local LLM can still produce software with practical quality. The final game is somewhat simple, but it is playable.

Impressions

I do not usually rely on full automation with workflows, so this run was insightful. Executing the prepared workflow revealed that some skill responsibilities partially overlap. In that sense, running this benchmark through a workflow was highly meaningful.

Please try it with your own workflows as well. One improvement I am considering is design coverage, because my current setup is heavily focused on program generation and includes almost no design-oriented skills.