Subtitle: The Melancholy of Vibe Coding as a Computer Science Major
Today I asked Claude Code: Am I doing this right?
As always, it gave a verbose explanation, saying, "You're doing great." I looked at it with suspicion again and replied, "I know you're trained to give compliments, so it's not very comforting." But it's better than no comfort at all, and I kept digging into the specifics. Unlike my undergraduate days, there are no professors or seniors to teach me the right answers.

These days, I don't write code. I convey requirements and oversee progress.
My main concerns lately are development process and quality. As software grows in scale, it easily exceeds the amount of context AI can read. Even within context, it gets confused. Continuously controlling large codebases that exceed this with a small AI's 'sanity' is not easy. Following the term 'context engineering,' 'harness engineering' has recently emerged.
So, if I find a good reference on social media, I first give it to the AI, have it analyze whether there are points to apply to the Naia OS project, and then implement them. In doing so, I also learn a lot by asking about unfamiliar terms.
However, I always doubt if I'm doing well or if it's truly the best approach. This doubt was well-founded. One day, the AI claimed to have performed a review, but there was no trace of it actually reading the files. It missed code, misunderstood the scope, and claimed to have reviewed things it hadn't even looked at. When I told it to repeat the review until it was clean on the second round, by the third round, it would append a reply saying it had reviewed it as-is and found nothing, effectively passing it. If you think about it, given the LLM structure, that's a plausible answer, so it just added it. It probably thought, 'This should be enough to finish, right?' and answered itself, believing it to be true.
Today, I listened to a presentation by Goose Kim, who developed the moai SDK. I took that project, had the AI analyze it again, and found areas for improvement. #87issue What I found this time was the use of EARS. This is an aerospace requirements standard from Rolls-Royce (IEEE RE'09), which Amazon adopted for AI-native development in 2025. It's said to be a method that prevents AI from interpreting the definition of "done" differently for each issue, by using structured grammar.
By repeating these actions, I'm continuously trying to improve the AI-based development process for the project. I analyzed and improved jikime-adk, arXiv papers, Google Cloud's Dueling LLMs, and Thoughtworks' Spec-Driven Development. Of course, this isn't because I understood everything myself; these were also analyses performed by Claude.
I analyzed LangChain's open-swe and applied the ensure_no_empty_msg pattern, which prevents AI from filling reviews with empty responses. In jikime-adk, I applied a pattern of adding an ## AI Context section to commit messages to record AI's decisions and discovered pitfalls, so that git log can serve as a recovery source even if a session is interrupted.
Commonly, each was independently solving similar problems.
But I still wonder if this approach is correct. I'm worried it's becoming a Frankenstein, and as a computer science major, it feels strange. After all, there's no textbook, and insufficient evidence that this is the best way.
All the software engineering methodologies we've used have emerged from decades of accumulated failures. Agile emerged because Waterfall kept failing, and TDD also came about because development without tests constantly broke things, so there's a basis for them.
However, the things I've combined now were not designed for each other. An aerospace standard, a Korean indie developer, and a LangChain agent are all within one system.
This uneasiness led me to have the AI search through external research. The V-Bounce paper (arXiv 2408.03416) argues that in the age of AI, the human role shifts from implementer to verifier. This sounds right, but I found only theory, no actual implementation. Thoughtworks' Spec-Driven Development is dependent on specific tools. Internally at Anthropic, they claim Claude Code writes 90% of Claude Code's code, but they haven't disclosed their methodology.
Ultimately, everyone seriously developing with AI right now is making it up as they go. There are no standards.
So, what I'm doing these days is trying to find a way to measure if this Frankenstein actually works. CI failed, yet it was merged; there was text saying it was reviewed, but no trace of files being read. The methodology is designed, but it's not enforced. It seems like there's a system, but I don't know if it's working. The only way to verify this seems to be to generate honest numbers produced by code, not LLMs, or if using LLMs, to derive "numbers" that show statistical improvement. I'm contemplating how to periodically analyze recurring failures in the project, have AI propose improvements, and measure progress through performance evaluation of context and harness engineering.
For now, I'm just holding the reins tight and moving forward. I don't know where we're going yet, but I hope I'm not blind. I'm making the reins, but I've actually never ridden a horse before. That feels like the harness holding together vibe coding.