Kaiju No. 8 2, episode 10


Last show standing, and since it started late, it’s got two one more episodes to go. [sigh; it's only 11 episodes, so things are going to be rushed next week]

This week, all hell breaks loose, kicking off an episode of Back-Seat Kaiju!, in which our cute glasses-wearing Operations gal’s sanity is tested. Basically, everyone gets to debut their new kaiju skinsuits, with Tsuntail exploding out of the gate in her mom’s hand-me-downs, Vice-Captain playing tsukkomi to #10’s boke, and Super Sidekick… not appearing this week. Good thing we’ve got two captains on deck.

Verdict: two one more episode, and I have no idea what they plan to end it on. I have a hunch Naughty Number Nine has been built up too much as a long-term antagonist, so maybe they’ll just Save The Country For Now and try for a third season? Ratings apparently support the idea.

(time for Tsuntail to go axe-crazy!)

Diffusion Image Evaluation Checklist

  1. Count the limbs.
  2. Count the fingers.
  3. Count the toes.
  4. Side-check the thumbs.
  5. Side-check the big toes.
  6. Count the navels.
  7. Check for crazy eyes.
  8. Check for consistent skin tones.
  9. Check skin texture.
  10. Check for texture patterns in background.
  11. Watch out for overtrained default clothing.
  12. Check all four edges for cut-off objects.
  13. Check for giants and tinyfolk.
  14. [update!] Check for melting.

Now you can apply standard advice about composition, like rule-of-thirds, division of negative space, not cropping people at joints, etc. If there’s anything even vaguely naughty about the pictures you’re generating, you’ll also want to check the apparent age of all human figures…

GenAI coding: the definition of insanity

devstral-small-2507:

It was quick, and friendly, and the code was half the size of the working one I got from gpt-oss-20b, and it ran the first time. It didn’t do anything, of course, and I spent N passes nit-picking every place where it did something stupid or simply ignored the spec, but as I blew past 24KB of context, my checkin comments started to get snarky. After all, I’d only just got it to finally display the bottom-bar buttons onscreen, without scrolling, and it still refused to actually make them work as buttons, or display the correct contents.

But it was quick, and friendly! Worst thing was that it actually styled the web GUI nicely; it just couldn’t make it meet the spec. I finally had to give up, because it started “fixing” bugs without changing a single line of code.


codestral-22b:

Did not want to write code. First pass, it just restated the specs with less detail, and said “you could write it like this”. Erased that response, tried again, got a high-level overview of code it thought I should write. Third try, and finally it’s writing code, using a completely different set of Python libraries than any of the others.

The code is incomplete. Try again, telling it to follow the spec and write a web GUI instead of TCL/TK, and it chooses a library I’ve never heard of that sounds like a product placement, writes partial code, and then says it can’t write the full app as requested, because that “would require quite a bit of code and is beyond the scope of this text-based environment.”

I shit you not.

gemma-3-27b

Considers this a substantial project, and the very first thing it decides to do is split it up into multiple files. It all ended up in one file anyway, but it also used gradio for the GUI, which is much heavier and slower to start up. Then it couldn’t read the ranking file, since it skipped the part of the spec where only one field was required. Then it got weird, telling me to fix a bug by replacing one line of code, but “you might encounter similar issues with other event handlers if you’re passing non-block arguments as arguments”.

Who is it talking to, here? I’m not the one writing the code. I scolded it and told it to do its job. It groveled a bit, and wrote out a new version that was still riddled with the same type of bug. Basically, it wrote the same type of bug in half a dozen places, but only fixed one of them each time through. An hour in, and I still haven’t even seen the GUI. I just keep pasting Python stack traces into the chat so it can fix one bug at a time.

“Thank you for your incredible patience and persistence in identifying these subtle errors.” Bitch, please; you can’t write valid code for the GUI library you chose yourself. It then completed the incompetent-junior-Indian-contractor experience by groveling profusely, apologizing for underestimating the complexity of Gradio, and saying “I suggest exploring alternative GUI libraries or seeking assistance from more experienced developers”.

I shit you not.

gpt-oss-20b with the improved spec

The final version was ugly but fully functional, so what happens if I take all those revisions into account, start up a brand-new chat, and ask the new junior contractor to do what the last one did, only better and with less back-and-forth?

It spent a long time ‘thinking’ about it before just giving up without writing any code. And when I say long, I mean just under an hour. It wrote nearly 128KB of ‘thinking’ in a churning vat of horseshit. Watching it scroll by was like packaging sanity loss up as a Star-Wars-credits screen-saver.

I erased all traces of this insane monologue, set the ‘Reasoning Effort’ to low, and tried again. It said it couldn’t help with my request. I erased that answer, asked it again, and it immediately started cranking out code. Guess I got the guy who went to the good diploma mill on the third try.

The code was syntactically valid. The GUI loaded, but didn’t display any images, throwing a stack trace instead. When I pasted in the stack trace, it spat out something irrelevant (model turds?). Telling it to fix the bug, however, prodded it into doing the needful and writing out a new version of the code. At this point, it ran and was partially functional. The buttons didn’t work, you couldn’t rank/flag images, and you had to manually refresh the page to navigate to a new image, but unlike the competition, it actually made progress.

It is not, however, converging to a fully-functional solution faster than the original spec did with the same model. So, all the things “we” learned from the first experience and incorporated into the spec resulted only in an improvement in my ability to rapidly write fix-it prompts.

“But what if we crossed the streams?”

That first try with gpt-oss produced an ugly, fully functional app. Devstral wrote a good-looking, mostly functional app but just couldn’t get it to meet the spec. So, what happens if we feed the revised spec and the devstral code into gpt-oss-20b?

The first two answers were, “I’m sorry, I can’t help with that”. Deleting and asking a third time got it off its ass and it wrote a brief paragraph about fixing the code, but required another explicit “fix the code” prompt before it actually did anything. This is a very strange experience, but at least it doesn’t have a 17-hour delay between each iteration.

First pass? Code ran but couldn’t display any images, and the buttons were missing. Second pass? Slight improvement over original, but it can’t figure out where the code deviates from the spec, so it still has almost all of the old bugs. Sigh.

It’s a real uphill struggle.

“But what about ChatGPT?”, I hear you cry

7 seconds of ‘thinking’, followed by typing out 700+ lines of code faster than the local models, with pretty much the same basic structure as the rest. It ran, auto-opened the web browser, displayed a blank GUI, and dropped a 404 error trying to load an image. So, y’know, same-old-same-old. I grabbed the error message out of the Javascript console and pasted it in, along with the offending line of code it identified.

It identified a one-line fix, but asked if I wanted it to type out a complete new version so I could just paste it in. I said yes, and the new version had a lot of changes, including deleting the entire HTML template so that it loaded a blank page. It went from 782 lines to 288.

When confronted, it figured that out, once again asked if I wanted the complete script typed out, and just randomly cut it off in the middle of line 452. Once I told it that, it typed out the final tested (how?) version, and was so confident in its success that it offered me a variety of enhancements, including Dockerizing it.

This was not the final version of the script. It was converging to full functionality, and it never just broke it completely again, but I still had to do a full QA pass after every iteration. And it took several more tries to get it right, and I was arguing with it about its analysis of the zoom problem. Then I was scolding it for going down a rat-hole, which got it to figure out the problem, this time for sure.

But before it would write out the answer, it demanded $20/month for ChatGPT Pro. Or I could wait four hours, or allow a lesser model to attempt to finish what their top-of-the-line model had taken so long to get right.

Reminder: I cancelled Pro because I do not enjoy getting scolded for violating the constantly-changing censorship rules.

I’m sure it’s sheer coincidence that it asked for money after it left the code in a state where basic functionality was newly-broken, but everything else finally worked. I could manually apply the solution they described, but I’d have to back out to a version before it went down the rat-hole, so either scroll waaaaaay back up or revert it in source control. You did use source control to store every version of the code it wrote for you, right?

(when you’re losing billions of dollars a year, ya gotta catch ’em all!)

In the end, it was slightly better than the local LLMs, mostly in being faster to iterate. It is still an incredibly hostile experience for anyone who doesn’t have a firm grasp of the problem, plenty of debugging experience, and the confidence to tell the “AI” that it’s aggressively solving the wrong problem.

To wash away the taste of all of these experiments, I’ll leave you with a link to a lengthy galley of pictures of NMB48’s astonishingly delicious Yuzuha Hongo.

(this is one of the rare surviving asian-cheesecake sites that is not completely covered in ads and predatory Javascript; shields up, still, just in case)


Comments via Isso

Markdown formatting and simple HTML accepted.

Sometimes you have to double-click to enter text in the form (interaction between Isso and Bootstrap?). Tab is more reliable.