Somewhat controversial for Lemmy, but maybe I'll start adding generated captions to comics I post that don't already have them. It'd be a good chance for me to try out some self-hosting. Here's what qwen3 did for this comic:

A caption reads, "WHO SAYS TV THEME SONGS AREN'T POETRY?" A person with red hair and a serious expression stands at a microphone, with speech bubbles displaying "BATMAN. BATMAN. BATMAN.", "...NA.", and "NA NA NA." while gesturing with one hand.

I don't think the thinking adds much lol:

Okay, let's tackle this. The user wants a concise prose description of the comic panel. First, I need to look at all the elements.

The top has a yellow banner with text: "WHO SAYS TV THEME SONGS AREN'T POETRY?" So that's a caption. Then there's a person with red hair, wearing a dark suit, standing in front of a microphone. The background is dark green. The person is speaking into the mic, and there are speech bubbles. One bubble says "BATMAN. BATMAN. BATMAN." Another has "...NA." and another "NA NA NA." Also, the person is making a hand gesture with their right hand.

I need to describe the scene, actions, expressions, and all readable text. The person's expression looks serious or focused. The text in the banner is a caption, so I'll attribute that as "a caption reads." The speech bubbles should be attributed to the person since they're the only one speaking.

Check the rules: no analysis, just description. Use present tense. Include all readable text in quotes. Don't mention humor or tone. Keep it 1-4 sentences.

Let me structure it. Start with the caption, then the person's action, their speech bubbles. Make sure to mention the microphone and hand gesture.

Wait, the user said "a caption reads" for the banner. So: "A caption reads, 'WHO SAYS TV THEME SONGS AREN'T POETRY?'" Then the person speaking into the mic with the speech bubbles. The speech bubbles are part of their dialogue. So the person says the Batman lines.

Check if all text is included. The banner, the three speech bubbles. The hand gesture is part of the action. The background is dark green, but maybe not necessary unless it's important. The rules say "describe the visible scene," so maybe mention the dark green background. Wait, the example given didn't mention background color, but the user's instructions say "describe the visible scene." Hmm. The example had "a man in a suit stands behind a counter..." so maybe include relevant details. The background here is dark green, but maybe it's not critical. Let me check the example again. The example included "a sign above the counter reads..." so maybe the background color isn't necessary unless it's part of the scene. Since the problem says "visible scene," maybe the dark green is part of the setting. But the user's example didn't mention color, so maybe it's okay to omit. Wait, the user's instructions say "describe the visible scene, the characters’ actions, facial expressions, and any readable text." So if the background is a solid color, maybe it's not necessary. Let me focus on the key elements: the person, the microphone, the speech bubbles, the caption.

So the transcript should be: A caption reads, "WHO SAYS TV THEME SONGS AREN'T POETRY?" A person with red hair stands at a microphone, saying "BATMAN. BATMAN. BATMAN." and "NA NA NA." while making a hand gesture, with a speech bubble containing "...NA."

Wait, the speech bubbles: one is "BATMAN..." another is "...NA." and another "NA NA NA." So need to list all. Let me check the image again. The three speech bubbles: the first one is "BATMAN. BATMAN. BATMAN." Then a smaller one with "...NA." and another "NA NA NA." So the person is saying those lines. The hand gesture is part of their action.

So the description would be: A caption reads, "WHO SAYS TV THEME SONGS AREN'T POETRY?" A person with red hair stands at a microphone, speaking the words "BATMAN. BATMAN. BATMAN." and "NA NA NA." while making a hand gesture, with a speech bubble containing "...NA."

Wait, but the speech bubbles are separate. Maybe structure it as: A caption reads, "WHO SAYS TV THEME SONGS AREN'T POETRY?" A person with red hair stands at a microphone, with speech bubbles reading "BATMAN. BATMAN. BATMAN.", "...NA.", and "NA NA NA." The person gestures with one hand.

Yes, that's concise. Check the rules: include all readable text in quotes. The order is important. The caption first, then the speech bubbles. The person's action (standing at mic, gesturing). No analysis, just description. Present tense. No character name since it's not given.

Let me make sure it's 1-4 sentences. Maybe two sentences. Let's see: "A caption reads, 'WHO SAYS TV THEME SONGS AREN'T POETRY?' A person with red hair stands at a microphone, with speech bubbles displaying 'BATMAN. BATMAN. BATMAN.', '...NA.', and 'NA NA NA.' while gesturing with one hand."

Yes, that's three sentences. Wait, the example had two sentences. Maybe combine. But the user says 1-4. This should be okay. Check if all text is included. Yes. The hand gesture is mentioned. The person's expression: the description says "facial expressions" so need to mention that. The person's expression: the original image shows a serious or neutral expression. So "a serious expression" should be included.

So revised: A caption reads, "WHO SAYS TV THEME SONGS AREN'T POETRY?" A person with red hair and a serious expression stands at a microphone, with speech bubbles displaying "BATMAN. BATMAN. BATMAN.", "...NA.", and "NA NA NA." while gesturing with one hand.

That's three sentences. Includes all elements. The facial expression is "serious" as per the image. The user said to describe facial expressions. The person's expression is part of the visible scene. So that's correct.

Check if "serious" is accurate. The image shows the person with a neutral or slightly concerned look, but maybe "serious" is appropriate. Since the user says to describe visible elements, and the expression is part of that, yes.

Final check: all text included, in order. Caption first, then the speech bubbles. Actions: standing at mic, gesturing. Facial expression. No extra analysis. Yes.

[–] southsamurai@sh.itjust.works 1 points 1 week ago

That's pretty neat :)