ChatGPT, Gemini, and Claude Compete in Multimodal Image Understanding

Key Points

ChatGPT delivers structured, reliable inventories of visual content.
Gemini provides highly detailed, context‑rich descriptions with precise text recognition.
Claude offers narrative‑style overviews that add creative flair but may include imaginative guesses.
All three models correctly identified major objects in Times Square, Michelangelo’s painting, and the cluttered room.
Gemini stood out for its ability to describe spatial relationships and avoid hallucinations.
ChatGPT avoided false naming of characters in complex artworks.
Claude highlighted artistic themes such as nudity controversy in the painting.

Testing ChatGPT, Gemini, and Claude in the multimodal maze

Times Square

Last Judgment

Overview of the Multimodal Test

The evaluation placed three prominent AI chat models—ChatGPT, Gemini, and Claude—against a set of visually demanding images. The chosen pictures represented different challenges: a neon‑lit Times Square filled with signage and movement, Michelangelo’s “Last Judgment” with its intricate crowd of figures, and a messy room cluttered with cables, papers, and assorted objects. The goal was to see how each system parsed visual information, identified objects, read embedded text, and articulated spatial relationships without inventing details.

Performance on the Times Square Image

ChatGPT produced a structured list, noting major signs for shows and brands, the hot‑dog cart, yellow cabs, buses, pedestrians, and street markings. It also quoted visible text on the signs and offered a brief comment on the overall energy of the scene. Gemini went deeper, describing the green glow from a sign reflecting on nearby surfaces, the staggered diagonal crosswalk pattern, and identifying the bus as an MTA vehicle while noting unreadable text. Claude took a more narrative approach, labeling the scene as a vibrant nighttime photograph and highlighting the iconic energy, while correctly identifying major signs and colors.

Interpretation of Michelangelo’s “Last Judgment”

ChatGPT described the central Christ figure surrounded by clusters of angels, resurrected bodies, and demons, carefully avoiding false names for specific characters. Gemini provided an art‑historian‑style analysis, outlining the radial composition, concentric arcs, and the directional motion of figures, while staying grounded in recognized symbols. Claude emphasized the controversy of nudity, identified Christ and Mary, and contrasted the upward movement of saved figures with the downward turmoil of the damned, delivering a concise but vivid overview.

Analysis of the Cluttered Indoor Room

In the chaotic room, ChatGPT listed items from left to right, recognizing tangled cords, binders, manuals, and various devices, though occasionally using vague labels like “a small device.” Gemini broke the scene into fine‑grained details, noting colors, shapes, lighting, and even speculating on the room’s purpose as an administrative space. Claude offered a summarized inventory, correctly naming many objects but occasionally inferring items not clearly visible, such as describing a stack of envelopes that was actually printed sheets.

Strengths and Weaknesses Across Models

ChatGPT demonstrated careful, reliable enumeration and avoided hallucinations, making it a solid choice for users who need clear, structured outputs. Gemini excelled at detailed, context‑rich descriptions, precise text recognition, and nuanced spatial reasoning, positioning it as the most precise visual interpreter among the three. Claude’s narrative style added creative flair, but occasional imaginative guesses showed a trade‑off between storytelling and strict accuracy.

Conclusion and Guidance for Users

The side‑by‑side test reveals distinct personalities among the models. Gemini’s meticulous attention to detail and grounding in observable facts makes it the top recommendation for tasks demanding high visual fidelity. ChatGPT offers a dependable, straightforward inventory suitable for quick reference, while Claude provides a more literary perspective that may appeal to users valuing expressive summaries. Selecting the appropriate model depends on whether precision, reliability, or creative narration is the priority.

Source: techradar.com