1. Home
2. Questions
3. Unanswered
4. AI Assist
5. Tags
7. Chat
8. Users
Stack Internal

Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work.
Try for free Learn more
Stack Internal
Bring the best of human thought and AI automation together at your work. Learn more

Captioning an average of image set

Ask Question

Asked 1 year ago

Modified 1 year ago

Viewed 100 times

0

I'm looking for a captioning model that would be able to describe a group of images in a single sentence. Alternatively, I need a way to conceptually average a group of images before feeding that "concept" (presumably a feature vector) to a regular captioning model.

Why?

For Lora training evaluation. It would be useful to test the trained generation model on a prompt that would fit the dataset as a whole instead of selecting captions of single images or trying to find what's common among them. Moreover, this would also allow to produce a single negative prompt to test how model behaves on out of scope prompts.

What I've done so far: I've modified an existing CLIP+BLIP interrogator to work with image sets (it can also produce negatives). However, while CLIP captioning allows averaging image features before using them to choose the best caption, it's far less accurate than captions produced by BLIP, which only works with single images. I need a model that would take in feature vectors like CLIP so I can preprocess them.

asked Nov 10, 2024 at 14:15

Seedmanc

1011 bronze badge

$\begingroup$ Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. $\endgroup$

Community
– Community Bot

2024-11-10 19:25:23 +00:00
Commented Nov 10, 2024 at 19:25
$\begingroup$ outline: have a model to turn each picture into a feature vector. have another model to ingest that set of feature vectors, and emit text. $\endgroup$

Christoph Rackwitz
– Christoph Rackwitz

2024-11-15 14:30:50 +00:00
Commented Nov 15, 2024 at 14:30
$\begingroup$ see also: ai.stackexchange.com/questions/47010/… $\endgroup$

Christoph Rackwitz
– Christoph Rackwitz

2024-11-17 12:54:03 +00:00
Commented Nov 17, 2024 at 12:54
$\begingroup$ You're just repeating what I said, i need a model which would be able to do that. So far they don't ingest feature vectors, only images. $\endgroup$

Seedmanc
– Seedmanc

2025-01-14 07:44:07 +00:00
Commented Jan 14 at 7:44
1

$\begingroup$ autoencoders turn unlabeled images into vectors. Grouping similar vectors is clustering. If you want to differentiate between groups, you could train a contrastive model like triplet-loss to get feature vectors. $\endgroup$

Sycorax
– Sycorax ♦

2025-01-15 01:18:04 +00:00
Commented Jan 15 at 1:18

Add a comment |

0

Your Answer

Sign up or log in

Post as a guest

Name

Email

Required, but never shown

Post as a guest

Name

Email

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.