I'm looking for a captioning model that would be able to describe a group of images in a single sentence. Alternatively, I need a way to conceptually average a group of images before feeding that "concept" (presumably a feature vector) to a regular captioning model.
Why?
For Lora training evaluation. It would be useful to test the trained generation model on a prompt that would fit the dataset as a whole instead of selecting captions of single images or trying to find what's common among them. Moreover, this would also allow to produce a single negative prompt to test how model behaves on out of scope prompts.
What I've done so far: I've modified an existing CLIP+BLIP interrogator to work with image sets (it can also produce negatives). However, while CLIP captioning allows averaging image features before using them to choose the best caption, it's far less accurate than captions produced by BLIP, which only works with single images. I need a model that would take in feature vectors like CLIP so I can preprocess them.