6
submitted 10 months ago* (last edited 10 months ago) by Even_Adder@lemmy.dbzer0.com to c/fosai@lemmy.world

Abstract

The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other. Video-LLaVA achieves superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits. Additionally, our Video-LLaVA also outperforms Video-ChatGPT by 5.8%, 9.9%, 18.6%, and 10.1% on MSRVTT, MSVD, TGIF, and ActivityNet, respectively. Notably, extensive experiments demonstrate that Video-LLaVA mutually benefits images and videos within a unified visual representation, outperforming models designed specifically for images or videos.

Paper: https://arxiv.org/abs/2311.10122

Code: https://github.com/PKU-YuanGroup/Video-LLaVA

Demo: https://huggingface.co/spaces/LanguageBind/Video-LLaVA

no comments (yet)
sorted by: hot top controversial new old
there doesn't seem to be anything here
this post was submitted on 21 Nov 2023
6 points (100.0% liked)

Free Open-Source Artificial Intelligence

0 readers
2 users here now

Welcome to Free Open-Source Artificial Intelligence!

We are a community dedicated to forwarding the availability and access to:

Free Open Source Artificial Intelligence (F.O.S.A.I.)

Have no idea where to begin with AI/LLMs? Try visiting our Lemmy Crash Course for Free Open-Source AI. When you're done with that, head over to FOSAI ▲ XYZ or check out the FOSAI LLM Guide for more info.

Monthly Roadmap

October 2023

More AI Communities

AI Resources

Learn

Build

Serve

Fediverse / FOSAI

LLM Leaderboards

LLM Search Tools

LLM Evaluations

GitHub Projects

Documentation Theory

founded 1 year ago
MODERATORS