ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2

Wenjun Huang¹ , Jiakai Pan¹ , Jiahao Tang¹ , Yanyu Ding² ,

Yifei Xing³ , Yuhe Wang¹ , Zhengzhuo Wang¹ , Jianguo Hu^1*

¹Sun Yat-sen University

²Dongguan University of Technology

³University of the Chinese Academy of Sciences

arXiv Code

Abstract

Multimodal Large Language Models (MLLMs) have attracted much attention for their multifunctionality. However, traditional Transformer architectures incur significant overhead due to their secondary computational complexity. To address this issue, we introduce ML-Mamba, a multimodal language model, which utilizes the latest and efficient Mamba-2 model for inference. Mamba-2 is known for its linear scalability and fast processing of long sequences. We replace the Transformer-based backbone with a pre-trained Mamba-2 model and explore methods for integrating 2D visual selective scanning mechanisms into multimodal learning while also trying various visual encoders and Mamba-2 model variants. Our extensive experiments in various multimodal benchmark tests demonstrate the competitive performance of ML-Mamba and highlight the potential of state space models in multimodal tasks. The experimental results show that: (1) We empirically explored the application of 2D visual selective scanning in multimodal learning and proposed the Mamba-2 Scan Connector (MSC) to enhance representational capabilities. (2) ML-Mamba achieves performance comparable to state-of-the-art methods such as TinyLaVA and MobileVLM v2 through its linear sequential modeling while faster inference speed; (3) Compared to multimodal models utilizing Mamba-1, the Mamba-2-based ML-Mamba exhibits superior inference performance and effectiveness.

ML-Mamba

The architecture of Mamba consists of four main components: a pre-trained visual encoder, a randomly initialized multi-modal connector called the Mamba-2 Scan Connector (MSC), and a pre-trained large language model (Mamba-2 LLM). As illustrated below, with an image as input, visual features are first extracted through the visual encoder. The extracted sequence of visual features is then fed into the multi-modal connector (MSC), whose output is mapped to the LLM using a multi-layer perceptron (MLP) projector. The output vector from the visual projector is then combined with tokenized text queries and input into the Mamba-2 LLM. Finally, the Mamba-2 LLM generates the corresponding response.

MultiModal Connector

Multimodal connectors act between visual features and language models to ensure seamless integration of visual and linguistic information. In this study, we explored a novel multimodal connector called Mamba-2 Scan Connector (MSC) architecture aimed at addressing the challenge of unclear causal relationships in computer vision. The core of the MSC module is a combination of the two-dimensional Mamba-2 visual selective scanning (MVSS) module and the SwiGLU module. We attempted to integrate this module into the multimodal connector of the ML-Mamba multimodal learning framework. Specifically, we studied three variants of multimodal connectors:

MLP: a three-layer Multi-Layer Perceptron (MLP) that aligns the features of vision and text.
MSC-MLP (Basic): It combines the multimodal connector called the Mamba-2 Scan Connector (MSC) module, which does not include the SwiGLU module and is intended to enhance the processing capability of two-dimensional non-causal visual information. Subsequently, the MLP aligns the features of vision and text
MSC-MLP (Advanced): This variant combines the MSC module and MLP, where the MSC module includes the SwiGLU module.

2D scanning mechanisms

The MSC module bridges the gap between 1D sequential processing capability (typical of SSM) and 2D non causal visual information by introducing two 2D scanning mechanisms. These scanning mechanisms include:

Bidirectional-Scan Mechanism (BSM): Scanning the complementary features of the image in both forward and backward directions to capture a broader context without increasing computational complexity
Cross-Scan Mechanism (CSM): unfolds image patch features into sequences along rows and columns and scans them in four directions (diagonally across the image).

Mamba-2 Scan Connector (BSM, With SwiGLU) and Mamba-2 Scan Connector (CSM, With SwiGLU):

Examples of ML-Mamba chat

BibTeX

@misc{huang2024mlmamba,
      title={ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2},
      author={Wenjun Huang and Jianguo Hu},
      year={2024},
      eprint={2407.19832},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2407.19832},
}