Despite their huge potential, deep learning-based models are still not trustful enough to warrant their adoption in clinical practice. The research on the interpretability and explainability of deep learning is currently attracting huge attention. Multilayer Convolutional Sparse Coding (ML-CSC) data model, provides a model-based explanation of convolutional neural networks (CNNs). In this article, we extend the MLCSC framework towards multimodal data for medical image segmentation, and propose a merged joint feature extraction ML-CSC model. This work generalizes and improves upon our previous model, by deriving a more elegant approach that merges feature extraction and convolutional sparse coding in a unified framework. A segmentation study on a multimodal magnetic resonance imaging (MRI) dataset confirms the effectiveness of the proposed approach. We also supply an interpretability study regarding the involved model parameters.