Key Achievement: Progressive PHM reparameterization reduces a 8.29B VLM to 5.38B parameters with 48% faster inference and 27% fewer FLOPs, while maintaining comparable captioning quality — accepted at WACV 2026.

Multimodal large language models (MLLMs) have become foundational to modern AI, enabling joint processing of text and vision. Yet state-of-the-art models contain billions of parameters, demanding immense computational resources for deployment. This research introduces a progressive reparameterization strategy that physically compresses these models by replacing dense feed-forward network blocks with compact Parameterized Hypercomplex Multiplication (PHM) layers — reducing parameters and FLOPs while preserving strong multimodal alignment.

The Deployment Problem

Existing efficiency methods face fundamental limitations: PEFT approaches like LoRA and QLoRA reduce fine-tuning overhead but leave the physical parameter count unchanged. Structural methods such as sparse attention improve scalability but are not yet deeply integrated into multimodal pipelines. Knowledge distillation aligns large teachers with smaller students but struggles with multimodal supervision. This research unifies all three approaches into a single cohesive framework that achieves genuine model compression.

Core Innovation: Hypercomplex Compression

PHM Layer Substitution

Dense FFN blocks are replaced with PHM layers that factorize large matrices via Kronecker products of fixed 2x2 hypercomplex bases with small learnable matrices — reducing parameters by 2x to 1.33x per layer.

Residual Fade-In Schedule

A smooth interpolation schedule ensures the network starts as the dense teacher and gradually hands over computation to PHM — eliminating instability during the dense-to-hypercomplex transition.

Dense-to-PHM Distillation

Three complementary objectives: cross-entropy loss for accuracy, knowledge distillation from the dense teacher, and a reconstruction penalty keeping PHM operators functionally equivalent to their dense counterparts.

LoRA on Attention

Attention projections (Q/K/V) are adapted with lightweight LoRA modules while heavy FFN blocks undergo PHM compression — combining representational adaptation with structural compactness.

Compression Results on Qwen2.5-VL 7B

35%

Parameter Reduction (8.29B to 5.38B)

48%

Faster Inference (1.83ms to 0.96ms)

27%

FLOPs Reduction (4.94T to 3.61T)

0.65

BLEU Score (vs 0.68 baseline)

PHM Capacity and Compression Trade-Off

The PHM factorization replaces a dense matrix Wl with a structured sum of Kronecker products. The basis count B controls the compression-expressivity trade-off:

Configuration	Compression Factor	Applied To
Dense (baseline)	1x	All layers
PHM, B=2	2x	Lower language layers
PHM, B=3 (selective)	1.33x	Top K language layers (highest Fisher sensitivity)

A sensitivity-guided capacity assignment allocates B=3 to the top K language transformer layers where logit sensitivity is highest, and B=2 elsewhere — achieving a favorable loss-per-parameter trade-off across the full transformer stack.

Three-Phase Training Pipeline

Dense-to-PHM Initialization

PHM parameters are initialized by projecting pretrained dense weights into the PHM subspace via least-squares, ensuring the initial approximation remains close to the dense operator in Frobenius norm.

Residual Transition Training

Three objectives guide training simultaneously: label-smoothed cross-entropy, temperature-scaled KD from the dense teacher, and a reconstruction penalty ensuring functional equivalence throughout the transition.

PHM-Only Fine-Tuning

After dense layers are fully removed, the PHM-only model undergoes brief fine-tuning to consolidate compression gains and maximize task performance without the dense scaffold.

Key Contributions

Unified Compression Framework: First approach combining PHM reparameterization, LoRA adaptation, and knowledge distillation in a single multimodal compression pipeline.
Stabilization Mechanisms: Convex-blended residual transitions with reconstruction-based alignment enable stable dense-to-PHM conversion without performance collapse.
Selective Capacity Assignment: Fisher-sensitivity-guided basis count allocation achieves the best loss-per-parameter trade-off across the transformer stack.

Implications for AI-Driven Systems

Efficient multimodal models unlock new capabilities for AI systems that process both visual data (charts, documents, images) and textual information simultaneously. The PHM framework's ability to achieve near-50% inference speedups while preserving output quality is directly relevant to latency-sensitive applications — enabling sophisticated multimodal reasoning on resource-constrained infrastructure at a fraction of the computational cost.

Open Source: Code is publicly available at github.com/milab-nsu/PHM, enabling reproducible research and community adoption.

Beyond Real Weights