Key Achievement: Bidirectional reasoning supervision enables 3B parameter models to surpass label-only fine-tuned 70B models on multilingual financial sustainability classification — published at EMNLP 2025 Industry Track, Suzhou, China.
Large Language Models have demonstrated remarkable capability across NLP tasks, yet their effectiveness in the multilingual financial domain remains underexplored. This research tackles financial sustainability classification across four diverse languages — English, Hindi, Bengali, and Telugu — with Bengali and Telugu representing low-resource settings where annotated data is scarce. A novel bidirectional reasoning fine-tuning approach is introduced that integrates both positive and negative rationales alongside classification labels, consistently outperforming all baseline methods while enabling smaller models to match significantly larger ones.
The Challenge of Multilingual Financial NLP
Financial markets are inherently global, yet most financial NLP research focuses on high-resource languages like English. Stakeholders in multilingual regions such as South Asia face delays and inaccuracies when analyzing financial reports in local languages, leading to missed risks and suboptimal investment decisions. Extending sustainability classification to low-resource languages like Bengali and Telugu is critical for equitable global financial access and risk assessment.
Three Fine-Tuning Strategies Compared
Labels Only (Baseline)
Traditional fine-tuning trains LLMs solely on classification labels using cross-entropy loss, offering no explanatory reasoning for decisions.
Unidirectional Reasoning
Extends label-only training by adding a positive rationale explaining why a statement is classified as sustainable or unsustainable.
Bidirectional Reasoning (Ours)
Trains with both positive reasons (why the label applies) and negative reasons (why the opposite does not apply), creating a contrastive supervision framework.
Performance Highlights
Benchmark Results — English Language
The bidirectional reasoning approach was evaluated across LLaMA-3.2 (3B), LLaMA-3.1 (8B), LLaMA-3.1 (70B), and the Qwen-2.5 family. Accuracy and F1 on the English financial sustainability dataset:
| Model | Fine-Tuning Method | Accuracy (%) | F1 (%) |
|---|---|---|---|
| LLaMA-3.2 (3B) | Labels Only | 94.71 | 95.16 |
| LLaMA-3.2 (3B) | Unidirectional Reason | 94.71 | 95.12 |
| LLaMA-3.2 (3B) | Bidirectional Reasons (Ours) | 96.92 | 97.17 |
| LLaMA-3.1 (70B) | Labels Only | 93.83 | 94.26 |
| LLaMA-3.1 (70B) | Bidirectional Reasons (Ours) | 96.48 | 96.80 |
The 3B bidirectional model (F1: 97.17%) surpasses the label-only 70B model (F1: 94.26%) — demonstrating that structured reasoning supervision can compensate for much larger model capacity.
Three-Stage Pipeline
Automated Reason Generation
GPT-4o automatically generates both positive and negative rationales for each training statement across all four languages, eliminating the need for costly human annotation.
PEFT Fine-Tuning with LoRA
Models are fine-tuned using LoRA (rank 64, alpha 16) with bidirectional supervision, minimizing cross-entropy loss jointly over classification label, positive reason R+, and negative reason R-.
Multilingual Evaluation
Models are assessed across English, Hindi, Bengali, and Telugu on financial sustainability classification, covering high- to low-resource language settings.
Generalization Beyond Finance
To validate robustness across domains, experiments were conducted on hate speech (ETHOS dataset) and ethics classification (DFAR dataset). The bidirectional reasoning approach consistently outperformed alternatives in both accuracy and F1 score, confirming its generalizability beyond the financial context.
Key Contributions
- Multilingual Coverage: Advances financial sustainability classification to include low-resource languages Bengali and Telugu alongside Hindi and English.
- Bidirectional Reasoning Framework: A novel contrastive fine-tuning method supervising LLMs with both positive and negative rationales, improving classification performance and decision interpretability.
- Efficient Deployment: Combined with PEFT and LoRA, the approach enables small 3B models to match or outperform 70B models fine-tuned with conventional label-only methods.