MSc.Thesis Defense: Kerem Aydın, Domain generalized remote sensing scene captioning via country-level geographic information

Thesis title:
Domain generalized remote sensing scene captioning via country-level geographic information

Kerem Aydın

Data science, MSc. Thesis, July 2025

Date & Time:

2 July 2025 - 11.30 AM

Place: FENS L061

Juries:
Prof. Erchan Aptoula (Thesis Advisor), Assist. Prof. Dilara Keküllüoğlu,
Assoc. Prof. Alp Ertürk

Keywords:

Scene Captioning, Domain Adaptation, Remote Sensing and Open Vocabulary Classification

Abstract

This thesis addresses the critical challenge of domain generalization in scene captioning of remote sensing images. Due to substantial geographic variability across regions, vision models often struggle to generalize effectively to unseen domains, limiting their practical deployment. To tackle this issue, a novel approach is proposed that enriches model input prompts with country-level geographic metadata—specifically, detailed textual descriptions extracted from Wikipedia. This semantic geographic context enables models to better adapt to new regions without requiring labeled data from target domains, serving as a form of weak supervision that mitigates domain shift. For rigorous evaluation, a carefully curated subset of the large-scale SkyScript dataset was constructed, focusing on frequent scene classes and split so that training occurs solely on European countries, while testing covers global data. This design simulates realistic cross-country domain shifts in remote sensing. Experiments were performed on two state-of-the-art vision-language architectures, LLaVA 1.5 and LLaMA 3.2, comparing baseline models without geographic metadata to enhanced models incorporating Wikipedia-derived geographic context. All models were fine-tuned with parameter-efficient adaptation techniques and evaluated primarily using accuracy. The findings demonstrate consistent and meaningful performance improvements when geographic metadata is integrated, particularly in countries and regions underrepresented in training data. Qualitative analysis further reveals that models with geographic context generate richer, more region-specific, and semantically grounded scene descriptions compared to baselines. Key contributions of this thesis include: the creation of a geographically split subset of SkyScript enabling systematic study of geographic domain shifts; the development of a novel method leveraging external, structured geographic knowledge to enhance model geographic awareness; and extensive empirical validation across multiple architectures showing that incorporating geographic metadata is an effective strategy for improving cross-domain generalization in multimodal remote sensing. These contributions advance both theoretical understanding and practical applications of domain generalization in vision-language tasks, supporting more robust deployment of remote sensing models across diverse geographic landscapes.