Research
My research sits at the intersection of vision, language, and speech — building systems that see, hear, and reason together.
Supporting production ML workflows for Falcon-H — TII's large-scale multi-modal model. My work spans internet-scale data acquisition, cleaning, and model alignment dataset generation.
Designed and contributed to VisRes Bench, a benchmark for systematically evaluating the visual reasoning capabilities of vision-language models across diverse and challenging tasks.
Investigated fundamental failure modes of VLMs on visually obvious reasoning tasks, contributing to a co-authored paper accepted at ICCV 2025.
Designed and built SVLA, a unified assistant capable of jointly processing speech, images, and text in a single end-to-end framework.
Competitive and academic VQA research, achieving top global rankings and publishing benchmark datasets and survey papers.
Designed a voice-based interface for smart home control using a speech recognition system combined with a text-to-CDQL encoder-decoder RNN, achieving 93% accuracy and 0.02% WER.
Studied adversarial robustness of ASR systems for mission-critical applications (autonomous vehicles, IoT, personal assistants). Key outputs: