Research

Research Experience

My research sits at the intersection of vision, language, and speech — building systems that see, hear, and reason together.

Multi-modal AI & Large-Scale Data Engineering
Jan 2025 – Present
Technology Innovation Institute (TII) · Engineering Consultant

Supporting production ML workflows for Falcon-H — TII's large-scale multi-modal model. My work spans internet-scale data acquisition, cleaning, and model alignment dataset generation.

  • Architected ETL pipelines to crawl, deduplicate, and normalize internet-scale multi-modal datasets.
  • Processed 3M+ PDFs via OCR, layout parsing, and CV-based structured text extraction.
  • Synthesized large-scale SFT datasets using GPT-4, Gemini, Claude, and Qwen for Falcon-H alignment.
  • Unified data from 10+ agent platforms into training-ready multi-modal corpora with quality-scoring pipelines.
  • Built STEM-VQA evaluation pipelines covering charts, equations, and scientific plots.
  • Trained VLMs on GCP and AWS across distributed Slurm clusters, optimizing GPU utilization.
LLM AlignmentVLMsOCRETLSTEM-VQAGCPAWSSlurm
VisRes Bench: Evaluating Visual Reasoning of VLMs
2025 – 2026
Technology Innovation Institute (TII) · CVPR 2026

Designed and contributed to VisRes Bench, a benchmark for systematically evaluating the visual reasoning capabilities of vision-language models across diverse and challenging tasks.

  • Developed evaluation protocols and dataset curation pipelines targeting multi-step visual reasoning.
  • Benchmarked a wide range of state-of-the-art VLMs, revealing systematic failure modes.
  • Accepted at CVPR 2026 (Brigitta T., Dahou Y., Huynh N. D., et al.).
CVPR 2026VLM BenchmarkVisual ReasoningEvaluation
Vision-Language Models Can't See the Obvious
Apr 2024 – Jan 2025
Technology Innovation Institute (TII) · ICCV 2025

Investigated fundamental failure modes of VLMs on visually obvious reasoning tasks, contributing to a co-authored paper accepted at ICCV 2025.

  • Designed probing tasks that expose systematic blind spots in leading VLMs.
  • Developed and integrated ASR, VQA, OCR, and LLM inference components into unified end-to-end evaluation pipelines.
ICCV 2025VLM EvaluationASRMulti-modal Reasoning
SVLA: A Unified Speech-Vision-Language Assistant
Apr 2024 – Jan 2025
Technology Innovation Institute (TII) · arXiv:2503.24164

Designed and built SVLA, a unified assistant capable of jointly processing speech, images, and text in a single end-to-end framework.

  • Integrated ASR, VQA, OCR, and LLM inference components into a cohesive multi-modal pipeline.
  • Enabled the model to take audio and visual inputs simultaneously and generate coherent language responses.
  • Published as a preprint on arXiv (arXiv:2503.24164, 2025).
SpeechVisionLanguageASRMulti-modalarXiv
Visual Question Answering Research
Mar 2022 – Oct 2022
Deakin University · Research Assistant

Competitive and academic VQA research, achieving top global rankings and publishing benchmark datasets and survey papers.

  • Ranked Top 9 worldwide in the Toloka VQA Challenge (WSDM Cup 2023).
  • Achieved Top 7 globally in the STOIC2021 COVID Detection Challenge (3D DenseNet on CT scans).
  • Published SimpsonsVQA (arXiv:2410.22648) — a domain-specific VQA dataset and benchmark.
  • Authored a comprehensive VQA survey (arXiv:2501.03939).
  • Founded a university-wide AI competition, growing the campus ML community.
VQAComputer Vision3D CNNBenchmarking
Speech-to-CDQL: Smart Home Voice Control
Nov 2021 – Mar 2022
Deakin University · MSc Thesis Research

Designed a voice-based interface for smart home control using a speech recognition system combined with a text-to-CDQL encoder-decoder RNN, achieving 93% accuracy and 0.02% WER.

  • Implemented three encoder-decoder architectures: basic, Bahdanau attention, and Luong attention.
  • Applied Finite Automata to improve query generation accuracy.
  • Published as an IEEE MDM Best Demo Paper (Jarvis system).
ASRSeq2SeqIoTIEEE MDM
Adversarial Attacks on Speech Recognition
Jun 2021 – Sep 2021
Deakin University · SIT Research Program

Studied adversarial robustness of ASR systems for mission-critical applications (autonomous vehicles, IoT, personal assistants). Key outputs:

  • Authored a survey on adversarial attacks on speech recognition (arXiv:2202.10594).
  • Developed a CTC-based ASR prototype published in the official Keras example library.
ASRAdversarial RobustnessCTCKeras