Publications

2025
  • Hasan Iqbal, Nazmul Karim, Umar Khalid, Azib Farooq, Zichun Zhong, Chen Chen and Jing Hua.SPF-4D: A Progressive Sampling Framework for View Consistent 4D Editing. arXiv 2025Conference
    Instruction-guided generative models, especially those us- ing text-to-image (T2I) and text-to-video (T2V) diffusion frameworks, have advanced the field of content editing in recent years. To extend these capabilities to 4D scene edit- ing, we introduce SPF-4D, a framework designed to main- tain both temporal and view consistency while editing dy- namic 3D scenes. SPF-4D achieves this by leveraging pro- gressive noise sampling during the forward diffusion phase and refining latent iteratively in the reverse diffusion phase. For temporal coherence, we design a correlated Gaussian noise structure that links frames over time, allowing each frame to depend meaningfully on prior frames. Addition- ally, to ensure spatial consistency across views, we imple- ment a cross-view noise model, which uses shared and in- dependent noise components to balance commonalities and distinct details among different views. To further enhance spatial coherence, SPF-4D incorporates view-consistent it- erative refinement, embedding view-aware information into the denoising process to ensure aligned edits across frames and views. Our approach enables high-quality 4D editing without relying on external models, addressing key chal- lenges in previous methods. Through extensive evaluation on multiple benchmarks and multiple editing aspects (e.g. style transfer, multi-attribute editing, local editing, etc.), we show the effectiveness of our proposed method
    @misc{khalid2024evlmselfreflectivemultimodalreasoning,
          title={EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing}, 
          author={Umar Khalid and Hasan Iqbal and Azib Farooq and Nazanin Rahnavard and Jing Hua and Chen Chen},
          year={2024},
          eprint={2412.10566},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2412.10566}, 
    }
  • Umar Khalid*, Hasan Iqbal*, Nazmul Karim, Azib Farooq, Zichun Zhong, Chen Chen and Jing Hua (*Equal Contribution).EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual EditingarXiv 2025Conference
    Editing complex visual content based on ambiguous instructions remains a challenging problem in vision-language modeling. While existing models can contextualize content, they often struggle to grasp the underlying intent within a reference image or scene, leading to misaligned edits. We introduce the Editing Vision-Language Model (EVLM), a system designed to interpret such instructions in conjunction with reference visuals, producing precise and context-aware editing prompts. Leveraging Chain-of-Thought (CoT) reasoning and KL-Divergence Target Optimization (KTO) alignment technique, EVLM captures subjective editing preferences without requiring binary labels. Fine-tuned on a dataset of 30,000 CoT examples, with rationale paths rated by human evaluators, EVLM demonstrates substantial improvements in alignment with human intentions. Experiments across image, video, 3D, and 4D editing tasks show that EVLM generates coherent, high-quality instructions, supporting a scalable framework for complex vision-language applications.
    @misc{khalid2024evlmselfreflectivemultimodalreasoning,
          title={EVLM: Self-Reflective Multimodal Reasoning for Cross-Dimensional Visual Editing}, 
          author={Umar Khalid and Hasan Iqbal and Azib Farooq and Nazanin Rahnavard and Jing Hua and Chen Chen},
          year={2024},
          eprint={2412.10566},
          archivePrefix={arXiv},
          primaryClass={cs.CV},
          url={https://arxiv.org/abs/2412.10566}, 
    }
2024
  • Umar Khalid*, Hasan Iqbal*, Nazmul Karim, Muhammad Tayyab, Jing Hua and Chen Chen (*Equal contribution)LatentEditor: text driven local editing of 3D scenesECCV 2024Conference
    While neural fields have made significant strides in view synthesis and scene reconstruction, editing them poses a formidable challenge due to their implicit encoding of geometry and texture information from multi-view inputs. In this paper, we introduce LatentEditor, an innovative framework designed to empower users with the ability to perform precise and locally controlled editing of neural fields using text prompts. Leveraging denoising diffusion models, we successfully embed real-world scenes into the latent space, resulting in a faster and more adaptable NeRF backbone for editing compared to traditional methods. To enhance editing precision, we introduce a delta score to calculate the 2D mask in the latent space that serves as a guide for local modifications while preserving irrelevant regions. Our novel pixel-level scoring approach harnesses the power of InstructPix2Pix (IP2P) to discern the disparity between IP2P conditional and unconditional noise predictions in the latent space. The edited latents conditioned on the 2D masks are then iteratively updated in the training set to achieve 3D local editing. Our approach achieves faster editing speeds and superior output quality compared to existing 3D editing models, bridging the gap between textual instructions and high-quality 3D scene editing in latent space. We show the superiority of our approach on four benchmark 3D datasets, LLFF, IN2N, NeRFStudio and NeRF-Art. Project website: https://latenteditor.github.io/
    @inproceedings{khalid2025latenteditor,
      title={LatentEditor: text driven local editing of 3D scenes},
      author={Khalid, Umar and Iqbal, Hasan and Karim, Nazmul and Tayyab, Muhammad and Hua, Jing and Chen, Chen},
      booktitle={European Conference on Computer Vision},
      pages={364--380},
      year={2025},
      organization={Springer}
    }
  • Nazmul Karim*, Hasan Iqbal*, Umar Khalid, Chen Chen and Jing Hua (*Equal contribution)Free-Editor: Zero-shot Text-driven 3D Scene EditingECCV 2024Conference
    Text-to-Image (T2I) diffusion models have recently gained traction for their versatility and user-friendliness in 2D content generation and editing. However, training a diffusion model specifically for 3D scene editing is challenging due to the scarcity of large-scale datasets. Currently, editing 3D scenes necessitates either retraining the model to accommodate various 3D edits or developing specific methods tailored to each unique editing type. Moreover, state-of-the-art (SOTA) techniques require multiple synchronized edited images from the same scene to enable effective scene editing. Given the current limitations of T2I models, achieving consistent editing effects across multiple images remains difficult, leading to multi-view inconsistency in editing. This inconsistency undermines the performance of 3D scene editing when these images are utilized. In this study, we introduce a novel, training-free 3D scene editing technique called \textsc{Free-Editor}, which enables users to edit 3D scenes without the need for model retraining during the testing phase. Our method effectively addresses the issue of multi-view style inconsistency found in state-of-the-art (SOTA) methods through the implementation of a single-view editing scheme. Specifically, we demonstrate that editing a particular 3D scene can be achieved by modifying only a single view. To facilitate this, we present an Edit Transformer that ensures intra-view consistency and inter-view style transfer using self-view and cross-view attention mechanisms, respectively. By eliminating the need for model retraining and multi-view editing, our approach significantly reduces editing time and memory resource requirements, achieving runtimes approximately 20 times faster than SOTA methods. We have performed extensive experiments on various benchmark datasets, showcasing the diverse editing capabilities of our proposed technique. Project website: https://free-editor.github.io/
    @inproceedings{karim2025free,
      title={Free-editor: zero-shot text-driven 3D scene editing},
      author={Karim, Nazmul and Iqbal, Hasan and Khalid, Umar and Chen, Chen and Hua, Jing},
      booktitle={European Conference on Computer Vision},
      pages={436--453},
      year={2025},
      organization={Springer}
    }
  • Umar Khalid*, Hasan Iqbal*, Nazmul Karim, Azib Farooq, Chen Chen and Jing Hua (*Equal contribution)3DEgo: 3D Editing on the Go!ECCV 2024Conference
    We introduce 3DEgo to address a novel problem of directly synthesizing photorealistic 3D scenes from monocular videos guided by textual prompts. Conventional methods construct a text-conditioned 3D scene through a three-stage process, involving pose estimation using Structure-from-Motion (SfM) libraries like COLMAP, initializing the 3D model with unedited images, and iteratively updating the dataset with edited images to achieve a 3D scene with text fidelity. Our framework streamlines the conventional multi-stage 3D editing process into a single-stage workflow by overcoming the reliance on COLMAP and eliminating the cost of model initialization. We apply a diffusion model to edit video frames prior to 3D scene creation by incorporating our designed noise blender module for enhancing multi-view editing consistency, a step that does not require additional training or fine-tuning of T2I diffusion models. 3DEgo utilizes 3D Gaussian Splatting to create 3D scenes from the multi-view consistent edited frames, capitalizing on the inherent temporal continuity and explicit point cloud data. 3DEgo demonstrates remarkable editing precision, speed, and adaptability across a variety of video sources, as validated by extensive evaluations on six datasets, including our own prepared GS25 dataset. Project website: https://3dego.github.io/
    @inproceedings{khalid20253dego,
      title={3DEgo: 3D Editing on the Go!},
      author={Khalid, Umar and Iqbal, Hasan and Farooq, Azib and Hua, Jing and Chen, Chen},
      booktitle={European Conference on Computer Vision},
      pages={73--89},
      year={2025},
      organization={Springer}
    }
2023
  • Hasan Iqbal, Umar Khalid, Chen Chen, Jing Hua. Unsupervised Anomaly Detection in Medical Images Using Masked Diffusion Model 14th International Conference on Machine Learning in Medical Imaging (MLMI) 2023Conference
    It can be challenging to identify brain MRI anomalies using supervised deep-learning techniques due to anatomical heterogeneity and the requirement for pixel-level labeling. Unsupervised anomaly detection approaches provide an alternative solution by relying only on sample-level labels of healthy brains to generate a desired representation to identify abnormalities at the pixel level. Although, generative models are crucial for generating such anatomically consistent representations of healthy brains, accurately generating the intricate anatomy of the human brain remains a challenge. In this study, we present a method called the masked-denoising diffusion probabilistic model (mDDPM), which introduces masking-based regularization to reframe the generation task of diffusion models. Specifically, we introduce Masked Image Modeling (MIM) and Masked Frequency Modeling (MFM) in our self-supervised approach that enables models to learn visual representations from unlabeled data. To the best of our knowledge, this is the first attempt to apply MFM in denoising diffusion probabilistic models (DDPMs) for medical applications. We evaluate our approach on datasets containing tumors and numerous sclerosis lesions and exhibit the superior performance of our unsupervised method as compared to the existing fully/weakly supervised baselines. Project website: https://mddpm.github.io/
    @inproceedings{iqbal2023unsupervised,
      title={Unsupervised anomaly detection in medical images using masked diffusion model},
      author={Iqbal, Hasan and Khalid, Umar and Chen, Chen and Hua, Jing},
      booktitle={International Workshop on Machine Learning in Medical Imaging},
      pages={372--381},
      year={2023},
      organization={Springer}
    }
  • Umar Khalid, Hasan Iqbal, Saeed Vahidian, Jing Hua, Chen Chen. CEFHRI: A Communication Efficient Federated Learning Framework for Recognizing Industrial Human-Robot Interaction IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2023Conference
    Human-robot interaction (HRI) is a rapidly growing field that encompasses social and industrial applications. Machine learning plays a vital role in industrial HRI by enhancing the adaptability and autonomy of robots in complex environments. However, data privacy is a crucial concern in the interaction between humans and robots, as companies need to protect sensitive data while machine learning algorithms require access to large datasets. Federated Learning (FL) offers a solution by enabling the distributed training of models without sharing raw data. Despite extensive research on Federated learning (FL) for tasks such as natural language processing (NLP) and image classification, the question of how to use FL for HRI remains an open research problem. The traditional FL approach involves transmitting large neural network parameter matrices between the server and clients, which can lead to high communication costs and often becomes a bottleneck in FL. This paper proposes a communication-efficient FL framework for human-robot interaction (CEFHRI) to address the challenges of data heterogeneity and communication costs. The framework leverages pre-trained models and introduces a trainable spatiotemporal adapter for video understanding tasks in HRI. Experimental results on three human-robot interaction benchmark datasets: HRI30, InHARD, and COIN demonstrate the superiority of CEFHRI over full fine-tuning in terms of communication costs. The proposed methodology provides a secure and efficient approach to HRI federated learning, particularly in industrial environments with data privacy concerns and limited communication bandwidth. Our code is available at https://github.com/umarkhalidAI/CEFHRI-Efficient-Federated-Learning.
    @inproceedings{khalid2023cefhri,
      title={CEFHRI: A Communication Efficient Federated Learning Framework for Recognizing Industrial Human-Robot Interaction},
      author={Khalid, Umar and Iqbal, Hasan and Vahidian, Saeed and Hua, Jing and Chen, Chen},
      booktitle={2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)},
      pages={10141--10148},
      year={2023},
      organization={IEEE}
    }
2021
  • Hasan Iqbal, Seemab Latif, Yukang Yan, Chun Yu, Yuanchun Shi. Reducing arm fatigue in virtual reality by introducing 3D-spatial offset IEEE Access 2021Journal
    Arm fatigue is an important factor affecting user experience in Virtual Reality (VR). In this work, we have proposed ProxyHand and StickHand, virtual hand techniques to address this issue. Using ProxyHand or StickHand, users can flexibly adjust the 3D-spatial offset between the physical hand and its virtual representation. This will allow users to keep their arms in a comfortable posture (vertically down) even when they have to manipulate objects in locations that require lifting of arms using the default interaction method. Proposed ProxyHand and StickHand have a similar Underlying concept that is to introduce a 3D-spatial offset between the physical hand and its virtual representation in VR. However, they respond differently to the user's hand movements because of different working mechanisms. Question arises whether the 3D-spatial offset will negatively impact the hand control ability as the directness of interaction is being violated. To investigate this, we conducted user studies where users were asked to perform object translation, rotation and hybrid tasks. ProxyHand and StickHand are used in combination in some scenarios to maximize positive impact on the user experience in VR. This raises the question to find the best possible combination of these virtual hands to reduce arm fatigue. Firstly, for this purpose, we combined both virtual hands by manually allowing users to switch between ProxyHand and StickHand. Secondly, we used machine learning to automatically switch between both the virtual hands. Results showed that introduction of a 3D-spatial offset largely reduced the arm fatigue while offering equal performance to the default interaction method for all these tasks; translation, rotation and hybrid task. Users preferred using ProxyHand and StickHand to interact in the VR environment for longer periods of time.
    @article{iqbal2021reducing,
      title={Reducing arm fatigue in virtual reality by introducing 3D-spatial offset},
      author={Iqbal, Hasan and Latif, Seemab and Yan, Yukang and Yu, Chun and Shi, Yuanchun},
      journal={IEEE Access},
      volume={9},
      pages={64085--64104},
      year={2021},
      publisher={IEEE}
    }
2018
  • Yukang Yan, Chun Yu, Xiaojuan Ma, Shuai Huang, Hasan Iqbal, Yuanchun Shi. Eyes-Free Target Acquisition in Interaction Space around the Body for Virtual Reality Proceedings of the 2018 CHI conference on human factors in computing systems (CHI) 2018Conference
    Eyes-free target acquisition is a basic and important human ability to interact with the surrounding physical world, relying on the sense of space and proprioception. In this research, we leverage this ability to improve interaction in virtual reality (VR), by allowing users to acquire a virtual object without looking at it. We expect this eyes-free approach can effectively reduce head movements and focus changes, so as to speed up the interaction and alleviate fatigue and VR sickness. We conduct three lab studies to progressively investigate the feasibility and usability of eyes-free target acquisition in VR. Results show that, compared with the eyes-engaged manner, the eyes-free approach is significantly faster, provides satisfying accuracy, and introduces less fatigue and sickness; Most participants (13/16) prefer this approach. We also measure the accuracy of motion control and evaluate subjective experience of users when acquiring targets at different locations around the body. Based on the results, we make suggestions on designing appropriate target layout and discuss several design issues for eyes-free target acquisition in VR.
    @inproceedings{yan2018eyes,
      title={Eyes-free target acquisition in interaction space around the body for virtual reality},
      author={Yan, Yukang and Yu, Chun and Ma, Xiaojuan and Huang, Shuai and Iqbal, Hasan and Shi, Yuanchun},
      booktitle={Proceedings of the 2018 CHI conference on human factors in computing systems},
      pages={1--13},
      year={2018}
    }