Advancing Multi-Modal Machine Learning in Smart Environments: Integrating Visual, Auditory, and Sensorial Data for Context-Aware Human–AI Interaction

Engr.Dr. Shamim Akhtar; Ijaz Khan; Amjad Jumani; Mir Rahib Hussain Talpur; Arshad Iqbal; Adnan Ahmed Rafique

doi:10.63075/44prhp17

Authors

Engr.Dr. Shamim Akhtar Assistant Professor, Faculty of Engineering Science and Technology, IQRA University, Karachi Author
Ijaz Khan Department of Avionics Engineering, College of Aeronautical Engineering (CAE), National University of sciences and Technology (NUST) Author
Amjad Jumani Lecture at Computer Science Department Bahria university Karachi Campus Author
Mir Rahib Hussain Talpur Dept: Information Technology Centre, Uni: Sindh Agriculture University Tandojam Author
Arshad Iqbal Department of Computer Science, Khushal Khan Khattak Univversity Karak Author
Adnan Ahmed Rafique Assistant Professor, Department of CS and IT, University of Poonch Rawalakot Author

DOI:

https://doi.org/10.63075/44prhp17

Abstract

As smart environments spread fast, it is necessary to create intelligent systems that can analyze people’s actions through many types of data. The study looks into the progress of multi-modal machine learning (ML) by letting machines use visual, auditory and sensory inputs to work with humans in various contexts. The purpose of this study is to contrast the performance of unimodal, early fusion and transformer-based architectures in a smart environment context. Mixed-methods were used to design the study. To train the models, both audio-visual event data and sensor-based activity data were used quantitatively. In the lab, researcher watched users as they interacted with devices in the smart home environment. A crossmodal attention mechanism was used in the transformer architecture to ensure the semantic and temporal alignment of the different inputs. For every model, researcher checked how accurate, precise, recall was, along with the F1-score, latency and user experience. It was found that the transformer-based model did better than the others on all metrics, scoring 89.5% on F1-score and having the lowest latency at 95 milliseconds. The differences were found to be statistically significant by both ANOVA and Tukey’s HSD tests (p < .001). Most users agreed that the transformer-based system built more trust and satisfaction in them. Therefore, adding multi-modal data to transformer models greatly enhances the speed and intelligence of smart environments. It is necessary for future studies to develop models that perform well in open environments, are efficient for edge devices and follow ethical guidelines.

Downloads

Download data is not yet available.

Advancing Multi-Modal Machine Learning in Smart Environments: Integrating Visual, Auditory, and Sensorial Data for Context-Aware Human–AI Interaction

Authors

DOI:

Abstract

Downloads

Downloads

Published

Issue

Section

How to Cite

Most read articles by the same author(s)

Make a Submission

HEC Recognized Journal

About the Journal

Call for Papers

Information

Latest publications