Advancing Multi-Modal Machine Learning in Smart Environments: Integrating Visual, Auditory, and Sensorial Data for Context-Aware Human–AI Interaction
DOI:
https://doi.org/10.63075/44prhp17Abstract
As smart environments spread fast, it is necessary to create intelligent systems that can analyze people’s actions through many types of data. The study looks into the progress of multi-modal machine learning (ML) by letting machines use visual, auditory and sensory inputs to work with humans in various contexts. The purpose of this study is to contrast the performance of unimodal, early fusion and transformer-based architectures in a smart environment context. Mixed-methods were used to design the study. To train the models, both audio-visual event data and sensor-based activity data were used quantitatively. In the lab, researcher watched users as they interacted with devices in the smart home environment. A crossmodal attention mechanism was used in the transformer architecture to ensure the semantic and temporal alignment of the different inputs. For every model, researcher checked how accurate, precise, recall was, along with the F1-score, latency and user experience. It was found that the transformer-based model did better than the others on all metrics, scoring 89.5% on F1-score and having the lowest latency at 95 milliseconds. The differences were found to be statistically significant by both ANOVA and Tukey’s HSD tests (p < .001). Most users agreed that the transformer-based system built more trust and satisfaction in them. Therefore, adding multi-modal data to transformer models greatly enhances the speed and intelligence of smart environments. It is necessary for future studies to develop models that perform well in open environments, are efficient for edge devices and follow ethical guidelines.