Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask (2024)

Jingyu XiaoTsinghua Shenzhen International Graduate SchoolPeng Cheng LaboratoryShenzhenChinajy-xiao21@mails.tsinghua.edu.cn,Zhiyao XuXi’an University of Electronic Science and TechnologyXi’anChina21009200843@stu.xidian.edu.cn,Qingsong ZouTsinghua Shenzhen International Graduate SchoolPeng Cheng LaboratoryShenzhenChinazouqs21@mails.tsinghua.edu.cn,Qing LiPeng Cheng LaboratoryShenzhenChinaliq@pcl.ac.cn,Dan ZhaoPeng Cheng LaboratoryShenzhenChinazhaod01@pcl.ac.cn,Dong FangTencentShenzhenChinavictordfang@tencent.com,Ruoyu LiTsinghua Shenzhen International Graduate SchoolShenzhenChinaliry19@mails.tsinghua.edu.cn,Wenxin TangTsinghua Shenzhen International Graduate SchoolShenzhenChinavinsontang2126@gmail.com,Kang LiTsinghua Shenzhen International Graduate SchoolShenzhenChinalk26603878@gmail.com,Xudong ZuoTsinghua Shenzhen International Graduate SchoolShenzhenChinazuoxd20@mails.tsinghua.edu.cn,Penghui HuTsinghua UniversityBeijingChinahuph22@mails.tsinghua.edu.cn,Yong JiangTsinghua Shenzhen International Graduate SchoolPeng Cheng LaboratoryShenzhenChinajiangy@sz.tsinghua.edu.cn,Zixuan WengBeijing Jiaotong UniversityBeijingChina20722027@bjtu.edu.cnandMichael R.LyuThe Chinese University of Hong KongHong KongChinalyu@cse.cuhk.edu.hk

(2024)

Abstract.

Smart homes, powered by the Internet of Things, offer great convenience but also pose security concerns due to abnormal behaviors, such as improper operations of users and potential attacks from malicious attackers. Several behavior modeling methods have been proposed to identify abnormal behaviors and mitigate potential risks. However, their performance often falls short because they do not effectively learn less frequent behaviors, consider temporal context, or account for the impact of noise in human behaviors. In this paper, we propose SmartGuard, an autoencoder-based unsupervised user behavior anomaly detection framework. First, we design a Loss-guided Dynamic Mask Strategy (LDMS) to encourage the model to learn less frequent behaviors, which are often overlooked during learning. Second, we propose a Three-level Time-aware Position Embedding (TTPE) to incorporate temporal information into positional embedding to detect temporal context anomaly. Third, we propose a Noise-aware Weighted Reconstruction Loss (NWRL) that assigns different weights for routine behaviors and noise behaviors to mitigate the interference of noise behaviors during inference. Comprehensive experiments on three datasets with ten types of anomaly behaviors demonstrates that SmartGuardconsistently outperforms state-of-the-art baselines and also offers highly interpretable results.

User Behavior Modeling, Anomaly Detection, Transformer.

copyright: acmlicensedjournalyear: 2024doi: 10.1145/3637528.3671708conference: the 30th ACM SIGKDDConference on Knowledge Discovery and Data Mining ; July 14–18, 2024; Barcelona, Spainisbn: 978-1-4503-XXXX-X/18/06ccs: Security and privacyHuman and societal aspects of security and privacy

1. Introduction

The rapid growth of IoT solutions has led to an unprecedented increase in smart devices within homes, expected to reach approximately 5 billion by 2025 (Lueth, 2018). However, the abnormal behaviors pose substantial security risks within smart homes. These abnormal behaviors usually originate from two primary sources. First, improper operations by users can cause abnormal behaviors, such as inadvertently activating the air conditioner’s cooling mode during winter or forgetting to close a water valve. Second, malicious attackers can exploit vulnerabilities within IoT devices and platforms, taking unauthorized control of these devices. For example, hackers can compromise IoT platforms, allowing them to disable security cameras and manipulate home automation systems, creating opportunities for burglary. These security concerns emphasize the urgency of robust behavioral modeling methods and enhanced security measures to safeguard smart home environments.

Deep learning has been employed across various domains to mine correlations between behaviors for modeling user behavior sequences (Tang etal., 2022, 2023; Li etal., 2024). DeepMove(Feng etal., 2018) leverages RNNs to model both long and short-term mobility patterns of users for human mobility prediction. To capture the dynamics of user’s behaviors, SASRec (Kang and McAuley, 2018) proposes a self-attention based model to achieve sequential recommendation. More recent efforts(Chen etal., 2019; Sun etal., 2019; deSouza PereiraMoreira etal., 2021) primarily focus on transformer-based models for their superior ability to handle sequential behavior data.

However, we cannot borrow the above models to directly apply them in our scenarios, because of the following three challenges of user behavior modeling in smart homes.

Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask (1)

First, the occurrence frequencies of different user behaviors may be imbalanced, leading to challenges in learning the semantics of these behaviors. This user behavior imbalance can be attributed to individuals’ living habits. For example, cook-related behaviors (e.g., using microwave and oven) of office workers may be infrequent, because they dine at their workplace on weekdays and only cook on weekends. On the other hand, some daily behaviors like turning on lights and watching TV of the same users can be more frequent. Behavior imbalance complicates the learning process for models: some behaviors, which occur frequently in similar contexts, can be easily inferred, while others that rarely appear or manifest in diverse contexts can be more challenging to infer. We train an autoencoder model on AN dataset (shown in Table1), record the occurrences and reconstruction loss of different behaviors. As shown in Figure1, with the number of occurrences of behavior decreases, the reconstruction loss tends to increase.

Second, temporal context, e.g., the timing and duration of user behaviors, plays a significant role in abnormal behavior detection but is overlooked by existing solutions. For example, turning on the cooling mode of the air conditioner in winter is abnormal, but is normal in summer. Showering for 30-40 minutes is normal, but exceeding 2 hour suggests a user accident. Ignoring timing information hinders the identification of abnormal behavior patterns. As shown in Figure2, sequence 1 represents a user’s normal laundry-related behaviors. Sequences 2 and 3 follow the same order as sequence 1. However, in sequence 2, the water valve were opens at 2 o’clock in the night. In sequence 3, the duration between opening and closing the water valve is excessively long. Therefore these two sequences should be identified as abnormal behaviors possibly conducted by attackers intending to induce water leakage.

Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask (2)

Third, arbitrary intents and passive device actions can cause noise behaviors in user behavior sequences, which interfere model’s inference. Figure3 shows noise behaviors in a behavior sequence related to a user’s behaviors after getting up. The user do some routine behaviors like “turn on the bed light”, “open the curtains”, “switch off the air conditioner”, “open the refrigerator”, “close the refrigerator” and “switch on the oven”. However, there are also some sporadic actions which are not tightly related to the behavior sequence, including 1) active behaviors, e.g., suddenly deciding to “turn on the network audio” to listen to music; 2) passive behavior from devices, e.g., the “self-refresh” of the air purifier. These noise behaviors may also occur in other sequences with varying patterns.These noise behaviors introduces uncertainty that can disrupt the learning process and lead the model to misclassify sequences containing noise behaviors as anomalies. Therefore, treating noise behaviors on par with normal behaviors could potentially harm the model’s performance, leading to increased losses.

Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask (3)

In this paper, we propose SmartGuardto solve above challenges. SmartGuardis an autoencoder-based architecture, which learns to reconstruct normal behavior sequences during training and identify the behavior sequences with high reconstruction loss as anomaly. Firstly, we devise a Loss-guided Dynamic Mask Strategy (LDMS) to promote the model’s learning of infrequent hard-to-learn behaviors. Secondly, we introduce a Three-level Time-aware Position Embedding (TTPE) to integrate temporal information into positional embedding for detecting temporal context anomalies. Lastly, we propose a Noise-aware Weighted Reconstruction Loss (NWRL) to assign distinct weights to routine behaviors and noise behaviors, thereby mitigating the impact of noise behaviors. Our codes are released to the GitHub 111https://github.com/xjywhu/SmartGuard. Our contributions can be summarized as follows:

  • We design LDMS to mask the behaviors with high reconstruction loss, thus encouraging the model to learn these hard-to-learn behaviors.

  • We propose TTPE to consider the order-level, moment-level and duration-level information of user behaviors meanwhile.

  • We design NWRL to treat noisy behaviors and normal behaviors differently for learning robust behavior representations.

2. Related Work

2.1. User Behavior Modeling in Smart Homes

Some works propose to model user behavior (i.e., user device interaction) based on deep learning.(Gu etal., 2020) uses event transition graph to model IoT context and detect anomalies. In(Wang etal., 2023), authors build device interaction graph to learn the device state transition relationship caused by user actions. (Fu etal., 2021) detects anomalies through correlational analysis of device actions and physical environment. (Srinivasan etal., 2008) infers user behavior through readings from various sensors installed in the user’s home.IoTBeholder (Zou etal., 2023) utilizes attention-based LSTM to predict the user behavior from history sequences. SmartSense (Jeon etal., 2022) leverages query-based transformer to model contextual information of user behavior sequences. DeepUDI (Xiao etal., 2023a) and SmartUDI (Xiao etal., 2023b) use relational gated graph neural networks, capsule neural networks and contrastive learning to model users’ routines, intents and multi-level periodicities. However, above methods aim at predicting next behavior of user accurately, they can not be applied into abnormal behavior detection.

2.2. Attacks and Defenses in Smart Homes

An increasing number of attack vectors have been identified in smart homes in recent years. In addition to cyber attacks, it is also a concerning factor that IoT devices are often close association with the user’s physical environment and they have the ability to alter physical environment. In this context, the automation introduces more serious security risks. Prior research has revealed that adversaries can leak personal information, and gain physical access to the home(Jia etal., 2017; Celik etal., 2018). In(Fernandes etal., 2016), spoof attack is employed to exploit automation rules and trigger unexpected device actions. (Chi etal., 2022; Fu etal., 2022) apply delay-based attacks to disrupt cross-platform IoT information exchanges, resulting in unexpected interactions, rendering IoT devices and smart homes in an insecure state. This series of attacks aim at causing smart home devices to exhibit expected actions, thereby posing significant security threats. Therefore, designing an effective mechanism to detect such attacks is necessary. 6thSense (Sikder etal., 2017) utilizes Naive Bayes to detect malicious behavior associated with sensors in smart homes. Aegis (Sikder etal., 2019) utilizes a Markov Chain to detect malicious behaviors. ARGUS (Rieger etal., 2023) designed an Autoencoder based on Gated Recurrent Units (GRU) to detect infiltration attacks. However, these methods ignore the behavior imbalance, temporal information and noise behaviors.

3. Problem Formulation

Let 𝒟𝒟\mathcal{D}caligraphic_D denote a set of devices, 𝒞𝒞\mathcal{C}caligraphic_C denote a set of device controls and 𝒮𝒮\mathcal{S}caligraphic_S denote a set of behavior sequences.

Definition 1.

(Behavior) A behavior b=(t,d,c)𝑏𝑡𝑑𝑐b=(t,d,c)italic_b = ( italic_t , italic_d , italic_c ), is a 3-tuple consisting of time stamp t𝑡titalic_t, device d𝒟𝑑𝒟d\in\mathcal{D}italic_d ∈ caligraphic_D and device control c𝒞𝑐𝒞c\in\mathcal{C}italic_c ∈ caligraphic_C.

For example, behavior b = (2022-08-04 18:30, air conditioner, air conditioner:switch on) describes the behavior “swich on the air conditioner” at 18:30 on 2022-08-04.

Definition 2.

(Behavior Sequence) A behavior sequence s=[b1,b2,,bn]𝒮𝑠subscript𝑏1subscript𝑏2subscript𝑏𝑛𝒮s=[b_{1},b_{2},\cdots,b_{n}]\in\mathcal{S}italic_s = [ italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] ∈ caligraphic_S is a list of behaviors,ordered by their timestamps, and n𝑛nitalic_n is the length of s𝑠sitalic_s.

We define the User Behavior Sequence (UBS) anomaly detection problem as follows.

Problem 1.

(UBSAnomaly Detection) Given a behavior sequence s𝑠sitalic_s, determine whether s𝑠sitalic_s is an anomaly event or a normal event.

Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask (4)

In this paper, we consider four types of abnormal behaviors:

  • (SD) Single Device context anomaly (Figure4(a)), defined as unusual high frequency operations on a single device, e.g., frequently switching light on and off to break the light.

  • (MD) Multiple Devices context anomaly (Figure4(b)), defined as the simultaneous occurrence of behaviors on multiple devices that are not supposed to occur in the same sequence, e.g., turning off the camera and opening the window for burglary.

  • (DM) Device control-Moment context anomaly (Figure4(c)), defined as a device control occurring at an inappropriate time, e.g., turning on the cooling mode of an air conditioner in winter, potentially causing the user to catch a cold.

  • (DD) Device control-Duration context anomaly (Figure4(d)), defined as device controls that last for an inappropriate duration, e.g., leaving a water valve open for 3 hours for flood attack.

Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask (5)

4. Methodology

4.1. Solution Overview

To achieve accurate user behavior sequence anomaly detection in smart homes, we propose SmartGuard, depicted in Figure5. The workflow of SmartGuardcan be summarized as follows. During training, the Loss-guided Dynamic Mask Strategy (§4.2) is initially employed to mask hard-to-learn behaviors based on the loss vector vecsubscriptvec\mathcal{L}_{\text{vec}}caligraphic_L start_POSTSUBSCRIPT vec end_POSTSUBSCRIPT from the previous epoch. Subsequently, the Three-level Time-aware Positional Encoder (§4.3.1) is applied to capture order-level, moment-level, and duration-level temporal information of the behaviors, producing the positional embedding PE¯¯𝑃𝐸\overline{PE}over¯ start_ARG italic_P italic_E end_ARG. This embedding is then added to the device control embedding hcsubscript𝑐h_{c}italic_h start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT to form the behavior embedding 𝐡𝐡\mathbf{h}bold_h. Finally, 𝐡𝐡\mathbf{h}bold_h is fed into an L𝐿Litalic_L-layer attention-based encoder and decoder to extract contextual information for reconstructing the source sequence. During the inference phase, the Noise-aware Weighted Reconstruction Loss Noise-aware Weighted Reconstruction Loss (§4.4) is utilized to assign different weights to various behaviors, determined by the loss vector from the training dataset, resulting in the final reconstruction loss score𝑠𝑐𝑜𝑟𝑒scoreitalic_s italic_c italic_o italic_r italic_e. If the score𝑠𝑐𝑜𝑟𝑒scoreitalic_s italic_c italic_o italic_r italic_e surpasses the threshold th𝑡thitalic_t italic_h, SmartGuardtriggers an alarm.

4.2. Loss-guided Dynamic Mask Strategy

Autoencoders (Zhai etal., 2018), which take complete data instances as input and target to reconstruct the entire input data, are widely used in anomaly detection. Different from traditional autoencoders, masked autoencoders randomly mask a portion of input data, encoding the partially-masked data and aiming to reconstruct the masked tokens. By introducing a more meaningful self-supervised task, masked autoencoders have recently excelled in images (He etal., 2022) learning. However, such reconstruction tasks without mask and with random mask are sub-optimal in our scenarios because they do not emphasize the learning of hard-to-learn behaviors that occur rarely.

Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask (6)
Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask (7)

We conduct experiments to verify the performance of autoencoders trained with three mask options:1) w/o mask: no mask strategy is used, the objective function is to reconstruct the input; 2) random mask: masking behaviors at every epoch randomly to reconstruct the masked behaviors; 3) top-k𝑘kitalic_k loss mask: masking top k𝑘kitalic_k behaviors with higher reconstruction loss to reconstruct the masked behaviors. We set mask ratio as 20% for the latter two.Figure6 shows the changing trends of the reconstruction loss and it’s variance of different behavior during training on SP dataset (described in Table1).First, as shown in Figure6(a), the model without mask shows the fastest convergence trend, whereas the loss of the model with mask fluctuates. Model without mask can simultaneously learn all behaviors, facilitating rapid convergence. In contrast, the mask strategy only encourages the model to focus on learning masked behaviors, which may hinder initial-stage convergence. Second, the model with top-k𝑘kitalic_k loss mask strategy shows lowest variance towards the end of training as shown in Figure6(b), because the top-k𝑘kitalic_k loss mask strategy effectively encourages the model to learn hard-to-learn behaviors (i.e., the behaviors with high reconstruction loss), thereby reducing the variance of behavior reconstruction losses.

In this paper, we design a Loss-guided Dynamic Mask Strategy. Intuitively, at the beginning of training, we encourage the model to learn the relatively easy task to accelerate convergence, i.e., behavior sequence reconstruction without mask. After training N𝑁Nitalic_N epochs without mask, we adopt the top-k𝑘kitalic_k loss mask strategy to encourage the model to learn the masked behaviors with high reconstruction loss. We continuously track the model’s reconstruction loss of different behaviors by updating a loss vector in each epoch to guide the mask strategy in the next epoch. In epoch ep𝑒𝑝epitalic_e italic_p, the loss vector vecepsubscriptsuperscript𝑒𝑝vec\mathcal{L}^{ep}_{\text{vec}}caligraphic_L start_POSTSUPERSCRIPT italic_e italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT vec end_POSTSUBSCRIPT is calculated as:

(1)vecep={1,2,,c,,|𝒞|},c𝒞,formulae-sequencesubscriptsuperscript𝑒𝑝vecsubscript1subscript2subscript𝑐subscript𝒞𝑐𝒞\mathcal{L}^{ep}_{\text{vec}}=\left\{\ell_{1},\ell_{2},\ldots,\ell_{c},\ldots,%\ell_{|\mathcal{C}|}\right\},c\in\mathcal{C},caligraphic_L start_POSTSUPERSCRIPT italic_e italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT vec end_POSTSUBSCRIPT = { roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT | caligraphic_C | end_POSTSUBSCRIPT } , italic_c ∈ caligraphic_C ,
(2)c=1nci=1ncci,subscript𝑐1subscript𝑛𝑐superscriptsubscript𝑖1subscript𝑛𝑐subscriptsuperscript𝑖𝑐\ell_{c}=\frac{1}{n_{c}}\sum_{i=1}^{n_{c}}\ell^{i}_{c},roman_ℓ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_ℓ start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ,

where ncsubscript𝑛𝑐n_{c}italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is the number of times the device control c𝑐citalic_c occurs in epoch ep𝑒𝑝epitalic_e italic_p, and csubscript𝑐\ell_{c}roman_ℓ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is the average reconstruction loss of the device control c𝑐citalic_c.In epoch ep+1𝑒𝑝1ep+1italic_e italic_p + 1, the mask vector for behavior sequence sample s=[b1,b2,,bn]𝑠subscript𝑏1subscript𝑏2subscript𝑏𝑛s=[b_{1},b_{2},\cdots,b_{n}]italic_s = [ italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] is obtained as:

(3)mask(i)={1,ifisorted_index[:nr]0,ifisorted_index[:nr],i[1,n],mask(i)=\left\{\begin{array}[]{rl}1,&\text{if}\ i\in sorted\_index[:\lfloor n%\cdot r\rfloor]\\0,&\text{if}\ i\notin sorted\_index[:\lfloor n\cdot r\rfloor]\end{array}\right%.,i\in[1,n],italic_m italic_a italic_s italic_k ( italic_i ) = { start_ARRAY start_ROW start_CELL 1 , end_CELL start_CELL if italic_i ∈ italic_s italic_o italic_r italic_t italic_e italic_d _ italic_i italic_n italic_d italic_e italic_x [ : ⌊ italic_n ⋅ italic_r ⌋ ] end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL if italic_i ∉ italic_s italic_o italic_r italic_t italic_e italic_d _ italic_i italic_n italic_d italic_e italic_x [ : ⌊ italic_n ⋅ italic_r ⌋ ] end_CELL end_ROW end_ARRAY , italic_i ∈ [ 1 , italic_n ] ,
(4)sorted_index=argsort({vecep(b1),vecep(b2),.vecep(bn)}),sorted\_index=\operatorname{argsort}\left(\left\{\mathcal{L}^{ep}_{\text{vec}}%(b_{1}),\mathcal{L}^{ep}_{\text{vec}}(b_{2}),....\mathcal{L}^{ep}_{\text{vec}}%(b_{n})\right\}\right),italic_s italic_o italic_r italic_t italic_e italic_d _ italic_i italic_n italic_d italic_e italic_x = roman_argsort ( { caligraphic_L start_POSTSUPERSCRIPT italic_e italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT vec end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , caligraphic_L start_POSTSUPERSCRIPT italic_e italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT vec end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , … . caligraphic_L start_POSTSUPERSCRIPT italic_e italic_p end_POSTSUPERSCRIPT start_POSTSUBSCRIPT vec end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) } ) ,

where argsortargsort\operatorname{argsort}roman_argsort gets the sorted index with descending order of the elements in the vector, r[0,1]𝑟01r\in[0,1]italic_r ∈ [ 0 , 1 ] is the mask ratio, and n𝑛nitalic_n is the length of behavior sequence s𝑠sitalic_s.

4.3. Autoencoder with Temporal Information

4.3.1. Three-level Time-aware Positional Encoder

The temporal information in user behavior sequence data primarily resides in the timing of control behaviors, which can be examined from two perspectives: the absolute timing of each individual control behavior, and the relative timing gap between control actions on the same device. On the one hand, the relative timing gap between control actions on the same device reflects the duration the device is in some specific state and the operation frequency of the user. On the one hand, user behaviors are usually time-regulated, and the functionalities a device carried can determine the absolute timing users operate on it. For example, users usually operate lights in the morning and evening, and operate the microwave and the oven at meal time. Since certain operations frequently take place nearly simultaneously, we will also consider the order of behaviors to provide a more comprehensive characterization of behaviors that occur successively. Therefore, we incorporate three types of temporal information into our model. (1) Order-level temporal information: we use integer order[0,n1]𝑜𝑟𝑑𝑒𝑟0𝑛1order\in[0,n-1]italic_o italic_r italic_d italic_e italic_r ∈ [ 0 , italic_n - 1 ] to denotes the order-level information of the behavior, n𝑛nitalic_n is the length of behaviors sequence s𝑠sitalic_s. (2) Moment-level temporal information: we represent the moment as hour of day hour𝑜𝑢𝑟houritalic_h italic_o italic_u italic_r and day of week day𝑑𝑎𝑦dayitalic_d italic_a italic_y based on behavior’s timestamp. (3) Duration-level temporal information: the duration for behavior b𝑏bitalic_b is calculated as:

(5)durationb=t(b)t(bnext)𝑑𝑢𝑟𝑎𝑡𝑖𝑜subscript𝑛𝑏𝑡𝑏𝑡subscript𝑏𝑛𝑒𝑥𝑡duration_{b}=t(b)-t(b_{next})italic_d italic_u italic_r italic_a italic_t italic_i italic_o italic_n start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = italic_t ( italic_b ) - italic_t ( italic_b start_POSTSUBSCRIPT italic_n italic_e italic_x italic_t end_POSTSUBSCRIPT )

where b𝑏bitalic_b and bnextsubscript𝑏𝑛𝑒𝑥𝑡b_{next}italic_b start_POSTSUBSCRIPT italic_n italic_e italic_x italic_t end_POSTSUBSCRIPT are the behaviors on the same device and bnextsubscript𝑏𝑛𝑒𝑥𝑡b_{next}italic_b start_POSTSUBSCRIPT italic_n italic_e italic_x italic_t end_POSTSUBSCRIPT is the first behavior after b𝑏bitalic_b that operates on the device, and t(b)𝑡𝑏t(b)italic_t ( italic_b ) represents the occurrence time of behavior b𝑏bitalic_b.

Then, the positional embedding is calculated as:

(6)PE¯=worderPE(pos)+whourPE(hour)+wdayPE(day)+wdurPE(duration),¯𝑃𝐸subscript𝑤𝑜𝑟𝑑𝑒𝑟𝑃𝐸𝑝𝑜𝑠subscript𝑤𝑜𝑢𝑟𝑃𝐸𝑜𝑢𝑟subscript𝑤𝑑𝑎𝑦𝑃𝐸𝑑𝑎𝑦subscript𝑤𝑑𝑢𝑟𝑃𝐸𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛\begin{split}\overline{PE}&=w_{order}\cdot PE(pos)+w_{hour}\cdot PE(hour)+\\&w_{day}\cdot PE(day)+w_{dur}\cdot PE(duration),\end{split}start_ROW start_CELL over¯ start_ARG italic_P italic_E end_ARG end_CELL start_CELL = italic_w start_POSTSUBSCRIPT italic_o italic_r italic_d italic_e italic_r end_POSTSUBSCRIPT ⋅ italic_P italic_E ( italic_p italic_o italic_s ) + italic_w start_POSTSUBSCRIPT italic_h italic_o italic_u italic_r end_POSTSUBSCRIPT ⋅ italic_P italic_E ( italic_h italic_o italic_u italic_r ) + end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL italic_w start_POSTSUBSCRIPT italic_d italic_a italic_y end_POSTSUBSCRIPT ⋅ italic_P italic_E ( italic_d italic_a italic_y ) + italic_w start_POSTSUBSCRIPT italic_d italic_u italic_r end_POSTSUBSCRIPT ⋅ italic_P italic_E ( italic_d italic_u italic_r italic_a italic_t italic_i italic_o italic_n ) , end_CELL end_ROW

where wordersubscript𝑤𝑜𝑟𝑑𝑒𝑟w_{order}italic_w start_POSTSUBSCRIPT italic_o italic_r italic_d italic_e italic_r end_POSTSUBSCRIPT, whoursubscript𝑤𝑜𝑢𝑟w_{hour}italic_w start_POSTSUBSCRIPT italic_h italic_o italic_u italic_r end_POSTSUBSCRIPT, wdaysubscript𝑤𝑑𝑎𝑦w_{day}italic_w start_POSTSUBSCRIPT italic_d italic_a italic_y end_POSTSUBSCRIPT and wdursubscript𝑤𝑑𝑢𝑟w_{dur}italic_w start_POSTSUBSCRIPT italic_d italic_u italic_r end_POSTSUBSCRIPT are learnable weights. PE()𝑃𝐸PE(\cdot)italic_P italic_E ( ⋅ ) is a positional encoding function (Vaswani etal., 2017) defined as:

(7)PE(,2i)𝑃subscript𝐸2𝑖\displaystyle PE_{(\cdot,2i)}italic_P italic_E start_POSTSUBSCRIPT ( ⋅ , 2 italic_i ) end_POSTSUBSCRIPT=sin(/100002i/d),\displaystyle=\sin\left(\cdot/10000^{2i/d}\right),= roman_sin ( ⋅ / 10000 start_POSTSUPERSCRIPT 2 italic_i / italic_d end_POSTSUPERSCRIPT ) ,
PE(,2i+1)𝑃subscript𝐸2𝑖1\displaystyle PE_{(\cdot,2i+1)}italic_P italic_E start_POSTSUBSCRIPT ( ⋅ , 2 italic_i + 1 ) end_POSTSUBSCRIPT=cos(/100002i/d),\displaystyle=\cos\left(\cdot/10000^{2i/d}\right),= roman_cos ( ⋅ / 10000 start_POSTSUPERSCRIPT 2 italic_i / italic_d end_POSTSUPERSCRIPT ) ,

where i𝑖iitalic_i denotes the i𝑖iitalic_i-th dimension of the positional embedding, d𝑑ditalic_d is the dimension of temporal embedding.

To learn the representation hcsubscript𝑐h_{c}italic_h start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT for device control c𝒞𝑐𝒞c\in\mathcal{C}italic_c ∈ caligraphic_C, we first encode device control c𝑐citalic_c into a low-dimensional latent space through device control encoder, i.e., an embedding layer. Finally, we add positional embedding to the device control embedding as following to get the behavior embedding:

(8)𝐡=PE¯+hc.𝐡¯𝑃𝐸subscript𝑐\mathbf{h}=\overline{PE}+h_{c}.bold_h = over¯ start_ARG italic_P italic_E end_ARG + italic_h start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT .

4.3.2. Sequence Encoder

To learn the sequence embedding, we employ transformer encoder (Vaswani etal., 2017) consisting of multi-head attention layer, residual connections and position-wise feed-forward network (FNN). Given an input behavior representation 𝐡𝐡\mathbf{h}bold_h, the self-attention layer can effectively mine global semantic information of behavior sequence context by learning query QQ\mathrm{Q}roman_Q, key KK\mathrm{~{}K}roman_K and value VV\mathrm{~{}V}roman_V matrices of different variables, which are calculated as:

(9)Q=𝐡WQ,K=𝐡WK,V=𝐡WV,formulae-sequenceQ𝐡superscriptW𝑄formulae-sequenceK𝐡superscriptW𝐾V𝐡superscriptW𝑉\mathrm{Q}=\mathbf{h}\mathrm{W}^{Q},\mathrm{~{}K}=\mathbf{h}\mathrm{W}^{K},%\mathrm{~{}V}=\mathbf{h}\mathrm{W}^{V},roman_Q = bold_h roman_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT , roman_K = bold_h roman_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , roman_V = bold_h roman_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT ,

where WQ,WK,WVsuperscriptW𝑄superscriptW𝐾superscriptW𝑉\mathrm{W}^{Q},\mathrm{W}^{K},\mathrm{W}^{V}roman_W start_POSTSUPERSCRIPT italic_Q end_POSTSUPERSCRIPT , roman_W start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT , roman_W start_POSTSUPERSCRIPT italic_V end_POSTSUPERSCRIPT are the transformation matrices. The attention score 𝐀𝐀\mathbf{A}bold_A is computed by:

(10)𝐀=Attention(Q,K,V)=softmax(QKTdk)V,𝐀Attention𝑄𝐾𝑉softmax𝑄superscript𝐾𝑇subscript𝑑𝑘𝑉\mathbf{A}=\operatorname{Attention}(Q,K,V)=\operatorname{softmax}\left(\frac{%QK^{T}}{\sqrt{d_{k}}}\right)V,bold_A = roman_Attention ( italic_Q , italic_K , italic_V ) = roman_softmax ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) italic_V ,

where dksubscript𝑑𝑘d_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the dimension of K𝐾Kitalic_K. Multi-head attention is applied to improve the stability of the learning process and achieve higher performance. Then, the position-wise feed-forward network (FNN) and residual connections are adopted:

(11)𝐡¯=Trans(𝐡)=𝐡+𝐀𝐡+FNN(𝐡+𝐀𝐡),¯𝐡Trans𝐡𝐡𝐀𝐡FNN𝐡𝐀𝐡\mathbf{\overline{h}}=\operatorname{Trans}(\mathbf{h})=\mathbf{h}+\mathbf{Ah}+%\mathrm{FNN}(\mathbf{h}+\mathbf{Ah}),over¯ start_ARG bold_h end_ARG = roman_Trans ( bold_h ) = bold_h + bold_Ah + roman_FNN ( bold_h + bold_Ah ) ,

where Trans()Trans\operatorname{Trans}(\cdot)roman_Trans ( ⋅ ) is the transformer and FNN()FNN\operatorname{FNN}(\cdot)roman_FNN ( ⋅ ) is a 2-layered position-wise feed-forward network (Vaswani etal., 2017).

4.3.3. Sequence Decoder

The decoder has the same architecture as the encoder. We input 𝐡¯¯𝐡\mathbf{\overline{h}}over¯ start_ARG bold_h end_ARG into the decoder to reconstruct the entire sequence, probabilities of target device controls are calculated as:

(12)𝐡𝐢~=decoder(𝐡𝐢¯),~subscript𝐡𝐢decoder¯subscript𝐡𝐢\mathbf{\widetilde{h_{i}}}=\operatorname{decoder}\left(\mathbf{\overline{h_{i}%}}\right),over~ start_ARG bold_h start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT end_ARG = roman_decoder ( over¯ start_ARG bold_h start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT end_ARG ) ,
(13)𝐲𝐢^=softmax(𝐖h𝐡𝐢~),^subscript𝐲𝐢softmaxsubscript𝐖~subscript𝐡𝐢\hat{\mathbf{y_{i}}}=\operatorname{softmax}\left(\mathbf{W}_{h}\mathbf{%\widetilde{h_{i}}}\right),over^ start_ARG bold_y start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT end_ARG = roman_softmax ( bold_W start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT over~ start_ARG bold_h start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT end_ARG ) ,

where 𝐲𝐢^^subscript𝐲𝐢\hat{\mathbf{y_{i}}}over^ start_ARG bold_y start_POSTSUBSCRIPT bold_i end_POSTSUBSCRIPT end_ARG is the predicted probabilities of the i𝑖iitalic_i-th device control and 𝐖h|𝒞|×len(h)subscript𝐖superscript𝒞𝑙𝑒𝑛\mathbf{W}_{h}\in\mathbb{R}^{|\mathcal{C}|\times len(h)}bold_W start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_C | × italic_l italic_e italic_n ( italic_h ) end_POSTSUPERSCRIPT is the learnable transformation matrix, |𝒞|𝒞|\mathcal{C}|| caligraphic_C | is the number of device controls, and len(h)𝑙𝑒𝑛len(h)italic_l italic_e italic_n ( italic_h ) is the length of hhitalic_h.

4.3.4. Objective Function

We optimize the model to minimize the average reconstruction loss measured by cross-entropy loss:

(14)rec={1|𝒮|s𝒮i=1|s|𝐲ilog𝐲^i,ifepoch<=N1|𝒮|s𝒮i=1|s|masks(i)𝐲ilog𝐲^i,ifepoch>N,subscript𝑟𝑒𝑐cases1𝒮subscript𝑠𝒮subscriptsuperscript𝑠𝑖1subscript𝐲𝑖subscript^𝐲𝑖if𝑒𝑝𝑜𝑐𝑁1𝒮subscript𝑠𝒮subscriptsuperscript𝑠𝑖1𝑚𝑎𝑠subscript𝑘𝑠𝑖subscript𝐲𝑖subscript^𝐲𝑖if𝑒𝑝𝑜𝑐𝑁\mathcal{L}_{rec}=\left\{\begin{array}[]{ll}-\frac{1}{|\mathcal{S}|}\sum_{s\in%\mathcal{S}}\sum^{|s|}_{i=1}\mathbf{y}_{i}\log\hat{\mathbf{y}}_{i},&\text{if}%\ epoch<=N\\-\frac{1}{|\mathcal{S}|}\sum_{s\in\mathcal{S}}\sum^{|s|}_{i=1}mask_{s}(i)%\mathbf{y}_{i}\log\hat{\mathbf{y}}_{i},&\text{if}\ epoch>N\end{array}\right.,caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT = { start_ARRAY start_ROW start_CELL - divide start_ARG 1 end_ARG start_ARG | caligraphic_S | end_ARG ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT | italic_s | end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log over^ start_ARG bold_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , end_CELL start_CELL if italic_e italic_p italic_o italic_c italic_h < = italic_N end_CELL end_ROW start_ROW start_CELL - divide start_ARG 1 end_ARG start_ARG | caligraphic_S | end_ARG ∑ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT ∑ start_POSTSUPERSCRIPT | italic_s | end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT italic_m italic_a italic_s italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_i ) bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log over^ start_ARG bold_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , end_CELL start_CELL if italic_e italic_p italic_o italic_c italic_h > italic_N end_CELL end_ROW end_ARRAY ,

where 𝒮𝒮\mathcal{S}caligraphic_S is the behavior sequences set, |s|𝑠|s|| italic_s | is the length of sequence s𝑠sitalic_s, 𝐲isubscript𝐲𝑖\mathbf{y}_{i}bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the one-hot vector of the ground-truth label, masks𝑚𝑎𝑠subscript𝑘𝑠mask_{s}italic_m italic_a italic_s italic_k start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is the mask vector for sequence s𝑠sitalic_s, and N𝑁Nitalic_N is the training steps w/o mask.

4.4. Noise-aware Weighted Reconstruction Loss

Although LDMS encourages the model to focus on learning behaviors with high reconstruction losses, it remains challenging to reconstruct noise behaviors due to their inherent uncertainty. The significant reconstruction loss associated with noise behaviors can overshadow other aspects during anomaly detection, potentially leading to the misclassification of normal sequences containing noise behaviors as anomalies.

To eliminate the interference of noise behaviors, we propose a Noise-aware Weighted Reconstruction Loss as the anomaly score. We can get the final loss vector after training:

(15)vec={1,2,,c,,|𝒞|},c𝒞,formulae-sequencesubscriptvecsubscript1subscript2subscript𝑐subscript𝒞𝑐𝒞\mathcal{L}_{\text{vec}}=\left\{\ell_{1},\ell_{2},\ldots,\ell_{c},\ldots,\ell_%{|\mathcal{C}|}\right\},c\in\mathcal{C},caligraphic_L start_POSTSUBSCRIPT vec end_POSTSUBSCRIPT = { roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , … , roman_ℓ start_POSTSUBSCRIPT | caligraphic_C | end_POSTSUBSCRIPT } , italic_c ∈ caligraphic_C ,

which is converted into the corresponding weight vector:

(16)𝒲vec={w1,w2,,wc,,w|C|},wk(0,1),formulae-sequencesubscript𝒲𝑣𝑒𝑐subscript𝑤1subscript𝑤2subscript𝑤𝑐subscript𝑤𝐶subscript𝑤𝑘01\mathcal{W}_{vec}=\left\{w_{1},w_{2},\ldots,w_{c},\ldots,w_{|C|}\right\},w_{k}%\in(0,1),caligraphic_W start_POSTSUBSCRIPT italic_v italic_e italic_c end_POSTSUBSCRIPT = { italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT | italic_C | end_POSTSUBSCRIPT } , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∈ ( 0 , 1 ) ,

by the following equation:

(17)𝒲vec=sigmoid(relu(vec𝔼(vec))Var(vec)μ),subscript𝒲𝑣𝑒𝑐sigmoidrelusubscript𝑣𝑒𝑐𝔼subscript𝑣𝑒𝑐Varsubscript𝑣𝑒𝑐𝜇\mathcal{W}_{vec}=\operatorname{sigmoid}\left(-\frac{\operatorname{relu}(%\mathcal{L}_{vec}-\mathbb{E}\left(\mathcal{L}_{vec}\right))}{\sqrt{%\operatorname{Var}\left(\mathcal{L}_{vec}\right)}\cdot\mu}\right),caligraphic_W start_POSTSUBSCRIPT italic_v italic_e italic_c end_POSTSUBSCRIPT = roman_sigmoid ( - divide start_ARG roman_relu ( caligraphic_L start_POSTSUBSCRIPT italic_v italic_e italic_c end_POSTSUBSCRIPT - blackboard_E ( caligraphic_L start_POSTSUBSCRIPT italic_v italic_e italic_c end_POSTSUBSCRIPT ) ) end_ARG start_ARG square-root start_ARG roman_Var ( caligraphic_L start_POSTSUBSCRIPT italic_v italic_e italic_c end_POSTSUBSCRIPT ) end_ARG ⋅ italic_μ end_ARG ) ,

where μ𝜇\muitalic_μ is a coefficient to adjust the input for sigmoid function, 𝔼𝔼\mathbb{E}blackboard_E and VarVar\operatorname{Var}roman_Var calculate the expectation and variance of the loss distribution, respectively. Relu function ensures that behaviors with losses less than 𝔼(vec)𝔼subscript𝑣𝑒𝑐\mathbb{E}\left(\mathcal{L}_{vec}\right)blackboard_E ( caligraphic_L start_POSTSUBSCRIPT italic_v italic_e italic_c end_POSTSUBSCRIPT ) (routine behaviors) are equally weighted. The sigmoid function assigns small weights to behaviors with high losses (potential noise behaviors). For each behavior bisubscript𝑏𝑖b_{i}italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in a sequence s={b1,b2,,bn}𝑠subscript𝑏1subscript𝑏2subscript𝑏𝑛s=\{b_{1},b_{2},\cdots,b_{n}\}italic_s = { italic_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, we compute the weight pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as follows:

(18)pi=𝒲vec(bi)j=1n𝒲vec(bj).subscript𝑝𝑖subscript𝒲𝑣𝑒𝑐subscript𝑏𝑖superscriptsubscript𝑗1𝑛subscript𝒲𝑣𝑒𝑐subscript𝑏𝑗p_{i}=\frac{\mathcal{W}_{vec}(b_{i})}{\sum_{j=1}^{n}\mathcal{W}_{vec}(b_{j})}.italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG caligraphic_W start_POSTSUBSCRIPT italic_v italic_e italic_c end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT caligraphic_W start_POSTSUBSCRIPT italic_v italic_e italic_c end_POSTSUBSCRIPT ( italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG .

Then, we can get the anomaly score of s𝑠sitalic_s as the weighted sum of the reconstruction losses of behaviors in s𝑠sitalic_s:

(19)score(s)=1|s|i=1|s|pi𝐲ilog𝐲^i.𝑠𝑐𝑜𝑟𝑒𝑠1𝑠subscriptsuperscript𝑠𝑖1subscript𝑝𝑖subscript𝐲𝑖subscript^𝐲𝑖score(s)=-\frac{1}{|s|}\sum^{|s|}_{i=1}p_{i}\mathbf{y}_{i}\log\hat{\mathbf{y}}%_{i}.italic_s italic_c italic_o italic_r italic_e ( italic_s ) = - divide start_ARG 1 end_ARG start_ARG | italic_s | end_ARG ∑ start_POSTSUPERSCRIPT | italic_s | end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_log over^ start_ARG bold_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT .

SmartGuardcan inference whether a behavior sequence sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is normal or abnormal based on the anomaly score:

(20)si={Normal,ifscore(si)<thAbnormal,ifscore(si)>th,subscript𝑠𝑖casesNormalif𝑠𝑐𝑜𝑟𝑒subscript𝑠𝑖𝑡Abnormalif𝑠𝑐𝑜𝑟𝑒subscript𝑠𝑖𝑡s_{i}=\left\{\begin{array}[]{ll}\text{Normal},&\text{if}\ score(s_{i})<th\\\text{Abnormal},&\text{if}\ score(s_{i})>th\end{array}\right.,italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { start_ARRAY start_ROW start_CELL Normal , end_CELL start_CELL if italic_s italic_c italic_o italic_r italic_e ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) < italic_t italic_h end_CELL end_ROW start_ROW start_CELL Abnormal , end_CELL start_CELL if italic_s italic_c italic_o italic_r italic_e ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) > italic_t italic_h end_CELL end_ROW end_ARRAY ,

where th𝑡thitalic_t italic_h is the anomaly threshold. We take the 95% quantile of the reconstruction loss distribution on the validation set as th𝑡thitalic_t italic_h.

5. Experiments

In this section, we conduct comprehensive experiments on three real-world datasets to answer the following key questions:

  • RQ1. Performance. Compared with other methods, does SmartGuardachieve better anomaly detection performance?

  • RQ2. Ablation study. How will model performance change if we remove key modules of SmartGuard?

  • RQ3. Parameter study. How do key parameters affect the performance of SmartGuard?

  • RQ4. Interpretability study. Can SmartGuardgive reasonable explanations for the detection results?

  • RQ5. Embedding space analysis. Does SmartGuardsuccessfully learn useful embeddings of behaviors and correct correlations between device controls and time?

5.1. Experimental Setup

5.1.1. Datasets

We train SmartGuardon three real-world datasets consisting of only normal samples, two (FR/SP) from public datasets222https://github.com/snudatalab/SmartSense and one anonymous dataset (AN) collected by ourselves. The datasets description is shown in Table 1. All datasets are split into training, validation and testing sets with a ratio of 7:1:2. To evaluate the performance of SmartGuard, we construct ten categories of abnormal behaviors as shown in Table2 and insert them among normal behaviors for simulating real anomaly scenarios.

NameTime period (Y-M-D)Sizes# Devices# Device controls
AN2022-07-31similar-to\sim2022-08-311,76536141
FR2022-02-27similar-to\sim2022-03-254,42333222
SP2022-02-28similar-to\sim2022-03-3015,66534234

AnomalyTypeAnomalyType
Light flickeringSD
Open the airconditioner’s
cool mode in winter
DM
Camera flickeringSD
Open the window
at midnight
DM
TV flickeringSD
Open the watervalve
at midnight
DM
Open the window
while smartlock lock
MDShower for long timeDD
Close the camera
while smartlock lock
MD
Microwave runs
for long time
DD

5.1.2. Baselines

We compare SmartGuardwith existing general unsupervised anomaly detection methods and unsupervised anomaly behaviors detection methods in smart homes:

  • Local Outiler Factor (LOF) (Cheng etal., 2019) calculates the density ratio between each sample and its neighbors to detect anomaly.

  • Isolation Forest (IF) (Liu etal., 2008) builds binary trees, and instances with short average path lengths are detected as anomaly.

  • 6thSense (Sikder etal., 2017) utilizes Naive Bayes to detect malicious behavior associated with sensors in smart homes.

  • Aegis(Sikder etal., 2019) utilizes a Markov Chain-based machine learning technique to detect malicious behavior in smart homes.

  • OCSVM (Amraoui and Zouari, 2021) build a One-Class Support Vector Machine model to prevent malicious control of smart home systems.

  • Autoencoder (Chen etal., 2018) learns to reconstruct normal data and then uses the reconstruction error to determine whether the input data is abnormal.

  • ARGUS(Rieger etal., 2023) designed an Autoencoder based on Gated Recurrent Units (GRU) to detect IoT infiltration attacks.

  • TransformerAutoencoder (TransAE) (Vaswani etal., 2017) uses self-attention mechanism in the encoder and decoder to achieve context-aware anomaly detection.

5.1.3. Evaluation metrics

We use common metrics such as False Positive rate, False Negative Rate, Recall, and F1-Score to evaluate the performance of SmartGuard.

5.1.4. Complexity analysis

Suppose the embedding size is em𝑒𝑚emitalic_e italic_m, and the behavior sequence length is n𝑛nitalic_n. The computational complexity of SmartGuardis mainly due to the self-attention layer and the feed-forward network, which is O(n2d+nd2)𝑂superscript𝑛2𝑑𝑛superscript𝑑2O(n^{2}d+nd^{2})italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d + italic_n italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). The dominant term is typically O(n2d)𝑂superscript𝑛2𝑑O(n^{2}d)italic_O ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d ) from the self-attention layer. SmartGuardonly takes 0.0145s, which shows that it can detect abnormal behaviors in real time.

DatasetTypeMetricLOFIF6thSenseAegisOCSVMAutoencoderARGUSTransAESmartGuard
ANSDRecall0.02750.41050.46800.29020.53990.98320.98580.98820.9986
F1 Score0.05190.49720.51960.36720.58620.99150.99280.99080.9967
MDRecall0.07450.40390.59410.44310.60390.51560.56660.62160.9745
F1 Score0.13570.48240.62150.47180.65530.66920.71350.75570.9832
DMRecall0.07840.43730.37450.56470.35100.51960.53130.60780.9961
F1 Score0.14180.51740.48170.56470.42570.67250.68430.74520.9941
DDRecall0.09610.34510.19800.78040.49610.51370.51170.52940.9980
F1 Score0.17130.42820.31080.70440.59670.66750.66750.68180.9951
FRSDRecall0.35410.24440.29070.39150.59180.98160.97960.98640.9979
F1 Score0.48040.36550.41670.45420.66120.99070.98970.99210.9932
MDRecall0.42750.29800.65670.70980.43840.97260.98750.97820.9984
F1 Score0.51920.42300.60920.38270.55340.98610.97830.98740.9907
DMRecall0.38250.31910.54610.76190.39200.49520.66760.65290.9985
F1 Score0.48300.44940.61240.68220.49400.65080.78670.77790.9912
DDRecall0.35720.18500.53580.97430.62670.43970.73290.60980.9981
F1 Score0.43750.28060.58800.44810.64220.60130.83820.74790.9921
SPSDRecall0.21970.26430.69790.16180.53320.98240.97950.91720.9862
F1 Score0.33500.38570.72480.21640.61550.99110.98960.94890.9831
MDRecall0.27860.33990.63170.74450.38400.56450.96960.99360.9961
F1 Score0.39160.46320.64400.66360.50260.70950.98450.98660.9830
DMRecall0.27800.34650.60800.81210.53510.30740.52970.54510.9198
F1 Score0.41120.49180.69350.77580.63410.46490.68470.69620.9498
DDRecall0.21090.17630.54490.80010.82930.64550.64550.64560.9961
F1 Score0.30520.26270.63430.65450.73110.76850.76580.76530.9788

5.2. Performance Comparison (RQ1)

We use grid search to adjust the parameters of SmartGuardand report the overall performance of SmartGuardand all baselines in Table3. Bold values indicate the optimal performance among all schemes, and underlined values indicate the second best performance. First, SmartGuardoutperforms all competitors in most cases. This is because SmartGuardsimultaneously considers the temporal information, behavior imbalance and noise behaviors. Second, SmartGuardsignificantly improves the performance on DM and DD type anomalies detection. We ascribe this superiority to our TTPE’s effective mining of temporal information of behaviors. Third, the LOF, IF and 6thSense show the worst performance. Aegis and OCSVM outperforms LOF, IF and 6thSense, which benifits from the Markov Chain’s modeling of behavior transitions and SVM’s powerful kernel function. The Autoencoder outperform the traditional models because of stronger sequence modeling capability. ARGUS outperforms Aueocoder because of stronger sequence modeling capability of GRU. By exploiting transformer to mine contextual information, TransAE achieves better performance than all other baselines, but is still inferior to our proposed scheme.

5.3. Ablation Study (RQ2)

LDMSTTPENWRLSDMDDMDD
XXXC0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT0.99080.75570.74520.6818
YYXC1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT0.98770.97080.97670.9817
YXYC2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT0.98830.87160.87830.8799
XYYC3subscript𝐶3C_{3}italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT0.99020.97660.98350.9855
YYYC4subscript𝐶4C_{4}italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT0.99670.98320.99410.9951

SmartGuardmainly consists of three main components: Loss-guided Dynamic Mask Strategy (LDMS), Three-level Time-aware Position Embedding (TTPE) and Noise-aware Weighted Reconstruction Loss (NWRL). To investigate different components’ effectiveness in SmartGuard, we implement 5 variants of SmartGuardfor ablation study (C0subscript𝐶0C_{0}italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT-C4subscript𝐶4C_{4}italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT). Y represents adding the corresponding components, X represents removing the corresponding components. C4subscript𝐶4C_{4}italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT is SmartGuardwith all three components. As shown in Table4, each component of SmartGuardhas a positive impact on results. The combination of all components brings thebest results, which is much better than using any subset of the three components.

5.4. Parameter Study (RQ3)

5.4.1. The mask ratio r𝑟ritalic_r and the training step N𝑁Nitalic_N without mask

Figure7 illustrates that SmartGuardachieves the optimal performance when r=0.4𝑟0.4r=0.4italic_r = 0.4 and N=5𝑁5N=5italic_N = 5. The parameter r𝑟ritalic_r (Equation3) determines the difficulty of the model learning task. A smaller r𝑟ritalic_r fails to effectively encourage the model to learn hard-to-learn behaviors, while a larger r𝑟ritalic_r increases the learning burden on the model, consequently diminishing performance. As for training steps without a mask, a smaller N𝑁Nitalic_N hinders the model from converging effectively at the beginning stage, whereas a larger N𝑁Nitalic_N impedes the model’s ability to learn hard-to-learn behaviors towards the end, resulting in degraded performance.

Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask (8)
Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask (9)
Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask (10)
Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask (11)

5.4.2. μ𝜇\muitalic_μ of Noise-aware Weighted Reconstruction Loss

The parameter μ𝜇\muitalic_μ(Equation17) controls the weights assigned to potential noise behaviors. A smaller μ𝜇\muitalic_μ results in a smaller weight for noise behaviors, while a larger μ𝜇\muitalic_μ leads to a greater weight for noise behaviors. As illustrated in Figure8(a), the False Positive Rate gradually decreases as μ𝜇\muitalic_μ decreases, benefiting from the reduced loss weight assigned to noise behaviors. However, as depicted in Figure8(b), the False Negative Rate slightly increases as μ𝜇\muitalic_μ decreases. When μ=0.1𝜇0.1\mu=0.1italic_μ = 0.1, SmartGuardachieves a balance, minimizing both the False Positive Rate and the False Negative Rate.

Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask (12)
Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask (13)

5.4.3. The embedding size em𝑒𝑚emitalic_e italic_m

We fine-tune the embedding size for time and device control, ranging from 8 to 512. As depicted in Figure9(a), an initial increase in the embedding dimension results in a notable performance improvement, which is attributed to the larger dimensionality enabling behavior embedding to capture more comprehensive information about the context, thereby furnishing valuable representations for other modules in SmartGuard. Nevertheless, excessively large sizes (e.g., ¿ 256) can lead to performance degradation due to over-fitting.

Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask (14)
Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask (15)

5.4.4. Number of layers L𝐿Litalic_L of encoder and decoder

Figure9(b) shows the performance of SmartGuardwith different numbers of layers. When L𝐿Litalic_L increases, F1-Score first increases and then decreases, reaching the optimal value at 3 layers, because fewer layers leads to under-fitting, and too many layers leads to over-fitting.

5.5. Case Study (RQ4)

To assess the interpretability of SmartGuard, we select a behavior sequence from the test set of the AN dataset and visualize its attention weights and reconstruction loss. Illustrated in Figure10, the user initiated a sequence of actions: turning off the TV, stopping the sweeper, closing the curtains, switching off the bedlight, and locking the smart lock before going to sleep. Subsequently, an attacker took control of IoT devices, turning off the camera, and opening the window for potential theft. Examining Figure10(a), we observe that the attention weights between behaviors b6subscript𝑏6b_{6}italic_b start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT, b7subscript𝑏7b_{7}italic_b start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT, b8subscript𝑏8b_{8}italic_b start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT, and other behaviors in the sequence are relatively smaller. This suggests that b6subscript𝑏6b_{6}italic_b start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT, b7subscript𝑏7b_{7}italic_b start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT, b8subscript𝑏8b_{8}italic_b start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT, and other behaviors lack contextual relevance and are likely abnormal. Turning to Figure10(b), the reconstruction losses for behaviors b6subscript𝑏6b_{6}italic_b start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT, b7subscript𝑏7b_{7}italic_b start_POSTSUBSCRIPT 7 end_POSTSUBSCRIPT, and b8subscript𝑏8b_{8}italic_b start_POSTSUBSCRIPT 8 end_POSTSUBSCRIPT are notably high. SmartGuardidentifies these anomalies in the sequence, triggering an immediate alarm.

Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask (16)

5.6. Embedding Space Analysis (RQ5)

We visualize the similarity between device embeddings and time embeddings (i.e., hour embedding, day embedding and duration embeddings) to analyze whether the model effectively learns the relationship between behaviors. As shown in Figure11(a), opening the curtains usually occurs between 6-9 and 9-12 o’clock because users usually get up during this period, while closing the curtains generally occurs between 21-24 o’clock because the user usually go to bed during this period. The dishwasher usually runs between 12-15 and 18-21 o’clock, which means that the user has lunch and dinner during this period, and then washes the dishes. As shown in Figure11(b), users generally watch TV and do laundry on Saturdays and Sundays. As shown in Figure11(c), users usually take a bath for about 1-2 hours, bath time longer than this may indicate abnormality occurs.

Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask (17)

6. Conclusion

In this paper, we introduce SmartGuardfor unsupervised user behavior anomaly detection. We first devise a Loss-guided Dynamic Mask Strategy (LDMS) to encourage the model to learn less frequent behaviors that are often overlooked during the learning process. Additionally, we introduce Three-level Time-aware Position Embedding (TTPE) to integrate temporal information into positional embedding, allowing for the detection of temporal context anomalies. Furthermore, we propose a Noise-aware Weighted Reconstruction Loss (NWRL) to assign distinct weights for routine behaviors and noise behaviors, thereby mitigating the impact of noise. Comprehensive experiments conducted on three datasets encompassing ten types of anomaly behaviors demonstrate that SmartGuardconsistently outperforms state-of-the-art baselines while delivering highly interpretable results.

Acknowledgements.

We thank the anonymous reviewers for their constructive feedback and comments. This work is supported by the Major Key Project of PCL under grant No. PCL2023A06-4, the National Key Research and Development Program of China under grant No. 2022YFB3105000, and the Shenzhen Key Labof Software Defined Networking under grant No. ZDSYS20140509172959989.

References

  • (1)
  • Amraoui and Zouari (2021)Noureddine Amraoui and Belhassen Zouari. 2021.An ml behavior-based security control for smart home systems. In Risks and Security of Internet and Systems: 15th International Conference, CRiSIS 2020, Paris, France, November 4–6, 2020, Revised Selected Papers 15. Springer, 117–130.
  • Celik etal. (2018)Z.Berkay Celik, Leonardo Babun, AmitKumar Sikder, Hidayet Aksu, Gang Tan, PatrickD. McDaniel, and A.Selcuk Uluagac. 2018.Sensitive Information Tracking in Commodity IoT. In 27th USENIX Security Symposium, USENIX Security 2018, Baltimore, MD, USA, August 15-17, 2018, William Enck and AdriennePorter Felt (Eds.). USENIX Association, 1687–1704.https://www.usenix.org/conference/usenixsecurity18/presentation/celik
  • Chen etal. (2019)Qiwei Chen, Huan Zhao, Wei Li, Pipei Huang, and Wenwu Ou. 2019.Behavior sequence transformer for e-commerce recommendation in alibaba. In Proceedings of the 1st international workshop on deep learning practice for high-dimensional sparse data. 1–4.
  • Chen etal. (2018)Zhaomin Chen, ChaiKiat Yeo, BuSung Lee, and ChiewTong Lau. 2018.Autoencoder-based network anomaly detection. In 2018 Wireless telecommunications symposium (WTS). IEEE, 1–5.
  • Cheng etal. (2019)Zhangyu Cheng, Chengming Zou, and Jianwei Dong. 2019.Outlier detection using isolation forest and local outlier factor. In Proceedings of the conference on research in adaptive and convergent systems. 161–168.
  • Chi etal. (2022)Haotian Chi, Chenglong Fu, Qiang Zeng, and Xiaojiang Du. 2022.Delay Wreaks Havoc on Your Smart Home: Delay-based Automation Interference Attacks. In 43rd IEEE Symposium on Security and Privacy, SP 2022, San Francisco, CA, USA, May 22-26, 2022. IEEE, 285–302.https://doi.org/10.1109/SP46214.2022.9833620
  • deSouza PereiraMoreira etal. (2021)Gabriel de Souza PereiraMoreira, Sara Rabhi, JeongMin Lee, Ronay Ak, and Even Oldridge. 2021.Transformers4rec: Bridging the gap between nlp and sequential/session-based recommendation. In Proceedings of the 15th ACM Conference on Recommender Systems (RecSys). 143–153.
  • Feng etal. (2018)Jie Feng, Yong Li, Chao Zhang, Funing Sun, Fanchao Meng, Ang Guo, and Depeng Jin. 2018.DeepMove: Predicting Human Mobility with Attentional Recurrent Networks. In Proceedings of the 2018 World Wide Web Conference (Lyon, France) (WWW ’18). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 1459–1468.https://doi.org/10.1145/3178876.3186058
  • Fernandes etal. (2016)Earlence Fernandes, Jaeyeon Jung, and Atul Prakash. 2016.Security Analysis of Emerging Smart Home Applications. In Proceedings of IEEE Symposium on Security and Privacy, SP 2016, San Jose, CA, USA.
  • Fu etal. (2022)Chenglong Fu, Qiang Zeng, Haotian Chi, Xiaojiang Du, and SivaLikitha Valluru. 2022.IoT Phantom-Delay Attacks: Demystifying and Exploiting IoT Timeout Behaviors. In 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2022, Baltimore, MD, USA, June 27-30, 2022. IEEE, 428–440.https://doi.org/10.1109/DSN53405.2022.00050
  • Fu etal. (2021)Chenglong Fu, Qiang Zeng, and Xiaojiang Du. 2021.HAWatcher: Semantics-Aware Anomaly Detection for Appified Smart Homes. In 30th USENIX Security Symposium, USENIX Security 2021, August 11-13, 2021, MichaelD. Bailey and Rachel Greenstadt (Eds.). USENIX Association, 4223–4240.https://www.usenix.org/conference/usenixsecurity21/presentation/fu-chenglong
  • Gu etal. (2020)Tianbo Gu, Zheng Fang, Allaukik Abhishek, Hao Fu, Pengfei Hu, and Prasant Mohapatra. 2020.IoTGaze: IoT Security Enforcement via Wireless Context Analysis. In 39th IEEE Conference on Computer Communications, INFOCOM 2020, Toronto, ON, Canada, July 6-9, 2020. IEEE, 884–893.https://doi.org/10.1109/INFOCOM41043.2020.9155459
  • He etal. (2022)Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. 2022.Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16000–16009.
  • Jeon etal. (2022)Hyunsik Jeon, Jongjin Kim, Hoyoung Yoon, Jaeri Lee, and U Kang. 2022.Accurate action recommendation for smart home via two-level encoders and commonsense knowledge. In Proceedings of the 31st ACM International Conference on Information and Knowledge Management (CIKM). 832–841.
  • Jia etal. (2017)YunhanJack Jia, QiAlfred Chen, Shiqi Wang, Amir Rahmati, Earlence Fernandes, ZhuoqingMorley Mao, and Atul Prakash. 2017.ContexloT: Towards Providing Contextual Integrity to Appified IoT Platforms. In 24th Annual Network and Distributed System Security Symposium, NDSS 2017, San Diego, California, USA, February 26 - March 1, 2017. The Internet Society.https://www.ndss-symposium.org/ndss2017/ndss-2017-programme/contexlot-towards-providing-contextual-integrity-appified-iot-platforms/
  • Kang and McAuley (2018)Wang-Cheng Kang and Julian McAuley. 2018.Self-attentive sequential recommendation. In 2018 IEEE international conference on data mining (ICDM). IEEE, 197–206.
  • Kingma and Ba (2014)DiederikP Kingma and Jimmy Ba. 2014.Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980 (2014).
  • Li etal. (2024)Fan Li, Xu Si, Shisong Tang, Dingmin Wang, Kunyan Han, Bing Han, Guorui Zhou, Yang Song, and Hechang Chen. 2024.Contextual Distillation Model for Diversified Recommendation.arXiv preprint arXiv:2406.09021 (2024).https://arxiv.org/abs/2406.09021
  • Liu etal. (2008)FeiTony Liu, KaiMing Ting, and Zhi-Hua Zhou. 2008.Isolation forest. In 2008 eighth ieee international conference on data mining. IEEE, 413–422.
  • Lueth (2018)KnudLasse Lueth. 2018.State of the IoT 2018: Number of IoT devices now at 7B – Market accelerating.https://iot-analytics.com/state-of-the-iot-update-q1-q2-2018-number-of-iot-devices-now-7b/.
  • Paszke etal. (2019)Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, etal. 2019.Pytorch: An imperative style, high-performance deep learning library.Advances in Neural Information Processing Systems (NIPS) 32 (2019).
  • Rieger etal. (2023)Phillip Rieger, Marco Chilese, Reham Mohamed, Markus Miettinen, Hossein Fereidooni, and Ahmad-Reza Sadeghi. 2023.ARGUS: Context-Based Detection of Stealthy IoT Infiltration Attacks. In Proceedings of the 32nd USENIX Conference on Security Symposium (Anaheim, CA, USA) (SEC ’23). USENIX Association, USA, Article 241, 18pages.
  • Sikder etal. (2017)AmitKumar Sikder, Hidayet Aksu, and ASelcuk Uluagac. 2017.{{\{{6thSense}}\}}: A context-aware sensor-based attack detector for smart devices. In 26th USENIX Security Symposium (USENIX Security 17). 397–414.
  • Sikder etal. (2019)AmitKumar Sikder, Leonardo Babun, Hidayet Aksu, and A.Selcuk Uluagac. 2019.Aegis: A Context-Aware Security Framework for Smart Home Systems. In Proceedings of the 35th Annual Computer Security Applications Conference (San Juan, Puerto Rico, USA) (ACSAC ’19). Association for Computing Machinery, New York, NY, USA, 28–41.https://doi.org/10.1145/3359789.3359840
  • Srinivasan etal. (2008)Vijay Srinivasan, JohnA. Stankovic, and Kamin Whitehouse. 2008.Protecting your daily in-home activity information from a wireless snooping attack. In UbiComp 2008: Ubiquitous Computing, 10th International Conference, UbiComp 2008, Seoul, Korea, September 21-24, 2008, Proceedings (ACM International Conference Proceeding Series, Vol.344), HeeYong Youn and We-Duke Cho (Eds.). ACM, 202–211.https://doi.org/10.1145/1409635.1409663
  • Sun etal. (2019)Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019.BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM). 1441–1450.
  • Tang etal. (2022)Shisong Tang, Qing Li, Xiaoteng Ma, Ci Gao, Dingmin Wang, Yong Jiang, Qian Ma, Aoyang Zhang, and Hechang Chen. 2022.Knowledge-based temporal fusion network for interpretable online video popularity prediction. In Proceedings of the ACM Web Conference 2022. 2879–2887.
  • Tang etal. (2023)Shisong Tang, Qing Li, Dingmin Wang, Ci Gao, Wentao Xiao, Dan Zhao, Yong Jiang, Qian Ma, and Aoyang Zhang. 2023.Counterfactual Video Recommendation for Duration Debiasing. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4894–4903.
  • Vaswani etal. (2017)Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, AidanN Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017.Attention is all you need.Advances in Neural Information Processing Systems (NIPS) 30 (2017).
  • Wang etal. (2023)Jincheng Wang, Zhuohua Li, Mingshen Sun, Bin Yuan, and John C.S. Lui. 2023.IoT Anomaly Detection Via Device Interaction Graph. In 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Network, DSN 2023, Porto, Portugal, June 27-30, 2023. IEEE, 494–507.https://doi.org/10.1109/DSN58367.2023.00053
  • Xiao etal. (2023a)Jingyu Xiao, Qingsong Zou, Qing Li, Dan Zhao, Kang Li, Wenxin Tang, Runjie Zhou, and Yong Jiang. 2023a.User Device Interaction Prediction via Relational Gated Graph Attention Network and Intent-aware Encoder. In Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems (AAMAS). 1634–1642.
  • Xiao etal. (2023b)Jingyu Xiao, Qingsong Zou, Qing Li, Dan Zhao, Kang Li, Zixuan Weng, Ruoyu Li, and Yong Jiang. 2023b.I Know Your Intent: Graph-enhanced Intent-aware User Device Interaction Prediction via Contrastive Learning.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMUWT/UbiComp) 7, 3 (2023), 1–28.
  • Zhai etal. (2018)Junhai Zhai, Sufang Zhang, Junfen Chen, and Qiang He. 2018.Autoencoder and its various variants. In 2018 IEEE international conference on systems, man, and cybernetics (SMC). IEEE, 415–419.
  • Zou etal. (2023)Qingsong Zou, Qing Li, Ruoyu Li, Yucheng Huang, Gareth Tyson, Jingyu Xiao, and Yong Jiang. 2023.IoTBeholder: A Privacy Snooping Attack on User Habitual Behaviors from Smart Home Wi-Fi Traffic.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT/UbiComp) 7, 1 (2023), 1–26.

Appendix A Appendices

A.1. Notations

Key notations used in the paper and their definitions are summarized in Table5.

NotationDefinition
d𝑑ditalic_d, 𝒟𝒟\mathcal{D}caligraphic_Da device/set of devices
c𝑐citalic_c, 𝒞𝒞\mathcal{C}caligraphic_Ca device control/set of device controls
s𝑠sitalic_s, 𝒮𝒮\mathcal{S}caligraphic_Sa sequence/set of sequences
n𝑛nitalic_nthe lenghth of a sequence
b𝑏bitalic_ba behavior
t𝑡titalic_tthe timestamp of behavior
hour𝑜𝑢𝑟houritalic_h italic_o italic_u italic_rthe hour of day of a behavior
day𝑑𝑎𝑦dayitalic_d italic_a italic_ythe day of week of a behavior
duration𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛durationitalic_d italic_u italic_r italic_a italic_t italic_i italic_o italic_nthe duration of a behavior
PE𝑃𝐸PEitalic_P italic_Ethe positional embedding function
PE(order)𝑃𝐸𝑜𝑟𝑑𝑒𝑟PE(order)italic_P italic_E ( italic_o italic_r italic_d italic_e italic_r ), wordersubscript𝑤𝑜𝑟𝑑𝑒𝑟w_{order}italic_w start_POSTSUBSCRIPT italic_o italic_r italic_d italic_e italic_r end_POSTSUBSCRIPTthe positional embedding and it’s weight
PE(hour)𝑃𝐸𝑜𝑢𝑟PE(hour)italic_P italic_E ( italic_h italic_o italic_u italic_r ), whoursubscript𝑤𝑜𝑢𝑟w_{hour}italic_w start_POSTSUBSCRIPT italic_h italic_o italic_u italic_r end_POSTSUBSCRIPTthe hour embedding and it’s weight
PE(day)𝑃𝐸𝑑𝑎𝑦PE(day)italic_P italic_E ( italic_d italic_a italic_y ), wdaysubscript𝑤𝑑𝑎𝑦w_{day}italic_w start_POSTSUBSCRIPT italic_d italic_a italic_y end_POSTSUBSCRIPTthe day embedding and it’s weight
PE(duration)𝑃𝐸𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛PE(duration)italic_P italic_E ( italic_d italic_u italic_r italic_a italic_t italic_i italic_o italic_n ), wdursubscript𝑤𝑑𝑢𝑟w_{dur}italic_w start_POSTSUBSCRIPT italic_d italic_u italic_r end_POSTSUBSCRIPTthe duration embedding and it’s weight
PE¯¯𝑃𝐸\overline{PE}over¯ start_ARG italic_P italic_E end_ARGthe positional embedding
hcsubscript𝑐h_{c}italic_h start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPTthe device control embedding
𝐡𝐡\mathbf{h}bold_hthe behavior embedding
recsubscript𝑟𝑒𝑐\mathcal{L}_{rec}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPTthe reconstruction loss
isubscript𝑖\ell_{i}roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, vecsubscriptvec\mathcal{L}_{\text{vec}}caligraphic_L start_POSTSUBSCRIPT vec end_POSTSUBSCRIPTthe loss of i𝑖iitalic_i-th behavior and loss vector
wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, 𝒲vecsubscript𝒲vec\mathcal{W}_{\text{vec}}caligraphic_W start_POSTSUBSCRIPT vec end_POSTSUBSCRIPTthe weight of i𝑖iitalic_i-th behavior and weight vector
pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPTthe normalized weight of i𝑖iitalic_i-th behavior
mask𝑚𝑎𝑠𝑘maskitalic_m italic_a italic_s italic_kthe mask vector
score(s)𝑠𝑐𝑜𝑟𝑒𝑠score(s)italic_s italic_c italic_o italic_r italic_e ( italic_s )the anomaly score of sequence s𝑠sitalic_s
th𝑡thitalic_t italic_hthe threshold

A.2. Device information of different dataset

The AN, FR and SP data sets contain 36, 33, and 34 devices respectively, as shown in Table6, Table7, and Table8.

No.DeviceNo.DeviceNo.Device
0AC12LED24projector
1heater13locker25washing_machine
2dehumidifier14bathheater26kettle
3humidifier_115water_cooler27dishwasher
4fan16curtains28bulb_1
5standheater17outlet29TV
6aircleaner18audio30pet_feeder
7humidifier_219plug31hair_dryer
8desklight20bulb_232window_cleaner
9bedight_121soundbox_133bedlight_2
10camera22soundbox_234bedlight_3
11sweeper23refrigerator35cooler
No.DeviceNo.DeviceNo.Device
0AirConditioner11Fan22Refrigerator
1AirPurifier12GarageDoor23RemoteController
2Blind13Light24RobotCleaner
3Camera14Microwave25Siren
4ClothingCareMachine15MotionSensor26SmartLock
5Computer16NetworkAudio27SmartPlug
6ContactSensor17None28Switch
7CurbPowerMeter18Other29Television
8Dishwasher19Oven30Thermostat
9Dryer20PresenceSensor31Washer
10Elevator21Projector32WaterValve
No.DeviceNo.DeviceNo.Device
0AirConditioner12GarageDoor24RobotCleaner
1AirPurifier13Light25SetTop
2Blind14Microwave26Siren
3Camera15MotionSensor27SmartLock
4ClothingCareMachine16NetworkAudio28SmartPlug
5Computer17None29Switch
6ContactSensor18Other30Television
7CurbPowerMeter19Oven31Thermostat
8Dishwasher20PresenceSensor32Washer
9Dryer21Projector33WaterValve
10Elevator22Refrigerator
11Fan23RemoteController

A.3. Data collection

Testbed and Participants. To create a practical and viable smart home model, we implemented our experimental platform within an apartment setting to gather user usage data of various devices, forming our smart home user behavior dataset (AN). Three volunteers were recruited to simulate the typical daily activities of a standard family, assuming the roles of an adult male, an adult female, and a child. The experimental platform comprises a comprehensive selection of 36 popular market-available devices, detailed in Table6, with their deployment illustrated in Figure12.

Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask (18)

Normal Behavior Collection. We enlisted volunteers to reside in apartments and encouraged them to utilize equipment in accordance with their individual habits. Throughout the designated period of occupancy, we refrained from actively or directly intervening in the users’ behavior. However, we implemented a system where users consistently logged their activities. Following the conclusion of the data collection phase, we reviewed the device usage logs via the smart home app, amalgamating these logs with the users’ behavior records to compile a comprehensive user behavior dataset. To mitigate potential biases arising from acclimating to a new living environment, participants were required to inhabit the experimental setting for a minimum of two weeks before the formal commencement of data collection. All users possessed comprehensive knowledge of the IoT devices and applications in use. Subsequent to check-in, control of all devices was relinquished to the users, who were duly informed in advance that their device usage would be subsequently reviewed and analyzed by our team

Anomaly Behavior Injection. We insert abnormal behaviors in Table2 into normal behavior sequences to construct abnormal behavior sequences. Then the abnormal behavior sequences. Then the abnormal behavior sequence and the normal behavior sequence together form the test dataset. The anomaly behavior sequences examples are shown in Figure13.

Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask (19)

A.4. Detailed experimental settings

All models (including baselines and SmartGuard) are implemented by PyTorch (Paszke etal., 2019) and run on a graphic card of GeForce RTX 3090 Ti. All models are trained with Adam optimizer (Kingma and Ba, 2014) with learning rate 0.001. We train SmartGuardto minimize recsubscript𝑟𝑒𝑐\mathcal{L}_{rec}caligraphic_L start_POSTSUBSCRIPT italic_r italic_e italic_c end_POSTSUBSCRIPT in Equation (14). During training, we monitor reconstruction loss and stop training if there is no performance improvement on the validation set in 10 steps. For model hyperparameters of SmartGuard, we set the batch size to 512 and the initial weights of TTPE are worder=0.1,whour=0.4,wday=0.4formulae-sequencesubscript𝑤𝑜𝑟𝑑𝑒𝑟0.1formulae-sequencesubscript𝑤𝑜𝑢𝑟0.4subscript𝑤𝑑𝑎𝑦0.4w_{order}=0.1,w_{hour}=0.4,w_{day}=0.4italic_w start_POSTSUBSCRIPT italic_o italic_r italic_d italic_e italic_r end_POSTSUBSCRIPT = 0.1 , italic_w start_POSTSUBSCRIPT italic_h italic_o italic_u italic_r end_POSTSUBSCRIPT = 0.4 , italic_w start_POSTSUBSCRIPT italic_d italic_a italic_y end_POSTSUBSCRIPT = 0.4, and wduration=0.7subscript𝑤𝑑𝑢𝑟𝑎𝑡𝑖𝑜𝑛0.7w_{duration}=0.7italic_w start_POSTSUBSCRIPT italic_d italic_u italic_r italic_a italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT = 0.7. For mask ratio and step without step, we search in {0.2,0.4,0.6,0.8}0.20.40.60.8\{0.2,0.4,0.6,0.8\}{ 0.2 , 0.4 , 0.6 , 0.8 } and {3,4,5,6}3456\{3,4,5,6\}{ 3 , 4 , 5 , 6 }, respectively. We chose the number of encoder and decoder layers in {1,2,3,4}1234\{1,2,3,4\}{ 1 , 2 , 3 , 4 }, and the embedding size in {8,16,32,64,128,256,512}8163264128256512\{8,16,32,64,128,256,512\}{ 8 , 16 , 32 , 64 , 128 , 256 , 512 }.

A.5. Mask strategy deep dive

To verify the effectiveness of LDMS, we compared it with the three baselines (w/o mask, random mask and top-k𝑘kitalic_k loss mask) mentioned in the section 4.2. As illustrated in Figure14(a), LDMS consistently outperforms all other mask strategies across four types of anomalies. The results presented in Figure14(b) further demonstrate that LDMS exhibits the smallest variance in reconstruction loss throughout the training process, which demonstrates that SmartGuardlearns both easy-to-learn behaviors and hard-to-learn behaviors very well. We also plotted the loss distribution diagram under different mask strategies. As shown in Figure15, LDMS shows the smallest reconstruction loss and variance, which demonstrates that our mask strategy can better learn hard-to-learn behaviors. We can still observe behaviors with high reconstruction loss as pointed by the red dashed arrow after applying LDMS, which is likely to be noise behaviors, thus it’s necessary to assign small weights for these noise behaviors during anomaly detection for avoiding identifying normal sequences containing noise behaviors as abnormal.

Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask (20)
Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask (21)
Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask (22)
Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask (2024)
Top Articles
Latest Posts
Article information

Author: Dean Jakubowski Ret

Last Updated:

Views: 5739

Rating: 5 / 5 (50 voted)

Reviews: 81% of readers found this page helpful

Author information

Name: Dean Jakubowski Ret

Birthday: 1996-05-10

Address: Apt. 425 4346 Santiago Islands, Shariside, AK 38830-1874

Phone: +96313309894162

Job: Legacy Sales Designer

Hobby: Baseball, Wood carving, Candle making, Jigsaw puzzles, Lacemaking, Parkour, Drawing

Introduction: My name is Dean Jakubowski Ret, I am a enthusiastic, friendly, homely, handsome, zealous, brainy, elegant person who loves writing and wants to share my knowledge and understanding with you.