Modeling and Discovering Occupancy Patterns in Sensor Networks using Latent Dirichlet Allocation Federico Castanedo1 , Hamid Aghajan2 , and Richard Kleihorst3 1
Computer Science Department. University Carlos III of Madrid
[email protected] 2 Department of Electrical Engineering. Stanford University
[email protected] 3 Vito & Ghent University
[email protected]
Abstract. This paper presents a novel way to perform probabilistic modeling of occupancy patterns from a sensor network. The approach is based on the Latent Dirichlet Allocation (LDA) model. The application of the LDA model is shown using a real dataset of occupancy logs from the sensor network of a modern office building. LDA is a generative and unsupervised probabilistic model for collections of discrete data. Continuous sequences of just binary sensor readings are segmented together in order to build the dataset discrete data (bag-of-words). Then, these bag-of-words are used to train the model with a fixed number of topics, also known as routines. Preliminary obtained results state that the LDA model successfully found latent topics over all rooms and therefore obtain the dominant occupancy patterns or routines on the sensor network.
Keywords: Probabilistic modeling, Sensor networks, Latent topics.
1
Introduction
The main objective of this paper is to show a way to perform probabilistic modeling of occupancy patterns from a sensor network without having any apriori or ground-truth information about the behaviors, thus following an unsupervised technique. More specifically, this paper presents an approach based on the Latent Dirichlet Allocation (LDA) model [1] for modeling and discovering occupancy patterns in an office environment using a sensor network. LDA is a generative and unsupervised machine learning probabilistic model for collections of discrete data. In a generative machine learning algorithm, the model adjust the parameters to produce the underlying data, so the model fits to the provided data. The LDA model is one of the hierarchical Bayesian text models that has been proposed in the research community. It overcomes some of the limitations that have been reported with the probabilistic Latent Semantic Indexing (pLSI) [2], such as the
over-fitting problems. Since Blei’s original paper [1], LDA has been successfully applied over text corpora [3] and for classification tasks on video sequences [4]. In this work, the LDA model is applied to sequences of occupancy logs provided by a sensor network in a modern office building. The main goal is to automatically discover long-term-behavior patterns or routines on the logged data. One of the main issues of this work, compared to other approaches, is that we only employ occupancy sensors that provided binary information, indicating if the room is occupied or not. Thus, it is a technique which could be easily implemented and deployed and also respect privacy issues. This paper covers the completely process under two different phases. First, it is necessary to obtain a good representation of the long-term activity of an specific room. Secondly, in a learning phase we answer the question of how to find the most common patterns given the training data. In the next section some related works are described. Then, section III illustrates how to build the representation of the long-term activity in the sensor network. Section IV focused on how to generate the LDA model and review it. The results of some experiments are shown in section V. Finally, section VI concludes the paper.
2
Related works
With the recent advances on sensor networks it becomes more easy to obtain and process the provided data. There are a lot of research on the computer vision community which aims to model the behavior of the people being monitored. However, that research is outside of the scope of this work since we are not dealing with visual sensors. Similar approaches in the literature which covers occupancy data in an office environment could be found in the works employing the Mitsubishi Electric Research Lab (MERL) Dataset [5]. MERL dataset is a public dataset which provides around 30 million of sensor logs from a network of 200 sensors in an office environment. The dataset consist of sensors readings recorded for about 1 year at MERL building. In [6], the authors use the MERL dataset with the the aim of modeling three different social features: (1) visiting another person, (2) attending meetings with another person and (3) traveling with another person. They employed information theoretic measures (entropy) and graph cuts to obtain the previous patterns on the MERL dataset. To extract relationships on the occupancy data they model pairwise statistics over the dataset. Their goal was to identify potentially important individuals within the organization. In [7], the authors reviewed the existing techniques for the discovery of temporal patterns in sensor data and proposed a modified T-Pattern algorithm [8]. This algorithm was tested using the MERL dataset and the presented results outperformed the Lempel-Ziv compression based methods. The work of [9] also introduce temporal information in the activity recognition problem, but using temporal evidence theory. Their framework was tested using a smart home dataset.
An interesting article about consumer patterns is presented in [10]. The authors employed a wireless local area network (WLAN) to obtain the data from a shopping environment. They results show how the consumer behavior varies depending on the time of the day. In their work, they divided the consumer behavior into three segments: (1) 8:00 am to 12:00, (2) 12:00 to 16:00 and (3) 16:00 to 20:00 and presents the obtained tracking results of each segment on the shop ground plane. Our work, go further since it automatically discover the occupancy patterns on the sensor data. Farrahi and Gatica-Perez [11] presents an interesting work in the mobile domain using also the LDA model to discover routines. The method used for generating the bag-of-words before training the model inspired the approach presented in our work. Their results also supports the advantage of using LDA for discover routines on large amount of data.
3
Building the representation of the long-term behavior model
The long-term-behavior model is constructed using terms which are commonly employed in the information retrieval community: words, documents, topics and corpora. That is, in the context of this work ”words” refers to occupancy patterns or location sequences over some time interval. A ”document” refers to a complete day of words (location sequences) for a specific room. The ”corpora” is the full data obtained from the sensors, that is the set of all documents. Finally, the ”topics” are equivalent to the routines and the main task is to automatically infer them. So, the aim is to learn both what the topics are and which documents employ them in which proportions. Each possible word w, of the vocabulary v, is generated taking intervals of 30 minutes which consists of three sequences of 10 minutes of occupancy over each room. The whole weekday is divided into 9 segments of day following this scheme: 1. 2. 3. 4. 5. 6. 7. 8. 9.
from from from from from from from from from
00:00 06:00 07:00 09:00 11:00 14:00 17:00 19:00 21:00
to to to to to to to to to
06:00, 07:00, 09:00, 11:00, 14:00, 17:00, 19:00, 21:00, 00:00,
interval interval interval interval interval interval interval interval interval
1. 2. 3. 4. 5. 6. 7. 8. 9.
So, the words generation approach of the vocabulary follow this structure: combinations of 3 sequences of I (someone in the room) or O (empty room) symbols, plus the segment time of day number (one of the previous 9), plus the room number. For instance, the word OOO1120 refers to 30 minutes of no
occupancy at the time interval 1 (from 00:00 to 6:00) for the room number 120; and the word IOI3019 refers to an occupancy pattern of the room 19 at the time interval 3 (from 7:00 to 9:00) in which someone is in the room for 10 minutes then go out for ten minutes and come inside for at least other 10 minutes. In the dataset employed for the experiments the vocabulary is composed of 9720 words, since there are 23 combinations times 9 segments times 135 rooms.
Fig. 1. Graphical representation of the LDA model, each node represents random variables, the edges possible dependence between them and the plates denote replicated structure. In reality the LDA model only observe the words (shaded w random variable) of each document.
4
Generating the LDA model
In LDA, the word is the basic unit of discrete data and is a member of the vocabulary. More formally in LDA, a document is defined as a distribution over topics p(document|topics) and the topics are defined as distributions over words p(topics|words), see figure 1 for a graphical representation of the LDA model which shows the conditional Independence assumptions of the model. LDA represents documents as random mixtures over latent topics, where each topic is characterized by a distribution over words, and assumes the following generative process for each document in a corpus: 1. Choose N ∼ P oisson(ξ). 2. Choose θ ∼ Dirichlet(α). 3. For each of the N words Wn : (a) Choose a topic Zn ∼ M ultinomial(θ). (b) Choose a word Wn from p(Wn |Zn , β), a multinomial probability conditioned on the topic Zn . The dimensionality K of the Dirichlet distribution (and thus the dimensionality of the topic variable Z) is assumed known and fixed. Also, the word probabilities are parametrized by a K × V matrix β where βij = p(wj = 1|z i = 1), which consists on a quantity to be estimated.
A Dirichlet distribution is a distribution over the K-dimensional probability simplex (i.e., positive vectors that sum to one): 4K = {(θ1 , ..., θk ) : θk > 0,
k X
θi = 1}
(1)
i=1
(θ1 , ..., θk ) is known to be Dirichlet distributed: (θ1 , ..., θk ) ∼ Dirichlet(α1 , ..., αk )
(2)
with parameters (α1 , ..., αk ), if has the following probability density: Pk Γ( αi ) α1 −1 θ1 ...θk αk−1 p(θ|α) = Qk i=1 i=1 Γ (αi )
(3)
where the parameter α is a k-vector with components αi > 0 and Γ (x) is the Gamma function. Given the parameters α and β, the joint distribution of a topic mixture θ, a set of K topics z, and a set of N words w is given by: p(θ, z, w|α, β) = p(θ|α)
N Y
p(Zn |θ)p(wn |Zn , β)
(4)
n=1
In order to obtain the parameters of equation (4), it is necessary to employ estimation sampling techniques since it is intractable to compute in general due to the coupling between θ and β. Different approximate inference algorithms could be employed, such as Gibbs sampling, Laplace approximation, variational approximation and Markov Chain Monte Carlo [12]. For a good comparison between the different inference algorithms, see [13]. We followed the approach detailed in [1] and used a variational Expectation-Maximization inference algorithm in order to find the parameters α, θ and β. These parameters provide information about the underlying data (corpus). The parameter α indicates how semantically diverse documents are (in our case daily occupancy patterns on each room), with lower alpha values indicating increase diversity. The parameter β1:k provides information about how similar the different topics are, since it provides the log-likelihood of prob(word|topic) for each one of the K topics. When a LDA model is trained it is possible to infer the multinomial distributions θ which provides the probability of each topic in a new document. These distributions can then be used for classifying new documents or measuring similarities between documents.
5
Experimental Results
In this section, we present preliminary results of the previous model. The data for our experiment was obtained from monitoring the Innotek building in Geel,
Belgium. This 3 floor building facilitates start-up companies. It contains offices with different uses. In the building we will find Innotek’s own administrative offices, such as receptionist, secretarial office, management office, building maintenance office, cantine and restroom facilities. Also, several rooms are rented out to start-up companies, usually 1 to 4 rooms per company. These rooms are mainly functioning as office space to meet prospective customers or guests, so they show a different character of use compared to regularly occupied offices. The top-floor of the Innotek building houses some laboratories where people have to be present to follow up experiments. These spaces also show occupancy at night, during weekends and in holidays. The occupancy data is captured by centrally mounted high-quality PIR-based sensors that are part of the Philips system to install the automatic lighting. Detection of motion will trigger the lights which will remain switched on for at least 10 minutes. If no motion has been detected in this period, the lights will turn-off again. Innotek has arranged for connecting these sensors to the bus of their Johnson Controls heating management system, which enables their use to control the room temperature also. The climate management system will set the room to ”hibernation mode” if no occupancy has been detected for a certain time interval. Once occupancy is detected, the room climate will go to the comfort level as set locally per room by the inhabitants. New office buildings in Flanders will have to meet smart building controls like this in the very near future. As the Johnson Controls sensor bus can be read out at a central point we have been able to log the data of most rooms of the building on a 1 minute accurate timescale. The events are based on integration of detections over the 1 minute interval and indicate the occupancy with a high level of confidence. The plans of the three floors of the Innotek building are shown in figure 2. We worked on the logs of the Innotek dataset related to the period from March 12 2010 to August 28 2010. The log file follows this structure: 1. 2. 3. 4. 5.
0 01 0 01 0 01 0 01 ...
0 1 0 1
10 10 10 10
3 3 3 3
12 12 12 12
6 6 6 6
13 14 14 15
37 33 48 5
The first line means that sensor 1 of floor 0 provides no occupancy information (0 at third column) the day 12 of march 2010, which has a type day of 6 (meaning it was Friday) at the time 13:37. In the second line, the same sensor provides occupancy information (value 1 at third column) at time 14:33, line three shows that the room was empty again at time 14:48 and so on. With this information (only binary occupancy with 1 minute frequency) we construct the documents, one document for each room and day, which are sequences of words over 30 minutes interval. The frequency patterns can be seen more clear in a video, which plots the obtained frequencies on the building ground-plane, posted online at http://fcastanedo.com/wp-content/uploads/ 2010/11/foo-candidate-30min.avi.
A document is formed using the frequency of words which composed them, following an approach similar to tf-idf [14] scheme. Therefore the original data is transformed on another which follows this structure: 1. 9 0:34 8:4 16:10 24:10 32:16 40:16 48:10 56:10 64:16 2. 23 0:34 8:4 16:2 17:2 18:1 19:1 20:1 23:3 31:10 39:16 40:7 41:1 43:1 44:1 46:1 47:5 50:1 52:1 53:1 54:1 55:6 56:10 64:16 3. ... Each line is called a document, where the first value (M ) gives the number of different words in the document, then there are M pairs of I:J numbers which I indicates the index of the word in the vocabulary V and J the count frequency of this word in the document. Remember, that in our context a document means an occupancy pattern of a room over one day. In these experiments we skip the weekend days (Saturday and Sunday) since, as you can see in figure 3, do not provide meaningful information. Also, we are more interested in the weekdays patterns due to the underlying nature of the data. The words generated from segment (***1***) and from segment 9 (***9***) are not taken into account for these experiments since there are many of them and we think that do not provide useful information. These kind of words are known in the natural language processing domain as stop-words. More specifically the dataset used for training the model is composed of 22 weeks and 5300 documents with 7840 different terms and 66577 total words in all the documents. In table1 some of the obtained 90 topics with the 5 most probable words of each topic are shown. Topics 7, 10, 15, 24, 52, 53, 74 and 76 provide a similar occupancy pattern of rooms 70, 103, 69, 8, 9, 96, 7 and 11 respectively. For instance in rooms 7,8 and 9, which corresponds to rooms 0.08, 0.09 and 0.10 of the first floor, the words III6 and OOO8 are the most probable ones (with a prob>(0.3)) meaning that this rooms are occupied between 14:00 and 17:00 and empty between 19:00 and 21:00. The topic 36 indicates that when room 7 is empty from 19:00 to 21:00, room 54 is also empty from 6:00 to 7:00. Finally, the topic 11 shows a correlation when the room 23 is occupied from 9:00 to 11:00 and empty from 17:00 to 21:00 (OOO7023 and OOO8023).
6
Conclusions and future work
The proposed method for modeling the occupancy behaviour of an office building is based on the “bag of words” assumption, so each word in a document follows the ex-changeability assumption. The ex-changeability assumption means that the words are conditionally independent and identically distributed with respect to a latent parameter (topics). A probabilistic model is enough powerful to handle the uncertainty of the underlying data. The main advantage of the LDA model over other techniques, is that it allows to obtain a probability distribution of several words in each topic, following an unsupervised scheme. The obtained results show that the
Table 1. Some obtained topics and the 5 most probable words for each topic (with k = 90)
Topic 7 words III6070 OOO8070 IIO7070 OII3070 IOO7070 Topic 53 words III6096 OOO8096 IIO7096 OII3096 IOO7096
Topic 10 words III6103 OOO8103 IOO7103 IIO7103 III7103 Topic 74 p(w|topic) words 0.204948 III6007 0.131598 OOO8007 0.016975 IIO7007 0.015717 III8007 0.015717 IIO8007 p(w|topic) 0.211401 0.131730 0.022237 0.013220 0.013220
Topic 15 words III6069 OOO8069 OII3069 IOO7069 IIO7069 Topic 76 p(w|topic) words 0.213525 III6011 0.101672 OOO8011 0.013434 III7011 0.007506 OOI8011 0.006711 III8011 p(w|topic) 0.178530 0.132126 0.021666 0.021666 0.013787
Topic 24 words III6008 OOO8008 IIO7008 III8008 IOI7008 Topic 36 p(w|topic) words 0.209088 OOO8007 0.107636 OOO2054 0.042809 III5069 0.012842 OOO5102 0.004283 III6069 p(w|topic) 0.199488 0.128993 0.014108 0.012204 0.011843
Fig. 2. Plan of the three floor Innotek building.
Topic 52 words III6009 OOO8009 IOO7009 IIO7009 IIO3009 Topic 11 p(w|topic) words 0.231541 III4023 0.193013 OOO7023 0.017457 OOO8023 0.016907 IOO6023 0.012438 IOO7023 p(w|topic) 0.219808 0.118008 0.013609 0.005343 0.003309
p(w|topic) 0.210557 0.123193 0.011962 0.011367 0.004305 p(w|topic) 0.176400 0.154432 0.148823 0.030838 0.027737
LDA model is enough powerful to handle the detection of complex patterns over a long-term dataset. In the future we would like to obtain most relevant results for the building occupancy environment. With the purpose of stimulate similar works using this dataset, we plan to provide the data and the developed vocabulary, to researchers interested in running further analysis.
Acknowledgments The first author would like to thank the 2010 UC3M grant: Ayudas de movilidad del programa propio de investigaci´ on and projects CICYT TIN2008-06742-C0202/TSI, CICYT TEC2008-06732-C02-02/TEC and CAM CONTEXTS 2009/TIC1485 for partially support this research. The authors want to thank Erik Degroof and Luc Peeters of Innotek Belgium for their cooperation and the use of their dataset.
Fig. 3. Big picture of occupancy frequencies for all the rooms over the whole data. Each cell corresponds to a 10 minute interval. Rows indicate specific rooms and columns the time evolution.
References 1. D. Blei, A. Ng, and M. Jordan, “Latent dirichlet allocation,” The Journal of Machine Learning Research, vol. 3, pp. 993–1022, 2003. 2. T. Hofmann, “Probabilistic latent semantic indexing,” in Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 1999, pp. 50–57. 3. T. Griffiths and M. Steyvers, “Finding scientific topics,” Proceedings of the National Academy of Sciences of the United States of America, vol. 101, no. Suppl 1, pp. 5228– 5235, 2004. 4. J. Niebles, H. Wang, and L. Fei-Fei, “Unsupervised learning of human action categories using spatial-temporal words,” International Journal of Computer Vision, vol. 79, no. 3, pp. 299–318, 2008.
5. C. R. Wren, Y. A. Ivanov, D. Leigh, and J. Westhues, “The merl motion detector dataset,” in Proceedings of the 2007 workshop on Massive datasets, ser. MD ’07. New York, NY, USA: ACM, 2007, pp. 10–14. [Online]. Available: http://doi.acm.org/ 10.1145/1352922.1352926 6. C. Connolly, J. Burns, and H. Bui, “Recovering social networks from massive track datasets,” in IEEE Workshop on Applications of Computer Vision, WACV 2008. IEEE, 2008, pp. 1–8. 7. A. Salah, E. Pauwels, R. Tavenard, and T. Gevers, “T-Patterns Revisited: Mining for Temporal Patterns in Sensor Data,” Sensors, vol. 10, no. 8, pp. 7496–7513, 2010. 8. M. Magnusson, “Discovering hidden time patterns in behavior: T-patterns and their detection,” Behavior Research Methods Instruments and Computers, vol. 32, no. 1, pp. 93–110, 2000. 9. S. McKeever, J. Ye, L. Coyle, C. Bleakley, and S. Dobson, “Activity recognition using temporal evidence theory,” Journal of Ambient Intelligence and Smart Environments, vol. 2, no. 3, pp. 253–269, 2010. 10. P. Skogster, V. Uotila, and L. Ojala, “From mornings to evenings: is there variation in shopping behaviour between different hours of the day?” International Journal of Consumer Studies, vol. 32, no. 1, pp. 65–74, 2008. 11. K. Farrahi and D. Gatica-Perez, “Discovering routines from large-scale human locations using probabilistic topic models,” ACM Transactions on Intelligent Systems and Technology, vol. 2, no. 1, 2011. 12. M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul, “An introduction to variational methods for graphical models,” Machine learning, vol. 37, no. 2, pp. 183–233, 1999. 13. A. Asuncion, M. Welling, P. Smyth, and Y. W. Teh, “On smoothing and inference for topic models,” in In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence, 2009. 14. G. Salton and M. McGill, “Introduction to modern information retrieval,” New York, 1983.