{"title": "Infinite Hidden Semi-Markov Modulated Interaction Point Process", "book": "Advances in Neural Information Processing Systems", "page_first": 3900, "page_last": 3908, "abstract": "The correlation between events is ubiquitous and important for temporal events modelling. In many cases, the correlation exists between not only events' emitted observations, but also their arrival times. State space models (e.g., hidden Markov model) and stochastic interaction point process models (e.g., Hawkes process) have been studied extensively yet separately for the two types of correlations in the past. In this paper, we propose a Bayesian nonparametric approach that considers both types of correlations via unifying and generalizing hidden semi-Markov model and interaction point process model. The proposed approach can simultaneously model both the observations and arrival times of temporal events, and determine the number of latent states from data. A Metropolis-within-particle-Gibbs sampler with ancestor resampling is developed for efficient posterior inference. The approach is tested on both synthetic and real-world data with promising outcomes.", "full_text": "In\ufb01nite Hidden Semi-Markov Modulated Interaction\n\nPoint Process\n\nPeng Lin\u00a7\u2020, Bang Zhang\u00a7, Ting Guo\u00a7, Yang Wang\u00a7, Fang Chen\u00a7\n\n\u00a7Data61 CSIRO, Australian Technology Park, 13 Garden Street, Eveleigh NSW 2015, Australia\n\u2020School of Computer Science and Engineering, The University of New South Wales, Australia\n\n{peng.lin, bang.zhang, ting.guo, yang.wang, fang.chen}@data61.csiro.au\n\nAbstract\n\nThe correlation between events is ubiquitous and important for temporal events\nmodelling. In many cases, the correlation exists between not only events\u2019 emitted\nobservations, but also their arrival times. State space models (e.g., hidden Markov\nmodel) and stochastic interaction point process models (e.g., Hawkes process)\nhave been studied extensively yet separately for the two types of correlations\nin the past. In this paper, we propose a Bayesian nonparametric approach that\nconsiders both types of correlations via unifying and generalizing the hidden semi-\nMarkov model and interaction point process model. The proposed approach can\nsimultaneously model both the observations and arrival times of temporal events,\nand automatically determine the number of latent states from data. A Metropolis-\nwithin-particle-Gibbs sampler with ancestor resampling is developed for ef\ufb01cient\nposterior inference. The approach is tested on both synthetic and real-world data\nwith promising outcomes.\n\n1\n\nIntroduction\n\nTemporal events modeling is a classic machine learning problem that has drawn enormous research\nattentions for decades. It has wide applications in many areas, such as \ufb01nancial modelling, social\nevents analysis, seismological and epidemiological forecasting. An event is often associated with\nan arrival time and an observation, e.g., a scalar or vector. For example, a trading event in \ufb01nancial\nmarket has a trading time and a trading price. A message in social network has a posting time and a\nsequence of words. A main task of temporal events modelling is to capture the underlying events\ncorrelation and use it to make predictions for future events\u2019 observations and/or arrival times.\nThe correlation between events\u2019 observations can be readily found in many real-world cases in which\nan event\u2019s observation is in\ufb02uenced by its predecessors\u2019 observations. For examples, the price of a\ntrading event is impacted by former trading prices. The content of a new social message is affected\nby the contents of the previous messages. State space model (SSM), e.g., the hidden Markov model\n(HMM) [16], is one of the most prevalent frameworks that consider such correlation. It models the\ncorrelation via latent state dependency. Each event in the HMM is associated with a latent state\nthat can emit an observation. A latent state is independent of all but the most recent state, i.e.,\nMarkovianity. Hence, a future event observation can be predicted based on the observed events and\ninferred mechanisms of emission and transition.\nDespite its popularity, the HMM lacks the \ufb02exibility to model event arrival time. It only allows\n\ufb01xed inter-arrival time. The duration of a type of state follows a geometric distribution with its\nself-transition probability as the parameter due to the strict Markovian constraint. The hidden semi-\nMarkov model (HSMM) [14, 21] was developed to allow non-geometric state duration. It is an\nextension of the HMM by allowing the underlying state transition process to be a semi-Markov chain\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fwith a variable duration time for each state. In addition to the HMM components, the HSMM models\nthe duration of a state as a random variable and a state can emit a sequence of observations.\nThe HSMM allows the \ufb02exibility of variable inter-arrival times, but it does not consider events\u2019\ncorrelation on arrival times. In many real-world applications, one event can trigger the occurrences of\nothers in the near future. For instance, earthquakes and epidemics are diffusible events, i.e., one can\ncause the occurrences of others. Trading events in \ufb01nancial markets arrive in clusters. Information\npropagation in social network shows contagious and clustering characteristics. All these events\nexhibit interaction characteristics in terms of arrival times. The likelihood of an event\u2019s arrival time is\naffected by the previous events\u2019 arrival times. Stochastic interaction point process (IPP), e.g., Hawkes\nprocess [6], is a widely adopted framework for capturing such arrival time correlation. It models the\ncorrelation via a conditional intensity function that depicts the event intensity depending on all the\nprevious events\u2019 arrival times. However, unlike the SSMs, it lacks the capability of modelling events\u2019\nlatent states and their interactions.\nIt is clearly desirable in real-world applications to have both arrival time correlation and observation\ncorrelation considered in a uni\ufb01ed manner so that we can estimate both when and how events will\nappear. Inspired by the merits of SSMs and IPPs, we propose a novel Bayesian nonparametric\napproach that uni\ufb01es and generalizes SSMs and IPPs via a latent semi-Markov state chain with\nin\ufb01nitely countable number of states. The latent states governs both the observation emission and\nnew event triggering mechanism. An ef\ufb01cient sampling method is developed within the framework of\nparticle Markov chain Monte Carlo (PMCMC) [1] for the posterior inference of the proposed model.\nWe \ufb01rst review closely related techniques in Section 2, and give the description of the proposed model\nin Section 3. Then Section 4 presents the inference algorithm. In Section 5, we show the results of\nthe empirical studies on both synthetic and real-word data. Conclusions are drawn in Section 6.\n\n2 Preliminaries\n\nIn this section, we review the techniques that are closely related to the proposed method, namely\nhidden (semi-)Markov model, its Bayesian nonparametric extension and Hawkes process.\n\n2.1 Hidden (Semi-)Markov Model\n\nThe HMM [16] is one of the most popular approaches for temporal event modelling. It utilizes a\nsequence of latent states with Markovian property to model the dynamics of temporal events. Each\nevent in the HMM is associated with a latent state that determines the event\u2019s observation via a\nemission probability distribution. The state of an event is independent of all but its most recent\npredecessor\u2019s state (i.e., Markovianity) following a transition probability distribution. The HMM\nconsists of 4 components: (1) an initial state probability distribution,(2) a \ufb01nite latent state space,\n(3) a state transition matrix, and (4) an emission probability distribution. As a result, the inference\nfor the HMM involves: inferring (1) the initial state probability distribution, (2) the sequence of the\nlatent states, (3) the state transition matrix and (4) the emission probability distribution.\nThe HMM has proven to be an excellent general framework modelling sequential data, but it has two\nsigni\ufb01cant drawbacks: (1) The durations of events (or the inter-arrival times between events) are\n\ufb01xed to a common value. The state duration distributions are restricted to a geometric form. Such\nsetting lacks the \ufb02exibility for real-world applications. (2) The size of the latent state space in the\nHMM must be set a priori instead of learning from data.\nThe hidden semi-Markov model (HSMM) [14, 21] is a popular extension to the HMM, which tries to\nmitigate the \ufb01rst drawback of the HMM. It allows latent states to have variable durations, thereby\nforming a semi-Markov chain. It reduces to the HMM when durations follow a geometric distribution.\nAdditional to the 4 components of the HMM, HSMM has a state duration probability distribution. As\na result, the inference procedure for the HSMM also involves the inference of the duration probability\ndistribution. It is worth noting that the interaction between events in terms of event arrival time is\nneglected by both the HMM and the HSMM.\n\n2\n\n\f2.2 Hierarchical Dirichlet Process Prior for State Transition\n\nThe recent development in Bayesian nonparametrics helps address the second drawback of the HMM.\nHere, we brie\ufb02y review the Hierarchical Dirichlet Process HMM (HDP-HMM). Let (\u0398,B) be a\nmeasurable space and G0 be a probability measure on it. A Dirichlet process (DP) G is a distribution\nof a random probability measure over the measurable space (\u0398,B). For any \ufb01nite measurable partition\n(A1,\u00b7\u00b7\u00b7 , Ar) of \u0398, the random vector (G(A1),\u00b7\u00b7\u00b7 , G(Ar)) follows a \ufb01nite Dirichlet distribution\nparameterized by (\u03b10G0(A1),\u00b7\u00b7\u00b7 , \u03b10G0(Ar)), where \u03b10 is a positive real number.\nHDP is de\ufb01ned based on DP for modelling grouped data. It is a distribution over a collection of\nrandom probability measures over the measurable space (\u0398,B). Each one of these random probability\nmeasure Gk is associated with a group. A global random probability measure G0 distributed as\na DP is used as a mean measure with concentration parameter \u03b3 and base probability measure H.\nBecause the HMM can be treated as a set of mixture models in a dynamic manner, each of which\ncorresponds to a value of the current state, the HDP becomes a natural choice as the prior over the\nstate transitions [2, 18]. The generative HDP-HMM model can be summarized as:\n\u03b8k | \u03bb, H (cid:118) H(\u03bb),\n(cid:118) F (\u03b8sn )\n.\n\n\u03c0k |\u03b10, \u03b2 (cid:118) DP(\u03b10, \u03b2),\nyn | sn, (\u03b8k)\u221e\n(cid:118) \u03c0sn\u22121,\n\n\u03b2 | \u03b3 (cid:118) GEM(\u03b3),\nsn |sn\u22121, (\u03c0k)\u221e\n\n(1)\n\nk=1\n\nk=1\n\nGEM denotes stick-breaking process. The variable sequence \u03c0k indicates the latent state sequence.\nyn represents the observation. HDP acts the role of a prior over the in\ufb01nite transition matrices. Each\n\u03c0k is a draw from a DP, it depicts the transition distribution from state k. The probability measures\nfrom which \u03c0k\u2019s are drawn are parameterized by the same discrete base measure \u03b2. \u03b8 parameterizes\nthe emission distribution F . Usually H is set to be conjugate of F simplifying inference. \u03b3 controls\nbase measure \u03b2\u2019s degree of concentration. \u03b10 plays the role of governing the variability of the prior\nmean measure across the rows of the transition matrix.\nBecause the HDP prior doesn\u2019t distinguish self-transitions from transitions to other states, it is\nvulnerable to unnecessary frequent switching of states and more states. Thus, [5] proposed a\nsticky HDP-HMM to include a self-transition bias parameter into the state transition measure\n\u03c0k \u223c DP (\u03b10 + \u03ba, (\u03b10\u03b2 + \u03ba\u03b4k)/(\u03b10 + \u03ba)), where \u03ba controls the stickness of the transition matrix.\n\n2.3 Hawkes Process\n\nStochastic point process [3] is a rich family of models that are designed for tackling various of\ntemporal event modeling problems. A stochastic point process can be de\ufb01ned via its conditional\nintensity function that provides an equivalent representation as a counting process for temporal events.\nGiven N (t) denoting the number of events occurred in the time interval [0, t) and \u03c4t indicating the\narrival times of the temporal events before t, the intensity for a time point t conditioned on the arrival\ntimes of all the previous events is de\ufb01ned as:\n\u03bb(t|\u03c4t) = lim\n\u2206t\u21920\n\nE[N (t + \u2206t) \u2212 N (t)|\u03c4t]\n\n(2)\n\n.\n\n\u2206t\n\nIt is worth noting that we do not consider edge effect in this paper, hence no events exist before time\n0. A variety of point processes has been developed with distinct functional forms of intensity for\nvarious modeling purposes. Interaction point process (IPP) [4] considers point interactions with an\nintensity function dependent on historical events. Hawkes process [7, 6] is one of the most popular\nand \ufb02exible IPPs. Its conditional intensity has the following functional form:\n\n(cid:88)\n\n\u03bb(t) = \u00b5(t) +\n\n\u03c8n(t \u2212 tn).\n\n(3)\n\ntn 1, tn is\ndrawn from an inhomogeneous Poisson process whose conditional intensity function is de\ufb01ned as:\ni=1 \u03c8si(t \u2212 ti). As de\ufb01ned before, \u03c8si(\u00b7) indicates the triggering kernel of a former point\ni whose latent state is si. The state of the point sn is drawn following the guidance of the HDP\nprior as in the HDP-HMM. The emitted observation yn is generation from the emission probability\ndistribution F (\u00b7) parameterized by \u03b8sn which is determined by the state sn.\n\n\u00b5 +(cid:80)n\u22121\n\n4\n\n\f4 Posterior Inference for the iHSMM-IPP\n\nIn this section, we describe the inference method for the proposed iHSMM-IPP model. Despite its\n\ufb02exibility, the proposed iHSMM-IPP model faces three challenges for ef\ufb01cient posterior inference:\n(1) strong correlation nature of its temporal dynamics (2) non-Markovianity introduced by the event\ntriggering mechanism, and (3) in\ufb01nite dimensional state transition. The traditional sampling methods\nfor high dimensional probability distributions, e.g., MCMC, sequential Monte Carlo (SMC), are\nunreliable when highly correlated variables are updated independently, which can be the case for the\niHSMM-IPP model. So we develop the inference algorithm within the framework of particle MCMC\n(PMCMC), a family of inferential methods recently developed in [1]. The key idea of PMCMC is to\nuse SMC to construct a proposal kernel for an MCMC sampler. It not only improves over traditional\nMCMC methods but also makes Bayesian inference feasible for a large class of statistical models. For\ntackling the non-Markovianity, ancestor resampling scheme [13] is incorporated into our inference\nalgorithm. As existing forward-backward sampling methods, ancestor resampling uses backward\nsampling to improve the mixing of PMCMC. However, it achieves the same effect in a single forward\nsweep instead of using separate forward and backward sweeps. More importantly, it provides an\neffective way of sampling for non-Markovian SSMs.\nGiven a sequence of N events, {yn, tn}N\nn=1, the inference algorithm needs to sample the hidden\nstate sequence, {sn}N\nn=1, emission distribution parameters \u03b81:K, background event intensity \u00b5,\ntriggering kernel parameters, \u03c81:K (we omit \u03c1 and use \u03c81:K instead of \u03c8\u03c11:K for notation simplicity\nas before), transition matrix, \u03c01:K, and the HDP parameters (\u03b10, \u03b3, \u03ba, \u03b2). We use K to represent\nthe number of active states and \u2126 to indicate the set of variables excluding the latent state sequence,\ni.e., \u2126 = {\u03b10, \u03b2, \u03b3, \u03ba, \u00b5, \u03b81:K, \u03c81:K, \u03c01:K}. Only major variables are listed, and \u2126 may also include\nother variables, such as the probability of initial latent state. At a high level, all the variables are\nupdated iteratively using a particle Gibbs (PG) sampler. A conditional SMC is performed as a\nproposal kernel for updating latent state sequence in each PG iteration. An ancestor resampling\nscheme is adopted in the conditional SMC for handling the non-Markovianity caused by the triggering\nmechanism. Metropolis sampling is used in each PG iteration to update background event intensity \u00b5\nand triggering kernel parameters \u03c81:K. The remaining variables in \u2126 can be sampled by following\nthe scheme in [5, 18] readily. The proposal distribution q\u2126(\u00b7) in the conditional SMC can be set by\nfollowing [19]. The PG sampler is given in the following:\n\nStep 1: Initialization, i = 0, set \u2126(0), s1:N (0), B1:N (0).\nStep 2: For iteration i (cid:62) 1\n\n(a) Sample \u2126(i) \u223c p{\u00b7|y1:N , t1:N , s1:N (i \u2212 1)}.\n(b) Run a conditional SMC algorithm targeting p\u2126(i)(s1:N|y1:N , t1:N ) conditional on\n(c) Sample s1:N (i) \u223c \u02c6p\u2126(i)(\u00b7|y1:N , t1:N ).\n\ns1:N (i \u2212 1) and B1:N (i \u2212 1).\n\nWe use B1:N to represent the ancestral lineage of the prespeci\ufb01ed state path s1:N and \u02c6p\u2126(i)(\u00b7|y1:N ) to\nrepresent the particle approximation of p\u2126(i)(\u00b7|y1:N ). The details of the conditional SMC algorithm\nare given in the following. It is worth noting that the conditioned latent state path is only updated via\nthe ancestor resampling.\nStep 1: Let s1:N = {sB1\n\nN } denote the path that is associated with the ancestral lineage\n\n2 ,\u00b7\u00b7\u00b7 , sBN\n\n1 , sB2\n\nB1:N\n\nStep 2: For n = 1,\n\n(a) For j (cid:54)= B1, sample sj\n(b) Compute weights w1(sj\n\n1)/(cid:80)J\n\n1) = p(sj\n\n1 \u223c q\u2126(\u00b7|y1), j \u2208 [1,\u00b7\u00b7\u00b7 , J]. (J denotes the number of particles.)\n1|y1) and normalize the weights\n1 ). (We use p(sj\n1) to represent the probability of the\n1|y1) to represent the proposal distribution conditional on\n\n1)F (y1|sj\n\n1)/q\u2126(sj\n\n1 = w1(sj\n\nW j\ninitial latent state and q\u2126(sj\nthe variable set \u2126.)\n\nm=1 w1(sm\n\nStep 3: For n = 2,\u00b7\u00b7\u00b7 , N :\n\n(a) For j (cid:54)= Bn, sample ancestor index of particle j: aj\n\nn\u22121 \u223c Cat(\u00b7|W 1:J\n\nn\u22121).\n\n5\n\n\f(b) For j (cid:54)= Bn, sample sj\n\nn \u223c q\u2126(\u00b7|yn, s\n\naj\nn\u22121\nn\u22121 ). If sj\n\nn = K + 1 then create a new state using\n\nthe stick-breaking construction for the HDP:\n(i) Sample a new transition probability \u03c0K+1 \u223c Dir(\u03b10\u03b2).\n(ii) Use stick-breaking construction to expand \u03b2 \u2190 [\u03b2, \u03b2K+1]:\n\nK+1 \u223c Beta(1, \u03b3),\n\u03b2(cid:48)\n\n\u03b2K+1 = \u03b2(cid:48)\n\nK+1\n\n(1 \u2212 \u03b2(cid:48)\nl).\n\nK(cid:89)\n\nl=1\n\n(iii) Expand transition probability vectors \u03c0k to include transitions to state K + 1 via\n\nthe HDP stick-breaking construction:\n\nk,K+1 \u223c Beta(\u03b10\u03b2K+1, \u03b10(1 \u2212 K+1(cid:88)\n\n\u03c0(cid:48)\n\n\u03c0k \u2190 [\u03c0k,1,\u00b7\u00b7\u00b7 , \u03c0k,K+1], \u2200k \u2208 [1, K + 1], where\n\nk,l).\n(iv) Sample parameters for a new emission probability and triggering kernel \u03b8K+1 \u223c\n\nk,K+1\n\nl=1\n\nl=1\n\n\u03b2l)), \u03c0k,K+1 = \u03c0(cid:48)\n\n(1 \u2212 \u03c0(cid:48)\n\nK(cid:89)\n\nH and \u03c81:K \u223c H(cid:48).\n\n(d) Perform ancestor resampling for the conditioned state path. Compute the ancestor\n\nweights \u02dcwp,j\n\nn\u22121|N via Eq. 7 and Eq. 8 and resample aBn\n\nn as p(aBn\n\n(e) Compute and normalize particle weights:\nn|s\nn)/q\u2126(sj\n\nn\u22121 )F (yn|sj\n\nn|s\nn) = \u03c0(sj\n\nwn(sj\n\naj\nn\u22121\n\naj\nn\u22121\nn\u22121 , yn), Wn(sj\n\nn\u22121|N .\n\nn = j) \u221d \u02dcwp,j\nJ(cid:88)\n\nn) = wn(sj\n\nn)/(\n\nwn(sj\n\nn)).\n\n4.1 Metropolis Sampling for Background Intensity and Triggering Kernel\n\nj=1\n\nFor the inference of the background intensity \u00b5 and the parameters of triggering kernels \u03c8k in the step\n2 (a) of the PG sampler, Metropolis sampling is used. As described in [3], the conditional likelihood\nof the occurrences of a sequence of events in IPP can be expressed as:\n\nL (cid:44) p(t1:N|\u00b5, \u03c81:K ) =\n\n(cid:32) N(cid:89)\n\n(cid:33)\n\n(cid:90) T\n\n(cid:18)\n\n\u2212\n\n\u03bb(tn)\n\nexp\n\n\u03bb(t)dt\n\n.\n\nn=1\n\n0\n\n(cid:19)\n\nWe describe the Metropolis update for \u03c8k, and similar update can be derived for \u00b5. The normal\ndistribution with the current value of \u03c8k as mean is used as the proposal distribution. The proposed\ncandidate \u03c8\u2217\n. The ratio can be\ncomputed as:\n\nk will be accepted with the probability: A(\u03c8\u2217\n\nk, \u03c8k) = min\n\n1, \u02c6p(\u03c8\u2217\n\nk)\n\u02c6p(\u03c8k)\n\n(cid:16)\n\n(cid:17)\n\n\u02c6p(\u03c8\u2217\nk)\n\u02c6p(\u03c8k)\n\n=\n\np(\u03c8\u2217\nk)\np(\u03c8k)\n\n\uf8eb\uf8ed (cid:88)\n\n\u00b7 exp\n\nu\u2208[1,N ]\n\n\u00b7 p(t1:N|\u03c8\u2217\nk, rest)\np(t1:N|\u03c8k, rest)\n\np(\u03c8\u2217\nk)\np(\u03c8k)\n\n\u00b7\n\n=\n\nsu (T \u2212 tu))\n(\u03a8su (T \u2212 tu) \u2212 \u03a8\n\u2217\n\n(cid:33)\n\nsu (tn \u2212 tu)\nu