This is a post introducing a Graph Markov Network structure for dealing with spatial-temporal data forecasting with missing values.

Traffic forecasting is a classical task for traffic management and it plays an important role in intelligent transportation systems. However, since traffic data are mostly collected by traffic sensors or probe vehicles, sensor failures and the lack of probe vehicles will inevitably result in missing values in the collected raw data for some specific links in the traffic network.

A traffic network normally consists of multiple roadway links. The traffic forecasting task targets to predict future traffic states of all (road) links or sensor stations in the traffic network based on historical traffic state data. The collected spatial-temporal traffic state data of a traffic network with SS links/sensor-stations can be characterized as a TT-step sequence [x1,x2,...,xt,...,xT]RT×S[x_1,x_2,...,x_t,...,x_T] \in \mathbb{R}^{T \times S}, in which xtRSx_t \in \mathbb{R}^{S} demonstrates the traffic states of all SS links at the tt-th step. The traffic state of the ss-th link at time tt is represented by xtsx_t^s.

Here, the superscript of a traffic state represents the spatial dimension and the subscript denotes the temporal dimension.

Problem Definition

Traffic Forecasting

The short-term traffic forecasting problem can be formulated as, based on TT-step historical traffic state data, learning a function F()F(\cdot) to generate the traffic state at next time step (single-step forecasting) as follows:

F([x1,x2...,xT])=[xT+1]F([x_1,x_2...,x_T]) = [x_{T+1}]

Problem definition 1

Learning Traffic Network as a Graph

Since the traffic network is composed of road links and intersections, it is intuitive to consider the traffic network as an undirected graph consisting of vertices and edges. The graph can be denoted as G=(V,E,A)\mathcal{G}=(\mathcal{V}, \mathcal{E}, \mathcal{A}) :

  • V\mathcal{V}: a set of vertices V={v1,...,vS}\mathcal{V}=\{v_1,...,v_S\}.
  • E\mathcal{E}: a set of edges E\mathcal{E} connecting vertices.
  • ARS×S\mathcal{A} \in \mathbb{R}^{S \times S}: an adjacency matrix, a symmetric (typically sparse) adjacency matrix with binary elements, in which Ai,j\mathcal{A}_ {i,j} denotes the connectedness between nodes viv_i and vjv_j.
    • To characterize the vertices’ self-connectedness, a self-connection adjacency matrix is denoted as A=A+I\mathbf{A} = \mathcal{A} + I, i.e. Ai,i=1\mathbf{A}_ {i,i}=1, implying each vertex in the graph is self-connected. Here, IRS×SI \in \mathbb{R}^{S \times S} is an identity matrix.
    • Based on A\mathcal{A}, a graph degree matrix (DRS×S\mathcal{D} \in \mathbb{R}^{S \times S}) describing the number of edges attached to each vertex can be obtained by Di,i=j=1SAi,j\mathcal{D}_ {i,i} = \sum_{j=1}^S \mathcal{A}_ {i,j}.
    • The connectedness of the graph vertices can also be encoded by the Laplacian matrix (L\mathcal{L}), which is essential for spectral graph analysis. The combinatorial Laplacian matrix is defined as L=DA\mathcal{L} = \mathcal{D}-\mathcal{A}, and the normalized definition is L=ID1/2AD1/2\mathcal{L} = I - \mathcal{D}^{-1/2}\mathcal{A}\mathcal{D}^{-1/2}. Since L\mathcal{L} is a symmetric positive semi-definite matrix, it can be diagonalized as L=UΛUT\mathcal{L}=U\Lambda U^T by its eigenvector matrix UU, where U=[u0,u1,...,uS1]RS×SU=[u_0, u_1,..., u_{S-1}] \in \mathbb{R}^{S \times S} and Λ=diag(λ0,λ1,...,λS1)RS×S\Lambda = \operatorname{diag}(\lambda_0, \lambda_1,..., \lambda_{S-1})\in \mathbb{R}^{S \times S} is the corresponding diagonal eigenvalue matrix satisfying Lui=λiui\mathcal{L}u_i = \lambda_i u_i.

Problem definition 2

Given the graph representation of the traffic network, the traffic forecasting problem can be reformulated as

F(G,[x1,x2...,xT])=[xT+1]F({\color{#C00000}{\mathcal{G}}},[x_1,x_2...,x_T]) = [x_{T+1}]

Traffic Forecasting with Missing Values

Traffic state data can be collected by multi-types of traffic sensors or probe vehicles. When traffic sensors fail or no probe vehicles go through road links, the collected traffic state data may have missing values. We use a sequence of masking vectors [m1,m2,...,mT]RT×S[m_1,m_2,...,m_T] \in \mathbb{R}^{T \times S}, where mtRSm_t \in \mathbb{R}^{S}, to indicate the position of the missing values in traffic state sequence [x1,x2,...,xT][x_1,x_2,...,x_T]. If xtsx_t^s is observed mts=1m_t^s = 1, otherwise mts=0m_t^s = 0.

Taking the missing values into consideration, we can formulate the traffic forecasting problem as follows

F(G,[x1,x2...,xT],[m1,m2...,mT])=[xT+1]F({\color{#C00000}{\mathcal{G}}},[x_1,x_2...,x_T],{\color{#0070C0}{[m_1,m_2...,m_T]}}) = [x_{T+1}]

Problem definition 3

Graph Markov Process

A traffic network is a dynamic system and the states on all links keep varying resulted by the movements of vehicles in the system. We assume teh transitions of traffic states have two properties

Markov Property: The future state of the traffic network xt+1x_{t+1} depends only upon the present state xtx_t, not on the sequence of states that preceded it. Taking X1,X2,...,Xt+1X_1, X_2, ... ,X_{t+1} as random variables with the Markov property and x1,x2,...,xt+1x_1,x_2,...,x_{t+1} as the observed traffic states. The Markov process can be formulated in a conditional probability form as

Pr(Xt+1=xt+1X1=x1,X2=x2,...,Xt=xt)=Pr(Xt+1=xt+1Xt=xt)Pr(X_{t+1} = x_{t+1} | X_1 = x_1, X_2 = x_2, ..., X_t = x_t)=Pr(X_{t+1} = x_{t+1} | X_t = x_t)

where Pr()Pr(\cdot) demonstrates the probability. It can be formulated in the vector form as

xt+1=Ptxtx_{t+1} = {\color{#0070C0}{P_t}} x_t

where PtRS×SP_t \in \mathbb{R}^{S\times S} is the transition matrix and (Pt)i,j{(P_t)}_ {i,j} measures how much influence xtjx_t^j has on forming the state xt+1ix_{t+1}^i.

The time interval Δt\Delta t between two consecutive time steps of traffic states will affect the transition process. Thus, a decay parameter γ(0,1)\gamma \in (0,1) is multiplied to represent the temporal impact on transition process as

xt+1=γPtxtx_{t+1} = \gamma {\color{#0070C0}{P_t}} x_t

Graph Localization Property: We assume traffic state of a link 𝑠 is mostly influenced by itself and neighboring links. This property can be formulated in a conditional probability form as

Pr(Xt+1=xt+1uXt=xt)=Pr(Xt+1=xt+1uXt=xtv,vN(u))Pr(X_{t+1} = x_{t+1}^u | X_t = x_t) = Pr(X_{t+1} = x_{t+1}^u | X_t = x_t^v, v \in \mathcal{N}(u))

where the superscripts uu and vv are the indices of graph links (road links). The N(u)\mathcal{N}(u) denotes a set of one-hop neighboring links of link uu and link uu itself. Its vector form can easily represented by

xt+1=Axtx_{t+1} = {\color{#C00000}{A}} x_t

Graph Markov Process: By incorporating the Markov property and the graph localization property, the traffic state transition process is defined as a graph Markov process (GMP) formulated in the vector form as

xt+1=γ(APt)xtx_{t+1} = \gamma ( {\color{#C00000}{\mathbf{A}}} \odot {\color{#0070C0}{P_t}}) x_t

where \odot is the Hadamard (element-wise) product operator that (APt)ij=Aij×(Pt)ij(\mathbf{A} \odot P_t)_ {ij} = \mathbf{A}_ {ij} \times {(P_t)}_ {ij}.

Handling Missing Values in Graph Markov Process

As we assume the traffic state transition process follows the graph Markov process, the future traffic state can be inferred by the present state. If there are missing values in the present state, we can infer the missing values from previous states.

A completed state is denoted by x~t\tilde{x}_ t, in which all missing values are filled based on historical data. The completed state consists of two parts, the observed state values and the inferred state values:

x~t=xtmtobserved values+x~t(1mt)inferred missing values\tilde{x}_ t = \underbrace{x_t \odot m_t}_ \text{observed values} + \underbrace{\tilde{x}_ t \odot (1-m_t)}_ \text{inferred missing values}

where x~t(1mt)\tilde{x}_ t \odot (1-m_t) is the inferred part. Because xtmt=xtx_t \odot m_t = x_t,

x~t=xt+x~t(1mt)\tilde{x}_ t = {\color{blue}{x_t + \tilde{x}_ t \odot (1-m_t)}}

Since the transition of completed states follows the the graph Markov process, x~t+1=γ(APt)x~t\tilde{x}_ {t+1} = \gamma (\mathbf{A} \odot P_t) \tilde{x}_ t. For simplicity, the graph Markov transition matrix is denoted by HtH_t, i.e. Ht=APtH_t = \mathbf{A} \odot P_t. Hence, the transition process of completed states can be represented as

x~t+1=γHtx~t\tilde{x}_ {t+1} = \gamma H_t \tilde{x}_ t

Applying the expansion of x~t\tilde{x}_ {t}, the transition process becomes

x~t+1=γHt(xt+x~t(1mt)); iteratively apply x~t to itself=γHt(xt+γHt1(xt1+x~t1(1mt1))(1mt))=γHtxt+γ2HtHt1(xt1(1mt))+γ2HtHt1(x~t1(1mt1)(1mt)) \begin{aligned} \tilde{x}_ {t+1} &= \gamma H_t ({\color{blue}{x_t + {\color{red}{\tilde{x}_ t}} \odot (1-m_t)}} ) \quad \scriptstyle{\text{; iteratively apply } \tilde{x}_ t \text{ to itself}} \\ &= \gamma H_t ({\color{blue}{x_t + {\color{red}{\gamma H_{t-1} (x_{t-1} + \tilde{x}_ {t-1} \odot (1-m_{t-1}))}} \odot (1-m_t)}}) \\ &= \gamma H_t x_t + \gamma^2 H_t H_{t-1} (x_{t-1} \odot (1-m_t)) + \gamma^2 H_t H_{t-1} (\tilde{x}_ {t-1} \odot (1-m_{t-1}) \odot (1-m_t)) \end{aligned}

After iteratively applying nn steps of previous states from xt(n1)x_{t-(n-1)} to xtx_t, x~t+1\tilde{x}_ {t+1} can be described as

x~t+1=γHtxt+γ2HtHt1(xt1(1mt))+γ3HtHt1Ht2(xt2(1mt1)(1mt))++γnHtHt(n1)(xt(n1)(1mt(n2))(1mt))+γnHtHt(n1)(x~t(n1)(1mt(n1))(1mt)) \begin{aligned} \tilde{x}_ {t+1} &= \gamma H_t x_t \\ &+ \gamma^2 H_t H_{t-1} (x_{t-1} \odot (1-m_t)) \\ &+ \gamma^3 H_t H_{t-1} H_{t-2} (x_{t-2}\odot (1-m_ {t-1}) \odot (1-m_t)) + \cdots \\ &+ \gamma^{n} H_t \cdots H_{t-(n-1)} (x_{t-(n-1)} \odot (1-m_{t-(n-2)}) \odot \cdots \odot (1-m_t)) \\ &+ \gamma^{n} H_t \cdots H_{t-(n-1)} (\tilde{x}_ {t-(n-1)} \odot (1-m_{t-(n-1)}) \odot \cdots \odot (1-m_t)) \end{aligned}

The nn steps of historical steps of states can be written in a summation form as

x~t+1=i=0n1γi+1(j=0iHtj)(xtij=0i1(1mtj))+γnHtHt(n1)(x~t(n1)(1mt(n1))(1mt)) \begin{aligned} \tilde{x}_ {t+1} &= \sum_{i=0}^{n-1} \gamma^{i+1} (\prod_{j=0}^{i} H_{t-j}) (x_{t-i} \odot \bigodot_{j=0}^{i-1} (1-m_{t-j})) \\ &+ \gamma^{n} H_t \cdots H_{t-(n-1)} (\tilde{x}_ {t-(n-1)} \odot (1-m_{t-(n-1)}) \odot \cdots \odot (1-m_t)) \end{aligned}

where \sum, \prod, and \bigodot are the summation, matrix product, and Hadamard product operators, respectively. For simplicity, we denote the term with the x~tn1\tilde{x}_{t-n-1} as O(x~tn1)\mathcal{O}(\tilde{x}_{t-n-1}), and the GMP of the complected states can be represented by

x~t+1=i=0n1γi+1(j=0iHtj)(xtij=0i1(1mtj))+O(x~t(n1))\tilde{x}_{t+1} = \sum_{i=0}^{n-1} \gamma^{i+1} (\prod_{j=0}^{i} H_{t-j}) (x_{t-i} \odot \bigodot_{j=0}^{i-1} (1-m_{t-j})) + \mathcal{O}(\tilde{x}_ {t-(n-1)})

When nn is large enough, we consider O(x~t(n1))\mathcal{O}(\tilde{x}_ {t-(n-1)}) is a negligibly term.

The graph Markov process can be demonstrated by the following figure. The gray-colored nodes in the left sub-figure demonstrate the nodes with missing values. Vectors on the right side represent the traffic states. The traffic states at time $t$ are numbered to match the graph and the vector. The future state (in red color) can be inferred from their neighbors in the previous steps.

GMP

Graph Markov Network

The graph Markov network (GMN) is designed based on the graph Markov process, which contains nn transition weight matrices. The product of the these matrices (j=0iHtj)=(j=0iAjPtj)(\prod_{j=0}^{i} H_{t-j}) = (\prod_{j=0}^{i} \mathbf{A}^j \odot P_{t-j}) measures the contribution of xtix_{t-i} for generating the x~t\tilde{x}_ t. To reduce matrix product operations and at the same time keep the learning capability in the GMP, (j=0iAjPtj)(\prod_ {j=0}^{i} \mathbf{A}^j \odot P_{t-j}) is simplified into (Ai+1Wi+1)(\mathbf{A}^{i+1} \odot W_{i+1}), where Wi+1RS×SW_{i+1} \in \mathbb{R}^{S \times S} is a weight matrix. In this way, (Ai+1Wi+1)(\mathbf{A}^{i+1} \odot W_{i+1}) can directly measure the contribution of xtix_{t-i} for generating the x~t\tilde{x}_ t and skip the intermediate state transition processes. Further, the GMP still has nn weight matrices ({A1W1,...,AnWn}\{\mathbf{A}^1 \odot W_1, ..., \mathbf{A}^n \odot W_n\}), and thus, the learning capability in terms of the size of parameters does not change. The benefits of the simplification is that the GMP can reduce n(n1)2\frac{n(n-1)}{2} times of multiplication between two S×SS\times S matrices in total. The O(x~t(n1))\mathcal{O}(\tilde{x}_ {t-(n-1)}) is considered small enough and omitted for simplicity.

Based on the GMP and the aforementioned simplification, the Graph Markov Network for traffic forecasting with missing values can be defined as

x^t+1=i=0n1γi+1(Ai+1Wi+1)(xtij=0i1(1mtj))\hat{x}_{t+1} = \sum_{i=0}^{n-1} \gamma^{i+1} (\mathbf{A}^{i+1} \odot W_{i+1}) (x_{t-i} \odot \bigodot_{j=0}^{i-1} (1-m_{t-j}))

where x^t+1\hat{x}_ {t+1} is the predicted traffic state for the future time step t+1t+1 and {W1,...,Wn}\{W_1,...,W_n\} are the model’s weight matrices that can be learned and updated during the training process.

Structure of the graph Markov network is shown below. Here, Htj=AjWjH_{t-j} = \mathbf{A}^j\odot W_j.

GMN

Code

The code for implementing the GMN is published in this GitHub Repo. GitHub stars

Reference

For more details of the model and experimental results, please refer to

[1] Zhiyong Cui, et al. Graph markov network for traffic forecasting with missing data. Transportation Research Part C: Emerging Technologies, 2020. [arXiv]