LaneGCN 论文解读

Learning Lane Graph Representations for Motion Forecasting

paper link: https://arxiv.org/abs/2007.13732 PPT: https://www.cs.toronto.edu/~byang/slides/LaneGCN.pdf

Lane Graph + Actor Map:

  • construct lane graph from vectorized map data to preserve the map structure and can avoid information loss 构建矢量化地图信息,避免地图信息丢失

  • LaneGCN:

    • extends graph convolutions with multiple adjacency matrices and along-lane dilation

      • to capture complex topology and long range dependencies of the lane graph.
    • exploit a fusion network consisting of four types of interactions: actor-to-lane, lane-to-actor, actor-to-actor, lane-to-lane.

      • present both actors and lanes as nodes in the graph and use a 1D CNN and LaneGCN to extract the features for the actor and lane nodes respectively, and then exploit spatial attention and another LaneGCN to model four types of interactions.

NN Architecture

Difference between VectorNet and LaneGCN:

  • VecotrNet uses vanilla graph networks with undirected full connections; LaneGCN uses connected lane graph following the map topology and propose task specific multi-type and dilated graph operators.
  • VectorNet uses polyline-level nodes for interactions; LaneGCN uses polyline segments as map nodes to capture higher resolution.

Model_Layout

Each Trajctory is represented as a sequence of displacement p(T1),,p1,p0{ \bigtriangleup{p_{-(T-1)},…,\bigtriangleup{p_{-1}}, \bigtriangleup{p_0}}}, where pt\bigtriangleup{p_t} is the 2D displacement from time step t1t-1 to t, and T is the trajectory size.

For trajectories with sizes smaller than TT , we pad them with zeros. We add a binary 1×T1 × T mask to indicate if the element at each step is padded or not and concatenate it with the trajectory tensor, resulting in an input tensor of size 3×T3 × T.

1D CNN is used to process the trajectory input for its effectiveness in extracting multi-scale features and efficiency in parallel computing. The output of ActorNet is a temporal feature map, whose element at t=0t = 0 is used as the actor feature. The network has 3 groups/scales of 1D convolutions.

Each group consists of 2 residual blocks, with the stride of the first block as 2. We then use a Feature Pyramid Network (FPN) to fuse the multi-scale features, and apply another residual block to obtain the output tensor. For all layers, the convolution kernel size is 3 and the number of output channels is 128. Layer Normalization and the Rectified Linear Unit (ReLU) are used after each convolution.

ActorNet

General Architecture:

  • part 1: building a lane graph from vectorized map data;
  • part 2: applying our novel LaneGCN to the lane graph to output the map features.

Map Data:

In this paper, we adopt a simple form of vectorized map data as our representation of HD maps. Specifically, the map data is represented as a set of lanes and their connectivity. Each lane contains a centerline, i.e., a sequence of 2D BEV points, which are arranged following the lane direction (see Fig. 3, top). For any two lanes which are directly reachable, 4 types of connections are given: predecessor, successor, left neighbour and right neighbour.

Lane Graph Construction:

first define a lane node as the straight line segment formed by any two consecutive points (grey circles in Fig. 3) of the centerline. The location of a lane node is the averaged coordinates of its two end points. Following the connections between lane centerlines, we also derive 4 connectivity types 4 connectivity types for the lane nodes, i.e., predecessor, successor, left neighbour and right neighbour.

We denote the lane nodes with VRN×2V ∈ \mathbb R^{N ×2} , where NN is the number of lane nodes and the ii-th row of VV is the BEV coordinates of the ii-th node. We represent the connectivity with 4 adjacency matrices {Ai}ipre,suc,left,right{\lbrace A_i \rbrace}_{i \in {pre,suc,left,right}} , with AiRN×NA_i \in \mathbb R^{N ×N}.

We denote Ai,j,kA_{i,j,k}, as the element in the jj-th row and kk-th column of AiA_i. Then Ai,j,k=1A_{i,j,k} = 1 if node kk is an ii-type neighbor of node jj.

LaneConv Operator:

Node Feature: Each lane node corresponds to a straight line segment of a centerline. To encode all the lane node information, we need to take into account both the shape (size and orientation) and the location (the coordinates of the center) of the corresponding line segment. We parameterize the node feature as follows,

xi=MLPshape(viendvistart)+MLPloc(vi)x_i = MLP_{shape} (v_{i}^{end} - v_{i}^{start}) + MLP_{loc}(v_i)

where MLPMLP indicates a multi-layer perceptron and the two subscripts refer to shape and location, respectively. viv_i is the location of the i-th lane node, i.e., the center between two end points, vistartv_i^{start} and viendv_i^{end} are the BEV coordinates of the node isi’s starting and ending points, and xix_i is the ii-th row of the node feature matrix XX, denoting the input feature of the ii-th lane node.

LaneConv: To aggregate the topology information of the lane graph at a larger scale, we design the following LaneConv operator:

Y=XW0+ipre,suc,left,rightAiXWi,(2)Y = XW_0 + \sum_{i\in{pre, suc, left, right}}A_iXW_i,\tag{2}

where AiA_i and WiW_i are the adjacency and the weight matrices corresponding to the ii-th connection type respectively. Since we order the lane nodes from the start to the end of the lane, AsucA_{suc} and ApreA_{pre} are matrices obtained by shifting the identity matrix (diagnal 1) one step towards upper right (non-zero superdiagonal) and lower left (non-zero subdiagonal). AsucA_{suc} and ApreA_{pre} can propagate information from the forward and backward neighbours whereas AleftA_{left} and ArightA_{right} allow information to flow from the cross-lane neighbours. It is not hard to see that our LaneConv builds on top of the general graph convolution and encodes more geometric (e.g., connection type/direction) information. As shown in our experiments this improves over the vanilla graph convolution.

Dilated LaneConv:

Functionality: The model needs to capture the long range dependency along the lane direction for accurate prediction.

the k-dilation LaneConv operator is defined as follows:

Y=XW0+AprekXWpre,k+AsuckXWsuc,k(3)Y = XW_0 + A_{pre}^k XW_{pre,k} + A_{suc}^k X W_{suc,k} \tag{3}

where AprekA_{pre}^k is the kk-th matrix power of ApreA_{pre}. This allows us to directly propagate information along the lane for kk steps, with kk a hyperparameter. Since AprekA_{pre}^k is highly sparse, one can efficiently compute it using sparse matrix multiplication. Note that the dilated LaneConv is only used for predecessor and successor, as the long range dependency is mostly along the lane direction.

LaneGCN:

With Eq.(2) and Eq.(3), we get a multi-scale LaneConv operator with C dilation size as follows:

Y=XW0+i{left,right}AiXWi+c=1C(AprekcXWpre,kc+AsuckcXWsuc,kc)(4)Y = XW_0 + \sum_{i\in \lbrace left, right \rbrace} A_i X W_i + \sum_{c=1}^C (A_{pre}^{k_c}XW_{pre, k_c} + A_{suc}^{k_c}XW_{suc, k_c}), \tag{4}

where kck_c is the cc-th dilation size. We denote LaneConv(k1,,kC)LaneConv(k_1 , · · · , k_C) this multi-scale layer.

Four types fusion modules:

  • A2L: introduces real-time traffic information to lane nodes, such as blockage or usage of the lanes.
  • L2L: updates lane node features by propagating the traffic information over the lane graph. -> LaneGCN
  • L2A: fuses updated map features with real-time traffic information back to the actors.
  • A2A: handles the interactions between actors and produces the output actor features, which are then used by the prediction header for motion forecasting.

We implement L2L using another LaneGCN, which has the same architecture as the one used in our MapNet (see Section 3.2). In the following we describe the other three modules in detail. We exploit a spatial attention layer for A2L, L2A and A2A. The attention layer applies to each of the three modules in the same way. Taking A2L as an example, given an actor node i, we aggregate the features from its context lane nodes j as follows:

yi=xiW0+jϕ(concat(xi,Δi,j,xj)W1)W2,(5)y_i = x_i W_0 + \sum_j \phi (concat(x_i, \Delta_{i,j}, x_j)W_1)W_2, \tag{5}

with xix_i the feature of the ii-th node, WW a weight matrix, ϕ\phi the compositon of layer notmalization and RelU, and Δij=MLP(vjvi)\Delta_{ij} = MLP(v_j - v_i), where vv denotes the node location.

Take after-fusion actor features as input, a multi-modal prediction header outputs the final motion forecasting. For each actor, it predicts KK possible future trajectories and their confidence scores.

The header has two branches, a regression branch to predict the trajectory of each mode and a classification branch to predict the confidence score of each mode.

For the m-th actor, we apply a residual block and a linear layer in the regression branch to regress the K sequences of BEV coordinates:

Om,reg={(pm,1k,pm,2k,,pm,Tk)}k[0,K1]O_{m,reg} = \lbrace (p_{m,1}^k, p_{m,2}^k, …, p_{m,T}^k) \rbrace _{k\in[0,K-1]}

where pm,ikp_{m,i}^k is the predicted mm-th actor’s BEV coordinates of the kk-th mode at the ii-th time step. For the classification branch, we apply an MLP to pm,Tkpm,0p^k_{m,T} − p_{m,0} to get KK distance embeddings. We then concatenate each distance embedding with the actor feature, apply a residual block and a linear layer to output KK confidence scores, Om,cls=(cm,0,cm,1,,cm,K1)O_{m,cls} = (c_{m,0}, c_{m,1}, …, c_{m,K−1}).

use the sum of classification and regreesion losses to train the model:

L=Lcls+αLreg, L = L_{cls} + \alpha L_{reg},

where α=1.0\alpha = 1.0.

For classification, we use the max-margin loss:

Lcls=1M(K1)m=1Mkk^max(0,cm,k+ϵcm,k^)(6)L_{cls} = \frac{1}{M(K-1)}\sum_{m=1}^M \sum_{k \neq \hat{k}} \max(0, c_{m,k} + \epsilon - c_{m, \hat{k}}) \tag{6}

where ϵ\epsilon is the margin and MM is the total number of actors. For regression, we apply the smooth l1l1 loss on all predicted time steps:

Lreg=1MTm=1Mt=1Treg(pm,yk^pm,t)(7)L_{reg} = \frac{1}{MT} \sum_{m=1}^M \sum_{t=1}^T reg(p_{m,y}^{\hat{k}} - p_{m,t}^*) \tag{7}

where ptp_t^* is the ground truth BEV coordinates at time step tt, reg(x)=id(xi)reg(x) = \sum\limits_i d(x_i), xix_i is the ii-th element of xx, and d(xi)d(x_i) is the smooth 1\ell1 loss defined as:

d(xi)={0.5xi2ifx<1, xi0.5otherwise,(8)d(x_i) = \begin{cases} 0.5x_i^2 &\text{if} ||x|| < 1, \ ||x_i|| - 0.5 & \text{otherwise,} \end{cases} \tag{8}

where xi||x_i|| denotes the 1\ell1 norm of xix_i.

以官方的 2645.csv 数据集为例子

agent node:

  • data['city']: 城市名称
  • data['trajs'] = [agt_traj] + ctx_trajs: 轨迹点,(agent + context vehicles)
  • data['steps'] = [agt_step] + ctx_steps: 在原始数据中的位置
  • data['feats'] = feats: (13 X 20 X 3) 前 20 预测轨迹 + 一维是否存在点
  • data['ctrs'] = ctrs: (13 X 2) 中心点
  • data['orig'] = orig: AGENT 当前点坐标
  • data['theta'] = theta: AGENT 偏转角
  • data['rot'] = rot: (2 X 2) 旋转矩阵
  • data['gt_preds'] = gt_preds:(13 X 30 X 2) 后 30 帧真实轨迹
  • data['has_preds'] = has_preds: (13 X 30) 标识后 30 帧轨迹是否存在

lane node:

  • graph['ctrs'] = np.concatenate(ctrs, 0): lane node 的中心点坐标
  • graph['num_nodes'] = num_nodes: lane node 的数量
  • graph['feats'] = np.concatenate(feats, 0): lane node 方向向量
  • graph['turn'] = np.concatenate(turn, 0): lane node 转向标识
  • graph['control'] = np.concatenate(control, 0): lane node 的 has_traffic_control 标识
  • graph['intersect'] = np.concatenate(intersect, 0): lane node 的 is_intersection 标识
  • graph['pre'] = [pre]: pre [‘u’] 和 pre [‘v’], v 是 u 的 pre, 这里表述的是 lane node 之间的关系
  • graph['suc'] = [suc]: suc [‘u’] 和 suc [‘v’], v 是 u 的 suc, 这里表述的是 lane node 之间的关系
  • graph['lane_idcs'] = lane_idcs: lane node index
    • 1
      2
      3
      4
      
      0 0 0 ... 0
      1 1 1 ... 1
          ...
      83 83 83 ... 83
  • graph['pre_pairs'] = pre_pairs: pair 表述的是 lane 之间的关系
  • graph['suc_pairs'] = suc_pairs: pair 表述的是 lane 之间的关系
  • graph['left_pairs'] = left_pairs: pair 表述的是 lane 之间的关系
  • graph['right_pairs'] = right_pairs: pair 表述的是 lane 之间的关系
    • 对于 pre['u']pre['v'], v 是 u 的 pre
    • 对于 suc['u']suc['v'], v 是 u 的 suc
    • 对于 left['u']left['v'], v 是 u 的 left
    • 对于 right['u']right['v'], v 是 u 的 right

Net 结构

  • ActorNet input: M x 3 x 20 output: M x 128 x 20

解释:

  • MapNet: 把 v 按照 u 加到 center 上 input: N x 4 output: N x 128

  • A2M input: N x 128 output: N x 128

  • M2M input: N x 128 output: N x 128

  • M2A input: N x 128 output: M x 128

  • A2A input: N x 128 output: N x 128

  • Prediction Header: input M x 128

    • MLP Regression
    • MLP Classification

ref link: https://zhuanlan.zhihu.com/p/447129428

Buy me a coffee~
支付宝
微信
0%