1. Massachusetts Institute of Technology (MIT)
2. National Institute for Informatics, Tokyo;
2018
presented by Albert M Orozco Camacho
Neighborhood aggregation turns out to be a crucial part of representation learning, due to the rise of graph neural networks.
Such procedure aims to extract high-level features from nodes via a message passing scheme.
Usually, traditional GCNs show state-of-the-art performance only with a 2-layer model, while deeper models don't take advantage of accessing to more information.
Networks that exhibit a diversity in subgraph structures (such as node hubs) yield inconsistent learning of node relations by GCNs.
$$ h_v^{(l)} = \text{RELU}\left(W_l \cdot \frac{1}{\tilde{\deg}(v)} \sum_{u \in \tilde{N}(v)} h_u^{(l-1)}\right) $$
$$ h_v^{(l)} = \text{COMBINE}\left(h_v^{(l-1)}, h_{N(v)}^{(l)}\right) $$
where $e$ is an all-ones vector
makes each layer increase the size of the influence distribution by aggregating
neighborhoods from the previous layer ⬆️;
combines, at the last layer, some of the previous layers' representations
independently for each node;
intermediate representations are said to jump to the last layer.
CONCATENATION
$$[h_v^{(1)},\ldots,h_v^{(k)}]$$
MAX-POOLING. Select the most informative layer for each feature coordinate.
LSTM-ATTENTION. Input $h_v^{(1)},\ldots,h_v^{(k)}$ into a bi-directional
LSTM to generate forward and backward features $f_v^{(l)}$ and $b_v^{(l)}$ for each
layer $l$; finally compute an attention score per each node by combining those
for each layer.
The influence score $I(x, y)$ for any $x, y \in V$ under a $k$-layer JK-Net with layer-wise max-pooling is equivalent in expectation to a mixture of $0,\ldots,k$-step random walk distributions on $\tilde{G}$ at $y$ starting at $x$, the coefficients of which depend on the values of the layer features $h_x^{(l)}$.
Goal: Provide a representation learning scheme that can generalize better on diverse variety of network structure, than the one proposed for GCN's
Problem: Denser subgraphs may cause aggregation algorithms to converge in expectation to biased random walks. ☹
Solution: JK-Nets aggregate and leverage information from more than one hidden layers.😁
JK-Nets with the LSTM-attention aggregators outperform the non-adaptive models GraphSAGE, GAT and JK-Nets with concatenation aggregators.
https://github.com/ShinKyuY/Representation_Learning_on_Graphs_with_Jumping_Knowledge_Networks
Exploring other layer aggregators and studying the effect of the combination of various layer-wise and node-wise aggregators on different types of graph structures.
How can sequence modelling by itself impact the task of layer aggregation?
Are there smarter ways to keep track of node/community correlations within a network?