An overview of graph neural networks for anomaly detection in e-commerce

Read original article here

In Part 1 of this blogpost, we introduced the terminology and touched on the benefit of using network and graph information toward a more efficient anomaly detection in e-commerce. Graph neural network (GNN) methods employ a combination of graph analysis and deep learning techniques (which are shown to be very beneficial in several real-world applications). In this article, we start with a review of a simple GNN method, followed by a few more complex ones. Finally, we look at challenges in applying GNN to dynamic environments as well as scaling it to large graphs.

GCN is an extension of convolutional neural networks (CNN) (which is one of the popular methods in deep learning and has been proved to be very successful in solving sophisticated problems in image, video, and voice/speech processing applications). Note that in CNN we normally operate on an input data with a fixed size (like a voice sequence of 0.5 seconds length, or a 512x512 image) and with a regular sampling like 1D or 2D grids where there is a spatial locality and ordering. However, graph or network data have more complex structures, arbitrary node neighborhood size, and no fixed node ordering. GCN (see Defferrard, et al. 2016, and Kipf, et al. 2017) is one of the first methods developed for GNN analysis.

Traditional GCNs operate by converting the graph structure, or node neighborhood/adjacency information of a graph, into a (sparse and binary) matrix (called adjacency matrix, A), and then using Aas a collection or gather operation to select and average one-hop neighborhoods. Let’s assume that each node of the graph is described by a numerical feature vector of length M, and all node feature vectors (which could include various attributes describing each node) are summarized and shown by matrix X, which has the size N × M, where Nis the number of nodes in the graph.

GCN’s main limitation is that the graph is assumed to be homogeneous, meaning that edge types are assumed to be the same across the graph, which is not the case for an e-commerce network.

Note that repeating several aforementioned steps, results in propagation of network information across the graph nodes, which eventually changes the node features. Once the final network embedding is obtained (which means after i steps or using i GCN layers, once the node embeddings H(i) are calculated), detecting anomaly in nodes or in edges can be performed with traditional anomaly detection techniques in ML (Figure 1 and 2). We could apply a fully connected (FC) or multilayer perceptron (MLP) layer over learned graph features H(i), to generate the final decision. The number of steps, i, which determines the desired neighborhood level is a design parameter of choice.

There are several revisions to GCN including the simple graph convolution (SGC), relational GCN (R-GCN), and the graph attention network (GAT). The R-GCN method allows having different node relationships, i.e., the edges can have different types.

As an extension to GCN, in MixHop method by Abu-El-Haija, et al. (2019), multi-scale neighborhood is incorporated into the GCN filter by repeatedly mixing feature representations of neighbors at several scales. For each layer in MixHop, we use a 0, 1, 2, …, N hop neighborhood, and allow the layer to learn to aggregate across all these hops simultaneously, (see Figure 3 and 4).

MPNN (Gilmer, et al., 2017) is a generalization of GCN. It can incorporate heterogeneous graphs, allowing to use node features as well as edge features, and it provides control of when and how messages are passed.

Considering the notion of message passing, which is now the most popular approach in graph analysis, we argue that a GNN captures the node dependence and graph interactions through passing messages between nodes of the graph.

In MPNN, each node has an embedding or state (or hidden state, denoted by h). Starting with initial embedding, messages are generated for each pair of edges in the graph:

See Figure 5. Here we used the summation operator as the aggregation function, but there are other options as well.

Note that the readout function R(.) as we wrote here must be invariant to permutation of node orders.

Many methods can be described as a realization of MPNN model, for example, the popular gated graph neural network (GGNN) model by Li, et al. (2016) which uses a ‘gated recurrent unit’ (GRU) as an update function.

To perform graph embedding or to learn the graph representation, it seems necessary to use all the nodes and edges of the input graph, however, this is not always a feasible choice due to computational cost. A feasible solution for scaling graph embedding to large graphs is through performing “neighborhood sampling” or “neighbor sampling”. The basic idea is that for every node, instead of using the entire neighborhood information (i.e., all connected k-hop neighbors), we select or sample a subset of them to perform the aggregation needed in graph embedding. For example, DeepWalk method by Perozzi, et al. (2014), used truncated random walks and then encoded them with a neural network to obtain node embeddings. The walk length as well as the number of random walks are fixed and limited, basically exploring local graph information around each node. At each node along the walk, the method picks the next step uniformly (i.e., with equal probability) from the neighbors. GraphSAGE method proposed by Hamilton, et al. (2017) and the presentation titled “mining and learning with graphs at scale” by Mirrokni, et al. (2020) are suggested for further reading, (see Figure 6).

Another method, called LINE (by Tang, et al., 2015), preserves not only the first-order (observed tie strength) relations but also the second-order relations (shared neighborhood structures of the vertices). The Node2Vec method (by Grover and Leskovec, 2016) uses two different sampling strategies for vertices (breadth-first sampling and depth-first sampling). For very large networks, the graph can be built using methods like Grale (by Halcrow, et al., 2020) as well.

Using adjacency matrix of entire graph easily runs into memory issues as the size of graph increases. There are methods to scale up to a real-life graph (which could be in the order of billion nodes), e.g., by looking at approximate neighborhoods, and by doing calculations on subgraphs (which are graph patches sampled from the main graph).

PageRank is a method for rating the importance of web pages using the link structure of the web. It is a global ranking of web pages based on their locations in the web graph structure. Its updated method, called Personalized PageRank (or PPR), see Andersen, et al., (2016), is shown to be a good alternative to explicit message passing, and it could be used to incorporate multi-hop neighborhood information of a node. Information propagation based on PPR corresponds to many neighborhood aggregation layers where with each layer the node influence decays exponentially. However, since it performs an expensive version of power iteration it is not readily scalable to large graphs.

To address the problem of computationally expensive aggregations at runtime, the PPRGo method (by Bojchevski, et al., 2020) pre-calculates the aggregations (PPR vectors) thus saving computation at runtime (see Mirrokni, et al. (2020) as well). Steps are:

PPRGo aggregates the most important nodes in the N-hop neighborhood using only 1-hop of computation.

In an e-commerce problem, the graph structure as well as attributes change over time. There is useful information in how the events evolved over time. If we ignore such temporal information and only look at a static snapshot of a graph when we want to do learning/inference at a point in time, we may lose efficiency and render inferior performance, not to mention that the entire computations are repeated each time.

In general, dynamic models can inherently help to make prediction at any point in time (like classifying the current label of a node) by predicting events sequentially, and they support node feature changes as well as adding or deleting nodes or edges.

For learning and inference methods on dynamic graphs, see “temporal graph networks (TGN) for deep learning on dynamic graphs” by Rossi, et al. (2020), and the survey by Kazemi, et al. (2020) where they describe various methods from an encoder-decoder perspective. For example, an encoder model generates temporal node embeddings, and the decoder is a task-dependent model, and can be for example an NN model which uses temporal embeddings of neighboring nodes as input to perform node classification.

There are several other methods available for embedding and employing dynamic graphs, including Lei et al. (2020), Goyal et al. (2019) who presented Dyngraph2vec embedding method which uses a recurrent neural network (RNN) to capture the temporal information and to learn the embedding using an auto-encoder model, and Yu et al. (2018) who presented the NetWalk embedding method.

There are several related graph-based works for fraud or abuse detection. Li, et al. (2020) published a work on using GCN for understanding and classifying financial data. The work by Xu, et al. (2021) is about consumer loan fraud detection using GCN. Rao, et al. (2020) presented a method comprised of a fraud detector part and an explainer part to generate a graphical visualization (to show fraud vs legitimate transactions) from heterogeneous graph data as well as the detector parameters. Wang, et al. (2021) presented a joint modeling of graph and time information for large-scale graphs. The time information and sequential events for every node are encoded using a RNN model. Then the per-node outputs from the RNN are propagated through the graph using a GNN which employs the network information in the graph to finally make the abuse detection decision, or to perform label prediction for unlabeled users.

In Part 1 and 2 of this blogpost, we briefly reviewed the value of network information among entities involved in e-commerce transactions and explained how to model them as graphs. Next, we introduced several popular GNN and graph embedding methods. These methods are emerging techniques for efficient extraction of network information from connected data. However, there are several challenges in deploying such methods in practice, including how to properly design and represent an e-commerce graph given its dynamic structure and time-based nature of events, how to efficiently perform inference on new nodes or unseen data, and how to scale the learning as well as inference process with these methods to large-scale graphs. These challenges require further research and development work, which is beyond the scope of this blogpost.

I would like to thank Henry Chen, Priya Venkat and James Tang (of Walmart) for reading the draft and providing their thoughtful comments, which helped to improve this post.

Images Powered by Shutterstock

The Data Daily

An overview of graph neural networks for anomaly detection in e-commerce