Exploring the Fundamental Principles and Various Approaches of Neural Architecture Search (NAS)

Neural Architecture Search (NAS) automates the design of neural networks’ topology to achieve the best performance on a specific task. The goal is to design the architecture using limited resources and minimal human intervention. Following the work of Ren et. al, this article discusses a general framework for NAS. At its core, NAS is a search algorithm that operates on the search space of possible network topologies, consisting of predefined operations (e.g.convolutional layers, recurrent, pooling, fully connected etc.) and their connections. A controller then chooses a list of possible candidate architectures from the search space, which are trained and ranked based on their performance on the validation test. The ranking is used to readjust the search and obtain new candidates. The process iterates until reaching a certain condition and provides the optimal architecture. The optimal architecture is evaluated on the test set.

nas-framework

In general, the landscape of NAS algorithms is quite confusing. The most popular categorization characterizes NAS based on three major components: (a) the search space, (b) the search strategy (involving the type of controller and the evaluation of the candidates), and (c) the performance evaluation technique. A good resource for more details on this is a review by Esken et.al. Also feel free to refer to an extra resource provided by Lillian Weng. Recent approaches combine the search strategy with the evaluation step, making it hard to distinguish algorithms between them. For that reason, this article will explore NAS based solely on the search strategy. As we progress, we will examine different search spaces and evaluation techniques. Note that many implementations experiment with different types of search strategies, so the following categorization is not always strict.

Search strategy refers to the methodology used to search for the optimal architecture in the search space. NAS algorithms can be classified by their search strategy into 5 main areas: Random search, Reinforcement learning, Evolutionary algorithms, Sequential model-based optimization, and Gradient optimization.

The most naive approach is obviously random search, which is often used as a baseline. Here a valid architecture is chosen at random with no learning involved whatsoever.

Reinforcement learning

NAS can be very elegantly formulated as an RL problem. The agent’s action is the generation of a neural architecture while the agent’s reward is the performance evaluation. The action space is, of course, the search space. As a result, different RL methods can be used to solve the problem.

nas-rl

Early works of NAS (NAS-RL, NASNet) used a recurrent neural network (RNN) as a policy network (controller). The RNN is responsible for generating candidate architectures. The architecture is then trained and evaluated on the validation set. The parameters of the RNN controller are optimized in order to maximize the expected validation accuracy, using policy gradients techniques such as REINFORCE and Proximal Policy Optimization (PPO).

rnn-controller

Similarly, ENAS uses an RNN controller trained with policy gradients. Notably, it is one of the first works that effectively share parameters among architectures. The intuition is that the architectures can be viewed as part of a large graph, an approach that has been used extensively as we will see below. ENAS training is performed in two alternating steps: (a) the RNN controller is trained with REINFORCE and (b) the shared parameters are trained in typical gradient descent form.

Another successful effort called MetaQNN uses Q Learning with an e-greedy exploration mechanism and experience replay.

Before we proceed, let’s open a parenthesis and discuss the search space.

Global search space

NAS-RL is looking for all possible combinations of operations, resulting in a huge and very expensive search space. It tried to combine the operations to form chain-structured (a.k.a sequential) networks. The search space is parameterized by (a) the number of layers, (b) the type of each operation, and (c) the hyperparameters of each operation (e.g kernel size, number of filters).

Later on, skip connections were also added to the mix, allowing multi-branch architectures such as ResNet or DenseNet-like topologies.

Modular search space

To solve the global space problem, cell-based approaches were proposed to “modularize” the search space by mixing different blocks of layers called modules. NASNet is the most popular algorithm in that category. NASNet learns only two kinds of modules or “cells”: a normal cell that performs feature extraction and a reduction cell that downsamples the input. The final architecture is built by stacking these cells in a predefined way.

modular-search-space

The cell-based approach is widely used in other works such as ENAS. But modularization via cells is not the only alternative. FPNAS emphasizes block diversity by alternatively optimizing blocks while keeping other blocks fixed. FBNet utilizes a layer-wise search space, allowing each searchable layer in the network to choose a different block from the layer-wise search space.

Evolutionary algorithms

Genetic Algorithms (GA) offer an alternative way to optimize the network architecture. Evolutionary algorithms start with a population of models. In each step, some models are sampled and “reproduce” to generate offsprings by applying mutations to them. Mutations can be local operations such as the addition of a layer or the modification of a hyperparameter. After training, the offspring models are evaluated and added back to the population. The process repeats itself until a certain condition is met.

nas-ea

GeNet proposes an encoding method to represent each architecture as a fixed-length binary string, which will be used by a standard genetic algorithm. AmoebaNet uses the tournament selection evolutionary algorithm, a modification known as regularized evolution. The difference is that the “age” (steps of the GA) of each model is also taken into consideration by favoring the younger ones. Note that the NASNet search space is used here as well.

Sequential model-based optimization

In sequential model-based optimization, we can view NAS as a sequential process that builds increasingly complex networks iteratively. A surrogate model evaluates all candidate modules (or cells) and selects some promising candidate modules. It then evaluates the performance of the generated network on a validation set and updates itself based on that performance. Through iteration, the model gradually expands and reaches the desired performance.

nas-sequential

Progressive NAS (PNAS) starts with the NASNet search space and searches the space in a progressive order. Each cell can contain up to 5 predefined blocks. It starts by generating, training, and evaluating all possible cells with only 1 block. It then expands to cells with 2 blocks. This time, instead of training all cells, a surrogate model (an RNN) predicts the performance of each architecture. Note that the surrogate model is trained based on the validation performance of 1-block cells. The process continues up to 5-block cells. DPP-Net follows a very similar approach with PNAS, but it also accounts for the device it’s executed on. Given the device the search is performed, constraints are set based on its memory size and similar hardware characteristics. Models that do not meet the constraints are either removed from the candidate list or optimized using Pareto optimality.

dpp-net

Gradient optimization and one-shot approaches

Gradient optimization methods use, in most cases, a one-shot model. A one-shot model, also known as supermodel or supernet, is usually a single large network that contains all possible operations in the search space and is used to generate the weights for other candidate networks. After training the model, it can be used to sample “sub-architectures” and compare them on the validation set. In a way, parameter sharing is utilized to the maximum extent. The one-shot network is usually trained with gradient descent. But how can we run gradient-based methods on discrete search spaces?

Continuous search space

Both global and modular search spaces assume that we have a discrete set of possible solutions. An exciting idea is to transform the space into a continuous and differentiable form. In this direction, DARTS relaxes the NASNet cell-based space in a continuous way by representing each cell as a node in a Directed Acyclic Graph (DAG). Every DAG is formed as a sequential connection of N nodes and has two input nodes and one output node. The relaxation is accomplished by formulating the discrete choice of operations as a softmax – a continuous choice of probabilities.

darts

DARTS treats NAS as a bi-level optimization problem because it jointly trains the architecture parameters and network weights with gradient descent. Similarly, Stochastic NAS (SNAS) searches space is a set of one-hot random variables from a fully factorizable joint distribution. This search space is made differentiable by relaxing the architecture distribution with concrete distribution, thus enabling the use of gradient optimization. NAO maps the discrete search space to a continuously embedded encoding, obtains the optimal embedded encoding using gradient optimization, and then uses a decoder to convert the optimal continuous representation into the final architecture.

One-shot approaches

Both DARTS and NAO are one-shot approaches because they use a supermodel to deduce the optimal architectures. SMASH trains an auxiliary model called HyperNet instead of training all possible candidates, reducing the search space even further. ProxylessNAS proposes a path-level pruning perspective and a novel idea of binarizing the architecture parameters to keep only one active path at a time. AlphaNet further improves the training of the supernet by using in-place knowledge distillation (KD) and an alpha-divergence approach. BossNAS uses a unique self-supervised representation learning scheme called ensemble bootstrapping. NAT uses transfer learning in the context of NAS, transferring an existing supernet into a task-specific supernet while searching for architectures that solve multiple objectives.

proxyless-nas

* Disclosure: Please note that some of the links above might be affiliate links, and at no additional cost to you, we will earn a commission if you decide to make a purchase after clicking through.

Conclusion

NAS research is still in its infancy. Benchmarking is also not a trivial endeavor, with the existing standard benchmarks such as ImageNet, NAS-Bench201, CIFAR10, but more work is needed in this direction. Open-source libraries are barely touching NAS as a problem. This article aims to serve as an introduction and will hopefully motivate more practitioners to pursue NAS

Latest articles

Related articles