Skip to main content

Multilayer perceptrons (MLP) in Computer Vision

Bram Verhoef | Director of Customer Engineering & Success at AXELERA AI  SummaryConvolutional neural networks (CNNs) still dominate today’s computer vision. Recently, however, networks based on transformer blocks have also been applied to typical computer vision tasks such as object classification, detection, and segmentation, attaining state-of-the-art results on standard benchmark datasets.However, these vision-transformers (ViTs) are usually pre-trained on extremely large datasets and may consist of billions of parameters, requiring teraflops of computing power. Furthermore, the self-attention mechanism inherent to classical transformers builds on quadratically complex computations.To mitigate some of the problems posed by ViTs, a new type of network based solely on multilayer perceptrons (MLPs), has recently been proposed. These vision-MLPs (V-MLP) shrug off classical self-attention but still achieve global processing through their fully connected layers.In this blog post, we review the V-MLP literature, compare V-MLPs to CNNs and ViTs, and attempt to extract the ingredients that really matter for efficient and accurate deep learning-based computer vision.IntroductionIn computer vision, CNNs have been the de facto standard networks for years. Early CNNs, like AlexNet [1] and VGGNet [2], consisted of a stack of convolutional layers, ultimately terminating in several large fully connected layers used for classification. Later, networks were made progressively more efficient by reducing the size of the classifying fully connected layers using global average pooling [3]. Furthermore these more efficient networks, among other adjustments, reduce the spatial size of convolutional kernels [4, 5], employ bottleneck layers and depthwise convolutions [5, 6], and use compound scaling of the depth, width and resolution of the network [7]. These architectural improvements, together with several improved training methods [8] and larger datasets have led to highly efficient and accurate CNNs for computer vision.Despite their tremendous success, CNNs have their limitations. For example, their small kernels (e.g., 3×3) give rise to small receptive fields in the early layers of the network. This means that information processing in early convolutional layers is local and often insufficient to capture an object’s shape for classification, detection, segmentation, etc. This problem can be mitigated using deeper networks, increased strides, pooling layers, dilated convolutions, skip connections, etc., but these solutions either lose information or increase the computational cost. Another limitation of CNNs stems from the inductive bias induced by the weight sharing across the spatial dimensions of the input. Such weight sharing is modeled after early sensory cortices in the brain and (hence) is well adapted to efficiently capture natural image statistics. However, it also limits the model’s capacity and restricts the tasks to which CNNs can be applied.Recently, there has been much research to solve the problems posed by CNNs by employing transformer blocks to encode and decode visual information. These so-called Vision Transformers (ViTs) are inspired by the success of transformer networks in Natural Language Processing (NLP) [9] and rely on global self-attention to encode global visual information in the early layers of the network. The original ViT was isotropic (it maintains an equal-resolution-and-size representation across layers), permutation invariant, based entirely on fully connected layers and relying on global self attention [10]. As such, the ViT solved the above-mentioned problems related to CNNs by providing larger (dynamic) receptive fields in a network with less inductive bias.This is exciting research but it soon became clear that the ViT was hard to train, not competitive with CNNs when trained on relatively small datasets (e.g., IM-1K, [11]), and computationally complex as a result of the quadratic complexity of self-attention. Consequently, further studies sought to facilitate training. One approach was using network distillation [12]. Another was to insert CNNs at the early stages of the network [13]. Further attempts to improve ViTs re-introduced inductive biases found in CNNs (e.g., using local self attention [14] and hierarchical/pyramidal network structures [15]). There were also efforts to replace dot-product QKV-self-attention with alternatives [e.g. 16]. With these modifications now in place, vision transformers can compete with CNNs with respect to computational efficiency and accuracy, even when trained on relatively small datasets [see this blog post by Bert Moons for more discussion on ViTs]. Vision MLPsNotwithstanding the success of recent vision transformers, several studies demonstrate that models building solely on multilayer perceptrons (MLPs) — so-called vision MLPs (V-MLPs) — can achieve surprisingly good results on typical computer vision tasks like object classification, detection and segmentation. These models aim for global spatial processing, but without the computationally complex self-attention. At the same time, these models are easy to scale (high model capacity) and seek to retain a model structure with low inductive bias, which makes them applicable to a wide range of tasks [17].Like ViTs, the V-MLPs first decompose the images into non-overlapping patches, called tokens, which form the input into a V-MLP block. A typical V-MLP block consists of a spatial MLP (token mixer) and a channel MLP (channel mixer), interleaved by (layer) normalization and complemented with residual connections. This is illustrated in Figure 1.Table 1. Overview of some V-MLPs. For each V-MLP, we present the accuracy of the largest reported model that is trained on IM-1K only. Here the spatial MLP captures the global correlations between tokens, while the channel MLP combines information across features. This can be formulated as follows:Y=spatialMLP(LN(X))+X, Z=channelMLP(LN(Y))+Y,Here X is a matrix containing the tokens, Y consists of intermediate features, LN denotes layer normalization, and Z is the output feature of the block. In these equations, spatialMLP and channelMLP can be any nonlinear function represented by some type of MLP with activation function (e.g. GeLU).In practice, the channelMLP is often implemented by one or more 1×1 convolutions, and most of the innovation found in different studies lies in the structure of the spatialMLP submodule. And, here’s where history repeats itself. Where ViTs started as isotropic models with global spatial processing (e.g., ViT [10] or DeiT [12]), V-MLPs did so too (e.g., MLP-Mixer [17] or ResMLP [18]). Where recent ViTs improved their accuracy and performance on visual tasks by adhering to a hierarchical structure with local spatial processing (e.g., Swin-transformer [14] or NesT [19]), recent V-MLPs do so too (e.g., Hire-MLP [20] or S^2-MLPv2 [21]). These modifications made the models more computationally efficient (fewer parameters and FLOPs), easier to train and more accurate, especially when trained on relatively small datasets. Hence, over time both ViTs and V-MLPs re-introduced the inductive biases well known from CNNs.Due to their fully connected nature, V-MLPs are not permutation invariant and thus do not necessitate the type of positional encoding frequently used in ViTs. However, one important drawback of pure V-MLPs is the fixed input resolution required for the spatialMLP submodule. This makes transfer to downstream tasks, such as object detection and segmentation, difficult. To mitigate this problem, some researchers have inserted convolutional layers or, similarly, bicubic interpolation layers, into the V-MLP (e.g., ConvMLP [22] or RaftMLP [23]). Of course, to some degree, this defies the purpose of V-MLPs. Other studies have attempted to solve this problem using MLPs only (e.g., [20, 21, 30]), but the data-shuffling needed to formulate the problem as an MLP results in an operation that is very similar or even equivalent to some form of (grouped) convolution.See Table 1 for an overview of different V-MLPs. Note how some of the V-MLP models are very competitive with (or better than) state-of-the-art CNNs, e.g. ConvNeXt-B with 89M parameters, 45G FLOPs and 83.5% accuracy [28]. What matters?It is important to note that the high-level structure of V-MLPs is not new. Depthwise-separable convolutions for example, as used in MobileNets [6], consist of a depthwise convolution (spatial mixer) and a pointwise 1×1 convolution (channel mixer). Furthermore, the standard transformer block comprises a self-attention layer (spatial mixer) and a pointwise MLP (channel mixer). This suggests that the good performance and accuracy obtained with these models results at least partly from the high-level structure of layers used inside V-MLPs and related models. Specifically, (1) the use of non-overlapping spatial patch embeddings as inputs, (2) some combination of independent spatial (with large enough spatial kernels) and channel processing, (3) some interleaved normalization, and (4) residual connections. Recently, such a block structure has been dubbed “Metaformer” ([24], Figure 2), referring to the high-level structure of the block, rather than the particular implementation of its subcomponents. Some evidence for this hypothesis comes from [27], who used a simple isotropic purely convolutional model, called “ConvMixer,” that takes non-overlapping patch embeddings as inputs. Given an equal parameter budget, their model shows improved accuracy compared to standard ResNets and DeiT. A more thorough analysis of this hypothesis was performed by “A ConvNet for the 2020s,” [28], which systematically examined the impact of block elements 1-4, finding a purely convolutional model reaching SOTA performance on ImageNet, even when trained on IN-1K alone.Figure 2. a. V-MLP, b. Transformer and c. MetaFormer. Adapted from [24]. 

Related products:AI Software

Interview with Torsten Hoefler, Axelera AI’s Scientific Advisor

Evangelos Eleftheriou | CTO at AXELERA AI  Our CTO had a chat with Torsten Hoefler to scratch the surface and get to know better our new scientific advisor.Evangelos: Could you please introduce yourself and your field of expertise?Torsten: My background is in High-Performance Computing on Supercomputers. I worked on large-scale supercomputers, networks, and the Message Passing Interface specification. More recently, my main research interests are in the areas of learning systems and applications of them, especially in the climate simulation area. E: Where is currently the focus of your research interests?T: I try to understand how to improve the efficiency of deep learning systems (both inference and training) ranging from smallest portable devices to largest supercomputers. I especially like the application of such techniques for predicting the weather or future climate scenarios. E: What do you see as the greatest challenges in data-centric computing in current hardware and software landscape?T: We need a fundamental shift of thinking – starting from algorithms, where we teach and reason about operational complexity. We need to seriously start thinking about data movement. From this algorithmic base, the data-centric view needs to percolate into programming systems and architectures. On the architecture side, we need to understand the fundamental limitations to create models to guide algorithm engineering. Then, we need to unify this all into a convenient programming system. E: Could you please explain the general concept of DaCe, as a generic data-centric programming framework?T: DaCe is our attempt to capture data-centric thinking in a programming system that takes Python (and others) codes and represents them as a data-centric graph representation. Performance engineers can then work conveniently on this representation to improve the mapping to specific devices. This ensures highest performance. E: DaCe has also extensions for Machine Learning (DaCeML). Where do those help? Could in general in-memory computing accelerators benefit by such a framework and how?T: DaCeML supports the Open Neural Network Exchange (ONNX) format and PyTorch through the ONNX exporter. It offers inference as well as training support at highest performance using data-centric optimizations. In-memory computing accelerators can be a target for DaCe – depending on their offered semantics, a performance engineer could identify pieces of the dataflow graph to be mapped to such accelerators. E: In which new application domains do you see data-centric computing playing a major role in the future?T: I would assume all computations where performance or energy consumption is important – ranging from scientific simulations to machine learning and from small handheld devices to large-scale supercomputers. E: What is your advice to young researchers in the field of data-centric optimization?T: Learn about I/O complexity! As Scientific Advisor, Torsten Hoefler advises the Axelera AI Team on the scientific aspects of its research and development. To learn more about Torsten’s work, please visit his biography page.

Related products:Company

Transformers in Computer Vision

Bert Moons | Director – System Architecture at AXELERA AI  SummaryConvolutional Neural Networks (CNN) have been dominant in Computer Vision applications for over a decade. Today, they are being outperformed and replaced by Vision Transformers (ViT) with a higher learning capacity. The fastest ViTs are essentially a CNN/Transformer hybrid, combining the best of both worlds: (A) CNN-inspired hierarchical and pyramidal feature maps, where embedding dimensions increase and spatial dimensions decrease throughout the network are combined with local receptive fields to reduce model complexity, while (B) Transformer-inspired self-attention increases modeling capacity and leads to higher accuracies. Even though ViTs outperform CNNs in specific cases, their dominance has not yet been asserted. We illustrate and conclude that SotA CNNs are still on-par, or better, than ViTs in ImageNet validation, especially when (1) trained from scratch without distillation, (2) in the lower-accuracy <80% regime, and (3) for lower network complexities optimized for Edge devices. Convolutional Neural NetworksConvolutional Neural Networks (CNN) have been the dominant Neural Network architectures in Computer Vision for almost a decade, after the breakthrough performance of AlexNet[1]on the ImageNet[2] image classification challenge. From this baseline architecture, CNNs have evolved into variations of bottlenecked architectures with residual connections such as ResNet[3], RegNet[4] or into more lightweight networks optimized for mobile contexts using grouped convolutions and inverted bottlenecks, such as Mobilenet[5] or EfficientNet[6]. Typically, such networks are benchmarked and compared by training them on small images on the ImageNet data set. After this pretraining, they can be used for applications outside of image classification such as object detection, panoptic vision, semantic segmentation, or other specialized tasks. This can be done by using them as a backbone in an end-to-end application-specific Neural Network and finetuning the resulting network to the appropriate data set and application.A typical ResNet-style CNN is given in Figure 1-1 and Figure 1-4 (a). Typically, such networks have several features:They interleave or stack 1×1 and kxk convolutions to balance the cost of convolutions with building a large receptive field, Training is stabilized by using batch-normalization and residual connections. Feature maps are built hierarchically by gradually reducing the spatial dimensions (W,H), finally downscaling them by a factor of 32x. Feature maps are built pyramidally, by increasing the embedding dimensions of the layers from the range of 10 channels in the first layers to 1000s in the last Figure 1-1: Illustration of ResNet34 [3] Within these broader families of backbone networks, researchers have developed a set of techniques known as Neural Architecture Search (NAS)[7] to optimize the exact parametrizations of these networks. Hardware-Aware NAS methods automatically optimize a network’s latency while maximizing accuracy, by efficiently searching over its architectural parameters such as the number of layers, the number of channels within each layer, kernel sizes, activation functions and so on. So far, due to high training costs, these methods have failed to invent radically new architectures for Computer Vision. They mostly generate networks within the ResNet/MobileNet hybrid families, leading to only modest improvements of 10-20% over their hand-designed baseline[8].

Related products:AI SoftwareTechnology

An Interview with Marian Verhelst, Axelera AI’s Scientific Advisor

Fabrizio Del Maffeo | CEO at AXELERA AI I met Marian Verhelst in the summer of 2019 and she immediately intrigued me with her passion and competence for computing architecture design. We started immediately a collaboration and today she’s here with us sharing her insights on the future of computing.Fabrizio: Can you please introduce yourself, your experience and your field of study?Marian: My name is Marian Verhelst, and I am a professor at the MICAS lab of KU Leuven[i]. I studied electrical engineering and received my PhD in microelectronics in 2008. After completing my studies, I joined Intel Labs in Portland, Oregon, USA, and worked as a research scientist. I then became a professor at KU Leuven in 2012, focusing on efficient processing architectures for embedded sensor processing and machine learning. My lab regularly tapes out processor chips using innovative and advanced technologies. I am also active in international initiatives, organising IC conferences such as ISSCC, DATE, ESSCIRC, AICAS and more. I also serve as the Director of the tinyML Foundation. Most recently, I was honoured to receive the André Mischke YAE Prize[ii] for Science and Policy, and I have been shortlisted for the 2021 Belgium Inspiring Fifty list[iii]. F: What is the focus of your most recent research?M: My research currently focuses on three areas. First, I am looking at implementing an efficient processor chip for embedded DNN workloads. Our latest tape-out, the Diana chip, combines a digital AI accelerator with an analogue- compute-in-memory AI accelerator in a common RISC-V-based processing system. This allows the host processor to offload neural network layers to the most suitable accelerator core, depending on parallelisation opportunities and precision needs. We plan to present this chip at ISSCC 2022[iv].The second research area is improving the efficiency of designing and programming such processors. We developed a new framework called the ZigZag framework[v], which enables rapid design space exploration of processor architectures and algorithm-to-processor mapping schedules for a suite of ML workloads.My last research area is exploring processor architectures for beyond-NN workloads. Neural networks on their own cannot sufficiently perform complex reasoning, planning or perception tasks. They must be complemented with probabilistic and logic-based reasoning models. However, these networks do not map well on CPU, GPU, or NPUs. We are starting to develop processors and compilers for such emerging ML workloads in my lab. F: There are different approaches and trends in new computing designs for artificial intelligence workloads: increasing the number of computing cores from a few to tens, thousands or even hundreds of thousands of small, efficient cores, as well as near-memory processing, computing-in-memory, or in-memory computing. What is your opinion about these architectures? What do you think is the most promising approach? Are there any other promising architecture developments?M: Having seen the substantial divergence in ML algorithmic workloads and the general trends in the processor architecture field, I am a firm believer in very heterogeneous multi-core solutions. This means that future processing systems will have a large number of cores with very different natures. Eventually, such cores will include (digital) in- or near-memory processing cores, coarse grain reconfigurable systolic arrays and more traditional flexible SIMD cores. Of course, the challenge is to build compilers and mappers that can grasp all opportunities from such heterogeneous and widely parallel fabrics. To ensure excellent efficiency and memory capabilities, it will be especially important to exploit the cores in a streaming fashion, where one core immediately consumes the data produced by another. F: Computing design researchers are working on low power and ultra-low power consumption design using metrics such as TOPs/w as a key performance indicator and low precision networks trained mainly on small datasets. However, we also see neural network research increasingly focusing on large networks, particularly transformer networks that are gaining traction in field deployment and seem to deliver very promising results. How can we conciliate these trends? How far are we from running these networks at the edge? What kind of architecture do you think can make this happen?M: There will always be people working to improve energy efficiency for the edge and people pushing for throughput across the stack. The latter typically starts in the data centre but gradually trickles down to the edge, where improved technology and architectures enable better performance. It is never a story of choosing one option over another. Over the past years, developers have introduced increasingly distributed solutions, dividing the workload between the edge and the data centre. The vital aspect of these presented solutions is that they need to work with scalable processor architectures. Developers can deploy these architectures with a smaller core count at the extreme edge and scale up to larger core numbers for the edge and a massive core count for the data centre. This will require processing architectures and memory systems that rely on a mesh-type distributed processor fabric, rather than being centrally controlled by a single host. F: How do you see the future of computing architecture for the data centre? Will it be dominated by standard computing, GPU, heterogeneous computing, or something else?M: As I noted earlier, I believe we will see an increasing amount of heterogeneity in the field. The data centre will host a wide variety of processors and differently-natured accelerator arrays to cover the widely different workloads in the most efficient manner possible. As a hardware architect, the exciting and still open challenge is what library of (configurable) processing tiles can cover all workloads of interest. Most intriguing is that, due to the slow nature of hardware development, this processor library should cover not only the algorithms we know of today but also those that researchers will develop in the years to come.As Scientific Advisor, Marian Verhelst advises the Axelera AI Team on the scientific aspects of its research and development. To learn more about Marian’s work, please visit her biography page. References[I] https://www.esat.kuleuven.be/micas/[ii] https://yacadeuro.org/fifth-edition-of-the-annual-andre-mischke-yae-prize-awarded-to-marian-verhelst/[iii] https://belgium.inspiringfifty.org/[iv] https://www.isscc.org/program-overview[v] https://github.com/ZigZag-Project/zigzag

Related products:Company

What’s Next for Data Processing? A Closer Look at In-Memory Computing

Evangelos Eleftheriou | CTO at AXELERA AI Technology is progressing at an incredible pace and no technology is moving faster than Artificial Intelligence (AI). Indeed, we are on the cusp of an AI revolution which is already reshaping our lives. One can use AI technologies to automate or augment humans, with applications including autonomous driving, advances in sensory perception and the acceleration of scientific discovery using machine learning. In the past five years, AI has become synonymous with Deep Learning (DL), another area seeing fast and dramatic progress. We are at a point where Deep Neural Networks (DNNs) for image and speech recognition can provide accuracy on par or even better than that achieved by the human brain.Most of the fundamental algorithmic developments around DL go back decades. However, the recent success has stemmed from the availability of large amounts of data and immense computing power for training neural networks. From around 2010, the exponential increase of single-precision floating point operations offered by Graphic Processing Units (GPUs) ran in parallel to the explosion of neural network sizes and computational requirements. Specifically, the amount of compute used in the largest AI training has doubled every 3.5 months during the last decade. At the same time, the size of state-of-the-art models increased from 26M weights for ResNet-50 to 1.5B for GPT-2. This phenomenal increase in model size is reflected directly in the cost of training such complex models. For example, the cost of training the bidirectional transformer network BERT, for Natural Language Processing applications, is estimated at $61,000, whereas training XLNet, which outperformed BERT, costs about nine times as much. However, a major concern is not only the cost associated with the substantial energy consumption needed to train complex networks but also the significant environmental impact incurred in the form of CO2 emissions.As the world looks to reduce carbon emissions, there is an even greater need for higher performance with lower power consumption. This is true not only for AI applications in the data center, but also at the Edge, which is where we expect the next revolution to take place. AI at the Edge refers to processing of data where it is collected, as opposed to requiring data to be moved to separate processing centers. There is a wealth of applications at the edge: AI for mobile devices, including authentication, speech recognition, and mixed/augmented reality, AI for embedded processing for IoT devices, including smart cities and homes or embedded processing for prosthetics, wearables, and personalized healthcare, as well as AI for real-time video analytics for autonomous navigation and control. However, these embedded applications are all energy and memory constrained, meaning energy efficiency matters even more so at the Edge. The end of Moore’s and Dennard’s laws are compounding these challenges. Thus, there are compelling motivations to explore novel computing architectures with inspiration from the most efficient computer on the planet, the human brain. Traditional Computing Systems: Current State of PlayTraditional digital computing systems, based on the von Neumann architecture, consist of separate processing and memory units. Therefore, performing computations typically results in a significant amount of data being moved back and forth between the physically separated memory and processing units. This data movement costs latency and energy and creates an inherent performance bottleneck. The latency associated with the growing disparity between the speed of memory and processing units, commonly known as the memory wall, is one example of a crucial performance bottleneck for a variety of AI workloads. Similarly, the energy cost associated with shuttling data represents another key challenge for computing systems that are severely power limited due to cooling constraints as well as for the plethora of battery-operated mobile devices. In general, the energy cost of multiplying two numbers is orders of magnitude lower than that of accessing numbers from memory. Therefore, it is clear to AI developers that there is a need to explore novel computing architectures that provide better collocation of processing and memory subsystems. One suggested concept in this area is near-memory computing, which aims to reduce the physical distance and time needed to access memory. This approach heavily leverages recent advances made in die stacking and new technologies such as the high memory cube (HMC) and high bandwidth memory (HBM). In-Memory Computing: A Radical New ApproachIn-memory computing is a radically different approach to data processing, in which certain computational tasks are performed in place in the memory itself (Sebastian 2020). This is achieved by organizing the memory as a crossbar array and by exploiting the physical attributes of the memory devices. The peripheral circuitry and the control logic play a key role in creating what we call an in-memory computing (IMC) unit or computational memory unit (CMU). In addition to overcoming the latency and energy issues associated with data movement, in-memory computing has the potential to significantly improve the computational time complexity associated with certain computational tasks. This is primarily a result of the massive parallelism created by a dense array of millions of memory devices simultaneously performing computations.For instance, crossbar arrays of such memory devices can be used to store a matrix and perform matrix-vector multiplications (MVMs) at constant O(1) time complexity without intermediate movement of data. The efficient matrix-vector multiplication via in-memory computing is very attractive for training and inference of deep neural networks, particularly for inference applications at the Edge where high energy efficiency is critical. In fact, matrix-vector multiplications constitute 70-90% of all deep learning operations. Thus, applications requiring numerous AI components such as computer vision, natural language processing, reasoning and autonomous driving can explore this new technology in new and innovative ways. Novel dedicated hardware with massive on-chip memory, where part of it is enhanced with in-memory computation capabilities could lead to very efficient training and inference engines of ultra-large neural networks comprising of potentially billions of synaptic weights.The core technology of IMC is memory. In general, there are two classes of memory devices. The conventional one, in which information is stored in the presence or absence of charge, includes dynamic random-access memory (DRAM), static random-access memory (SRAM) and Flash memory. There is also an emerging class of memory devices, in which information is stored in terms of the atomic arrangements within nanoscale volumes of materials, as opposed to charge on a capacitor. Generally speaking, one atomic configuration corresponds to one logic state, and the other corresponds to another logic state. These differences in atomic configuration manifest as a change in resistance, and thus these devices are collectively called resistive memory devices or memristors. Traditional and emerging memory technologies can perform a range of in-memory logic and arithmetic operations. In addition, SRAM, Flash and all memristive memories can also be used for MVM operations.The most important characteristics of a memory device are its read and write times, that is how fast a device can store and retrieve information. Equally important characteristics are the cycling endurance, which refers to the number of times a memory device can be switched from one state to the other, the energy required to store information in a memory cell as well as the size of the memory cell. Table 1 -compares the traditional DRAM, SRAM and NOR Flash with the most popular emerging resistive-memory technologies, such as spin-transfer torque RAM (STT-RAM), phase-change memory (PCM) and resistive RAM (ReRAM).Table 1 – Comparing different memory technologies. Sources:(B. Li 2019), (Marinella 2013) 

Related products:Technology

Ten questions with Axelera AI’s Scientific Advisor Luca Benini

Fabrizio Del Maffeo | CEO at AXELERA AI  Professor Luca Benini is one of the foremost authorities on computer architecture, embedded systems, digital integrated circuits, and machine learning hardware. We’re honored to count him as one of our scientific advisors. Prof. Benini kindly agreed to answer a few questions for our followers on his research and the future of artificial intelligence. For our readers who are unfamiliar with your work, can you give us a brief summary of your career?I am the chair of Digital Circuits and Systems at ETHZ, and I am a full professor at the Università di Bologna. I received a PhD from Stanford University, and I have been a visiting professor at Stanford University, IMEC, EPFL. I also served as chief architect at STMicroelectronics France.My research interests are in energy-efficient parallel computing systems, smart sensing micro-systems and machine learning hardware. I’ve published more than 1.000 peer-reviewed papers and five books.I am a Fellow of the IEEE, of the ACM and a member of the Academia Europaea. I’m the recipient of the 2016 IEEE CAS Mac Van Valkenburg Award, the 2019 IEEE TCAD Donald O. Pederson Best Paper Award, and the ACM/IEEE A. Richard Newton Award 2020. Which research subjects are you exploring?I am extremely interested in energy-efficient hardware for machine learning and data-intensive computing. More specifically, I am passionate about exploring the trade-off between efficiency and flexibility. While everybody is aware of the fact that you can enormously boost efficiency with super-specialization, a super-specialized architecture will be narrow and short-lived, so we need flexibility. Artificial Intelligence requires a new computing paradigm and new data-driven architectures with high parallelisation. Can you share with us what you think the most promising directions are and what kind of new applications they can unleash?I believe that the most impactful innovations are those that improve efficiency without over-specialization. For instance, using low bit-width representations reduces energy, but you need to have “transprecision,” i.e., the capability to dynamically adjust numerical precision. Otherwise, you won’t be accurate enough on many inference/training tasks, and then your scope of application may narrow down too much.Another high-impact direction is related to minimising switching activity across the board. For instance, systolic arrays are very scalable (local communication patterns) but have huge switching activity related to local register storage. In-memory computing cores can do better than systolic arrays, but they are not a panacea. In general, we need to design architectures where we reduce the cost related to moving data in time and space. Can you share more with us about the tradeoffs and benefits of analog computing versus digital computing and where they can work together?Analog computing is a niche, but a very important one. Ultimately, we can implement multiply-accumulate arrays very efficiently with analog computation, possibly beating digital logic, but it’s a tough fight. You need to do everything right (from interface and core computation circuits to precision selection to size).The critical point is to design the analog computing arrays in a way that can be easily ported to different technology targets without complete manual redesign. I view an analog computing core as a large-scale “special function unit” that needs to be efficiently interfaced with a digital architecture. So, it’s a “digital on top” design, with some key analog cores, that can win.Our sector has a prevailing opinion that Moore’s Law is dead. Do you agree, and how can we increase computing density?The “traditional” Moore’s Law is dead, but scaling is fully alive and kicking through a number of different technologies — 2.5D, 3D die stacking, monolithic 3D, heterogeneous 3D, new electron devices, optical devices, quantum devices and more. This used to be called “More-than-Moore,” but I think it’s now really the cornerstone of scaling compute density – the ultimate goal. You are a very important contributor to the RISC-V community with your PULP platform, widely used in research and commercial applications. Why and when did you start the project, and how do you see it evolving in the next ten years?I started PULP because I was convinced that the traditional closed-source computing IP market, and even more proprietary ISAs, were stifling innovation in many ways. I wanted to create a new innovation ecosystem where research could be more impactful and startups could more easily be created and succeed. I think I was right. Now the avalanche is in motion. I am sure that the open hardware and open ISA revolution will continue in the next ten years and change the business ecosystem, starting from more fragmented markets (e.g., IoT, Industrial) and then percolating to more consolidated markets (mobile, cloud). Can Europe play a leading role in the worldwide RISC-V community?The EU can play a leading role. All the leading EU companies in the semiconductor business are actively exploring RISC-V, not just startups and academia. Of course, adoption will come in waves, but I think that some of the markets where the EU has strong leadership (automotive, IoT) are ripe for RISC-V solutions — as opposed to markets where the USA and Asia lead, such as mobile phones and servers which are much more consolidated. There is huge potential for the European industry in leveraging RISC-V. What is the position of European universities and research centres versus American and Chinese in computing technologies – is there a gap, and how can the public sector help?There is a gap, but it’s not quality; it’s in quantity. The number of researchers in computer architecture, VLSI, analog and digital circuits and systems in the EU is small in relation to USA and Asia. Unfortunately, these “demographic factors” take time to change. So really, the challenge is on academics to increase the throughput. Industry can play a role, too – for instance, leading companies can help found “innovation hubs” across Europe to increase our research footprint.Companies can also help make Europe more attractive for jobs. Now that smart remote working is mainstream, people are not forced to move elsewhere. Good students in — for example — Italian or Spanish universities interested in semiconductors can find great jobs without moving. I am not saying that moving is bad, but if there are choices that do not imply moving away, more people will be attracted to these semiconductor companies and roles. Is the European Chips Act powerful enough to change the trajectory of Europe within the global semiconductor ecosystem?It helps, but it’s not enough. There is no way to pump enough public money to make an EU behemoth at the scale of TSMC. But, if this money is well spent, it can “change the derivative” and create the conditions for much faster growth. Over the last decade, European semiconductor companies didn’t bring any cutting-edge computing technology to market. Is this changing, and do you think European startups can play a role in this change?I think that some large EU companies are, by nature, “competitive followers,” so disruptive innovation is not their preferred approach, even though of course there are exceptions. The movement will come from startups, if they can attract the growth and funding of the larger companies. The emergence of a few European unicorns, as opposed to many small startups that just survive, will help Europe strengthen its position in the semiconductor market.

Related products:Company