Keep up to date with the latest news, information and updates from the Axelera team
Evangelos Eleftheriou | CTO at AXELERA AI Our CTO had a chat with Torsten Hoefler to scratch the surface and get to know better our new scientific advisor.Evangelos: Could you please introduce yourself and your field of expertise?Torsten: My background is in High-Performance Computing on Supercomputers. I worked on large-scale supercomputers, networks, and the Message Passing Interface specification. More recently, my main research interests are in the areas of learning systems and applications of them, especially in the climate simulation area. E: Where is currently the focus of your research interests?T: I try to understand how to improve the efficiency of deep learning systems (both inference and training) ranging from smallest portable devices to largest supercomputers. I especially like the application of such techniques for predicting the weather or future climate scenarios. E: What do you see as the greatest challenges in data-centric computing in current hardware and software landscape?T: We need a fundamental shift of thinking – starting from algorithms, where we teach and reason about operational complexity. We need to seriously start thinking about data movement. From this algorithmic base, the data-centric view needs to percolate into programming systems and architectures. On the architecture side, we need to understand the fundamental limitations to create models to guide algorithm engineering. Then, we need to unify this all into a convenient programming system. E: Could you please explain the general concept of DaCe, as a generic data-centric programming framework?T: DaCe is our attempt to capture data-centric thinking in a programming system that takes Python (and others) codes and represents them as a data-centric graph representation. Performance engineers can then work conveniently on this representation to improve the mapping to specific devices. This ensures highest performance. E: DaCe has also extensions for Machine Learning (DaCeML). Where do those help? Could in general in-memory computing accelerators benefit by such a framework and how?T: DaCeML supports the Open Neural Network Exchange (ONNX) format and PyTorch through the ONNX exporter. It offers inference as well as training support at highest performance using data-centric optimizations. In-memory computing accelerators can be a target for DaCe – depending on their offered semantics, a performance engineer could identify pieces of the dataflow graph to be mapped to such accelerators. E: In which new application domains do you see data-centric computing playing a major role in the future?T: I would assume all computations where performance or energy consumption is important – ranging from scientific simulations to machine learning and from small handheld devices to large-scale supercomputers. E: What is your advice to young researchers in the field of data-centric optimization?T: Learn about I/O complexity! As Scientific Advisor, Torsten Hoefler advises the Axelera AI Team on the scientific aspects of its research and development. To learn more about Torsten’s work, please visit his biography page.
Bert Moons | Director – System Architecture at AXELERA AI SummaryConvolutional Neural Networks (CNN) have been dominant in Computer Vision applications for over a decade. Today, they are being outperformed and replaced by Vision Transformers (ViT) with a higher learning capacity. The fastest ViTs are essentially a CNN/Transformer hybrid, combining the best of both worlds: (A) CNN-inspired hierarchical and pyramidal feature maps, where embedding dimensions increase and spatial dimensions decrease throughout the network are combined with local receptive fields to reduce model complexity, while (B) Transformer-inspired self-attention increases modeling capacity and leads to higher accuracies. Even though ViTs outperform CNNs in specific cases, their dominance has not yet been asserted. We illustrate and conclude that SotA CNNs are still on-par, or better, than ViTs in ImageNet validation, especially when (1) trained from scratch without distillation, (2) in the lower-accuracy <80% regime, and (3) for lower network complexities optimized for Edge devices. Convolutional Neural NetworksConvolutional Neural Networks (CNN) have been the dominant Neural Network architectures in Computer Vision for almost a decade, after the breakthrough performance of AlexNet[1]on the ImageNet[2] image classification challenge. From this baseline architecture, CNNs have evolved into variations of bottlenecked architectures with residual connections such as ResNet[3], RegNet[4] or into more lightweight networks optimized for mobile contexts using grouped convolutions and inverted bottlenecks, such as Mobilenet[5] or EfficientNet[6]. Typically, such networks are benchmarked and compared by training them on small images on the ImageNet data set. After this pretraining, they can be used for applications outside of image classification such as object detection, panoptic vision, semantic segmentation, or other specialized tasks. This can be done by using them as a backbone in an end-to-end application-specific Neural Network and finetuning the resulting network to the appropriate data set and application.A typical ResNet-style CNN is given in Figure 1-1 and Figure 1-4 (a). Typically, such networks have several features:They interleave or stack 1×1 and kxk convolutions to balance the cost of convolutions with building a large receptive field, Training is stabilized by using batch-normalization and residual connections. Feature maps are built hierarchically by gradually reducing the spatial dimensions (W,H), finally downscaling them by a factor of 32x. Feature maps are built pyramidally, by increasing the embedding dimensions of the layers from the range of 10 channels in the first layers to 1000s in the last Figure 1-1: Illustration of ResNet34 [3] Within these broader families of backbone networks, researchers have developed a set of techniques known as Neural Architecture Search (NAS)[7] to optimize the exact parametrizations of these networks. Hardware-Aware NAS methods automatically optimize a network’s latency while maximizing accuracy, by efficiently searching over its architectural parameters such as the number of layers, the number of channels within each layer, kernel sizes, activation functions and so on. So far, due to high training costs, these methods have failed to invent radically new architectures for Computer Vision. They mostly generate networks within the ResNet/MobileNet hybrid families, leading to only modest improvements of 10-20% over their hand-designed baseline[8].
Fabrizio Del Maffeo | CEO at AXELERA AI I met Marian Verhelst in the summer of 2019 and she immediately stroke me with her passion and competence for computing architecture design. We started immediately a collaboration and today she’s here with us sharing her insights on the future of computing.Fabrizio: Can you please introduce yourself, your experience and your field of study?Marian: My name is Marian Verhelst, and I am a professor at the MICAS lab of KU Leuven[i]. I studied electrical engineering and received my PhD in microelectronics in 2008. After completing my studies, I joined Intel Labs in Portland, Oregon, USA, and worked as a research scientist. I then became a professor at KU Leuven in 2012, focusing on efficient processing architectures for embedded sensor processing and machine learning. My lab regularly tapes out processor chips using innovative and advanced technologies. I am also active in international initiatives, organising IC conferences such as ISSCC, DATE, ESSCIRC, AICAS and more. I also serve as the Director of the tinyML Foundation. Most recently, I was honoured to receive the André Mischke YAE Prize[ii] for Science and Policy, and I have been shortlisted for the 2021 Belgium Inspiring Fifty list[iii]. F: What is the focus of your most recent research?M: My research currently focuses on three areas. First, I am looking at implementing an efficient processor chip for embedded DNN workloads. Our latest tape-out, the Diana chip, combines a digital AI accelerator with an analogue- compute-in-memory AI accelerator in a common RISC-V-based processing system. This allows the host processor to offload neural network layers to the most suitable accelerator core, depending on parallelisation opportunities and precision needs. We plan to present this chip at ISSCC 2022[iv].The second research area is improving the efficiency of designing and programming such processors. We developed a new framework called the ZigZag framework[v], which enables rapid design space exploration of processor architectures and algorithm-to-processor mapping schedules for a suite of ML workloads.My last research area is exploring processor architectures for beyond-NN workloads. Neural networks on their own cannot sufficiently perform complex reasoning, planning or perception tasks. They must be complemented with probabilistic and logic-based reasoning models. However, these networks do not map well on CPU, GPU, or NPUs. We are starting to develop processors and compilers for such emerging ML workloads in my lab. F: There are different approaches and trends in new computing designs for artificial intelligence workloads: increasing the number of computing cores from a few to tens, thousands or even hundreds of thousands of small, efficient cores, as well as near-memory processing, computing-in-memory, or in-memory computing. What is your opinion about these architectures? What do you think is the most promising approach? Are there any other promising architecture developments?M: Having seen the substantial divergence in ML algorithmic workloads and the general trends in the processor architecture field, I am a firm believer in very heterogeneous multi-core solutions. This means that future processing systems will have a large number of cores with very different natures. Eventually, such cores will include (digital) in- or near-memory processing cores, coarse grain reconfigurable systolic arrays and more traditional flexible SIMD cores. Of course, the challenge is to build compilers and mappers that can grasp all opportunities from such heterogeneous and widely parallel fabrics. To ensure excellent efficiency and memory capabilities, it will be especially important to exploit the cores in a streaming fashion, where one core immediately consumes the data produced by another. F: Computing design researchers are working on low power and ultra-low power consumption design using metrics such as TOPs/w as a key performance indicator and low precision networks trained mainly on small datasets. However, we also see neural network research increasingly focusing on large networks, particularly transformer networks that are gaining traction in field deployment and seem to deliver very promising results. How can we conciliate these trends? How far are we from running these networks at the edge? What kind of architecture do you think can make this happen?M: There will always be people working to improve energy efficiency for the edge and people pushing for throughput across the stack. The latter typically starts in the data centre but gradually trickles down to the edge, where improved technology and architectures enable better performance. It is never a story of choosing one option over another. Over the past years, developers have introduced increasingly distributed solutions, dividing the workload between the edge and the data centre. The vital aspect of these presented solutions is that they need to work with scalable processor architectures. Developers can deploy these architectures with a smaller core count at the extreme edge and scale up to larger core numbers for the edge and a massive core count for the data centre. This will require processing architectures and memory systems that rely on a mesh-type distributed processor fabric, rather than being centrally controlled by a single host. F: How do you see the future of computing architecture for the data centre? Will it be dominated by standard computing, GPU, heterogeneous computing, or something else?M: As I noted earlier, I believe we will see an increasing amount of heterogeneity in the field. The data centre will host a wide variety of processors and differently-natured accelerator arrays to cover the widely different workloads in the most efficient manner possible. As a hardware architect, the exciting and still open challenge is what library of (configurable) processing tiles can cover all workloads of interest. Most intriguing is that, due to the slow nature of hardware development, this processor library should cover not only the algorithms we know of today but also those that researchers will develop in the years to come.As Scientific Advisor, Marian Verhelst advises the Axelera AI Team on the scientific aspects of its research and development. To learn more about Marian’s work, please visit her biography page. References[I] https://www.esat.kuleuven.be/micas/[ii] https://yacadeuro.org/fifth-edition-of-the-annual-andre-mischke-yae-prize-awarded-to-marian-verhelst/[iii] https://belgium.inspiringfifty.org/[iv] https://www.isscc.org/program-overview[v] https://github.com/ZigZag-Project/zigzag
Evangelos Eleftheriou | CTO at AXELERA AI Technology is progressing at an incredible pace and no technology is moving faster than Artificial Intelligence (AI). Indeed, we are on the cusp of an AI revolution which is already reshaping our lives. One can use AI technologies to automate or augment humans, with applications including autonomous driving, advances in sensory perception and the acceleration of scientific discovery using machine learning. In the past five years, AI has become synonymous with Deep Learning (DL), another area seeing fast and dramatic progress. We are at a point where Deep Neural Networks (DNNs) for image and speech recognition can provide accuracy on par or even better than that achieved by the human brain.Most of the fundamental algorithmic developments around DL go back decades. However, the recent success has stemmed from the availability of large amounts of data and immense computing power for training neural networks. From around 2010, the exponential increase of single-precision floating point operations offered by Graphic Processing Units (GPUs) ran in parallel to the explosion of neural network sizes and computational requirements. Specifically, the amount of compute used in the largest AI training has doubled every 3.5 months during the last decade. At the same time, the size of state-of-the-art models increased from 26M weights for ResNet-50 to 1.5B for GPT-2. This phenomenal increase in model size is reflected directly in the cost of training such complex models. For example, the cost of training the bidirectional transformer network BERT, for Natural Language Processing applications, is estimated at $61,000, whereas training XLNet, which outperformed BERT, costs about nine times as much. However, a major concern is not only the cost associated with the substantial energy consumption needed to train complex networks but also the significant environmental impact incurred in the form of CO2 emissions.As the world looks to reduce carbon emissions, there is an even greater need for higher performance with lower power consumption. This is true not only for AI applications in the data center, but also at the Edge, which is where we expect the next revolution to take place. AI at the Edge refers to processing of data where it is collected, as opposed to requiring data to be moved to separate processing centers. There is a wealth of applications at the edge: AI for mobile devices, including authentication, speech recognition, and mixed/augmented reality, AI for embedded processing for IoT devices, including smart cities and homes or embedded processing for prosthetics, wearables, and personalized healthcare, as well as AI for real-time video analytics for autonomous navigation and control. However, these embedded applications are all energy and memory constrained, meaning energy efficiency matters even more so at the Edge. The end of Moore’s and Dennard’s laws are compounding these challenges. Thus, there are compelling motivations to explore novel computing architectures with inspiration from the most efficient computer on the planet, the human brain. Traditional Computing Systems: Current State of PlayTraditional digital computing systems, based on the von Neumann architecture, consist of separate processing and memory units. Therefore, performing computations typically results in a significant amount of data being moved back and forth between the physically separated memory and processing units. This data movement costs latency and energy and creates an inherent performance bottleneck. The latency associated with the growing disparity between the speed of memory and processing units, commonly known as the memory wall, is one example of a crucial performance bottleneck for a variety of AI workloads. Similarly, the energy cost associated with shuttling data represents another key challenge for computing systems that are severely power limited due to cooling constraints as well as for the plethora of battery-operated mobile devices. In general, the energy cost of multiplying two numbers is orders of magnitude lower than that of accessing numbers from memory. Therefore, it is clear to AI developers that there is a need to explore novel computing architectures that provide better collocation of processing and memory subsystems. One suggested concept in this area is near-memory computing, which aims to reduce the physical distance and time needed to access memory. This approach heavily leverages recent advances made in die stacking and new technologies such as the high memory cube (HMC) and high bandwidth memory (HBM). In-Memory Computing: A Radical New ApproachIn-memory computing is a radically different approach to data processing, in which certain computational tasks are performed in place in the memory itself (Sebastian 2020). This is achieved by organizing the memory as a crossbar array and by exploiting the physical attributes of the memory devices. The peripheral circuitry and the control logic play a key role in creating what we call an in-memory computing (IMC) unit or computational memory unit (CMU). In addition to overcoming the latency and energy issues associated with data movement, in-memory computing has the potential to significantly improve the computational time complexity associated with certain computational tasks. This is primarily a result of the massive parallelism created by a dense array of millions of memory devices simultaneously performing computations.For instance, crossbar arrays of such memory devices can be used to store a matrix and perform matrix-vector multiplications (MVMs) at constant O(1) time complexity without intermediate movement of data. The efficient matrix-vector multiplication via in-memory computing is very attractive for training and inference of deep neural networks, particularly for inference applications at the Edge where high energy efficiency is critical. In fact, matrix-vector multiplications constitute 70-90% of all deep learning operations. Thus, applications requiring numerous AI components such as computer vision, natural language processing, reasoning and autonomous driving can explore this new technology in new and innovative ways. Novel dedicated hardware with massive on-chip memory, where part of it is enhanced with in-memory computation capabilities could lead to very efficient training and inference engines of ultra-large neural networks comprising of potentially billions of synaptic weights.The core technology of IMC is memory. In general, there are two classes of memory devices. The conventional one, in which information is stored in the presence or absence of charge, includes dynamic random-access memory (DRAM), static random-access memory (SRAM) and Flash memory. There is also an emerging class of memory devices, in which information is stored in terms of the atomic arrangements within nanoscale volumes of materials, as opposed to charge on a capacitor. Generally speaking, one atomic configuration corresponds to one logic state, and the other corresponds to another logic state. These differences in atomic configuration manifest as a change in resistance, and thus these devices are collectively called resistive memory devices or memristors. Traditional and emerging memory technologies can perform a range of in-memory logic and arithmetic operations. In addition, SRAM, Flash and all memristive memories can also be used for MVM operations.The most important characteristics of a memory device are its read and write times, that is how fast a device can store and retrieve information. Equally important characteristics are the cycling endurance, which refers to the number of times a memory device can be switched from one state to the other, the energy required to store information in a memory cell as well as the size of the memory cell. Table 1 -compares the traditional DRAM, SRAM and NOR Flash with the most popular emerging resistive-memory technologies, such as spin-transfer torque RAM (STT-RAM), phase-change memory (PCM) and resistive RAM (ReRAM).Table 1 – Comparing different memory technologies. Sources:(B. Li 2019), (Marinella 2013)
Fabrizio Del Maffeo | CEO at AXELERA AI Professor Luca Benini is one of the foremost authorities on computer architecture, embedded systems, digital integrated circuits, and machine learning hardware. We’re honored to count him as one of our scientific advisors. Prof. Benini kindly agreed to answer a few questions for our followers on his research and the future of artificial intelligence. For our readers who are unfamiliar with your work, can you give us a brief summary of your career?I am the chair of Digital Circuits and Systems at ETHZ, and I am a full professor at the Università di Bologna. I received a PhD from Stanford University, and I have been a visiting professor at Stanford University, IMEC, EPFL. I also served as chief architect at STMicroelectronics France.My research interests are in energy-efficient parallel computing systems, smart sensing micro-systems and machine learning hardware. I’ve published more than 1.000 peer-reviewed papers and five books.I am a Fellow of the IEEE, of the ACM and a member of the Academia Europaea. I’m the recipient of the 2016 IEEE CAS Mac Van Valkenburg Award, the 2019 IEEE TCAD Donald O. Pederson Best Paper Award, and the ACM/IEEE A. Richard Newton Award 2020. Which research subjects are you exploring?I am extremely interested in energy-efficient hardware for machine learning and data-intensive computing. More specifically, I am passionate about exploring the trade-off between efficiency and flexibility. While everybody is aware of the fact that you can enormously boost efficiency with super-specialization, a super-specialized architecture will be narrow and short-lived, so we need flexibility. Artificial Intelligence requires a new computing paradigm and new data-driven architectures with high parallelisation. Can you share with us what you think the most promising directions are and what kind of new applications they can unleash?I believe that the most impactful innovations are those that improve efficiency without over-specialization. For instance, using low bit-width representations reduces energy, but you need to have “transprecision,” i.e., the capability to dynamically adjust numerical precision. Otherwise, you won’t be accurate enough on many inference/training tasks, and then your scope of application may narrow down too much.Another high-impact direction is related to minimising switching activity across the board. For instance, systolic arrays are very scalable (local communication patterns) but have huge switching activity related to local register storage. In-memory computing cores can do better than systolic arrays, but they are not a panacea. In general, we need to design architectures where we reduce the cost related to moving data in time and space. Can you share more with us about the tradeoffs and benefits of analog computing versus digital computing and where they can work together?Analog computing is a niche, but a very important one. Ultimately, we can implement multiply-accumulate arrays very efficiently with analog computation, possibly beating digital logic, but it’s a tough fight. You need to do everything right (from interface and core computation circuits to precision selection to size).The critical point is to design the analog computing arrays in a way that can be easily ported to different technology targets without complete manual redesign. I view an analog computing core as a large-scale “special function unit” that needs to be efficiently interfaced with a digital architecture. So, it’s a “digital on top” design, with some key analog cores, that can win.Our sector has a prevailing opinion that Moore’s Law is dead. Do you agree, and how can we increase computing density?The “traditional” Moore’s Law is dead, but scaling is fully alive and kicking through a number of different technologies — 2.5D, 3D die stacking, monolithic 3D, heterogeneous 3D, new electron devices, optical devices, quantum devices and more. This used to be called “More-than-Moore,” but I think it’s now really the cornerstone of scaling compute density – the ultimate goal. You are a very important contributor to the RISC-V community with your PULP platform, widely used in research and commercial applications. Why and when did you start the project, and how do you see it evolving in the next ten years?I started PULP because I was convinced that the traditional closed-source computing IP market, and even more proprietary ISAs, were stifling innovation in many ways. I wanted to create a new innovation ecosystem where research could be more impactful and startups could more easily be created and succeed. I think I was right. Now the avalanche is in motion. I am sure that the open hardware and open ISA revolution will continue in the next ten years and change the business ecosystem, starting from more fragmented markets (e.g., IoT, Industrial) and then percolating to more consolidated markets (mobile, cloud). Can Europe play a leading role in the worldwide RISC-V community?The EU can play a leading role. All the leading EU companies in the semiconductor business are actively exploring RISC-V, not just startups and academia. Of course, adoption will come in waves, but I think that some of the markets where the EU has strong leadership (automotive, IoT) are ripe for RISC-V solutions — as opposed to markets where the USA and Asia lead, such as mobile phones and servers which are much more consolidated. There is huge potential for the European industry in leveraging RISC-V. What is the position of European universities and research centres versus American and Chinese in computing technologies – is there a gap, and how can the public sector help?There is a gap, but it’s not quality; it’s in quantity. The number of researchers in computer architecture, VLSI, analog and digital circuits and systems in the EU is small in relation to USA and Asia. Unfortunately, these “demographic factors” take time to change. So really, the challenge is on academics to increase the throughput. Industry can play a role, too – for instance, leading companies can help found “innovation hubs” across Europe to increase our research footprint.Companies can also help make Europe more attractive for jobs. Now that smart remote working is mainstream, people are not forced to move elsewhere. Good students in — for example — Italian or Spanish universities interested in semiconductors can find great jobs without moving. I am not saying that moving is bad, but if there are choices that do not imply moving away, more people will be attracted to these semiconductor companies and roles. Is the European Chips Act powerful enough to change the trajectory of Europe within the global semiconductor ecosystem?It helps, but it’s not enough. There is no way to pump enough public money to make an EU behemoth at the scale of TSMC. But, if this money is well spent, it can “change the derivative” and create the conditions for much faster growth. Over the last decade, European semiconductor companies didn’t bring any cutting-edge computing technology to market. Is this changing, and do you think European startups can play a role in this change?I think that some large EU companies are, by nature, “competitive followers,” so disruptive innovation is not their preferred approach, even though of course there are exceptions. The movement will come from startups, if they can attract the growth and funding of the larger companies. The emergence of a few European unicorns, as opposed to many small startups that just survive, will help Europe strengthen its position in the semiconductor market.
Already have an account? Login
Enter your E-mail address. We'll send you an e-mail with instructions to reset your password.
Sorry, we're still checking this file's contents to make sure it's safe to download. Please try again in a few minutes.
OKSorry, our virus scanner detected that this file isn't safe to download.
OK