Keep up to date with the latest news, information and updates from the Axelera team
The world has been chaotic lately - market swings, tariffs, companies being acquired (Kinara by NXP), other companies are refusing acquisition (Furiosa allegedly declined an $800m buyout from Meta), DeepSeek made everyone question the future of closed AI models and model scaling, while OpenAI’s Sam Altman committed to an “open-weight” AI model to come out this summer - something no one thought would happen.As I was reflecting on all of these, and Axelera’s state in it all, I find myself incredibly grateful.Despite everything happening around us, our team has been focusing on what we can control: solving customer problems, building world-class technology, and partnering with some of the world’s best technology providers. So much to be thankful for, and I want to share some of these reasons with you.We have seen amazing progress towards our vision of bringing artificial intelligence to everyone, to truly democratize what could be the most revolutionary technology we have seen in our lifetime.I am proud to share the latest milestones as we continue to drive innovation, growth, and the democratization of AI. The milestones can be divided in 5 main categories:Product & Commercial traction Funding Partnerships Democratization of our technology Supporting Europe’s audacious sovereign initiatives Product & Commercial traction Our first products, the Metis AIPU and the Voyager SDK are both shipping to customers in production, and companies or developers wanting to test can now purchase from our very own webstore and get the hardware in their hands in a matter of days. We also published the Voyager SDK on GitHub, launched our Model Zoo.We recently launched our AI single-board computer which integrates in the same board a powerful RockChip 3588 with our Metis™ AI processing unit: you can get a high-end edge AI computer with 210 TOPs AI computing power in a standard mini-ITX form-factor! You can pre-order at special discounted price!We are also preparing for the launch of a powerful PCI-e card with 4 Metis™ AI processing units. Stay tuned for the pre-order announcement.Customers like duagon have demonstrated with us at Embedded World to showcase how our Metis product, paired with their industrial solutions, can help improve railway operations.We are continuing to partner with the channel ecosystem too and are proud to be working with Azulle and Arcobel, and look forward to more announcements in this space soon.Order directly from the Axelera AI storeFundingAxelera AI recently secured up to €61.6 million in funding from the EuroHPC Joint Undertaking and member states to develop Titania™, a high-performance, energy-efficient, and scalable AI inference chiplet. This funding, part of the Digital Autonomy with RISC-V for Europe (DARE) Project, will support the development of a very high-performance, high-efficiency chiplets based on Axelera AI's innovative Digital In-Memory Computing (D-IMC) architecture, which enables near-linear scalability from the edge to the cloud.We were selected for the EIC Step Up program which will provide investments of between €10 and €30 million (per company) aiming to leverage private co-investment and achieve financing rounds of €50 to €150 million or more. What an honor to be selected from the 34 applicants to receive this additional equity investment!With this new funding, Axelera AI has raised over $225 million USD in just three years, further solidifying our position as a leader in AI hardware acceleration technology.Axelera AI secures grant to develop a scalable AI chipletPartnershipsBetween Embedded World, ISC West, and CES, we have been proud to showcase our work with partners like SeeChange, Lenovo, Dell, Aetina, Advantech, Seco, Duagon, and many others!We recently demonstrated running real-time analytics on an 8k video stream. When we took this demo to ISC West, I was reminded of how our industry-standard form factor approach is helping make integration simple for customers. The quick story is our demo systems were caught in US customs coming from Amsterdam – not for tariffs 😉- so the team purchased a gaming PC off the shelf in Las Vegas, borrowed some 8K cameras from our friends at Axis, and were able to showcase the powerful capabilities of Metis! For more details, please see the blog.At the Embedded World in Nuremberg, duagon and Axelera AI showcased two pre-series products with integrated Metis™ AIPUs, highlighting the potential of high-performance AI applications at the embedded level. The AIPU modules, which leverage digital in-memory computing power, deliver exceptional performance per watt for inference workloads, making them ideal for parallel processing of visual information. The two products, the Box PC BL74A and the CompactPCI card G506A, are now entering the test phase and will be available as standard products with embedded AI capabilities once certified.One other to highlight is the recent announcement with the European Space Agency to bring its Metis inference acceleration platforms to space, supporting the ESA's mission to protect the planet, explore the universe, and strengthen European autonomy. The partnership will enable the ESA to leverage Axelera's sovereign technology and long-term availability to deliver high-performance and low-power AI capabilities in space, supporting missions that may last for years or even decades. This collaboration marks an exciting step forward in the intersection of AI and space exploration, empowering scientists to answer some of the universe's biggest mysteries.We were also delighted to see Thales in France showcase what they are doing with Metis as they are working to keep humans in the loop of their AI solutions.Metis in space Democratization of AI with AxeleraAxelera AI has furthered its strategy to democratize artificial intelligence everywhere through a strategic partnership with Arduino, the global leader in open-source hardware and software. The collaboration combines Axelera AI's Metis™ AI Platform with Arduino's Portenta to provide customers with easy-to-use hardware and software solutions for AI innovation. This partnership enables users to dictate their own AI journey with tools that deliver high performance and usability at a fraction of the cost and power of other solutions available today. Arduino is one of the world’s largest sources of hardware and software for hardware and software developers. Arduino has a massive user base of over 30 million registered users worldwide. This community of innovators has created an astonishing 100,000+ projects on the Arduino platform, ranging from simple circuits to complex robotics, AI, and IoT devices. Additionally, Arduino has sold over 20 million boards, a testament to the platform's popularity and versatility. This partnership will truly democratize our technology.Additionally, Axelera’s technology is now also available for purchase online. Yes, we opened our very own online store. What does a full-fledged Axelera customer experience look like? Well, customers can download our Voyager SDK and read SW docs. They can buy our boards and systems with one mouse click, and then get support and documentation while interacting with other developers right here on the Axelera AI Community. To date we have shipped to a dozen countries and are looking forward to hearing how these developers use their Metis systems!Metis and Portenta, together Supporting Europe Technology SovereigntyIn addition to partnering with the EuroHPC and the DARE project, Axelera has been invited to participate in a number of incredibly important discussions to advance the European technology landscape. We are proud to participate and share our expertise with the governments of France through the AI Action Summit, the Netherlands at the State of Dutch Tech, and across Europe through the D9+. Most recently, I was honored and humbled to participate in the launch of the AI Continent Action Plan in Brussels.It was a great discussion, full of valuable insights that make me genuinely optimistic about the future of European technology. A few key takeaways:With the first AI Factories already operational, and the launch of the AI Gigafactories call for interest, Europe is showing that when we work together, we can move fast and decisively. Across many parts of the AI value chain, Europe has players delivering cutting-edge solutions. With stronger collaboration and vertical integration — from hardware to applications — European AI can win in key sectors. European leaders increasingly recognize the importance of stimulating demand for “‘made in Europe” products, both in public procurement and beyond. Europe has plenty of capital. Now is the time to unleash it and direct it towards strategic sectors like AI and deep tech. The European defense market is becoming a strategic growth driver for our deep tech sector. Many EuroHPC Joint Undertaking (EuroHPC JU) centers are open to testing and deploying experimental European hardware and software solutions, helping our deep tech startups scale. They are also giving startups simplified access to their infrastructure, which is strategic to compete with well-funded overseas companies A true European single market is essential to build the first European trillion-euro company.The combination of all of the discussions makes me incredibly excited. The current global turmoil has served as a wake-up call for Europe, and we are standing united to be successful. SummaryIn a recent all company meeting, I told all employees that ‘the best is yet to come’. As I look back on the past period, I'm pleased to say that we've successfully set up our chess pieces on the board. We've made strategic moves to position ourselves for success, and our team is now well-equipped to make the most of the opportunities ahead. Just as a chess player must carefully plan their opening moves to set themselves up for checkmate, we've taken the time to lay the groundwork for our future growth and success. With our pieces in place, we're now poised to make the moves that will take us to the next level.It’s still day one, the best is yet to come!
What happens when you take one of the most demanding computer vision models, push it to 8K resolution, and run it live on the edge—right in the middle of the world’s biggest security tech trade show? At ISC West 2025, that’s exactly what we did. Here’s how.Just a few months ago at CES I was speaking with a nationwide retailer in the process of upgrading its store camera systems to 4K. We discussed its AI strategy given that many state-of-the-art models are still trained at HD resolutions or less. Reasons for this are that training at higher resolutions is more computationally expensive, and labelling high-resolution datasets is resource intensive. Yet the point of inferencing at high resolution is often just to increase the distance from the camera at which objects can be detected, and not anything related to the model’s fundamental accuracy. How We Cracked High-Resolution Inferencing at the EdgeWith conventional inferencing, high-resolution video is downsized to the native model input resolution, producing a loss of information prior to inferencing. Tiling techniques such as SAHI mitigate this loss by subdividing each input image into a grid, running the model on each tile individually, and then reconstructing all detections with respect to their position in the original image.A key Axelera AI differentiator is the ability to rapidly and efficiently scale up inferencing by using multiple cores and chips. So we decided to showcase the popular, yet computationally-demanding, YOLOv8l model running on Metis with an IP camera at 8K resolution. Developing this capability – not just for demo but as a general-purpose SDK feature that anyone can use with their own models – is technically quite challenging. YOLOv8l is a large model with 43.7M parameters compared to the industry benchmark SSD-MobileNetv2, which has 3.4M parameters. It can take over 100 times longer to run, even before tiling. YOLOv8l’s native input size is 640x640, which subdivides an input video at 8K resolution (7680x4320) into a grid of 12x7 tiles, allowing for some vertical overlapping. In addition, to ensure accurate detection of objects spanning multiple tiles, the original, downsized image is also used as a model input, for a total of 85 parallel streams. This processing must be split efficiently end-to-end between the host processor and Metis accelerator, with the host preparing vast amounts of camera data for inferencing Tasks include:Color conversion Scaling Letterboxing Tensor conversion Post-processing The latter of these applies algorithms such as non-maximal suppression to the output, removing duplicate detections between tiles.Live feed from the 8K camera with real-time object detection. A lot of it!Real-World Deployment, Real-World ResultsFor the ISC West demo, we placed an 8K camera four meters above our booth - huge thank you to Axis who loaned us a Q1809-LE 8K bullet camera when ours got stuck with US Customs. From there, it could survey a section of the convention center floor, accurately reflecting how these cameras are actually being deployed in venues, stadiums, airports and more.At this distance we found the optimal tile size to be 1280x1280, with each tile capturing a range of people and objects on the show floor. Using an Intel Core i9-based PC and two Metis cards, we were able to detect objects accurately with YOLOv8l at a rate of 23 frames/second. This equates to processing around 300 1280x1280 tiles per second. Moving to the smaller, but still very capable YOLOv8s model enables the same levels of performance using only a single Metis device, while our upcoming 4-chip PCIe card enables the highest levels of accuracy and performance. The ability to easily change models and parameters using our Voyager SDK, simply by modifying YAML configuration files, makes it incredibly easy to build powerful systems for processing multiple high-definition camera streams at low latency and high frame rates. It’s also testament to the flexibility that has come from building the Voyager SDK from the ground up to deliver ease of use and high performance at the same time, all within a single development environment. Building our SDK foundations to enable this degree of flexibility was not, however, such an easy problem to solve.Axelera AI Metis In actionA Challenge in the MakingMy journey with heterogeneous computing began a decade ago at GPU IP-supplier Imagination Technologies, where I worked with mobile OEMs trying to repurpose their GPUs for emerging compute workloads in the Android market. At this time, the use of GPU computing in application processors was still in its infancy. Apple, as Imagination’s lead customer, was developing in-house features for the iPhone, and a fragmented Android ecosystem struggled to find compelling use cases in phones and tablets. Through various industry collaborations we did eventually achieve a few Android deployments, most notably (or perhaps notoriously) real-time camera “beautification.” But my feeling at that time was that embedded GPU computing was failing to reach its full potential, in part due to the difficulties of programming heterogeneous systems-on-chips using low-level APIs.Fast forward to the last couple of years and we’re now in the midst of a major industry shift: the transitioning of compute from cloud data centers to edge devices, bringing end-user benefits such as reduced latency, enhanced privacy and lower costs. With AI playing an increasingly important role in products everywhere, many companies are looking to incorporate AI accelerators into their edge products.Many of these products are already designed using host processors with embedded GPUs that offer impressive image processing capabilities. However, when it comes to integrating these capabilities within end-to-end AI pipelines, the state of the industry has unfortunately moved towards proprietary solutions. Apple has orphaned OpenCL in favour of its Metal API, which, albeit very capable, is proprietary to Mac. NVIDIA’s ecosystem is firmly rooted in CUDA, also a proprietary API. At the same time, open APIs such as Khronos’s Vulkan has not yet delivered on its promise to evolve from a graphics-centric API to one that unifies compute-based kernels.Against this backdrop, we set out to develop an SDK that makes it easy to integrate Metis AI accelerators with a wide range of host processors, while maximizing end-to-end performance by leveraging the image-acceleration APIs available on these hosts. This challenge was just one part, albeit an important part, of the broader Axelera AI vision to make artificial intelligence accessible to everyone.This is the 8K camera in position at the Axelera AI ISC West boothFirst Make it Easy to UseWe started by making it easy for developers to express their complete AI inferencing pipeline in a single YAML configuration file. Pipelines are described declaratively, including all preprocessing and post-processing elements, and optionally combining multiple models in parallel or sequence so that, for example, the output of a person detector is input to a secondary weapon detector. We created YAML files for every model in our model zoo, using weights trained on default industry-standard datasets, and we made it easy to customize these models with pre-trained weights and datasets. We then developed a pipeline builder that automatically converts these YAML definitions into functionally-equivalent low-level implementations for a range of target platforms. We designed high-level Python and C/C++ application integration APIs that enable developers to dynamically configure these pipelines at runtime with mixtures of different video sources, formats and tile sizes. At the application level, developers can simply iterate to obtain images and inference metadata, which can then be analyzed and visualized using Voyager application libraries. The Voyager SDK provides a single environment in which pipeline development and evaluation can proceed hand-in-hand with application development, from product ideation all the way through to production. Then Optimize, RelentlesslyWorking closely with early access customers, we prioritized the optimizations that mattered most to their use cases, like adding support for Intel Core hosts with VAAPI-accelerated graphics and Rockchip ARM platforms with OpenCL-accelerated Mali GPUs. Integrating these compute APIs with other hardware, such as video decoders and the Metis PCIe driver, required careful consideration of various low-level issues, such as alignment requirements when allocating memory, and understanding which API interoperability extensions were supported most efficiently and reliably by the different hardware vendors. This was codified into our pipeline builder so that it can construct efficient zero-copy pipelines that pass only pointers between elements. Unnecessarily copying even a single buffer can substantially degrade performance on bandwidth-constrained devices, so a lot of time was spent considering optimal approaches for different combinations of tasks and devices. With the core framework in place, we added optimization passes to the pipeline builder that fuse together different combinations of pre-processing tasks on the same device. This eliminates unnecessary generation of intermediate pipeline data that isn’t required by the application, saving additional memory bandwidth. Over time the pipeline builder has matured into a product that can generate near-optimal implementations of many complex pipelines to meet the demands of real-world applications, and we’re excited to make it available to our broader community.Surveying the competitive landscape with our high-resolution 8K inferencing capabilitiesWhere We’re Headed NextThe first public release of the Voyager SDK is a major milestone on a journey that offers many exciting opportunities. As part of this first release we’ve also opened up lower-level APIs on which the pipeline builder and application integration APIs are built. These include a Khronos-inspired low-level AxRuntime API (available today in C/C++ and Python), which provides full control over all hardware resources used to build end-to-end pipelines.There’s also a mid-level AxInferenceNet API (available today in C/C++ with Python to follow) that allows direct control of model execution from within an application (distinct from our highest-level API that fully abstracts pipelines to objects generating images and metadata). We’re excited to see how developers make use of these APIs, and how they would like to see them further improved. Feel free to share any such feature requests.As developers continue to push the boundaries of what’s possible with AI, Axelera AI continues to innovate and ensure the broadest adoption of our products. For example, developers working with high-resolution cameras often need to manage large amounts of data within their application, from capturing and recording video in real-time, to scanning complex scenes over time to track, identify and analyze objects, and detecting key events in dynamic, real-world environments. These are fundamentally difficult problems to solve, but by building tools that keep raising the abstraction level at which developers can create applications, I believe Axelera AI is perfectly positioned to deliver on the promise of making AI accessible to everyone.
The importance of software for unlocking the value of artificial intelligence cannot be overstated: the most powerful hardware in the world is just a paperweight without a usable software stack. At Axelera AI we’ve built the Voyager Software Development Kit (SDK) to give developers and ML engineers a simple solution for developing and deploying AI. Today, we’re excited to share that the Voyager SDK is now publicly available on our GitHub page. IntroductionThe Voyager SDK is an end-to-end integrated software stack for Axelera AI’s inference platform which has been designed for performance, efficiency and ease of use. It enables developers to deploy pre-trained machine learning (ML) models, and to construct end-to-end optimized application pipelines quickly and easily. Whether you have trained your own model weights, are using an open source model with pre-trained weights, or want to build on one of the models offered in our model zoo, Voyager provides an effortless path to deploy and evaluate a model on Axelera AI’s hardware platforms, to build an inference pipeline using it and to integrate the pipeline into your application logic.Voyager SDK offers a development environment where the developer deploys a model, measures its accuracy and performance and integrates it into an application pipeline. It also offers a runtime environment (i.e., a runtime stack) that offloads the execution of the pipeline when the application runs on an edge system. The two environments are logically separate but can also co-exist on the same system. What is included in Voyager SDK?We distribute the latest version of the Voyager SDK as a GitHub repository which,among other content, offers the following:An automated installer for the core binary packages of the SDK, including native packages and Python wheels. For certain packages, such as the Linux kernel driver, native source code is available as well. The installer can be used to install the developer environment, the runtime environment or both. Source code for the AI pipeline builder, image acceleration libraries, GStreamer plug-ins, inference server and model evaluation infrastructure. Comprehensive documentation to support developers using our platform. Our documentation covers general topics such as installation, getting started and performance benchmarking, as well as tutorials on more specific topics such as deployment of custom model weights. Additional documents specific to various host platforms and upgrade instructions will be available in our customer portal. A model zoo of optimized models, including dozens of models for tasks such as image classification, object detection, semantic segmentation, instance segmentation and keypoint detection. As we optimize the performance and accuracy of new models, our model zoo will be expanding continuously with additional models and use cases. Multiple sample pipelines and applications that exemplify the use of our stack and can help streamline development and speed up time-to-market.Performance Recently we published a blog post on the performance and accuracy achieved with benchmark computer vision models on Metis AI processing units. Those tests were run using the software we are releasing today, so you can reproduce those benchmarks for yourself and, more importantly, take advantage of the performance of Metis in your applications. For the latest benchmarks and performance numbers, please visit here. Why Now?Axelera AI was founded on the principle that everyone should have access to leading edge inference capabilities. We believe openness is the best way to empower developers and we are thrilled to have reached this milestone. For over six months our customers have been using Voyager, providing us feedback, and helping shape our roadmap. We look forward to now broadening both the access and the feedback through our online community. Interested in Getting Involved?Our team is committed to fostering a collaborative environment by encouraging open source contributions. Developers can submit new pipelines or improvements via pull requests, which will be reviewed and potentially integrated into the repository. With this approach we aspire to enhance the quality and reach of our SDK and build a vibrant community of contributors.Make sure you’re signed up here at the Axelera AI community to discuss your projects, ask questions, and support your fellow developers.Still don’t have a Metis inference accelerator? Get one today.
Evangelos Eleftheriou | CTO at AXELERA AI It is with great enthusiasm and a sense of humble pride that I share a pivotal development in the realm of AI and high-performance computing (HPC). Axelera AI, as part of the esteemed EuroHPC Joint Undertaking (JU) DARE consortium, has embarked on an ambitious journey to revolutionize AI inference technology with our groundbreaking chiplet architecture, Titania. Unveiling TitaniaTitania represents a synthesis of our foundational principles: high performance, low power consumption, and unparalleled scalability. This innovative AI inference chiplet is a testament to the ingenuity and dedication of our team, who have tirelessly worked to bring this vision to life. Built on our proprietary Digital In-Memory Computing (D-IMC) architecture, Titania offers near-linear scalability from the edge to the cloud, marking a significant leap forward in AI computing efficiency. “Our Digital In-Memory Computing (D-IMC) technology leverages a future-proof, scalable multi-AI-core architecture, ensuring unparalleled adaptability and efficiency. Enhanced with proprietary RISC-V vector extensions, this versatile mixed-precision platform is engineered to excel across diverse AI workloads. Uniquely, our architecture facilitates scaling from the edge to the cloud, streamlining expansion and optimizing performance in ways that traditional cloud-to-edge approaches cannot. We are setting a new standard for AI infrastructure, making true scalability a tangible reality”Evangelos Eleftheriou, CTO and Co-Founder, Axelera AI The Significance of TitaniaWhy is Titania so crucial for the future of AI? The answer lies in its design and the pressing needs of our rapidly evolving industry. As AI applications become more sophisticated, models get bigger, and the compute demands seem endless, the technology industry owes it to the world to bring a more efficient, scalable, and cost-effective inference to the market. Titania is engineered to meet these demands, providing server-grade performance with the energy efficiency required at the edge. This balance is essential for applications ranging from weather prediction and industrial automation to security monitoring and advanced Large Language Models with multimodal capabilities.Collaboration and SupportThe development of Titania is made possible through the generous support of the EuroHPC JU and the DARE consortium which have allocated €240 million in funding, of which Axelera AI will receive up to €61.6 million. This support underscores the importance of fostering European innovation and technological sovereignty in the HPC ecosystem. It also aligns perfectly with Axelera AI’s mission to bring state-of-the-art AI capabilities to a broader range of applications and industries. Our Technological Edge At the core of Titania’s capabilities is our D-IMC technology, integrated with cutting-edge RISC-V vector extensions. This combination ensures that our chiplet can handle diverse AI workloads with remarkable efficiency and adaptability. The scalable multi-AI-core architecture sets a new standard for AI infrastructure and streamlines the expansion process, making true scalability a tangible reality. Looking AheadOur journey with Titania is just beginning. We anticipate the first systems powered by Titania to be available by 2027, supporting a vast array of use cases and demonstrating the profound impact of this technology. As we forge ahead, we remain committed to our core values and dedicated to delivering cutting-edge solutions that address the AI industry’s most pressing challenges. A Heartfelt Thank YouI extend my deepest gratitude to the EuroHPC JU, the DARE consortium, and our incredible team at Axelera AI. Their unwavering support and commitment have been instrumental in driving this project forward. Together, we are setting a new benchmark for AI inference technology and paving the way for a future where AI capabilities are more accessible, efficient, and impactful than ever before. Titania is more than just a chiplet; it represents the culmination of years of research, innovation, and collaboration. It embodies our vision for the future of AI and our dedication to pushing the boundaries of what is possible. I invite you to join us on this exciting journey as we continue to explore new frontiers in AI and HPC.
Manuel Botija | Head of Product at AXELERA AI Ioannis Koltsidas | VP AI Software at AXELERA AI Exec Summary: The Metis AI Processing Unit (AIPU) is an inference-optimized accelerator for the Edge. We are proud to showcase up to a 5x performance boost over competitive accelerators in terms of raw inference performance for key families of Neural Networks for Computer Vision along with state-of-the-art accuracy. As significant as 5x is though, we believe the best measurement of performance is application-level performance which is a much better proxy for what the user will realize. For example, if the AIPU can infer that a cat is a cat at 900fps, but the post-processing slows it down so significantly that the user only sees 20fps, the 900fps is nearly useless. Thanks to our easy-to-use Voyager™ SDK, which optimizes the entire data pipeline, we also showcase that Axelera AI’s application performance brings worldclass speed to computer vision applications. Three years ago we started Axelera AI with a singular mission, to empower everyone with the best performance for AI inference. Since then, we have taped-out 3 chips, built the Voyager SDK, and are fulfilling that promise.Today we are pleased to release the latest performance benchmarks based on the upcoming public release of our Voyager SDK, which will be available via our Github repo in March. All of the data measured has been done on our products, within our labs. Competitor data has been utilized from their own published sources as noted below.Multiple Metis AI processing units. The AI chips that accelerate deep learning at the edge Performance Results: Metis vs. CompetitionWhen compared to other AI accelerators, Metis consistently outperforms in key benchmarks such as Ultralytics YOLO models. The chart and table below show the frames per second (FPS) processed by Metis, compared to the throughput of other AI accelerators.This is just some of the benchmarks we have tested and a few of the more than 50 models available for immediate use within our Model Zoo. Software is extremely important to us at Axelera AI and we invest significant resources in ensuring we are always improving. We continue to add optimized models and capabilities to ease development and integration within AI pipelines. Having the highest performance only matters if the users can trust the accuracy of the inference being performed. We are thrilled to say that, thanks to the mixed precision architecture of Metis and the quantization capabilities of our SDK, the achieved accuracy is state of the art.In the following table we list the accuracy measured for various models when running on a machine with full numerical precision (32-bit Floating Point arithmetic, a.k.a. ‘FP32’) and compare it with the accuracy of the same models running on Metis after being quantized by the Voyager SDK. As you can see, the accuracy reduction with Metis is negligible in many cases. Our software team continues to work on optimizations and will keep updates in our public release.Voyager SDKWithout a robust and easy to use software stack, AI hardware is useless. There, we said it! So, to ensure developers can get the most out of our performance-leading hardware, we built the Voyager™ SDK which facilitates the development of high-performance applications. Developers can build their computer vision pipelines using a straightforward, high-level declarative language, YAML. A Voyager pipeline may include one or more neural networks along with their associated pre- and post-processing tasks, which can include complex image processing operations. The SDK automatically compiles, optimizes, and deploys the entire pipeline. While the neural network runs on the Metis AI Processing Unit (AIPU), the SDK also generates code for the non-neural operations of the pipeline, such as image preprocessing and inference post-processing, to take advantage of the host hardware acceleration offered by the host CPU, integrated GPU or media accelerator. Additionally, thanks to the architecture of our chip, the developer can choose how to allocate Digital In-Memory Compute (D-IMC) cores to the application: if there are multiple models, the cores can be loaded in parallel, or they can be cascaded, the decision is yours. This means if you have a very compute heavy model that you want to utilize 3 of the 4 cores to compute, you may. Likewise, if you have four models you want to run in parallel, that is also possible. Application-level performanceRunning a Computer Vision application is much more than just running inference. At Axelera AI we believe it’s important to understand what the realized performance is – meaning, how long does it take to get the answer a user is looking for, that’s the full end-to-end measurement. The Axelera AI Voyager SDK helps optimize the entire data pipeline, including the parts that run on the host CPU or embedded GPU. Why does this matter? This means that both the developer and the users will have a better experience because the SDK will handle the work for the developer, and the user gets faster results.As can be appreciated in the table, Voyager SDK manages to deliver the raw inference performance to the end-to-end application: by optimizing the execution of non-neural operations in the computer vision pipeline we ensure that the application can take full advantage of the unmatched capabilities of Metis.The Voyager SDK is compatible with a variety of host architectures and platforms to accommodate different application environments. Additionally, the SDK allows embedding a pipeline into an inference service, providing various preconfigured solutions for use cases ranging from fully embedded applications to distributed processing of multiple 4K streams. State-of-the-Art Digital In-Memory ComputingWhy is Metis so powerful? One of the key innovations that sets Metis apart from its competition is its use of Digital In-Memory Computing (D-IMC) technology. D-IMC allows for the simultaneous processing and storage of data within memory cells, allowing extremely high throughput and power efficient matrix-vector-multiplication. This approach is particularly beneficial for AI workloads, which require high-speed data access and intensive computation, and all with an average power consumption below 10 watts!
Manuel Mohr | Staff Software Engineer at AXELERA AI Open standards enable developers to more easily harness the power of AI accelerators, especially in heterogenous computing. Here you can read in detail why and how we implemented OpenCL using oneAPI on Metis.The Necessity of Dedicated AI Hardware AcceleratorsAI applications have an endless hunger for computational power. Currently, increasing the sizes of the models and cranking up the number of parameters has not quite yet reached the point of diminishing returns. Thus, the ever growing models still yield better performance than their predecessors.At the same time, new areas for application of AI tools are explored and discovered almost daily. Hence, building dedicated AI hardware accelerators is extremely attractive. In some situations it is even a necessity, as it enables running more powerful AI applications while using less energy on cheaper hardware. Welcome to the Hardware JungleSuch specialized accelerator hardware poses great challenges to software developers, as they instantly transform a regular computer into a heterogeneous supercomputer, where the accelerator is distinctly different from the host processor. Moreover, each accelerator is different in its own way and wants to be programmed appropriately to actually reap the potential performance and efficiency benefits.In his 2011 article*, Herb Sutter heralded this age with the words “welcome to the hardware jungle”. And since he wrote this article, a thick jungle it has indeed become, with multiple specialized hardware accelerators now being commonplace across all device categories ranging from low-end phones to high-end servers.So what’s the machete that developers can use to make their way through this jungle without getting lost? Why Custom AcceleratorIinterfaces Are a Bad IdeaThe answer lies in the creation of a suitable programming interface for those accelerators. Creating a custom interface that is completely tailored for a new accelerator silicon could let a developer exploit every little feature that the hardware has to offer to achieve maximum performance.However, upon closer inspection, this is a bad idea for a variety of reasons. Firstly, while there might be the possibility of achieving peak performance with a custom interface, it would require expertise that is already hard to come by for existing devices and even rarer for new devices. The necessary developer training is time-intensive and costly.Even more importantly, using a different bespoke interface to program each accelerator can also result in vendor lock-in if the created software completely relies on such a custom interface, making it highly challenging and significantly more expensive to switch to a different hardware accelerator. The choice of programming interface is thus crucial not only from a technical perspective, but also from a business standpoint. At Axelera, we therefore believe that the answer to the question of how to best bushwhack through the accelerator jungle is to embrace open standards, such as OpenCL* and SYCL*4. Open Standards for Open InteractionOpenCL and SYCL are open standards defined by the Khronos Group. They define an application programming interface (API) for interacting with all kinds of devices, as well as programming languages for implementing compute kernels to run on these devices.SYCL provides high-level programming concepts for heterogeneous computing architectures, together with the ability to maintain code for host and device inside a shared source file.But providing a standard-conformant implementation of such open standards poses a daunting challenge for creators of new hardware accelerators. The OpenCL API consists of more than 100 functions and OpenCL C specifies over 10000 built-in functions that compute kernels can use. It would be great if these open standards were also accompanied by high-quality open-source implementations that are easy to port to new silicon. Fortunately, in the case of OpenCL and SYCL, this is indeed the case. Increased Developer ProductivityOpen standards such as OpenCL & SYCL promise portability across different hardware devices and also foster collaboration and code reuse. After all, it suddenly becomes possible and worthwhile to create optimized libraries that are usable for many devices, which ultimately increases developer productivity.Axelera is a member of the UXL Foundation*, a group that governs optimized libraries implemented using SYCL. These libraries are compatible with this software stack, offering math and AI operations through standard APIs. Conquering the Jungle with the oneAPI Construction KitThe open source oneAPI Construction Kit from Codeplay is a collection of high-quality implementations of open standards, such as OpenCL and Vulkan Compute, that are designed from the ground up to be easily portable to new hardware targets. We want to share our experiences using the Construction Kit to unlock OpenCL and SYCL for our Metis AI Processing Unit (AIPU)*. Prerequisites for deployment In order to enable porting an existing OpenCL implementation to a new device, two prerequisites must be fulfilled: There must be a compiler backend able to generate code for the device’s compute units. As the oneAPI construction kit, like virtually all OpenCL implementations, is based on the LLVM compiler framework, in this case this means having an LLVM code generator backend for the target instruction set architecture (ISA). As our Metis AIPU’s compute units are based on the RISC-V ISA, we could just use the RISC-V backend that’s part of the upstream LLVM distribution to get us started. If the accelerator uses a non-standard ISA, an adapted version of LLVM with a custom backend can of course be used with the Construction Kit as well. There must be some way for low-level interaction with the device, to perform actions like reading or writing device memory, or triggering the execution of a newly loaded piece of machine code. As we already supported another API before looking into OpenCL, such a fundamental library was already in place. In our case, it was a kernel driver exposing the minimal needed functionality to user space (essentially handling interrupts and providing access to device memory), accompanied by a very thin user space library wrapping those exposed primitives.Implementing HALWith these prerequisites being met, we started following the Construction Kit’s documentation*. The first thing to do is implementing what the Construction Kit calls the “hardware abstraction layer” (HAL). The HAL comprises a minimal interface that covers the second item of the above list and consists of just eight functions: allocating/freeing device memory, reading/writing device memory, loading/freeing programs on the device, and finding/executing a kernel contained in an already loaded program.In order to avoid having to deal with the full complexity of OpenCL from the get-go, a smaller helper library called “clik” is provided by the Construction Kit to implement the HAL. This library is essentially a severely stripped-down version of OpenCL, with some especially complex parts like on-line kernel compilation being completely absent. Hence, the clik library serves as a stepping stone for getting the HAL implemented function by function, and provides matching test cases to ensure that the HAL implementation fulfills the contract expected by the Construction Kit. After all tests pass, this scaffolding can be removed, and the resulting HAL implementation can be used to bring up a full OpenCL implementation.In our case, implementing the HAL was straightforward. The tests enabled a quick development cycle, where more tests started passing every time some new functionality was added or pointed out problems where the HAL implementation didn’t meet the Construction Kit’s expectations. In total, it took about two weeks of full-time work by one developer without prior Construction Kit knowledge to go from starting the work to passing all clik tests. Configuring a complete OpenCL stack.After gaining confidence that the Metis HAL implementation was functional, we could continue with the next step and bring up a complete OpenCL stack* . This, too, was surprisingly quick, taking roughly another two person weeks of developer time. The Construction Kit again provides an extensive unit test suite, whose tests can be used to guide development by pointing out specific areas that aren’t working yet. Testing our Metis OpenCL implementation.All bring-up work was initially performed using an internal simulator environment, but after passing all tests there, we could quickly move to working on actual silicon (see 8). As the first real-world litmus test for our Metis OpenCL implementation, we picked an OpenCL C kernel that is currently used for preprocessing as part of our production vision pipeline. By default, the kernel is offloaded to the host’s GPU. However, with Metis now being a possible offloading target for OpenCL workloads as well, we pointed the existing host application at our Metis OpenCL library and gave it a try. We were very happy to see that without any modifications to the host application1, we were able to run the vision pipeline while offloading the computations to Metis instead of the host GPU. In total, with the transition to actual silicon taking another week of developer time, it took us around five person weeks of development effort to go from having no OpenCL support to having a prototype implementation capable of offloading an existing OpenCL C kernel used in a production setting to our accelerator.Hence, in our experience, OpenCL and the oneAPI Construction Kit fully delivered on the promises of easy portability and avoiding vendor lock-in.Opening up PossibilitiesHaving a functional OpenCL implementation is also an important building block that opens up many other possibilities. OpenCL can be used as a backend for the DPC++ SYCL implementation*, which enables a more modern single-source style for programming accelerators.Even more importantly, a SYCL implementation makes it possible to tap into the wider SYCL ecosystem. This includes optimized libraries, such as portBLAS* providing linear algebra routines and portDNN* providing neural-network-related routines, but also brings the potential to support the UXL Foundation libraries including oneMKL*, oneDPL*, and oneDNN*. Alongside these libraries it also includes tools like SYCLomatic*, which assists with migrating existing CUDA codebases to SYCL. Thus, it offers an important migration path to escape from vendor lock-in.Why oneAPI Simplifies AI Accelerator ImplementationThe best way to bushwhack through the accelerator jungle and enable heterogeneous computing is to embrace open standardsOpen standards play a crucial role in the evolution and adoption of heterogeneous computing by addressing some of the fundamental challenges associated with developing for diverse hardware architectures. They provide standardized programming models and APIs, such as OneAI, that allow software to communicate with various hardware components, including CPUs, GPUs, DSPs, and FPGAs, irrespective of the vendor.Overall, we found the Construction Kit of oneAPI to be key for unlocking access to open standards.Through the use of oneAPI, the integration of AI accelerators can be significantly simplified and made more efficient and future-proof. That’s because oneAPI enables seamless, hardware-agnostic interoperation between tools and libraries. This accelerates the development process and ensures that applications can leverage the latest advancements in AI hardware and software technologies, and remain compatible with future hardware innovations, reducing the need for costly rewrites or optimizations. At Axelera AI, we are excited to continue on this path.*OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.*SYCL and the SYCL logo are trademarks of the Khronos Group Inc. ReferencesH. Sutter, "Welcome to the Jungle," 2011. Online. Available: https://herbsutter.com/welcome-to-the-jungle/. The Khronos Group, "OpenCL Overview," Online. Available: https://www.khronos.org/opencl/. The Khronos Group, "SYCL Overview," [Online]. Available: https://www.khronos.org/sycl/. UXL Foundation, "UXL Foundation: Unified Acceleration," [Online]. Available: https://uxlfoundation.org/. Axelera AI, "Metis AIPU Product Page," [Online]. Available: https://www.axelera.ai/metis-aipu. Codeplay Software Ltd, "Guide: Creating a new HAL," [Online]. Available: https://developer.codeplay.com/products/oneapi/construction-kit/3.0.0/guides/overview/tutorials/creating-a-new-hal. Codeplay Software Ltd, "Guide: Creating a new ComputeMux Target," [Online]. Available: https://developer.codeplay.com/products/oneapi/construction-kit/3.0.0/guides/overview/tutorials/creating-a-new-mux-target. Axelera AI, "First Customers Receive World’s Most Powerful Edge AI Solutions from Axelera AI," 12 September 2023. [Online]. Available: https://www.axelera.ai/news/first-customers-receive-worlds-most-powerful-edge-ai-solutions-from-axelera-ai.. Intel Corporation, "Intel® oneAPI DPC++/C++ Compiler," [Online]. Available: https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-compiler.html. Codeplay Software Ltd, "portBLAS: Basic Linear Algebra Subroutines using SYCL," [Online]. Available: https://github.com/codeplaysoftware/portBLAS. Codeplay Software Ltd, "portDNN: neural network acceleration library using SYCL," [Online]. Available: https://github.com/codeplaysoftware/portDNN. Intel Corporation, "Intel® oneAPI Math Kernel Library (oneMKL)," [Online]. Available:https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl.html. Intel Corporation, "Intel® oneAPI DPC++ Library," [Online]. Available: https://www.intel.com/content/www/us/en/developer/tools/oneapi/dpc-library.html. Intel Corporation, "Intel® oneAPI Deep Neural Network Library," [Online]. Available: https://www.intel.com/content/www/us/en/developer/tools/oneapi/onednn.html. Intel Corporation, "SYCLomatic: CUDA to SYCL migration tool," [Online]. Available: https://github.com/oneapi-src/SYCLomatic.
Manuel Botija | Head of Product at AXELERA AI Quality control and defect inspection are vital processes in manufacturing, ensuring that products meet stringent standards and are free from defects. AI-driven vision inspection systems (integral to quality control 4.0) have revolutionized these processes, providing high accuracy and efficiency in identifying defects that human inspectors might miss. These vision inspection systems use advanced imaging technologies and machine learning algorithms to automatically detect and classify defects, ensuring consistent quality across various industries.Discover how vision inspection system manufacturers can tackle the challenges associated with applying machine learning in quality control.Choosing the right AI-based algorithms Adapting to the uniqueness of each manufacturing line Deploying fast and secure inference, anywhere Scaling up and keeping upExample applications of vision inspection systemsAutomotive Industry: In the automotive sector, vision-based systems are used to inspect components such as engine parts, brake systems, and body panels. These systems utilize high-resolution cameras and image processing algorithms to detect defects like cracks, deformations, and surface irregularities that could compromise vehicle safety and performance. Electronics Manufacturing: Automated Optical Inspection (AOI) systems are widely used in the electronics industry to inspect printed circuit boards (PCBs). These systems capture high-resolution images of PCBs and use pattern recognition algorithms to identify defects such as missing components, soldering issues, and misalignments, ensuring that only functional electronics are shipped to customers. Textile Industry: Vision inspection systems in the textile industry scan fabrics to identify defects such as holes, stains, and color inconsistencies. These systems use cameras and image processing software to continuously monitor the fabric during production, ensuring high-quality textiles are produced without manual inspection. Food and Beverage Production: In the food and beverage industry, vision systems are used to inspect products for contamination and packaging defects. For example, x-ray and infrared imaging technologies can detect foreign objects in packaged foods, while high-speed cameras ensure that labels and seals are correctly applied. Pharmaceuticals: Vision inspection systems in the pharmaceutical industry ensure that tablets, capsules, and vials are free from defects. These systems use cameras and specialized lighting to inspect for cracks, chips, and discoloration, ensuring that only safe and effective medications reach consumers.These examples illustrate the significant role of vision inspection systems in enhancing product quality and safety across various manufacturing sectors. By leveraging advanced imaging and machine learning technologies, these systems provide manufacturers with reliable and efficient tools to maintain high standards and improve operational efficiency.Deep learning has changed quality control for the betterDeep learning has revolutionized defect inspection and quality control in manufacturing by providing unprecedented accuracy, speed, and adaptability. Traditional approaches that use hand-crafted algorithms on vision systems for manufacturing, while effective, have limitations in their ability to learn and adapt to new types of product defects and variations. Deep learning overcomes these limitations by leveraging vast amounts of data to train neural networks that can identify and classify defects with high precision.Deep learning models, particularly Convolutional Neural Networks (CNNs), have significantly improved the accuracy of defect detection. These models can automatically learn complex features from images, enabling them to detect even the smallest and most subtle defects. For instance, deep learning models can identify micro-cracks in semiconductor wafers, which are crucial for the electronics industry where even minor defects can lead to significant product failures. Thus, deep learning can significantly improve automated inspection.But the improvements of machine learning in quality control come with some challenges. Challenges and Opportunities of Machine learning in Quality Control1. Choosing the right AI-based algorithms for a given problemThe diversity of AI-based algorithms available for automated inspection pose a significant challenge for manufacturers. Different algorithms, based on classification, localization, and segmentation, apply to different problems. Choosing the right algorithm requires understanding the specific requirements of the defect recognition task at hand, which varies significantly across different manufacturing sectors.Classification algorithms allow identifying defects within a set of predefined classes (like classifying defective and non-defective microchips in the electronic industry) with CNN models such as ResNet which automatically learn and extract hierarchical features from images recognizing complex patterns Localization algorithms allow detecting the presence of defects within the image (like locating surface cracks on automotive parts). Models from the YOLO family can detect and localize multiple defects in real-time, making them suitable for high-speed manufacturing lines Segmentation algorithms precisely delineate the boundaries of defects within an image (for example segmenting defects on textile surfaces to identify the exact areas of flaws). U-Net is a popular CNN architecture for segmentation tasks. It excels whenever detailed localization of features within images is required thanks to its encoder-decoder structure that allows it to capture fine details and provide pixel-level segmentation. Anomaly detection identifies deviations from the norm without predefined defect classes (for example detecting unusual wear patterns on machinery parts that indicate potential failures). Variational Autoencoders (VAEs) are an example of an architecture used for anomaly detection. These models learn the normal distribution of the data and can identify anomalies as deviations from this distribution. VAEs, in particular, are used to model the normal appearance of components and flag deviations as defects.How Axelera AI addresses this challengeAt Axelera AI we know that our customers require a wide variety of models for their vision inspection systems. We address this challenge by offering a flexible and performant architecture designed to integrate and adapt to a wide variety of deep learning models at high resolution and high frame rates.Our unique In-Memory architecture is built with adaptability in mind, allowing seamless integration of new and emerging backbones (the part of neural networks that extract features from the input data, like ResNet and MobileNet). This helps us provide a comprehensive and ever-growing toolkit for defect inspection as part of our Model Zoo.Our Metis AI platform already supports a wide range of state-of-the-art models and features best-in-class performance across many models. This allows deploying real-time vision inspection systems with multiple high resolution, high frame rate cameras.2. Adapting to the uniqueness of each manufacturing lineThe fragmentation of problems across different manufacturing lines presents a significant challenge in automated inspection. Each line has unique requirements, materials, and processes, making it difficult to find a one-size-fits-all solution. Consequently, there is a need to adapt models specifically to each problem, often with limited data available for training. Here are the main approaches to solving this challenge:Retraining: Involves training a pre-existing model on new data specific to the manufacturing line. This process can be time-consuming and data-intensive but ensures that the model is tailored to the specific defects and characteristics of the production line. Fine-Tuning: Is a less resource-intensive method where a pre-trained model is slightly adjusted using a smaller, task-specific dataset. This approach is particularly useful when the available data is limited. Model-Agnostic Meta-Learning (MAML): Is a meta-learning technique where a model is trained on a variety of tasks such that it can quickly adapt to new tasks with minimal data. This approach is beneficial in environments where new types of defects may frequently arise. Zero-Shot Learning: Allows a model to recognize defects it has never seen before by leveraging knowledge from similar tasks or utilizing descriptive labels. This method is highly advantageous in scenarios with very limited or no defect data. How Axelera AI addresses this challengeAxelera AI addresses the challenge of adapting models for machine learning in quality control to the uniqueness of each manufacturing line by providing a robust and flexible solution that minimizes complexity. Our technology does not get in the way of training. Models that have been trained in a hardware-agnostic fashion can be compiled and run on our inference hardware seamlessly and without degradation in accuracy. We provide quantization libraries that automatically handle the optimization of models post-training. This enables efficient deployment without compromising performance and ensures that manufacturers can quickly implement tailored AI solutions to address specific defect inspection needs.3. Deploying fast and secure inference, anywhereDeploying AI models that enhance vision systems for manufacturing involves several critical challenges. Axelera AI makes sure they are addressed in order to ensure operational efficiency, privacy, and confidentiality. Manufacturing environments often consist of diverse hardware systems from various vendors running different operating systems. A manufacturing line might use a mix of Windows, Linux, and custom real-time operating systems across different machines from vendors like Dell, HP, and Lenovo. Axelera AI’s acceleration platform Metis is available as PCIe or M.2 modules and can be integrated into many hardware solutions, supporting a wide range of operating systems. This ensures compatibility with existing heterogeneous hardware setups. Manufacturing systems often have pre-existing software architectures that may be custom-built or rely on widely used libraries like GStreamer. Integrating AI models into these architectures without disrupting existing workflows and processes is achieved thanks to Axelera AI’s Voyager SDK. This SDK provides both low-level APIs and pipelines based on popular frameworks like GStreamer, allowing for easy integration into pre-existing software architectures. This flexibility ensures that AI models can be deployed without significant modifications to existing systems. AI inference needs to have low latency and high pixel throughput to meet the demands of real-time defect detection. This is especially important when multiple high-resolution cameras are used, or when the manufacturing process operates at high frames per second (fps). Metis delivers datacenter-grade performance at the edge, ensuring low latency and high throughput necessary for real-time defect detection. This capability is crucial for maintaining operational efficiency in high-speed manufacturing lines with multiple high-resolution cameras. Manufacturing data often includes sensitive information that cannot leave the premises due to privacy and confidentiality concerns. Axelera AI’s on-premises AI solution, addresses confidentiality and privacy concerns by ensuring that all data processing occurs within the manufacturing facility.
Bram Verhoef | Director of Customer Engineering & Success at AXELERA AI To create a high-performing and highly energy efficient AI processing unit (AIPU) that obsoletes extensive model retraining, our engineers took a radically different approach to data processing. Through unique quantization methods and a proprietary system architecture, Axelera is able to offer the most powerful AI accelerator for the edge you can buy today. In this blog, you can read all about our unique quantization techniques.Industry-leading performance and usabilityOur Metis acceleration hardware leads the industry, because of our unique combination of advanced technologies. This is how our sophisticated quantization flow methodology enables Metis’ high performance and efficiency.Metis is very user-friendly, not in the least because of the quantization techniques that are applied. Axelera AI uses Post-Training-Quantization (PTQ) techniques. These quantization techniques do not require the user to perform any retraining of the model, which would be time-, compute- and cost-intensive. Instead, PTQ can be performed quickly, automatically, and with very little data. Metis is also fast, energy-efficient and cost-effective. This is the result of innovative hardware design, like digital in-memory-computation and RISC-V, but also from the efficiency of the algorithms running on it. Our efficient digital in-memory-computation works hand in hand with quantization of the AI algorithms. The quantization process casts the numerical format of the AI algorithm elements into a more efficient format, compatible with Metis. For this, Axelera AI has developed an accurate, fast and easy-to-use quantization technique.Model Deviation from FP32 accuracy ResNet-34 -0.1% ResNet-50v1.5 -0.1% SSD-MobileNetV1 -0.3% YoloV5s-ReLu -0.9% Accuracy drop @ INT8 Highly accurate quantization techniqueIn combination with the mixed–precision arithmetic of the Axelera Metis AIPU, our AI accelerators can deliver an accuracy practically indistinguishable from a reference 32-bit floating point model. As an example, Metis AIPU can run the ResNet50v1.5 neural network processing, at a full processing speed of 3,200 frames per second, with a relative accuracy of 99.9%. Technical details of our post-training quantization methodTo reach high performance, AI accelerators often deploy 8-bit integer processing of the most compute-intensive parts of neural network calculations instead of using 32-bit floating-point arithmetic. To do so, a quantization of the data from 32-bit to 8-bit needs to be done.The Post-Training Quantization (PTQ) technique begins with the user providing around hundred images. These images are processed through the full-precision model while detailed statistics are collected. Once this process is complete, the gathered statistics are used to compute quantization parameters, which are then applied to quantize the weights and activations to INT8 and other precisions in both hardware and software.Additionally, the quantization technique modifies the compute graph to enhance quantization accuracy. This may involve operator folding and fusion, as well as reordering graph nodes. Our radically different approach to data processingFrom the outset, we designed our quantization method with two primary goals in mind. The first goal is achieving high efficiency, the second is high accuracy. Our quantized models typically maintain accuracy comparable to full-precision models.To ensure this high accuracy, we begin with a comprehensive understanding of our hardware, as the quantization techniques employed depend on the specific hardware in use. Additionally, we utilize various statistical and graph optimization techniques, many of which were developed in-house. Compatible with Various Neural NetworksBy employing a generic quantization flow methodology, our systems can be applied to a wide variety of neural networks while minimizing accuracy loss.Our quantization scheme and hardware allow developers to efficiently deploy an extremely wide variety of operators. This means that Axelera AI's hardware and quantization methods can support many different types of neural network architectures and applications.
Access control is a fundamental element in safeguarding both physical and digital environments. Integrating vision AI has significantly advanced access control systems, offering a level of automation and intelligence previously unattainable. Especially for biometric access control systems. Yet, the challenge remains: How can we speed up verification without compromising accuracy? More specifically, how do we reduce false positives and negatives?This blog explores the current and future state of AI access control, the pivotal role of verification speed, and a method to increase verification speed without increasing false positives in security. The Evolution of Vision AI in Access ControlAt present, vision AI applications in (biometric) access control systems are primarily used for identification and verification, and sometimes for motion detection and behavior analysis. With technologies like facial recognition, object detection and anomaly detection, we've moved from reactive to proactive security measures. Looking ahead, we envision a more sophisticated integration of AI in access control, where adaptive learning algorithms can predict potential security breaches before they occur, and personalized access protocols cater to the unique security requirements of individual users or entities.The Critical Importance of Speed in VerificationIn today’s fast-paced world, rapid verification in access control is not just a convenience; it's a necessity. Delays in access verification can lead to bottlenecks in high-traffic environments, disrupt operations, and degrade the user experience. More critically, the speed at which individuals can be verified and granted access can be a matter of life and death.Low speed of processing and available performance headroom of the equipment used may increase the risk of missing detection of people or objects due to an inability to use more advanced and more reliable image processing, such as using the latest neural networks, such as YOLOv8, picking the best picture from several, alignment, and real-time matching. Why Accuracy Matters TooEvery millisecond saved in the verification process enhances the user experience and operational efficiency. However, every incorrect decision made by the system — be it a false positive or a false negative — undermines trust in the security framework and can cause delays itself. High traffic environments, such as airports, commercial buildings, and public events, require a solution that combines high-speed, high-accuracy verification to maintain security without disrupting the flow of movement. The goal, therefore, is a verification process that is not only fast but also reduces false positives and false negatives in security to the absolute minimum.The Challenge with Current AI AcceleratorsCurrent AI accelerators have made significant strides in improving the efficiency of running vision AI models. However, they often face a trade-off between speed and accuracy, as they commonly deploy 8-bit integer inference arithmetic instead of 32-bit floating-point full-precision. High verification speeds can sometimes result in increased false positives and negatives, as the security and surveillance systems may not spend enough time analyzing the data to make accurate decisions. This is particularly problematic in access control, where errors can either compromise security by allowing unauthorized access or hinder operations by denying access to legitimate users. Therefore, eliminating false negatives and false positives in machine learning used for automatic identification is important.Fortunately, Axelera AI solved the challenge of reducing precision of the mathematical computations without any practical accuracy loss, eliminating the false positives in security processes produced by vision AI accelerators. The exceptional performance and accuracy of the Axelera AI acceleration platform have significantly fueled our collaborative efforts. Its unmatched performance-to-price ratio, surpassing traditional GPUs and dedicated AI processing units, has been critical in our selection process. We are confident that leveraging their state-of-the-art YOLO performances will empower us to tackle new challenges in our current and future video analysis applications.Alexandre Perez, R&D Director at XXII. How We Accelerated Vision AI Applications Without Accuracy LossTo address the challenges outlined above, our engineers took a radically different approach to data processing. By combining Axelera’s proprietary digital in-memory computing technology (D-IMC) and a unique post-quantization method, Axelera has created the Metis AIPU – the most powerful AI accelerator for the edge you can buy today. Its unmatched efficiency and accuracy redefine the standard for AI access control. The technology ensures that vision AI models run with the same accuracy as PCs or GPUs (FP32 equivalent), but at significantly lower cost and power consumption while delivering the highest level of accuracy to minimize false positives and negatives. It can make biometric access control systems not only efficient but also highly reliable.
Bram Verhoef | Director of Customer Engineering & Success at AXELERA AI At this year’s World Economic Forum in Davos, the spotlight was firmly placed on artificial intelligence (AI), reflecting its growing importance across various sectors. The discussions not only highlighted AI’s expansive role but also emphasized the evolving trend of edge computing driven by specialized hardware accelerators.The topic captivated the forum for several days due to its impact on scaling AI applications, the accelerating pace of technological advancements, and the democratization of AI through open-source models. Among the people that were at the center of the debate and on stage discussions were Yann LeCun, Kai-Fu Lee, Daphne Koller, Andrew Ng, and Aidan Gomez that contributed deep insights into the potential and direction of AI growth.Here are some deeper insights into these topics, offering a glimpse into the future shaped by AI and edge computing. AI’s Ubiquity in Davos Discussions AI dominated discussions in Davos, underscoring its critical role in both posing challenges and offering solutions. This ranged from ethical considerations and privacy concerns to AI’s potential in enhancing safety and efficiency in industries such as surveillance, healthcare, finance, and manufacturing. Strategic Imperative of AI Adoption There was a consensus on the need for comprehensive AI strategies within the next five years. This goes beyond merely adopting AI technologies; it involves integrating AI into core business processes, understanding its impact on customer engagement, and rethinking how AI can drive innovation and competitive advantage. AI as a Collaborative Partner AI was widely recognized as a collaborator that augments human capabilities. This concept extends to various sectors, from creative industries using AI for design and content generation to legal and medical fields where AI assists in analysis and diagnostics, enhancing the expertise of professionals. The Need for AI Fluency A recurring theme was the importance of AI literacy in the workforce. This means not just understanding AI but being adept at leveraging AI tools for decision-making, problem-solving, and innovation. It highlights the need for continuous learning and upskilling in the age of AI. AI and Productivity: A Symbiotic Relationship Discussions also focused on AI’s role in boosting productivity, especially in the context of aging populations and slower economic growth. AI’s ability to automate complex tasks and analyze large data sets can drive efficiency, leading to job creation in AI development, management, and maintenance. AI as a Catalyst for Scientific Discovery AI’s potential to revolutionize scientific research was a prominent topic. From drug discovery and climate modeling to exploring new materials, AI’s ability to process vast amounts of data and identify patterns can significantly accelerate scientific breakthroughs. The Open Source AI Debate The role of open-source AI was acknowledged as vital in democratizing access to AI technologies. However, concerns were raised about the safety and ethical use of AI, emphasizing the need for robust governance frameworks to manage these open-source resources responsibly. AGI: A Work in Progress Artificial General Intelligence (AGI) was discussed as an emerging yet influential area. While current AI systems excel in specific tasks, the pursuit of AGI aims at creating more versatile, human-like intelligence, marking a significant leap in AI capabilities. Artificial General Intelligence (AGI) was discussed as an emerging area. While today’s AI systems exhibit increasing levels of generality, there is a clear need for further advancement to enhance their overall applicability. Despite the growing sophistication of AI, it notably lacks certain core aspects intrinsic to human intelligence. Key among these are the abilities to learn from a limited number of examples and to achieve visual grounding. Intriguingly, these areas are currently at the forefront of AI research, sparking considerable interest and anticipation for significant progress in the coming year. Tailored AI While 2023 was the year of general large language models, 2024 will be the year of customized experienced. For consumers, OpenAI has just released the AI store with millions of customized models to serve specific purposed. In the business-to-business market companies will start deploying custom models, tailored on specific applications and fine-tuned with proprietary data, preserving privacy, security and intellectual proprieties. AI at the Edge: The Future of Digital Interactions A key foresight from Davos was the move towards processing data at the edge, in proximity of the user, facilitated by hardware accelerators. This approach is crucial for real-time processing and response, essential for applications ranging from industrial 4.0, autonomous vehicles to smart cities, where delay in data processing can have critical implications. The Axelera AI RevolutionAs Europe’s largest player in the AI acceleration space, we are pioneering this shift towards edge-centric AI. Our focus on developing cutting-edge hardware accelerators is pivotal in bringing the power of AI closer to where data is generated, reducing latency, enforcing data privacy, and enhancing efficiency. This is not just about advancing technology; it’s about reshaping how we interact with and benefit from AI in our daily lives. As we lead this charge, Axelera AI remains committed to innovating and driving forward a future where AI is more accessible, efficient, and integrated into the fabric of our evolving digital world.
Evangelos Eleftheriou | CTO at AXELERA AI The Metis AI Platform is a one-of-a-kind holistic hardware and software solution establishing best-in-class performance, efficiency, and ease of use for AI inferencing of computer vision workloads at the Edge. It encompasses the recently taped-out high-performance Metis AI Processing Unit (AIPU) chip, designed in 12nm CMOS, and the comprehensive Voyager Software Development Kit (SDK). Axelera’s Metis AIPUAxelera’s Metis AIPU is equipped with four homogeneous AI cores built for complete neural network inference acceleration. Each AI core is self-sufficient and can execute all layers of a standard neural network without external interactions. The four AI-Cores can either collaborate on a workload to boost throughput or operate on the same neural network in parallel to reduce latency or process different neural networks required by the application concurrently.The AI core is a RISC-V-controlled dataflow engine delivering up to 53.5 TOPS of AI processing power featuring several high-throughput data-paths to provide balanced performance over a vast range of layers and to address the heterogenous nature of modern neural network workloads. The total throughput of Axelera’s four-core Metis AIPU can reach 214 TOPS at a compute density of 6.65 TOPS/mm2.At the heart of each AI core is a massive in-memory-computing-based matrix-vector-multiplier to accelerate matrix operations, and thereby convolutions, offering an unprecedented high energy efficiency of 15 TOPS/W. These matrix-vector multiplications constitute 70-90% of all deep learning operations. In-memory computing is a radically different approach to data processing, in which crossbar arrays of memory devices are used to store a matrix and perform matrix-vector multiplications at constant O(1) time complexity without intermediate movement of data. The matrix-vector multiplication via in-memory computing is extremely efficient for Accuracy and noise immunity with D-IMCAxelera AI has fundamentally changed the architecture of “compute-in-place” by introducing an SRAM-based digital in-memory computing (D-IMC) engine. In contrast to analog in-memory computing approaches, Axelera’s D-IMC design is immune to noise and memory non-idealities that affect the precision of the analog matrix-vector operations as well as the deterministic nature and repeatability of the matrix-vector multiplication results. Our D-IMC supports INT8 activations and weights, but the accumulation maintains full precision at INT32, which enables state-of-the-art FP32 iso-accuracy for a wide range of applications without the need for retraining.The D-IMC engine of the matrix-vector-multiplier is a handcrafted full-custom design that interleaves the weight storage and the compute units in an extremely dense fashion. Besides saving energy by not moving weights, energy consumption is further reduced by a custom adder with minimized interconnect parasitics, with balanced delay paths to avoid energy-consuming glitches, and with judicious pipelining to provide high compute throughput at low supply voltage. Although the matrix-vector-multiplier supports a large matrix size, by using both activity and clock gating, energy efficiency stays high even at low utilization. Note that the matrix coefficients can be written to the D-IMC engine in the background without stalling the computations.In addition to the D-IMC-based matrix-vector-multiplier, each AI core features a unit for block-sparse diagonal matrix operations to provide balanced performance for layers such as depth-wise convolution, pooling and rescaling that have a high IO-to-compute ratio compared to normal matrix-vector multiplications. Lastly, to address element-wise vector operations and other non-matrix-based operations including activation function computations, a stream vector unit is provided. This unit can operate on floating-point numbers to address the increases numerical precision requirements of those functions.Providing massive compute power is only one consideration. Having a high-throughput and high-capacity memory close to the compute element is equally important for good overall performance and power efficiency: besides 1 MiB of computational memory in the matrix-vector-multiplier that can be accessed with several tens of terabit per second, each core features 4MiB of L1 memory that can be accessed with multiple streams concurrently with an aggregated bandwidth of multiple terabits per second. The combination of these two memories offers a total of 5MiB of tightly coupled high-speed memories within a single AI core. A fully integrated SoCThe four AI cores are integrated into a System-on-Chip (SoC), comprising RISC-V, PCIe, LPDDR4x, embedded Root of Trust, an at-speed Crypto engine and large on-chip SRAM, all connected via a high-speed and packetized Network-on-Chip (NoC). First, the application class RISC-V control core, running a real-time operating system, is responsible for booting the chip, interfacing with external peripherals and orchestrating collaboration between AI cores. Second, the PCIe provides a high-speed link connection to an external host for offloading full neural network applications to the Metis AIPU. Finally, the NoC connects the AI cores to a multi-level shared memory-hierarchy with 32MiB of on-chip L2 SRAM and multiple GiB of LPDDR4 SDRAM, ultimately connecting more than 52MiB of on-chip high-speed memories if the memories of the AI-Cores are included. The NoC splits control and data transfers and is further optimized to minimize contention for simultaneous access of multiple data managers (AI core, RISC-V core, or external host) to the AI cores and higher-level memories in the memory hierarchy. As such, it offers more than a terabit per second of aggregated bandwidth to the shared memories, ensuring the AI cores will not stall in highly congested multi-core scenarios.By pairing the massive compute capabilities provided by our D-IMC technology with an advanced memory subsystem and a flexible control scheme, the Metis AIPU chip can handle multiple demanding complete neural network tasks in parallel, with an unparalleled energy efficiency. Axelera’s Voyager SDKThe Voyager SDK provides an end-to-end integrated framework for Edge application development. It is built in-house with a focus on user experience, performance, and efficiency. In its first release, the SDK is optimized specifically for the development of computer vision applications for the Edge and enables developers to adopt and use the Metis AI platform for these use cases quickly and easily. Voyager takes users along the entire development process without requiring them to understand the internals of the Metis AIPU or to have expertise in deep learning: Developers can start from turnkey pipelines for state-of-the-art models, customize these models to their particular application domain, deploy to Metis-enabled devices and evaluate performance and accuracy with one-click simplicity.As part of Axelera’s Metis AI platform, developers are provided access to the Axelera Model Zoo, which is accessible on the Web and via cloud APIs. The Model Zoo offers state-of-the-art neural networks and turnkey application pipelines for a wide variety of use cases such as image classification, object detection, segmentation, key point detection and face recognition. Developers can also import their own pre-trained models with ease: Axelera’s toolchain automatically quantizes and compiles models that have been trained in many different ML frameworks such as PyTorch and TensorFlow and it generates code that runs on the Metis AIPU with industry-leading performance and efficiency.
Florian Zaruba | Technical CPU Lead at AXELERA AI Abstract – Recently, Transformer-based models have led to significant breakthroughs in several forms of generative AI. They are key in both increasingly powerful text-to-image models, such as DALL-E or stable diffusion, and language and instruction-following models, such as ChatGPT or Stanford’s Alpaca. Today, such networks are typically executed on GPU-based compute infrastructure in the cloud, because of their massive model sizes and high memory and bandwidth requirements. In bringing real-time generative transformers to edge devices, their applicability could be greatly expanded. To this end, this article discusses bottlenecks in transformer inference for generative AI at the Edge. Figure 1: Encoder/Decoder stacks, Dot-Product and Multi-Head attention. Images are taken from [attention is all you need] Introducing TransformersTransformer models, first introduced in the 2017 research appear, ‘Attention Is All You Need’ [1] have been firmly established as the state-of-the-art approach in both sequence modeling problems, such as language processing and in image processing [2]. Its network architecture is based solely on attention mechanisms, as opposed to Recurrent or Convolutional Neural Networks. Compared to recurrent networks, this makes them much faster to train, as model execution can be parallelized rather than sequentialized. Compared to convolutional neural networks, the attention mechanism increases modelling capacity.A transformer model typically contains an Encoder and Decoder stack, see Figure 1. Here, the encoder maps an input sequence of tokens, such as words or embedded pixels, onto a sequence of intermediate feature representations. The decoder uses this learned intermediate feature representation to generate an output sequence, one token at a time. The encoder stack exists out of N identical layers, split into multi-head attention-, normalization, elementwise addition, and fully connected feed-forward sublayers. The decoder stack differs in that it inserts a second multi-head cross-attention sublayer, performing attention over the output of the encoder stack, as well as over the newly generated output tokens. Figure 1 illustrates this typical Encoder/Decoder setup, as well as the concept of multi-head Dot-Product attention. We refer the reader to [1] for a detailed discussion.Since their introduction in 2017, transformer topologies and network architectures have largely remained the same, increasing their functionality through better training on more complex data rather than through architectural changes. The architecture proposed in Figure 1, is now used mostly unchanged in State-Of-The-Art Large Language Models such as ChatGPT [3], Falcon [4], Guanaco [5], Llama [6] or Alpaca [7]. The quality of the proposed models varies depending on how they are trained, and on their size, as is illustrated on the hugginface leaderboard at [8]. Smaller models contain less layers (lower N) and have lower embedding dimensions (smaller E). State-of-the-art large language models now contain between 7-65 billion parameters in their feedforward connections. Challenges in Transformer InferenceInferring transformer models on an Edge device is challenging due to their large computational complexity, large model size and massive memory requirements. On top of that, computational and memory requirements can be badly balanced in a modern AI accelerator, which focusses mostly on implementing many cheap parallel computational units and have limited memory capacity and bandwidth available due to physical, size and cost constraints. However, transformers are often memory-capacity and memory-bandwidth bound, as discussed below.Transformer models can primarily be used in three ways: (1) encoder-only, typically in classification tasks, (2) decoder-only, typically in language modeling and (3) encoder-decoder, typically in machine translation. In the decoder-only case, the encoder is removed, input-tokens are directly fed to de decoder, and there is no cross-attention module. It is especially the execution of the decoder mode that is challenging, but even encoding can come at a high computational cost. Figure 2: Number of Operations required in a decoder consuming S tokens and generating S tokens. (left) Without Caching optimizations, (right) with caching optimizations. Figure 3: Demonstration of KV Caching Mechanism. Figure courtesy to Nvidia [9]. A. Computational CostThe computational cost of transformers is extremely high, as discussed in the survey by Yi Tay and colleagues [10]. The authors show the number of computations in a Transformer can be dominated by the Multi-Head Self-Attention module, whose complexity scales quadratically with the sequence length s. This is particularly challenging in vision transformers, where the sequence length scales with the number of pixels in an image, and when trying to interpret or generate large portions of text, with potentially thousands of words or tokens. This is illustrated in Figure 2, showing the number of operations required in a decoder-only transformer for various Embedding sizes E, Sequence Lengths S, and number of layers N. Figure 2 (left) shows the number of operations required without caching optimizations, Figure 2 (right) the number of operations with caching optimizations, see below. Note that mostly the sequence length dominates the computational cost, due to the quadratic dependency on sequence length in self-attention.In KV-caching, intermediate data is cached and reused, rather than recomputed. Instead of recomputing full key and value matrices in every iteration of the decoding process, some intermediate feature maps (the Key and Value matrices) cand be cached and reused in the next iteration, see Figure 3. This caching mechanism reduces the computational complexity of the decoding mechanism exchanging it for data transfers and essentially further lowering the applications arithmetic intensity. The memory footprint of this KV-cache can be massive, with up to terabytes of required memory capacity for relatively small sequence lengths in a state-of-the-art LLM model.A large body of research focusses on reducing this computational complexity. Here, complexity is not reduced by KV-Caching, but by either (1) finding ways to break the quadratic complexity of self-attention through subsampling or downsampling the field-of view, or (2) by creating different types of sparse models that can be conditionally executed. See Figure 4 or the survey by Tay and colleagues [10] for a full overview of recent techniques in efficient transformer design. Notable works as Linformer [11] or Performer [12] manage to reduce the complexity from O(s^2) to O(s) at a limited accuracy cost. Other works such as GLAM [13], keep the O(s^2) complexity but reduce the computational cost by introducting various forms of sparsity. Though these works do reduce the computational complexity of transformers, especially on large sequence lengths, their overall success is mostly limited, and they are not yet used in the latest sota ChatGPT-like models.Another mainstream approach that is used to reduce transformers computational cost and memory footprint is to aggressively quantize both the intermediate features and weights, often down to 8 or 4 bits, without losing accuracy [14]. Figure 4: Overview of efficient transformer models [10]
Introducing Stephen Owen, Axelera AI Advisor Axelera AI has recently welcomed Stephen Owen as an Advisor. Stephen is a highly experienced executive with over 16 years of board-level experience in an S&P Top 500 semiconductor company. He has extensive knowledge in global leadership, organizational management, sales, and marketing. Stephen successfully led a global team of over 1800 employees, consistently delivering exceptional results that significantly boosted the company’s performance over several years. He formerly served as the Global Marketing and Sales EVP at NXP Semiconductors.Read or watch the interview to learn more about his insights and experiences.Axelera AI is focusing on the imaging space, which has a huge market and significant potential. You’re a seasoned executive with extensive experience in the high-tech industry. What sparked your interest in the AI sector and let you to join Axelera AI as an advisor? In my previous roles over the last few years, we began to explore AI, machine learning, and the use of algorithms to improve a variety of different areas. Administrative tasks, for instance, but what’s more interesting is the application in vehicles—machine learning aids with infrastructure, secure edge, secure IoT.Also, in the sales department, we tried to understand how customers engage with us as a tech company. We aimed to create a database that could interact in such a way that we could serve not only 500 direct customers but also 50,000 indirectly online. It’s a massive challenge—you can’t possibly staff enough people for that. You need to move that interaction online, and to provide meaningful responses, you have to employ AI. So you build this database of questions and answers and refine the system with keyword searches, among other tools. And that’s what we built upon.Technically, we’ve applied various innovations related to microprocessors. This is where my interest peaked, especially when I began to connect with the people at Axelera AI, which then led me to look beyond just processing to AI and cloud services and so on. From your perspective, what are the most compelling opportunities and challenges facing startups in the AI solutions space today? And how is Axelera AI positioned? The biggest challenge for many companies in the AI space, such as Axelera AI, is deciding what to focus on first. AI is a pretty generic term, and there are so many different directions you could go in.Axelera AI is focusing on the imaging space, which has a huge market and significant potential. Then, there are the machine learning opportunities, such as predictive maintenance programs.The medical space is also an exciting opportunity, intersecting with imaging. It involves looking at mammography, X-rays, scans, and using AI to detect tumors. Predictive maintenance for hospitals, as well as in robotics and industrial environments, can save millions of dollars. Predicting breakdowns well in advance saves a considerable amount of time.On the imaging front, the possibilities are endless for how it can be used, from identification in a commercial or retail environment to distinguishing between bad and good actors. The potential extends even beyond that.The key is really about focus. Don’t try to do everything for everyone. Pick your markets. Axelera AI is doing just that, concentrating on imaging.For companies like Axelera AI, the opportunities are indeed fantastic. But the key is really about focus. Don’t try to do everything for everyone. Pick your markets. Axelera AI is doing just that, concentrating on imaging. How do you envision the transition from cloud to edge computing altering the broader technological landscape?For over a decade, we’ve anticipated the proliferation of IoT nodes, and now we’re seeing it come to fruition, especially during the COVID period, with an uptick in devices connecting from homes and offices. This increase has pressured infrastructure to manage higher data rates, something traditional data centers alone can’t handle. The evolving solution involves super powerful data centers performing the heavy computing, then distributing that data back to local environments. The capability of local nodes—devices at the edge—has greatly improved. They can now handle more complex tasks, including their own AI calculations, reducing the latency issues associated with data constantly moving to and from the central data centers.A crucial element in this transition is addressing security. Secure edge computing is essential for protecting the vast amount of data generated at the edge and in the cloud. We’re going to see a shift towards more localized system operations, which is where the real work happens and is most needed.In the automotive industry, for instance, it’s impractical to rely on data centers far removed from the action. Processing needs to happen on-site, rapidly and efficiently, to recognize and respond to situations. The same principle applies to retail and other industries—speed and local processing are of the essence. With the increasing shift towards edge computing and the strain on data centers, how do you foresee companies adapting their infrastructure strategies to optimize performance and efficiency in this evolving landscape?The primary shift will be to delineate the two domains: the data center and the edge computing sectors. Introducing more potent accelerators into the mix is where organizations like Axelera AI become pivotal. With high-end computing platforms and AI accelerators integrated into systems, edge devices will gain significantly more computational power. This will enable them to handle more complex computations and use cases that were previously unattainable.This evolution implies that companies will need to collaborate to create a cohesive system. They’ll need to work in unison to run AI models and think strategically about constructing infrastructure that functions as a unified entity. That’s one of the formidable challenges today.Traditionally, companies have operated their own infrastructures and solutions in isolation, but there’s a gradual shift happening. As the necessity for cooperation becomes more apparent, startups are recognizing this need. Increasingly, larger corporations are also starting to acknowledge this trend and are beginning to transition as well. So you say it’s it’s a shift becoming towards more of the ecosystem?Of course, the movement is indeed towards a more ecosystem-oriented direction. This is exemplified by the alliance of consumer electronics companies with chip manufacturers in creating the Matter standard. Matter enhances plug-and-play capabilities, making it easier for devices to work together out of the box, and even competitors are providing solutions that are interoperable.This ecosystem approach, designed for consumer benefit, is also essential in larger scale systems such as edge computing, security, and AI. These sectors require companies to collaborate, creating models and systems that can operate together efficiently across various platforms, whether it’s RISC-V or ARM-based systems, or accelerators designed for specific architectures.Ultimately, the AI software that powers these systems must be neutral and flexible, able to operate across different hardware environments. This neutrality ensures that AI can be a versatile tool, capable of being implemented in multiple ecosystems, serving the broader purpose of enhancing consumer convenience and experience. What emerging trends withing AI and edge computing excite you the most?It’s a big list,My focus, particularly in the medical arena, is on advancing women’s health. It’s a comprehensive field where significant progress is needed and where I believe we can make meaningful advancements.In the realm of industrial automation, we’re witnessing a shift towards what are known as ‘dark factories.’ These are spaces where fewer people are needed, and the factories can operate in the dark because the machines and robotic systems take over. This allows us to reallocate human resources to other tasks and increase overall efficiency.Furthermore, automotive and mobility represent another area ripe for innovation. With the amount of traffic congestion globally, there’s an enormous amount of carbon fuel wasted. If we can develop an infrastructure that interlinks homes, cars, trains, buses, scooters, and more, and streamline it with better data, we can save significantly. Not only in terms of fuel but also by making our cities safer and more efficient places to live. It’s about connecting everything.Yeah, But it’s utilizing that connection. And taking advantage of the fact that there is such a big connection of so many systems and then concentrating on how to make those systems and ecosystems work together and get the maximum efficiency out of it. How do you think open-source architectures like RISC-V are influencing the dynamics of the AI semiconductor industry?RISC-V is making quite an impact on the AI semiconductor industry. It has been a refreshing development, allowing many startups to expand their businesses more rapidly. Additionally, it has offered consumers and customers a new perspective, a different way to look at their computing needs.Alternatives are incredibly important in this market. From what I’ve seen in my experience with bringing products to market, customers can become concerned when there is only a unique system available without alternative options.With RISC-V’s expansion, more semiconductor companies are considering providing solutions that include both RISC-V and ARM. This indicates that RISC-V is poised to have a significant future alongside ARM.Companies like Axelera AI are taking advantage of this opportunity. They are focusing on RISC-V and ensuring that their software can operate in both the RISC-V and ARM ecosystems. This adaptability is a major advantage that can help make customers feel more at ease. It also presents a great opportunity for Axelera AI to attract new customers. Despite the perception of fragmentation, do you see a burgeoning market for AI at the edge, especially compared to cloud-based solutions?It really depends on whether you’re considering pure software applications or those that are tied to real-world, local functionalities. Take, for instance, camera-based or imaging-based systems where Axelera AI is likely focused—these are inherently at the edge.Data predominantly will end up being sent back to the data center, but a substantial amount of processing occurs at the edge. Immediate reactions and responses, whether in security or retail environments, are handled locally and in real-time.We’re observing a proliferation of systems at the edge, which is crucial. More companies are moving towards edge computing, realizing that their systems become more efficient and interactive, with the ability to perform transactions and interactions much faster, all thanks to secure edge computing. AI is instrumental in accelerating this, as data doesn’t just go back to the data centers. A lot of the work happens at the edge, where information can be consolidated, and machine learning models can be further refined before being pushed back to the edge.So, while there’s a valid place for cloud-based solutions, the applications at the edge are numerous. Many companies, when they direct their focus appropriately, will likely see their businesses grow.
Florian Zaruba | Technical CPU Lead at AXELERA AI RISC-V is inevitable – it became the mantra of RISC-V, and it’s true. But before we see why that is, let’s step back and discuss what RISC-V is and why we should care.Back in the early days, computers were large and bulky, taking up entire rooms and making a lot of noise.It’s incredible to think about how much technology has advanced over the years. Nowadays, we’re surrounded by microprocessors that make our lives easier without us even realizing it. For example, when we use our bank cards to pay at the supermarket, we’re actually using a full computer that encrypts, signs and verifies our transactions. And our phones contain dozens and dozens of processors that make all the conveniences of modern life possible. What is RISC-V?RISC-V (Reduced Instruction Set Computers) is an open-source Instruction Set Architecture (ISA). You may have heard of other well-known ISAs such as x86, ARM, Power, and MIPS.The RISC-V ISA is the common, standardized language between the processor (hardware) and the programs running on it (software). With RISC-V, the processor itself can only do very basic things, such as (conditionally) adding or subtracting two numbers and deciding what to do based on those arithmetic outcomes. It can also repeat those arithmetic operations several times until a certain condition is reached. In fact, there are only 37 instructions in the base. But these are the very basic ingredients needed to describe essentially any problem to the processor.We call those commands, instructions.By defining the exact meaning (semantic) and spelling (syntax) of those instructions, you can start to build a common understanding between the processor and the software running on it. You can think of the ISA as a dictionary. The ISA of RISC-V is free for anyone to access here. This is incredibly important because today’s software has become so complex that no single company can manage it alone and a common standard ensures interoperability. RISC-V came to the right place, by the right people, at the right timeDavid Patterson, a visionary in the field of computer architecture, spearheaded the development of Reduced Instruction Set Computers (RISC) at Berkeley. The project went through five iterations, which ultimately led to the creation of the RISC-V architecture.Initially, RISC-V mainly piqued academic curiosity. However, the availability of open-source processor implementations, coupled with rising geopolitical tensions, quickly drew industrial interest as well. From that point, the project experienced a snowball effect gaining momentum and broader adoption. Open source breeds innovationThe primary advantage of RISC-V lies in its licensing flexibility. Although the specifications for other ISAs are publicly available, legal constraints prevent you from implementing and selling processor hardware unless you have obtained a license for the ISA specification. With RISC-V, this limitation is removed, offering greater freedom for innovation and commercialization.The limits that most ISAs have reduce their use by smaller innovators. For example, the widely used ISA “x86” allows only a handful of licensees. Arm, another well-known provider of processor IP, has an entirely different business model. They are selling processor IPs that can be licensed and integrated into your product. Only the biggest players in the market are able to own an architectural license that allows them to implement their own processors.Most notable of these is Apple, which switched its entire product line from Intel x86 processors to its own Arm implementation. Their investment in this endeavor proved to be a wise decision both technically and economically, as it enabled them to distinguish their product line even further and surpass their rivals by a significant margin.In the past, designing your own processor and utilizing the expansive software ecosystem was thought to be reserved solely for large corporations. However, that’s not entirely true. Alternatives like OpenRISC existed, but they weren’t widely adopted because their software ecosystems were underdeveloped, and they were often considered hobbyist projects.Another notable benefit of RISC-V is its built-in modularity, which contrasts with ARM’s more rigid structure. The RISC-V architecture is designed from the ground up to be modular, allowing for easier customization and innovation. While ARM tightly controls and protects its ISA specification, making it difficult to introduce your own specialized features, RISC-V offers the flexibility to add custom extensions and your “secret sauce.”This is particularly advantageous if you aim to target specific market segments and differentiate your product from competitors. Remember those 37 instructions I’ve mentioned? Turns out this is only the very base instruction set. For most applications, you would likely want to include some more specialized instructions such as floating-point support, atomic memory operations, or maybe even scalable vector extensions. You don’t need to, but you can. In the same way, your company could add your own instructions in case those are beneficial for the application in mind. This means you can still leverage the entire software ecosystem that has been and is being built and only add value on top instead of reinventing the wheel. So let’s have a look at some of these advantages. How can RISC-V help start-ups design their own chips?For me, the two most important aspects for a hardware-producing start-up are time-to-market and freedom to innovate.Most silicon/hardware start-ups innovate and try to provide value-add in their niche. The processor is a necessary, boring commodity and usually not the main star of the product. This is good and the way it should be. Otherwise, we would all be selling the same processors.RISC-V is an absolute game changer for that.Do you want to quickly try out an integration idea? Prototype something on an FPGA? Before RISC-V this either meant you needed to use some custom FPGA vendor’s ISA (which you couldn’t send to manufacture for your final product) or start negotiations with the processor vendor of your choice. Negotiations are hard for a small silicon start-up. With RISC-V you can go and grab some core from the internet, do your trials, and start building your software stack. Once happy with the prototype and you want to move forward, there are a plethora of RISC-V IP vendors you can choose from. Choice and market competition are a good thing.Expanding on the points raised earlier, the freedom to innovate with RISC-V comes without the risk of architectural lock-in. This has two advantages, first, you are free to switch IP vendors. Your software stack will still work the same because you adhered to an open standard. Second, you can provide your own extensions on top of the main RISC-V standards, and the company can provide very concentrated value-add without the need to maintain or develop commodity. Let’s have a look at what we’ve done at Axelera.At the heart of our Metis AI Processing Unit, there are four AI Cores. Each AI core provides a 512×512 matrix-vector multiplication (MVM) in-memory compute array and a vector datapath that operates on streams of data.To provide a generic solution that can keep up with the fast evolution in the field of neural networks we kept the datapath control as flexible as possible. We aim to push as much as possible into low-level driver software where we can innovate, correct, and adapt throughout the product’s life cycle. Therefore, each AI Core has a dedicated RISC-V, application-class core, that has full control over the datapath unit. A 5th system controller provides SoC management and workload dispatch to the individual AI cores. We’ve confidently chosen the RISC-V ISA because we could start virtual prototyping and building our software stack on open and free software far before we had done any IP negotiations. Furthermore, we knew that the IP landscape would provide us with sufficient competitive choices for our needs.In roughly a year, we went from concept/FPGA prototype to tape-out, and in just a couple of days after having received the chips back, we could run the first end-to-end neural networks on our lab benches. Taking the limited resources and this aggressive time schedule into account it was clear that we’ll needed to buy proven IP without long negotiations to kick-start our developments.Choosing to proceed with RISC-V ensured that we could continue to innovate in future generations and further distinguish our offerings.So, is it all love, peace, and harmony? Not exactly. Like anything that’s still a work in progress. RISC-V has its own set of hiccups and curveballs.An ISA is mainly about instructions, but a processor is part of a bigger system that includes memory, peripherals, and other processors. This broader setup, often called a platform, isn’t covered in the main ISA spec. The RISC-V Foundation aims to standardize these aspects too.However, details matter. Things like atomic handling in multi-hart systems may only be standardized for specific platforms. Also, some IP providers might have legacy elements because they developed solutions before RISC-V standards were set. Standards are scattered, so you have to keep an eye out. And given the rapid growth, it’s tough to stay updated on all the ongoing activities. Looking aheadRemember those room-filling mainframes and how technology has continuously shrunk down to fit our pockets and wrists? Just like those evolutionary leaps, RISC-V aims to be a cornerstone in the next phase of computational progress.In an ideal future, RISC-V will be as “boring” as any well-established technology—boring in the sense that it’s reliable, stable, and so embedded in our daily lives that we take it for granted. The long-term vision is for software to run seamlessly on RISC-V-based processors without users having to worry about compatibility issues.In the short term, however, there’s still much to be done. While RISC-V is showing promise in deeply embedded systems like smartcards and IoT devices, achieving the same level of support and efficiency as ARM and x86 in more user-centric applications—like your phone or computer—is an ongoing project. Industry trends, as evidenced by companies like SiFive and Ventana and initiatives like the RISE project, suggest a competitive and collaborative future for RISC-V. From a research standpoint, its open nature makes it a fertile ground for innovation across the entire hardware-to-software stack.There’s a palpable sense of momentum behind RISC-V, with heavyweight stakeholders and top-tier companies pooling their resources and expertise. Far from slowing down, the ecosystem around RISC-V is only gaining speed, setting it on a course to become as ubiquitous and “boring” as the microprocessors that have come to define modern life.So, looking at where things are going, ignoring RISC-V would be like missing the boat when we switched from those giant mainframes to desktop PCs. It’s a game-changer, and you really don’t want to be left behind.
Cristian – Gavril Olar | Director of Systems Software at AXELERA AI After 20 years of calling me to fix their email and antivirus settings, my parents started calling my 13-year-old daughter for help instead. Technology seems to have been nagging rather than helping grandparents, but the advent of cheap computing gives fundamental reasons to change this balance.Figure 1 – A few population pyramid examples. There are places with significant population shrinking and places with population growth, but in the last few years, the global trend is toward population shrinkage. Source: https://www.populationpyramid.net Economic Growth in a Shifting Population LandscapeIn the past century, we’ve seen a 10% market growth [1], reshaping the foundation of our civilization with the assumption of continuous economic growth. Whether it’s investment funds, pension plans, or governments calculating bond rates, all have factored in the premise of growth.Until recently, an important driver of this growth was the increasing global population. A larger population equates to more brain and muscle power, leading to heightened productivity capacity. However, in recent years, the world has witnessed a decline in population growth, albeit unevenly across different regions.Some see this decline as ominous, as famously raised by Elon Musk two years ago when claiming civilization could crumble [2]. On the more analytical side of things, this shift has drawn attention due to its potential impact on various aspects, notably retirement plans and workload stress. The population decrease is driving the trend toward fewer workers supporting a higher number of retirees along with a greater number of dependents and fewer skilled workers.It has led some countries to make legislation changes to move the retirement age upwards [3]. Increasing productivity, therefore, is becoming a crucial goal in sustaining economic growth, albeit with potential consequences such as increased workload stress [4]. Working Smarter with Cheap ComputingTo counteract these challenges, the focus must shift towards increasing productivity without human overload. In other words, we need to get more human equivalent. Historically, automation has played a key role in boosting productivity. However, the recent decrease in computational costs suggests a potential profound transformation on the horizon. The promise of almost free computer power has significant implications for productivity growth.The ESP32-CAM from Espressif, a low-cost development kit capable of computer vision applications, perfectly exemplifies the advent of cheap computing. In fact, in an article where it was presented 5 years ago it prompted the author to say computing [with ESP32] is not just cheap but it’s essentially free [5]. This seemingly trivial device’s potential is immense, setting a precedent for affordable compute power that creates value from virtually nothing.When I learned about Espressif, I was working at Intel in the technology department building the MyriadX chip. I tried out ESP32 in my spare time because it had some similarities to Myriad: it could use a camera, do some level of image processing, and claimed to do computer vision applications. A couple of weekends later, I was able to build a few interesting applications and was very excited. But then I looked at machine learning applications on ESP32, and the performance there was really limiting my ideas. Even so, the article comment was something that stuck in my mind: a platform that makes compute essentially free.This revelation came back into focus when during an internal brainstorming session, Fabrizio del Maffeo, our CEO at Axelera, commented that we were essentially getting Tera Operations Per Second (TOPS) for free with In-Memory Computing. At the time the comment was made, we were still waiting for Metis, Axelera’s first commercially available chip to come from the factory into the lab. The comment still felt theoretical. Figure 2 – Areas which were traditionally empty are starting to get filled up with architectures from different applications. Cornami adds encryption power, Axelera adds Artificial Intelligence power. Unlocking the Power of Artificial Intelligence at low costsThings changed just a few days later when Metis came into the lab. Incidentally, Metis’ early power-up coincided with the week before and after Easter. In Romania, where I am based, in a word-by-word translation these weeks would be “The Great Week” and “The Enlightened Week” which seemed fitting: Metis came alive in the Great Week and started running Neural Networks in the Enlightened Week. To introduce Metis to the world we decided to use the MIT Lincoln Laboratory [6] report which charted current compute trends with a live demonstration.Metis AIPU was not there yet at the time of publishing the 2022 trends but we did a live plot running internal max TOPS programs for the In-Memory Computing engines.This chart doesn’t show the compute cost but all this comes at around $200. Now, the human brain memory power is likened to around 100 trillion calculations [also called „operations”] per second [7]. The program used above was an artificial one loading the In-Memory Computing part of Metis with an artifical neural net-like that had some load/store overheads to an ideal pure-calculation program measuring the AIPU In-Memory Computing power used for this. It reaches a practical 200 TOPS result, so the equivalent of two human brains. There is a lot more work to reach actual use of human brains but what really shook me was that price for such brain power. Such calculation capacity has existed for quite sometime now. Still, the computing platforms needed for it have had wild price evolutions so far and have generally stayed in the thousands of dollars. But $100 for a human brain capacity is eye opening and has significant transformation potential. Currently, the smallest country nominal GDP per capita in the world is $249. With the capacity of two human brains lowered at less than half of that, we are witnessing the first point in history where the compute equivalent of human productivity is clearly available anywhere in the world, creating the ability to democratize AI to even the poorest of countries.To look at the practical implications of this capacity we can look at analysis recently made showing the potential to double the productivity in China which is one of the countries already facing some of the most serious population declines and hedging their hopes on automation [8]. Doubling productivity could more than compensate for an eventual productivity decline caused by population shrinkage.
Bram Verhoef | Director of Customer Engineering & Success at AXELERA AI Yesterday was Labor Day, a day dedicated to celebrating the achievements and perseverance of the workforce. Now we find ourselves on the cusp of a new era where artificial intelligence (AI) is poised to transform the labor market. The dawn of this technological revolution holds both promise and peril, and as we gather to honor the labor and perseverance of preceding generations, we, as an AI company, find it crucial to ponder the implications of AI for the future of work. Let us delve into the profound impact generative AI may have on employment, exploring both the opportunities it creates and the challenges it presents, as we strive to shape a world where human ingenuity and machine intelligence can coexist harmoniously. What is generative AI?Moravec’s Paradox is a principle in artificial intelligence that highlights the observation that tasks that are relatively easy for humans to perform can be quite difficult for machines, while tasks that are hard for humans can be relatively simple for machines. Previously, activities like playing chess and complex arithmetic calculations were effortlessly handled by computers, while object recognition, language understanding, and walking remained elusive for machines. However, with the emergence of deep learning, AI systems have made remarkable advancements in tackling tasks that were once considered human domain, such as image recognition and natural language processing. Although not all seemingly “easy” tasks have been conquered by AI, the advancements in deep learning over the past decade have undeniably brought machines closer to mastering human-centric tasks. This holds especially true for generative models.Generative models represent a class of machine learning algorithms that surpass the constraints of decision boundaries, enabling the capacity to model the intrinsic properties of data distributions. As such, generative models overcome the limitations inherent in discriminative models. In contrast to discriminative models, which prioritize delineating decision boundaries between data distributions—like discerning between images of dogs and cats—generative models encapsulate the inherent structures and patterns within the data distributions themselves.The capacity of generative models to capture the intrinsic structure of data is far from trivial. The complexities of data distributions, such as images and text, are vast and high-dimensional, encompassing rare examples that cannot be overlooked (i.e., long tails). However, recent advancements in deep learning techniques have served as a catalyst, unlocking a plethora of opportunities in this domain. Propelled by large datasets (e.g., the internet), sophisticated models (e.g., transformer blocks), increased computing resources (e.g., dedicated AI accelerators), and innovative learning techniques (e.g., reinforcement learning from human feedback), generative models are now making waves across the digital landscape.Today, generative models astound us with their capacity to generate hyper-realistic images through technologies like those employed by Stable Diffusion [1], DALL-E [2], and Midjourney [3]. Meanwhile, their linguistic siblings, such as GPTs [4] and LLaMA [5], produce human-like text that defies expectations. Video creation is revolutionized with Video LDM [6], and speech synthesis achieves new heights of authenticity [7].Large Language Models (LLMs) like ChatGPT are particularly impressive, demonstrating proficiency in a wide array of tasks, including text summarization, general question-answering, music composition, code writing, mathematical problem-solving, and even understanding human intentions (theory of mind) [8]. Groundbreaking innovations like Auto-GPT [9] and BabyAGI [10] push the envelope further, imbuing LLMs with self-prompting, memory capabilities, browsing, and critical reasoning. Unlike traditional chatbots, auto-GPT and BabyAGI may operate with minimal human intervention, edging us ever closer to the domain of Artificial General Intelligence (AGI).As the rapid evolution of generative AI reshapes our world, the impact of these transformative techniques on the labor market emerges as a significant consideration. Both the positive and negative consequences merit our attention as we contemplate how generative AI will redefine our professional landscape and what this revolution means for the future of work. How will generative AI reshape our professional landscape? Impact of Generative AI on the Labor MarketThe concept of machines encroaching upon our jobs is far from novel. Throughout history, innovations such as steam-powered machines, computers, and robots have simultaneously captivated and alarmed us with their potential impact on the labor market. In the near future, we may well equate AI’s role in the fourth industrial revolution to that of the steam engine in the first. Should AI progress maintain its current trajectory, its substantial influence on employment is all but certain. Consequently, identifying the occupations most vulnerable to AI disruption is crucial for adapting to and capitalizing on this technology.OpenAI, the very organization responsible for ChatGPT, has explored the possible effects of LLMs on the U.S. labor market. They discovered that roughly 80% of jobs could have at least 10% of their tasks impacted by LLMs [11]. Moreover, 19% of jobs might experience 50% of their tasks being affected. As anticipated, the occupations facing the greatest impact are those heavily dependent on writing and computer programming. A corroborating study by Goldman Sachs reveals that approximately two-thirds of occupations are susceptible to some degree of AI disruption, with a quarter of jobs potentially having up to half their workload replaced [12].The study also forecasts that administrative and legal jobs will be among those most significantly affected by AI [Figure 2].In contrast to previous waves of automation, manual occupations in sectors like manufacturing, construction, agriculture, and mining are predicted to be less affected by this emerging technology. This is primarily due to the current disparity in advancement between data-driven AI and robotics. The OpenAI study also anticipates a lesser impact on scientific occupations and jobs demanding critical thinking. Nevertheless, recent research involving GPT3.5 and GPT4 has already demonstrated how LLMs could potentially streamline and expedite scientific endeavors [14].General-purpose technologies, such as printing and the steam engine, are typified by their widespread diffusion, ongoing enhancement, and the generation of complementary innovations [11]. The aforementioned studies suggest that AI will be a general-purpose technology. Importantly, such technologies also present considerable opportunities for growth and development.
Douglas Watt | Director of AI Application Engineering at AXELERA AI Ioannis Koltsidas | VP AI Software at AXELERA AI Machine learning frameworks such as PyTorch and TensorFlow are the de facto tools that AI developers use to train models and develop AI applications because of the powerful capabilities they provide. In this article, we introduce the Voyager SDK, which developers can use to deploy such applications to the Metis AI PU quickly, effortlessly and with high performance. What is different at the Edge?Machine learning frameworks are designed around the use of 32-bit floating point data, which has the precision needed to train models using standard backpropagation techniques. Models are often trained in the data center using powerful but expensive, energy-inefficient GPUs, and in the past these models were often used directly for inferencing on the same hardware. However, this class of hardware is no longer needed to achieve high inference accuracy and today’s challenge is how to efficiently deploy these models to lower cost, power-constrained devices operating at the network edge.A complete AI application involves a pipeline of multiple tasks. For example, a computer vision application typically combines a deep learning model that operates on tensor data with various pre and post processing tasks that operate on non-tensor data such as pixels, labels and key points. The latter, also referred to as non-neural tasks, prepare data for input to the deep learning model. Examples include scaling an image to the model’s input resolution and encoding the image to the required tensor format. Non-neural tasks are also used to interpret the predicted output tensors, for example generating an array of bounding boxes.For ease of development, most models are implemented and trained in high-level languages such as Python. However, most inference devices rely on low-level embedded programming to achieve the requisite performance. The core deep learning model is usually defined within the tight constraints of the ML framework, which enables the use of quantization tools to optimize and compile the model to run as native assembly on the target AI accelerator. The non-neural tasks are often more general-purpose in their design and their optimal location may vary from one platform to the next. In the example above, preprocessing elements are offloaded to an embedded media accelerator, and visualization elements reimplemented as OpenGL kernels on an embedded GPU. Furthermore, combining these heterogeneous components efficiently requires the use of a low-level language such as C++ and libraries that enable efficient buffer sharing and synchronization between devices. Many application developers are not familiar with low-level system design and thus providing developers with easy-to-use pipeline deployment tools is a prerequisite to enable the mass adoption of new Edge AI hardware accelerators in the market. Simplifying AI development for the EdgeThe Voyager SDK offers a fast and easy way for developers to build powerful and high-performance applications for Axelera AI’s Metis AI platform. Developers describe their end-to-end pipelines declaratively, in a simple YAML configuration file, which can include one or more deep learning models along with multiple non-neural pre and post processing elements. The SDK toolchain automatically compiles and deploys the models in the pipeline for the Metis AI platform and allocates pre and post processing components to available computing elements on the host such as the CPU, embedded GPU or media accelerator. The compiled pipeline can then be used directly as a first-class object from Python or C++ application code as an “inference input/output stream”.
Interview with Jonathan Ballon, Chairman of Axelera AIJonathan Ballon has recently joined Axelera AI as Chairman of the board. Coming from leadership roles in some of the world’s most recognizable companies, such as Cisco Systems, General Electric and Intel, Jonathan brings deep entrepreneurial and operational expertise to the company. To commemorate Jonathan joining our team, we hosted an in-depth interview to learn more about his background. Jonathan, thank you for joining us today. Before moving into why you joined Axelera AI, could you please tell us something about yourself? I have lived and worked in Silicon Valley for all of my career, almost 25 years. There’s a theme to my interest which aligns to where I’ve spent my time over that period, which is really around understanding this world that we live in. First, with startups that were looking at better dissemination of data and helping people to gain access to information, which then traversed into a career at Cisco for a decade looking at how we move data and information around the world. It’s about helping to build and deploy advanced applications of Internet technology, first in the Enterprise and subsequently in the cloud which has proven to be a great leveling agent for the world.For the past 10 years I have focused on what happens after that fabric is in place, how do we start gaining access to the data in the world that we live in? And so the last several years have really been focused on distributed computing environments – what we now call the edge, which is all of this infrastructure out on the physical world we all inhabit. It’s in our hospitals, our factories, our cities, and making the most of the data generated at the edge.If we can better understand the dynamic systems we inhabit and participate in, we can better understand how things work so we can improve the human condition.You are a seasoned executive who worked in several Fortune 500 companies. There are plenty of start-ups developing AI solutions, so why did you decide to join the board of Axelera AI?When I think about my career I’ve worked in both start-ups and large companies and each of them are equally valuable in helping to drive a pervasive adoption of technology.And so typically what you see is lots of innovation happening in startups because they move faster and have less bureaucracy. They’re typically smaller and as a result you can get very close to the application and the use case.However, it’s typically large companies that have the ability to drive pervasive adoption of technology at scale, so you really need both experiences to drive pervasive adoption of new technologyAnd when I think about my journey, it’s really been focused on kind of three primary areas. The first is how do we get adoption of technology, the life cycle of innovation, and what does it take to get something from being novel or interesting across that chasm into mass market adoption. That has been a key focus of mine for years.Secondly, its around access. We want a world in which we don’t have the “have and have nots”. We want there to be an equal distribution of the benefits of technology. So how do you drive almost a democratization of access to technology? That’s done through economics, global reach and scale, as well providing tools to accumulate the value.Lastly, it’s really about the application of this technology. Not just the application of technology for novel use cases, but really the understanding of how technology can be applied in a way that’s good business and that it has an economic value proposition for the end user.Looking at Axelera AI, there are a couple of things that attracted me to the company. It starts with people. I’ve known the CEO Fabrizio for years. He’s an incredibly charismatic and passionate leader, but importantly, he not only understands the technology, he understands how to drive an economic value proposition, how to drive scale in the market commercially.That’s important, because it is not just about great technology. It’s about how do we get technology scaled through the ecosystem and how does this become available on a global basis in a fair and equitable way. I think Fabrizio and the rest of the team really encompass all of those things.Secondly, it’s about the market. I’m deeply passionate about what’s happening at the edge. We’re very early days in the movement of computing and computing architectures from being focused on cloud computing, now traversing out to the edge, as most of inference happens there. This distributed computing architecture is emerging where the definition of the network is expanding through every device, and the value created as a result is immense.This shift will require purpose-built technology that factors this in, which Axelera AI has.The third reason is about the people underwriting the company. Deep tech can be exceptionally hard. Having investors that really understand those dynamics and have the ability to represent the customer, the use cases and the core fundamental technology development cycles really makes a difference.Axelera AI was incubated by Bitfury, the largest Bitcoin miner outside of China, and a groundbreaking crypto technology provider. IMEC provides not only investment but access to some of the core fundamental intellectual property that fuel the technology roadmap. And Innovation Industries, a European deep tech VC fund, who brings extensive industry experience and deep technical and operational knowledge to the table. How do you see computing evolving in this data-driven era? And what will be the impact of in-memory computing and RISC-V? It starts with the customer and the use case. In the past 10-12 years the technology has really been focused on the economies of scale that are gained through cloud computing environment, which is relatively unconstrained. You have unconstrained compute and unconstrained energy, all operating in a supervised, safe environment.What we’re now seeing is this movement towards deep learning and accessing information out at the edge. However, you can’t take these unconstrained models that have been built for the cloud and deploy them at the edge. The volume or velocity of data can’t be supported physically or economically by existing networks. In many instances the use cases require real-time processing. Take autonomous driving. It obviously doesn’t make sense and would be very dangerous to send data back and fort to the cloud in real time.So we need architectures that can support local inference with a much reduced and optimized neural network algorithms. But we also need computing architectures that can not only support the data movement, but also a much lower energy requirement and computing footprint.The bottleneck with data velocity and computing architectures is the data bus, which brings the data back and forth between memory and the CPU. So the ability of putting memory in the CPU itself dramatically reduces the amount of energy that is required in order to perform those calculations, but also increases the speed. From an architectural point of view this is very important, both in the cloud and on the edge, which is why we’re seeing such movement towards in-memory computing.Historically we’ve had two architectures, X86 and ARM, which the industry has been focused on for decades. This has been a closed proprietary instruction set, which in case of ARM you can license, but that value accrues to one company. Now we have a third new platform that’s emerging with RISC-V. What’s great about RISC-V is that you now have an open architecture and an open instruction set that allows for innovation to happen in a frictionless way. I think this is going to be a really pervasive and widely adopted architecture. How do you see the AI semiconductor industry consolidating in the coming years? And if you see consolidation, how many startups will stay independent 5 years from now? History is always a good indicator of the future. If you think back to what happened in the early 90s we had a relatively small number of semiconductor companies that were vertically integrated. They had everything from R&D, all the way through production and manufacturing operations.That model was broken and so you started seeing the creation of foundries, solely focusing on scale manufacturing, allowing the innovation to take place with a much lower barrier to entry – and that’s continued to this day.In fact, we’re seeing a resurgence of that right now, with more foundries being stood up to support the growth in demand and application-specific designs. This is creating more innovation because we now have given companies of any size the ability to design and build a novel computing architecture.What will likely happen is you’ll see some of these computing architectures gain purchase in the marketplace through adoption at scale and be part of a of a broader set of capabilities in the ecosystem.Other chip startups will simply go away through some combination of technical inferiority, inability to manufacture, failure to commercialize or simply run out of cash. In Silicon Valley alone, there are around 4000 new start-ups (all kinds) created yearly and only about 10% of them are ever successful. I have to imagine the ratio of silicon startups aligns to that metric in a best case scenario. Some investors say that the AI semiconductor market has peaked and it’s now getting ready for a severe adjustment. Others are saying that this slowdown is more temporary and that we really haven’t reached the peak because AI is really still at that infancy phase and there is a need for new technologies and solutions. How do you see that play out? We’re in the early days, so I don’t think anything’s peaked. In fact, if you look at historic R&D spend in semiconductors, it has increased every year for the past 50 years. What you’re seeing now with AI is the need for purpose-built architectures, not dissimilar to the trend a few decades ago around demand for graphics architectures.Over the past decade we’ve witnessed a huge amount of growth and innovation happening in training, particularly for data centers and cloud. What we haven’t seen yet, is scale and adoption of applications at the edge. That is the next frontier.And if you look at data as the indicators, we’re seeing more data being created in the next year than we saw over the last 10 years combined, with 75% of that data being generated out in the physical world such as factories, hospitals and cities. Currently that data is moving back to the cloud, but that won’t continue.You’re not only going to see computing architectures, but also system architectures where inference, training and storage will happen as close to the sensor as possible.This goes back to the use cases around real time compute, the ability for systems to not just be smart, but be intelligent. The difference between those two is; smart can think, intelligent can learn.So we’ve currently got devices and systems that can be smart about whatever function it’s performing, but we’re moving towards a set of intelligent systems, and in that inference-training flywheel we’re not going to be moving the world’s data back into the cloud for training. A lot is going to happen close to the source of the data.This will change computing communication networks and the algorithms that we write, because they’re going to need to be more constrained for a lower footprint in terms of energy and costs, out at the edge.Many incumbent companies and well-funded startups are battling to win the AI cloud computing market while relatively few companies are developing solutions for AI at the edge. Some market experts and investors think that the market opportunity at the edge is still pretty small (compared to cloud) and the edge market is way to fragmented to be efficiently served. What do you think about that, do you see an opportunity?Well, I think the cloud is not going away and will continue to grow and be innovative. However, because the data is being generated at the edge and needs to be analysed, processed, moved and stored at the edge, there is an untapped market for all of these layers.So you’re going to see a tremendous amount of growth as a result of that.So from an AI point of view we’ll go from narrow towards more broad use cases, eventually moving towards General AI, looking cross-domain and knowing the application of insights that are available when you start, being able to harness data from multiple sources. It creates a step function that will be available to us when we move through this journey over the next 10 to 20 years. What is the impact of data if you look at the driving factors for an AI company to succeed at the edge? I think it’s a factor. It’s about having the ability to access data, and to do that in a cost effective, and energy efficient way.You should also factor in all kinds of other characteristics of the physical world that don’t exist in a controlled cloud environment. Things like physical security and temperature control, which you don’t always have. Often being operational 24/7, there is not always an opportunity for reboots, software updates and redundancy. You also have the physical world creating hot, harsh and extreme environments with dust, temperature changes, vibrations and such.So the edge is very different – and depending on the use case, you may need to be operating in real time, measured in milliseconds. If you look at robotics for example, having zero latency in the control system is critical for precision and safety.It introduces a whole new set of challenges, which you need to be prepared for.Over last years we have seen a big change in the global market with US government trying to in-shore the semiconductor supply chain with a large government support (Chips Act), the EU government trying to relaunch the local semiconductor ecosystem with the EU Chips Act, China making impressive investments in the AI semiconductors and fuelling the internal demand, and finally Taiwan and Korea pushing to strengthen their position in the market. How do you see this evolve in the coming 10 years?I think a lot of people see this as a retreat from globalization, where they’re starting to insource and localize a lot of capability in order to protect national interests and security, but that’s not the reality of the market that we’re in.When you get down to raw materials as well as sophisticated equipment necessary for production at scale, there really is no country today that can completely vertically integrate and be successful in semiconductors.It requires a global community. For example TSMC, the largest semiconductor foundry in the world, receives raw material supply and equipment from all over the world. It’s not as simplistic as an advanced factory with trained workforce both operating at scale. It really requires a global supply chain of materials and technical innovation.I think what we’re seeing now is a political acknowledgement and recognition of how fundamental silicon is to the success of any nation state in terms of national security interests, but also the health and well-being of its citizens across every industry. The supply shortage that we’ve seen in semiconductors over the last several years has made that painfully obvious.I’m very enthusiastic that we’re now seeing national investment programs, subsidies and other benefits in order to support the growth necessary. This really needs to be a public/private collaboration in order to supply the fundamental building blocks of innovation for the world.I don’t see it being a retreat from globalization at all. I see it really being as shoring up of capabilities and the creation of capacity to support an ever increasing demand for computing. What are the top 3 edge market which you expect to be disrupted by artificial intelligence? And what kind of applications will be more impactful on our daily life? Over the past seven years there has been a focus on natural language processing – the ability to control the human machine interface using voice. We see that in our homes with any voice assistant, and also in a healthcare environment or an industrial setting where you have a worker that needs to be able to use both of their hands, but now may control a system using just their voice.Over that similar period we’ve seen better than human accuracy in image analytics. It started with being able to identify cat or a dog in a photograph, now moving towards being able to analyse very dense and complex medical images.Being able to translate the terabytes of data in one of those images to find anomalies better by applying deep learning algorithms – faster and with more accuracy than one of the world’s most sophisticated radiologists – is providing a huge a benefit. Not only to the overworked radiologists, but also towards better health outcomes as a result. Because not only can we now derive insight from the image, we can apply that to other datasets and fuse together not just a single diagnosis of what’s happening in that particular image, but applying this to population health records to look for insights in what caused those anomalies in the first place.We’re seeing those same image analytics applied to video in real time, from object detection, object tracking and facial recognition. This is now at a point where we can understand not only the image, but who is in the image, how they are feeling and what are they doing – starting to understand behavioral analysis right in video images. It’s the ultimate sensor, because you can see what’s happening in the world.We’ll start to take in other sensor types for sound, smell and vibration and applying all of these things together, moving towards more of a generalized AI where we start looking across domains and data sets, getting a robust understanding of the world we live in. I see this moving towards ‘what do we do about it’, being able to predict better and start allowing some degrees of autonomy.I see this this journey from understanding what the data is telling us, to having it make a recommendation of what we should do in that scenario (but still requiring a human to take action,) towards full autonomy. And that autonomy can be a car making a decision to swerve or break. It could be in robotics – where you have an unsupervised robotic system – performing tasks and learning from a dynamic environment.All of these things will start pervade our lives and in the process allowing humans to move to higher order of value creation and skill set. A lot of things that are historically mundane can be automated, things that are dangerous can be automated, or things that are dirty can be automated.All of these things now allow the human experience again to improve.
Evangelos Eleftheriou | CTO at AXELERA AI Our CTO and Co-Founder Evangelos Eleftheriou, presented at the ESSCIRC – ESSDERC 2022 event about in-memory computing for deep-learning acceleration.In-memory computing (IMC) is a novel computing paradigm, where certain computational tasks are performed in the memory itself using analog or mixed signal computation techniques.In his presentation he shares a broad overview of the recent progress of IMC for accelerating deep learning workloads, highlighting the strengths and weaknesses of the various approaches.Learn all about it in his presentation. DOWNLOAD THE PRESENTATION
Fabrizio Del Maffeo | CEO at AXELERA AI Introducing Axelera AI’s New Advisor, Andreas HanssonAndreas Hansson joined Axelera AI as an advisor last month. Andreas is an angel investor in several start-ups and serves on the board of several public companies. He will advise us on technology, market trends, and computing and artificial intelligence investment opportunities. To commemorate Andreas joining our team, we hosted a short interview to learn more about his background. Andreas, thank you for joining us today. Before we jump into your career and accomplishments, can you tell us a bit more about you personally?As a kid, I was always encouraged to be curious and inquisitive, and it has been a constant theme throughout my life. I spent much of my childhood taking things apart and making new things. I think it was this curiosity that sparked my interest in technology from a young age. I loved learning how something worked, building on it, or creating something different. To a large extent, it’s still what I love doing most today. That curiosity has taken you to many great places. Why did you move from research into investment?Thank you. Research is hugely exciting, and I enjoy the thrill of expanding my horizons with new technologies and innovations. Sometimes it can get detached from reality, though – it’s possible to get too focused on technology for technology’s sake. Getting more involved in the business decisions guiding the research and M&A activities grounded me in the purpose of all that research. I started to see that investment is a natural progression, and I love that it allows me to dive into all the aspects of a business. It’s a great place to be for a full-circle view. You worked for two worldwide leaders in two completely different fields: first Arm, the biggest IP company in the world, and then Softbank, the largest VC in the world. What are the most important lessons you learned in these two experiences?Arm taught me the value of partnership. The company’s astonishing success comes from, and still relies on, trust within the ecosystem. That trust and partnered work permeate the whole organisation. As a result, Arm is very collaborative, both internally and externally, and for me, it was a fantastic learning platform with tons of support.One of my key takeaways from SoftBank was the power of thinking big and asking, “what if…?” It lit up the same inquisitive nature I had as a child. In some of my previous roles in engineering, I found myself getting a little too pragmatic and level-headed – important in some cases but stunting in others. Within SoftBank and the Vision Fund, I was surrounded by people pushing the envelope and truly thinking outside the box. More and more startups are trying to enter the computing and AI semiconductor markets, proposing new architectures which always claim to be way more efficient and powerful than the incumbents. What is your opinion about this? Is there any secret sauce to succeed in this market?Computing is permeating everything in our lives and is ever-evolving to deliver the right power/performance trade-off for each use case. For the same reasons, we are also seeing more changes in how computing systems are built, with novel architectures, technologies, manufacturing methods, etc. These developments present fantastic opportunities for startups to innovate and show what is possible. I actually think there are not enough semiconductor startups and also not enough semiconductor-focused DeepTech VCs. After years of large investment rounds, it seems like the venture capital market is undergoing a correction. What is your opinion about this? What is the outlook for the coming 24 months?VC activity is merely reflecting what’s happening in the markets broadly. I’m not surprised that priorities are shifting as everyone is working out what the world will look like going forward. While it will likely be a more challenging environment, and valuation expectations will come down, the next 24 months should ultimately present good investment opportunities for VCs. What do you suggest to early-stage startups to do in this uncertain time when raising money?The best thing startups can do is stay on top of their spending. If possible, secure 18-24 months of runway. Consider prioritising profitability instead of growth, and at the very least, work out a route to positive unit economics. You recently departed from Softbank for a new great adventure – what is that?Yes! I’m launching 2Q Ventures, a dedicated quantum computing fund in partnership with my stellar team. While I enjoy late-stage investment and my public-company board work, I’ve stayed really passionate about frontier technology, and helping visionaries transform the world. 2Q Ventures gives me a framework for doing exactly that while accelerating development and building up an ecosystem. Quantum computing is an exciting field. When do you think quantum computing technology will become accessible to enterprises? Which market sector do you expect will be an early adopter of commercial quantum computing?Excitingly, enterprises can already access quantum computers on the cloud through services like Amazon Braket. However, due to limited scale and relatively high noise rates, quantum computers don’t have a commercial advantage yet. That said, the progress is incredible. We also see signs of a virtuous circle, similar to machine learning in the mid 2000s, with technology progress leading to more investment, which in turn is getting more people involved, helping broaden the talent pool and seed new startups in the field, which in turn accelerates the next generation of achievements. It’s the perfect recipe for acceleration over the next few years. I wouldn’t be surprised if we see a true commercial advantage in areas like quantum simulation in the same time frame.
Bram Verhoef | Director of Customer Engineering & Success at AXELERA AI Martino Dazzi | Algorithm and Quantization Researcher at AXELERA AI We are delighted to share the slideshow “Insights and Trends of Machine Learning for Computer Vision”. Recently presentation, at different conferences, by our head of machine learning Bram-Ernst Verhoef and our Algorithm and Quantisation researcher Martino Dazzi.Learn all about it in their presentation. DOWNLOAD THE PRESENTATION
Bram Verhoef | Director of Customer Engineering & Success at AXELERA AI SummaryConvolutional neural networks (CNNs) still dominate today’s computer vision. Recently, however, networks based on transformer blocks have also been applied to typical computer vision tasks such as object classification, detection, and segmentation, attaining state-of-the-art results on standard benchmark datasets.However, these vision-transformers (ViTs) are usually pre-trained on extremely large datasets and may consist of billions of parameters, requiring teraflops of computing power. Furthermore, the self-attention mechanism inherent to classical transformers builds on quadratically complex computations.To mitigate some of the problems posed by ViTs, a new type of network based solely on multilayer perceptrons (MLPs), has recently been proposed. These vision-MLPs (V-MLP) shrug off classical self-attention but still achieve global processing through their fully connected layers.In this blog post, we review the V-MLP literature, compare V-MLPs to CNNs and ViTs, and attempt to extract the ingredients that really matter for efficient and accurate deep learning-based computer vision.IntroductionIn computer vision, CNNs have been the de facto standard networks for years. Early CNNs, like AlexNet [1] and VGGNet [2], consisted of a stack of convolutional layers, ultimately terminating in several large fully connected layers used for classification. Later, networks were made progressively more efficient by reducing the size of the classifying fully connected layers using global average pooling [3]. Furthermore these more efficient networks, among other adjustments, reduce the spatial size of convolutional kernels [4, 5], employ bottleneck layers and depthwise convolutions [5, 6], and use compound scaling of the depth, width and resolution of the network [7]. These architectural improvements, together with several improved training methods [8] and larger datasets have led to highly efficient and accurate CNNs for computer vision.Despite their tremendous success, CNNs have their limitations. For example, their small kernels (e.g., 3×3) give rise to small receptive fields in the early layers of the network. This means that information processing in early convolutional layers is local and often insufficient to capture an object’s shape for classification, detection, segmentation, etc. This problem can be mitigated using deeper networks, increased strides, pooling layers, dilated convolutions, skip connections, etc., but these solutions either lose information or increase the computational cost. Another limitation of CNNs stems from the inductive bias induced by the weight sharing across the spatial dimensions of the input. Such weight sharing is modeled after early sensory cortices in the brain and (hence) is well adapted to efficiently capture natural image statistics. However, it also limits the model’s capacity and restricts the tasks to which CNNs can be applied.Recently, there has been much research to solve the problems posed by CNNs by employing transformer blocks to encode and decode visual information. These so-called Vision Transformers (ViTs) are inspired by the success of transformer networks in Natural Language Processing (NLP) [9] and rely on global self-attention to encode global visual information in the early layers of the network. The original ViT was isotropic (it maintains an equal-resolution-and-size representation across layers), permutation invariant, based entirely on fully connected layers and relying on global self attention [10]. As such, the ViT solved the above-mentioned problems related to CNNs by providing larger (dynamic) receptive fields in a network with less inductive bias.This is exciting research but it soon became clear that the ViT was hard to train, not competitive with CNNs when trained on relatively small datasets (e.g., IM-1K, [11]), and computationally complex as a result of the quadratic complexity of self-attention. Consequently, further studies sought to facilitate training. One approach was using network distillation [12]. Another was to insert CNNs at the early stages of the network [13]. Further attempts to improve ViTs re-introduced inductive biases found in CNNs (e.g., using local self attention [14] and hierarchical/pyramidal network structures [15]). There were also efforts to replace dot-product QKV-self-attention with alternatives [e.g. 16]. With these modifications now in place, vision transformers can compete with CNNs with respect to computational efficiency and accuracy, even when trained on relatively small datasets [see this blog post by Bert Moons for more discussion on ViTs]. Vision MLPsNotwithstanding the success of recent vision transformers, several studies demonstrate that models building solely on multilayer perceptrons (MLPs) — so-called vision MLPs (V-MLPs) — can achieve surprisingly good results on typical computer vision tasks like object classification, detection and segmentation. These models aim for global spatial processing, but without the computationally complex self-attention. At the same time, these models are easy to scale (high model capacity) and seek to retain a model structure with low inductive bias, which makes them applicable to a wide range of tasks [17].Like ViTs, the V-MLPs first decompose the images into non-overlapping patches, called tokens, which form the input into a V-MLP block. A typical V-MLP block consists of a spatial MLP (token mixer) and a channel MLP (channel mixer), interleaved by (layer) normalization and complemented with residual connections. This is illustrated in Figure 1.Table 1. Overview of some V-MLPs. For each V-MLP, we present the accuracy of the largest reported model that is trained on IM-1K only. Here the spatial MLP captures the global correlations between tokens, while the channel MLP combines information across features. This can be formulated as follows:Y=spatialMLP(LN(X))+X, Z=channelMLP(LN(Y))+Y,Here X is a matrix containing the tokens, Y consists of intermediate features, LN denotes layer normalization, and Z is the output feature of the block. In these equations, spatialMLP and channelMLP can be any nonlinear function represented by some type of MLP with activation function (e.g. GeLU).In practice, the channelMLP is often implemented by one or more 1×1 convolutions, and most of the innovation found in different studies lies in the structure of the spatialMLP submodule. And, here’s where history repeats itself. Where ViTs started as isotropic models with global spatial processing (e.g., ViT [10] or DeiT [12]), V-MLPs did so too (e.g., MLP-Mixer [17] or ResMLP [18]). Where recent ViTs improved their accuracy and performance on visual tasks by adhering to a hierarchical structure with local spatial processing (e.g., Swin-transformer [14] or NesT [19]), recent V-MLPs do so too (e.g., Hire-MLP [20] or S^2-MLPv2 [21]). These modifications made the models more computationally efficient (fewer parameters and FLOPs), easier to train and more accurate, especially when trained on relatively small datasets. Hence, over time both ViTs and V-MLPs re-introduced the inductive biases well known from CNNs.Due to their fully connected nature, V-MLPs are not permutation invariant and thus do not necessitate the type of positional encoding frequently used in ViTs. However, one important drawback of pure V-MLPs is the fixed input resolution required for the spatialMLP submodule. This makes transfer to downstream tasks, such as object detection and segmentation, difficult. To mitigate this problem, some researchers have inserted convolutional layers or, similarly, bicubic interpolation layers, into the V-MLP (e.g., ConvMLP [22] or RaftMLP [23]). Of course, to some degree, this defies the purpose of V-MLPs. Other studies have attempted to solve this problem using MLPs only (e.g., [20, 21, 30]), but the data-shuffling needed to formulate the problem as an MLP results in an operation that is very similar or even equivalent to some form of (grouped) convolution.See Table 1 for an overview of different V-MLPs. Note how some of the V-MLP models are very competitive with (or better than) state-of-the-art CNNs, e.g. ConvNeXt-B with 89M parameters, 45G FLOPs and 83.5% accuracy [28]. What matters?It is important to note that the high-level structure of V-MLPs is not new. Depthwise-separable convolutions for example, as used in MobileNets [6], consist of a depthwise convolution (spatial mixer) and a pointwise 1×1 convolution (channel mixer). Furthermore, the standard transformer block comprises a self-attention layer (spatial mixer) and a pointwise MLP (channel mixer). This suggests that the good performance and accuracy obtained with these models results at least partly from the high-level structure of layers used inside V-MLPs and related models. Specifically, (1) the use of non-overlapping spatial patch embeddings as inputs, (2) some combination of independent spatial (with large enough spatial kernels) and channel processing, (3) some interleaved normalization, and (4) residual connections. Recently, such a block structure has been dubbed “Metaformer” ([24], Figure 2), referring to the high-level structure of the block, rather than the particular implementation of its subcomponents. Some evidence for this hypothesis comes from [27], who used a simple isotropic purely convolutional model, called “ConvMixer,” that takes non-overlapping patch embeddings as inputs. Given an equal parameter budget, their model shows improved accuracy compared to standard ResNets and DeiT. A more thorough analysis of this hypothesis was performed by “A ConvNet for the 2020s,” [28], which systematically examined the impact of block elements 1-4, finding a purely convolutional model reaching SOTA performance on ImageNet, even when trained on IN-1K alone.Figure 2. a. V-MLP, b. Transformer and c. MetaFormer. Adapted from [24].
Evangelos Eleftheriou | CTO at AXELERA AI Our CTO had a chat with Torsten Hoefler to scratch the surface and get to know better our new scientific advisor.Evangelos: Could you please introduce yourself and your field of expertise?Torsten: My background is in High-Performance Computing on Supercomputers. I worked on large-scale supercomputers, networks, and the Message Passing Interface specification. More recently, my main research interests are in the areas of learning systems and applications of them, especially in the climate simulation area. E: Where is currently the focus of your research interests?T: I try to understand how to improve the efficiency of deep learning systems (both inference and training) ranging from smallest portable devices to largest supercomputers. I especially like the application of such techniques for predicting the weather or future climate scenarios. E: What do you see as the greatest challenges in data-centric computing in current hardware and software landscape?T: We need a fundamental shift of thinking – starting from algorithms, where we teach and reason about operational complexity. We need to seriously start thinking about data movement. From this algorithmic base, the data-centric view needs to percolate into programming systems and architectures. On the architecture side, we need to understand the fundamental limitations to create models to guide algorithm engineering. Then, we need to unify this all into a convenient programming system. E: Could you please explain the general concept of DaCe, as a generic data-centric programming framework?T: DaCe is our attempt to capture data-centric thinking in a programming system that takes Python (and others) codes and represents them as a data-centric graph representation. Performance engineers can then work conveniently on this representation to improve the mapping to specific devices. This ensures highest performance. E: DaCe has also extensions for Machine Learning (DaCeML). Where do those help? Could in general in-memory computing accelerators benefit by such a framework and how?T: DaCeML supports the Open Neural Network Exchange (ONNX) format and PyTorch through the ONNX exporter. It offers inference as well as training support at highest performance using data-centric optimizations. In-memory computing accelerators can be a target for DaCe – depending on their offered semantics, a performance engineer could identify pieces of the dataflow graph to be mapped to such accelerators. E: In which new application domains do you see data-centric computing playing a major role in the future?T: I would assume all computations where performance or energy consumption is important – ranging from scientific simulations to machine learning and from small handheld devices to large-scale supercomputers. E: What is your advice to young researchers in the field of data-centric optimization?T: Learn about I/O complexity! As Scientific Advisor, Torsten Hoefler advises the Axelera AI Team on the scientific aspects of its research and development. To learn more about Torsten’s work, please visit his biography page.
Bert Moons | Director – System Architecture at AXELERA AI SummaryConvolutional Neural Networks (CNN) have been dominant in Computer Vision applications for over a decade. Today, they are being outperformed and replaced by Vision Transformers (ViT) with a higher learning capacity. The fastest ViTs are essentially a CNN/Transformer hybrid, combining the best of both worlds: (A) CNN-inspired hierarchical and pyramidal feature maps, where embedding dimensions increase and spatial dimensions decrease throughout the network are combined with local receptive fields to reduce model complexity, while (B) Transformer-inspired self-attention increases modeling capacity and leads to higher accuracies. Even though ViTs outperform CNNs in specific cases, their dominance has not yet been asserted. We illustrate and conclude that SotA CNNs are still on-par, or better, than ViTs in ImageNet validation, especially when (1) trained from scratch without distillation, (2) in the lower-accuracy <80% regime, and (3) for lower network complexities optimized for Edge devices. Convolutional Neural NetworksConvolutional Neural Networks (CNN) have been the dominant Neural Network architectures in Computer Vision for almost a decade, after the breakthrough performance of AlexNet[1]on the ImageNet[2] image classification challenge. From this baseline architecture, CNNs have evolved into variations of bottlenecked architectures with residual connections such as ResNet[3], RegNet[4] or into more lightweight networks optimized for mobile contexts using grouped convolutions and inverted bottlenecks, such as Mobilenet[5] or EfficientNet[6]. Typically, such networks are benchmarked and compared by training them on small images on the ImageNet data set. After this pretraining, they can be used for applications outside of image classification such as object detection, panoptic vision, semantic segmentation, or other specialized tasks. This can be done by using them as a backbone in an end-to-end application-specific Neural Network and finetuning the resulting network to the appropriate data set and application.A typical ResNet-style CNN is given in Figure 1-1 and Figure 1-4 (a). Typically, such networks have several features:They interleave or stack 1×1 and kxk convolutions to balance the cost of convolutions with building a large receptive field, Training is stabilized by using batch-normalization and residual connections. Feature maps are built hierarchically by gradually reducing the spatial dimensions (W,H), finally downscaling them by a factor of 32x. Feature maps are built pyramidally, by increasing the embedding dimensions of the layers from the range of 10 channels in the first layers to 1000s in the last Figure 1-1: Illustration of ResNet34 [3] Within these broader families of backbone networks, researchers have developed a set of techniques known as Neural Architecture Search (NAS)[7] to optimize the exact parametrizations of these networks. Hardware-Aware NAS methods automatically optimize a network’s latency while maximizing accuracy, by efficiently searching over its architectural parameters such as the number of layers, the number of channels within each layer, kernel sizes, activation functions and so on. So far, due to high training costs, these methods have failed to invent radically new architectures for Computer Vision. They mostly generate networks within the ResNet/MobileNet hybrid families, leading to only modest improvements of 10-20% over their hand-designed baseline[8].
Fabrizio Del Maffeo | CEO at AXELERA AI I met Marian Verhelst in the summer of 2019 and she immediately stroke me with her passion and competence for computing architecture design. We started immediately a collaboration and today she’s here with us sharing her insights on the future of computing.Fabrizio: Can you please introduce yourself, your experience and your field of study?Marian: My name is Marian Verhelst, and I am a professor at the MICAS lab of KU Leuven[i]. I studied electrical engineering and received my PhD in microelectronics in 2008. After completing my studies, I joined Intel Labs in Portland, Oregon, USA, and worked as a research scientist. I then became a professor at KU Leuven in 2012, focusing on efficient processing architectures for embedded sensor processing and machine learning. My lab regularly tapes out processor chips using innovative and advanced technologies. I am also active in international initiatives, organising IC conferences such as ISSCC, DATE, ESSCIRC, AICAS and more. I also serve as the Director of the tinyML Foundation. Most recently, I was honoured to receive the André Mischke YAE Prize[ii] for Science and Policy, and I have been shortlisted for the 2021 Belgium Inspiring Fifty list[iii]. F: What is the focus of your most recent research?M: My research currently focuses on three areas. First, I am looking at implementing an efficient processor chip for embedded DNN workloads. Our latest tape-out, the Diana chip, combines a digital AI accelerator with an analogue- compute-in-memory AI accelerator in a common RISC-V-based processing system. This allows the host processor to offload neural network layers to the most suitable accelerator core, depending on parallelisation opportunities and precision needs. We plan to present this chip at ISSCC 2022[iv].The second research area is improving the efficiency of designing and programming such processors. We developed a new framework called the ZigZag framework[v], which enables rapid design space exploration of processor architectures and algorithm-to-processor mapping schedules for a suite of ML workloads.My last research area is exploring processor architectures for beyond-NN workloads. Neural networks on their own cannot sufficiently perform complex reasoning, planning or perception tasks. They must be complemented with probabilistic and logic-based reasoning models. However, these networks do not map well on CPU, GPU, or NPUs. We are starting to develop processors and compilers for such emerging ML workloads in my lab. F: There are different approaches and trends in new computing designs for artificial intelligence workloads: increasing the number of computing cores from a few to tens, thousands or even hundreds of thousands of small, efficient cores, as well as near-memory processing, computing-in-memory, or in-memory computing. What is your opinion about these architectures? What do you think is the most promising approach? Are there any other promising architecture developments?M: Having seen the substantial divergence in ML algorithmic workloads and the general trends in the processor architecture field, I am a firm believer in very heterogeneous multi-core solutions. This means that future processing systems will have a large number of cores with very different natures. Eventually, such cores will include (digital) in- or near-memory processing cores, coarse grain reconfigurable systolic arrays and more traditional flexible SIMD cores. Of course, the challenge is to build compilers and mappers that can grasp all opportunities from such heterogeneous and widely parallel fabrics. To ensure excellent efficiency and memory capabilities, it will be especially important to exploit the cores in a streaming fashion, where one core immediately consumes the data produced by another. F: Computing design researchers are working on low power and ultra-low power consumption design using metrics such as TOPs/w as a key performance indicator and low precision networks trained mainly on small datasets. However, we also see neural network research increasingly focusing on large networks, particularly transformer networks that are gaining traction in field deployment and seem to deliver very promising results. How can we conciliate these trends? How far are we from running these networks at the edge? What kind of architecture do you think can make this happen?M: There will always be people working to improve energy efficiency for the edge and people pushing for throughput across the stack. The latter typically starts in the data centre but gradually trickles down to the edge, where improved technology and architectures enable better performance. It is never a story of choosing one option over another. Over the past years, developers have introduced increasingly distributed solutions, dividing the workload between the edge and the data centre. The vital aspect of these presented solutions is that they need to work with scalable processor architectures. Developers can deploy these architectures with a smaller core count at the extreme edge and scale up to larger core numbers for the edge and a massive core count for the data centre. This will require processing architectures and memory systems that rely on a mesh-type distributed processor fabric, rather than being centrally controlled by a single host. F: How do you see the future of computing architecture for the data centre? Will it be dominated by standard computing, GPU, heterogeneous computing, or something else?M: As I noted earlier, I believe we will see an increasing amount of heterogeneity in the field. The data centre will host a wide variety of processors and differently-natured accelerator arrays to cover the widely different workloads in the most efficient manner possible. As a hardware architect, the exciting and still open challenge is what library of (configurable) processing tiles can cover all workloads of interest. Most intriguing is that, due to the slow nature of hardware development, this processor library should cover not only the algorithms we know of today but also those that researchers will develop in the years to come.As Scientific Advisor, Marian Verhelst advises the Axelera AI Team on the scientific aspects of its research and development. To learn more about Marian’s work, please visit her biography page. References[I] https://www.esat.kuleuven.be/micas/[ii] https://yacadeuro.org/fifth-edition-of-the-annual-andre-mischke-yae-prize-awarded-to-marian-verhelst/[iii] https://belgium.inspiringfifty.org/[iv] https://www.isscc.org/program-overview[v] https://github.com/ZigZag-Project/zigzag
Already have an account? Login
Enter your E-mail address. We'll send you an e-mail with instructions to reset your password.
Sorry, we're still checking this file's contents to make sure it's safe to download. Please try again in a few minutes.
OKSorry, our virus scanner detected that this file isn't safe to download.
OK